<< goback()

How to Migrate from dbt Semantic Layer to Databricks Unity Catalog Metric Views (Databricks)

Typedef Team

How to Migrate from dbt Semantic Layer to Databricks Unity Catalog Metric Views (Databricks)

Migrating from dbt MetricFlow to Databricks Unity Catalog Metric Views requires translating YAML semantic models into Unity Catalog objects and adapting query patterns from external APIs to native SQL with MEASURE() clauses. This guide covers translation patterns, validation techniques, and implementation strategies. Typedef's Fenic framework provides semantic data processing capabilities for metric validation and quality assurance throughout migration.


Migration Architecture

dbt MetricFlow vs Databricks Metric Views: Architectural Differences

The migration involves fundamental shifts in metric definition, storage, and access patterns.

Definition storage:

  • dbt MetricFlow: YAML files in git repositories
  • Databricks Metric Views: Unity Catalog objects defined via SQL DDL or YAML import

Execution model:

  • dbt MetricFlow: Query generation engine compiling requests into SQL for external warehouses
  • Databricks Metric Views: Native Spark SQL execution with Photon optimization

Query interface:

  • dbt MetricFlow: JDBC connections via Arrow Flight, GraphQL API
  • Databricks Metric Views: MEASURE() clause in SQL queries, native Databricks integrations

Access control:

  • dbt MetricFlow: Warehouse-level permissions
  • Databricks Metric Views: Metric view-level RBAC through Unity Catalog

Join semantics:

  • dbt MetricFlow: Semantic graph with automatic multi-hop joins
  • Databricks Metric Views: Explicit join specifications in YAML or DDL

Time intelligence:

  • dbt MetricFlow: Automatic time grains (metric__day, metric__month, metric__year)
  • Databricks Metric Views: Explicit dimension definitions per time granularity

What Transfers Directly

These elements require minimal translation:

  • Simple aggregation metrics (SUM, COUNT, AVG, MIN, MAX)
  • Basic ratio metrics with numerator/denominator logic
  • Dimension definitions and column expressions
  • Star schema relationship patterns
  • Metric descriptions and business documentation

What Requires Manual Translation

These elements need significant rework:

  • Conversion metrics (cohort-based, time-windowed) - No native equivalent in Databricks
  • Cumulative metrics - Require window function implementations
  • Automatic time grains - Must define separate dimensions for each grain
  • GraphQL API queries - Convert to SQL with MEASURE() clauses
  • dbt-specific metric types - Translate to Spark SQL expressions

What Does Not Transfer

These capabilities have no direct equivalent:

  • GraphQL API access for programmatic queries
  • dbt Cloud infrastructure and service layer
  • JDBC Arrow Flight connections
  • dbt test framework for metrics
  • Automatic semantic graph join resolution

Pre-Migration Assessment

Metric Inventory

Catalog all metrics in your dbt project to assess migration scope.

Locate semantic models:

bash
find models/ -name "*.yml" -exec grep -l "semantic_models:" {} \;

Count metrics by type:

bash
grep -r "type: simple" models/ | wc -l
grep -r "type: ratio" models/ | wc -l
grep -r "type: derived" models/ | wc -l
grep -r "type: conversion" models/ | wc -l

Categorize by translation difficulty:

CategoryCharacteristicsExamples
SimpleDirect aggregations, basic filtersSUM(revenue), COUNT(orders)
ModerateRatios, derived metrics, standard joinsavg_order_value, profit_margin
ComplexConversion funnels, cumulative calculations, custom expressionssignup_to_purchase_rate, lifetime_value

Unity Catalog Readiness

Verify Databricks environment prerequisites.

Check Unity Catalog status:

sql
SHOW CATALOGS;

If results show only hive_metastore, Unity Catalog migration is required before proceeding.

Design namespace structure:

<catalog>.<schema>.<metric_view>

Examples:
production.finance.revenue_metrics
production.marketing.campaign_metrics
production.sales.pipeline_metrics

Align namespaces with organizational domains for discoverability.

Plan access control mapping:

sql
-- Map dbt consumer access to Unity Catalog permissions
GRANT SELECT ON METRIC VIEW production.finance.revenue_metrics TO `finance_analysts`;
GRANT SELECT ON METRIC VIEW production.marketing.campaign_metrics TO `marketing_team`;

Document which roles, users, and teams need access to which metric views.

BI Tool Integration Assessment

Identify how current tools connect to dbt MetricFlow:

  • Tools with native integrations (Hex, Mode) require reconfiguration
  • JDBC-connected tools (Tableau, Power BI) need connector updates
  • Custom applications using GraphQL API need SQL rewrites

Translation Patterns

Simple Metrics

Direct aggregations translate with minimal changes.

dbt definition:

yaml
semantic_models:
  - name: orders
    measures:
      - name: order_total
        agg: sum
      - name: order_count
        agg: count

metrics:
  - name: total_revenue
    type: simple
    measure: order_total

Databricks YAML:

yaml
metric_view:
  name: order_metrics
  source_table: main.sales.orders
  measures:
    - name: total_revenue
      expr: SUM(order_amount)
      description: "Total order revenue"

    - name: order_count
      expr: COUNT(DISTINCT order_id)
      description: "Total number of orders"

Databricks SQL DDL:

sql
CREATE METRIC VIEW production.sales.order_metrics AS
SELECT
  SUM(order_amount) AS total_revenue,
  COUNT(DISTINCT order_id) AS order_count,
  DATE(order_timestamp) AS order_date,
  customer_id
FROM main.sales.orders;

Ratio Metrics

Convert numerator/denominator specifications to explicit expressions.

dbt definition:

yaml
metrics:
  - name: avg_order_value
    type: ratio
    numerator: total_revenue
    denominator: order_count

Databricks YAML:

yaml
measures:
  - name: avg_order_value
    expr: SUM(order_amount) / COUNT(DISTINCT order_id)
    description: "Average revenue per order"

Databricks SQL DDL:

sql
CREATE METRIC VIEW production.sales.order_metrics AS
SELECT
  SUM(order_amount) / COUNT(DISTINCT order_id) AS avg_order_value,
  customer_region,
  DATE(order_timestamp) AS order_date
FROM main.sales.orders
JOIN main.sales.customers ON orders.customer_id = customers.customer_id;

The key difference: dbt handles numerator/denominator separately for correct aggregation, while Databricks requires explicit expression that maintains proper aggregation semantics.

Derived Metrics

Combine multiple metrics into formulas.

dbt definition:

yaml
metrics:
  - name: profit
    type: derived
    expr: total_revenue - total_cost

  - name: profit_margin
    type: derived
    expr: profit / total_revenue

Databricks YAML:

yaml
measures:
  - name: profit
    expr: SUM(order_amount) - SUM(cost_amount)
    description: "Revenue minus costs"

  - name: profit_margin
    expr: (SUM(order_amount) - SUM(cost_amount)) / SUM(order_amount)
    description: "Profit as percentage of revenue"

Databricks SQL DDL:

sql
CREATE METRIC VIEW production.finance.profit_metrics AS
SELECT
  SUM(order_amount) - SUM(cost_amount) AS profit,
  (SUM(order_amount) - SUM(cost_amount)) / SUM(order_amount) AS profit_margin,
  product_category,
  DATE(order_timestamp) AS order_date
FROM main.sales.orders
JOIN main.sales.products ON orders.product_id = products.product_id;

Conversion Metrics (Manual Implementation Required)

dbt's conversion metric type has no Databricks equivalent. Implement with window functions.

dbt definition:

yaml
metrics:
  - name: signup_to_purchase_rate
    type: conversion
    conversion_type_params:
      entity: user_id
      calculation: conversion_rate
      window: 30 days
      base_measure: signups
      conversion_measure: purchases

Databricks SQL implementation:

sql
CREATE METRIC VIEW production.growth.conversion_metrics AS
WITH user_events AS (
  SELECT
    user_id,
    MIN(CASE WHEN event_type = 'signup' THEN event_timestamp END) AS signup_time,
    MIN(CASE WHEN event_type = 'purchase' THEN event_timestamp END) AS purchase_time
  FROM main.events.user_events
  GROUP BY user_id
),
conversions AS (
  SELECT
    user_id,
    signup_time,
    purchase_time,
    CASE
      WHEN purchase_time IS NOT NULL
        AND purchase_time <= signup_time + INTERVAL 30 DAYS
      THEN 1
      ELSE 0
    END AS converted
  FROM user_events
  WHERE signup_time IS NOT NULL
)
SELECT
  DATE_TRUNC('month', signup_time) AS signup_month,
  SUM(converted) AS conversions,
  COUNT(*) AS signups,
  SUM(converted) / COUNT(*) * 100 AS conversion_rate
FROM conversions;

This manual implementation requires careful validation to ensure correctness matches dbt logic.

Dimension Definitions

Translate dimension expressions and time dimensions.

dbt definition:

yaml
semantic_models:
  - name: orders
    dimensions:
      - name: order_date
        type: time
        type_params:
          time_granularity: day

      - name: customer_tier
        expr: |
          CASE
            WHEN lifetime_value > 10000 THEN 'VIP'
            WHEN lifetime_value > 1000 THEN 'Standard'
            ELSE 'Basic'
          END

Databricks YAML:

yaml
dimensions:
  - name: order_date_day
    expr: DATE(order_timestamp)

  - name: order_date_week
    expr: DATE_TRUNC('week', order_timestamp)

  - name: order_date_month
    expr: DATE_TRUNC('month', order_timestamp)

  - name: customer_tier
    expr: |
      CASE
        WHEN lifetime_value > 10000 THEN 'VIP'
        WHEN lifetime_value > 1000 THEN 'Standard'
        ELSE 'Basic'
      END

Key difference: dbt auto-generates time grains from single time dimension, while Databricks requires explicit definition for each grain.

Relationship and Join Definitions

Translate entity relationships to explicit joins.

dbt semantic graph:

yaml
semantic_models:
  - name: orders
    entities:
      - name: order_id
        type: primary
      - name: customer_id
        type: foreign
      - name: product_id
        type: foreign

  - name: customers
    entities:
      - name: customer_id
        type: primary

  - name: products
    entities:
      - name: product_id
        type: primary

dbt automatically resolves joins via entities. Databricks requires explicit join specifications.

Databricks YAML:

yaml
metric_view:
  name: order_metrics
  source_table: main.sales.orders
  joins:
    - table: main.sales.customers
      on: orders.customer_id = customers.customer_id

    - table: main.sales.products
      on: orders.product_id = products.product_id

  measures:
    - name: total_revenue
      expr: SUM(order_amount)

  dimensions:
    - name: customer_region
      expr: customers.region

    - name: product_category
      expr: products.category

Databricks SQL DDL:

sql
CREATE METRIC VIEW production.sales.order_metrics AS
SELECT
  SUM(order_amount) AS total_revenue,
  COUNT(DISTINCT order_id) AS order_count,
  customers.region AS customer_region,
  products.category AS product_category,
  DATE(order_timestamp) AS order_date
FROM main.sales.orders
JOIN main.sales.customers
  ON orders.customer_id = customers.customer_id
JOIN main.sales.products
  ON orders.product_id = products.product_id;

Implementation Steps

Step 1: Export dbt Semantic Models

Extract metric definitions from dbt project.

Locate all semantic models:

bash
# Find semantic model files
find models/ -type f -name "*.yml" | xargs grep -l "semantic_models:"

# Extract to single file for review
find models/ -type f -name "*.yml" -exec cat {} \; > all_metrics.yml

Document dependencies:

Create inventory spreadsheet:

Metric NameTypeSource TablesDimensionsConsumers
total_revenuesimpleordersregion, dateFinance dashboard
avg_order_valueratioordersregion, date, productExecutive KPIs
conversion_rateconversionusers, orderscohort_monthGrowth analytics

Step 2: Create Unity Catalog Structure

Set up catalog hierarchy for metric views.

Create catalog and schemas:

sql
-- Create catalog if needed
CREATE CATALOG IF NOT EXISTS production;

-- Create schemas by domain
CREATE SCHEMA IF NOT EXISTS production.finance;
CREATE SCHEMA IF NOT EXISTS production.marketing;
CREATE SCHEMA IF NOT EXISTS production.sales;
CREATE SCHEMA IF NOT EXISTS production.product;

Set ownership and permissions:

sql
-- Assign schema ownership
ALTER SCHEMA production.finance OWNER TO `finance_team`;
ALTER SCHEMA production.marketing OWNER TO `marketing_team`;

-- Grant catalog-level access
GRANT USE CATALOG ON CATALOG production TO `all_users`;
GRANT USE SCHEMA ON SCHEMA production.finance TO `finance_analysts`;

Step 3: Register Metric Views

Create metric views using translated definitions.

Option 1: YAML-based registration

Create YAML file (production_finance_revenue_metrics.yaml):

yaml
metric_view:
  name: revenue_metrics
  source_table: main.sales.orders

  joins:
    - table: main.sales.customers
      on: orders.customer_id = customers.customer_id
    - table: main.sales.products
      on: orders.product_id = products.product_id

  measures:
    - name: total_revenue
      expr: SUM(order_amount)
      description: "Total order revenue before discounts"

    - name: total_discounts
      expr: SUM(discount_amount)
      description: "Total discount amount applied"

    - name: net_revenue
      expr: SUM(order_amount) - SUM(discount_amount)
      description: "Revenue after discounts"

    - name: avg_order_value
      expr: SUM(order_amount) / COUNT(DISTINCT order_id)
      description: "Average revenue per order"

    - name: order_count
      expr: COUNT(DISTINCT order_id)
      description: "Total number of orders"

  dimensions:
    - name: customer_region
      expr: customers.region

    - name: customer_segment
      expr: customers.segment

    - name: product_category
      expr: products.category

    - name: order_date_day
      expr: DATE(order_timestamp)

    - name: order_date_month
      expr: DATE_TRUNC('month', order_timestamp)

    - name: order_date_quarter
      expr: DATE_TRUNC('quarter', order_timestamp)

Upload to DBFS and register:

bash
databricks fs cp production_finance_revenue_metrics.yaml \
  dbfs:/metrics/finance/revenue_metrics.yaml
sql
CREATE METRIC VIEW production.finance.revenue_metrics
FROM 'dbfs:/metrics/finance/revenue_metrics.yaml';

Option 2: SQL DDL registration

sql
CREATE METRIC VIEW production.finance.revenue_metrics AS
SELECT
  -- Measures
  SUM(order_amount) AS total_revenue,
  SUM(discount_amount) AS total_discounts,
  SUM(order_amount) - SUM(discount_amount) AS net_revenue,
  SUM(order_amount) / COUNT(DISTINCT order_id) AS avg_order_value,
  COUNT(DISTINCT order_id) AS order_count,

  -- Dimensions
  customers.region AS customer_region,
  customers.segment AS customer_segment,
  products.category AS product_category,
  DATE(order_timestamp) AS order_date_day,
  DATE_TRUNC('month', order_timestamp) AS order_date_month,
  DATE_TRUNC('quarter', order_timestamp) AS order_date_quarter

FROM main.sales.orders
JOIN main.sales.customers
  ON orders.customer_id = customers.customer_id
JOIN main.sales.products
  ON orders.product_id = products.product_id;

Add metadata:

sql
ALTER METRIC VIEW production.finance.revenue_metrics
SET TBLPROPERTIES (
  'description' = 'Core revenue metrics for financial reporting and analysis',
  'owner' = 'finance_team@company.com',
  'certified' = 'true',
  'domain' = 'finance',
  'last_validated' = '2024-12-15'
);

COMMENT ON METRIC VIEW production.finance.revenue_metrics
IS 'Revenue metrics including gross revenue, discounts, net revenue, and order-level aggregations';

Configure access:

sql
-- Grant read access
GRANT SELECT ON METRIC VIEW production.finance.revenue_metrics
TO `finance_analysts`;

GRANT SELECT ON METRIC VIEW production.finance.revenue_metrics
TO `executive_dashboard_service`;

-- Verify grants
SHOW GRANTS ON METRIC VIEW production.finance.revenue_metrics;

Step 4: Validate Metric Accuracy

Compare results between dbt and Databricks to ensure correctness.

Query both systems identically:

dbt MetricFlow query (via JDBC):

sql
SELECT
  customer_region,
  total_revenue,
  avg_order_value,
  order_count
FROM semantic_layer.metrics
WHERE order_date >= '2024-01-01'
  AND order_date < '2024-02-01'

Databricks query:

sql
SELECT
  customer_region,
  MEASURE(total_revenue) AS total_revenue,
  MEASURE(avg_order_value) AS avg_order_value,
  MEASURE(order_count) AS order_count
FROM production.finance.revenue_metrics
WHERE order_date_day >= '2024-01-01'
  AND order_date_day < '2024-02-01'
GROUP BY customer_region;

Automated validation with Fenic:

python
import fenic as fc

# Configure session
config = fc.SessionConfig(
    default_model="gpt-4o-mini",
    max_retries=3,
    timeout=30
)
session = fc.FenicSession(config)
fc.set_session(session)

# Load comparison datasets
dbt_results = fc.read_parquet("s3://validation/dbt_metrics_jan2024.parquet")
databricks_results = fc.read_parquet("s3://validation/databricks_metrics_jan2024.parquet")

# Join and calculate variance
comparison = (
    dbt_results
    .join(
        databricks_results,
        on=["customer_region"],
        how="full_outer"
    )
    .with_column(
        "revenue_diff",
        fc.col("dbt_total_revenue") - fc.col("databricks_total_revenue")
    )
    .with_column(
        "revenue_variance_pct",
        fc.abs(fc.col("revenue_diff") / fc.col("dbt_total_revenue") * 100)
    )
    .with_column(
        "aov_diff",
        fc.col("dbt_avg_order_value") - fc.col("databricks_avg_order_value")
    )
)

# Flag significant discrepancies
issues = comparison.filter(
    (fc.col("revenue_variance_pct") > 0.01) |  # >0.01% variance
    (fc.abs(fc.col("aov_diff")) > 0.01)        # >$0.01 difference
)

results = issues.collect()

if len(results) > 0:
    print(f"Found {len(results)} discrepancies requiring investigation")
    for row in results:
        print(f"Region: {row['customer_region']}")
        print(f"  Revenue variance: {row['revenue_variance_pct']:.4f}%")
        print(f"  AOV difference: ${row['aov_diff']:.2f}")
else:
    print("All metrics validated successfully")

Root cause analysis for mismatches:

Common discrepancy causes:

  • Join cardinality differences: Fan-out from incorrect join relationships
  • Time zone conversions: Date/timestamp handling varies between systems
  • Null handling: Aggregation behavior with NULL values
  • Filter timing: WHERE clauses applied at different query stages
  • Rounding differences: Floating point precision in calculations

Investigate specific issues:

python
# Note: Fenic's semantic.map works on a single column and applies a transformation.
# For complex multi-field analysis, you need to first combine fields into a single text column

investigation = (
    comparison
    .filter(fc.col("customer_region") == "US-West")
    .with_column(
        "analysis_input",
        fc.concat_ws(" | ",
            fc.lit("Metric: "), fc.col("metric_name"),
            fc.lit("Dimension: "), fc.col("customer_region"),
            fc.lit("Date: "), fc.col("order_date"),
            fc.lit("dbt value: "), fc.col("dbt_value"),
            fc.lit("Databricks value: "), fc.col("databricks_value"),
            fc.lit("Variance: "), fc.col("variance_pct")
        )
    )
    .with_column(
        "analysis",
        fc.semantic.map(
            fc.col("analysis_input"),
            """Analyze this metric discrepancy and identify the likely cause. 
            Consider: Join cardinality (duplicates), Time zone handling, 
            Null handling, Filter order. Provide specific hypothesis."""
        )
    )
)

investigation.show()

Step 5: Update BI Tool Connections

Reconfigure dashboards to query Databricks Metric Views.

Tableau migration:

  1. Open Tableau workbook
  2. Data → New Data Source → Databricks
  3. Configure connection:
    • Server: your-workspace.cloud.databricks.com
    • HTTP Path: /sql/1.0/warehouses/your-warehouse-id
    • Authentication: Personal Access Token or OAuth
  4. Update custom SQL:
sql
-- Old dbt query
SELECT
  customer_region,
  total_revenue,
  avg_order_value
FROM semantic_layer.metrics
WHERE order_date >= DATE('2024-01-01')

-- New Databricks query
SELECT
  customer_region,
  MEASURE(total_revenue) AS total_revenue,
  MEASURE(avg_order_value) AS avg_order_value
FROM production.finance.revenue_metrics
WHERE order_date_day >= DATE('2024-01-01')
GROUP BY customer_region
  1. Republish workbook to Tableau Server

Power BI migration:

  1. Get Data → More → Databricks
  2. Enter connection details
  3. Update Power Query M code:
objectivec
// Old dbt connection
let
    Source = Odbc.Query("dsn=dbt_metrics",
        "SELECT customer_region, total_revenue FROM semantic_layer.metrics")
in
    Source

// New Databricks connection
let
    Source = Databricks.Catalogs("your-workspace.cloud.databricks.com",
        "/sql/1.0/warehouses/your-warehouse-id",
        [Catalog = "production", Schema = "finance"]),
    revenue_metrics = Source{[Name="revenue_metrics"]}[Data],
    custom_query = Databricks.Query("your-workspace.cloud.databricks.com",
        "/sql/1.0/warehouses/your-warehouse-id",
        "SELECT customer_region, MEASURE(total_revenue) AS total_revenue
         FROM production.finance.revenue_metrics
         GROUP BY customer_region")
in
    custom_query
  1. Refresh data model
  2. Update scheduled refreshes

Mode Analytics migration:

Replace native dbt integration with custom SQL:

sql
-- Mode report query
WITH monthly_metrics AS (
  SELECT
    order_date_month AS month,
    customer_region,
    MEASURE(total_revenue) AS revenue,
    MEASURE(order_count) AS orders,
    MEASURE(avg_order_value) AS aov
  FROM production.finance.revenue_metrics
  WHERE order_date_day >= '2024-01-01'
  GROUP BY order_date_month, customer_region
)
SELECT
  month,
  customer_region,
  revenue,
  orders,
  aov,
  revenue - LAG(revenue, 1) OVER (PARTITION BY customer_region ORDER BY month) AS revenue_growth
FROM monthly_metrics
ORDER BY month DESC, revenue DESC;

Update report parameters and filters to match new schema.

Custom applications using GraphQL:

Replace GraphQL queries with SQL via Databricks connectors.

Before (dbt GraphQL):

python
import requests

response = requests.post(
    'https://semantic-layer.cloud.getdbt.com/api/graphql',
    json={
        'query': '''
            query {
              metrics(
                metrics: ["total_revenue", "order_count"]
                dimensions: ["customer_region"]
                where: "order_date >= '2024-01-01'"
              ) {
                customer_region
                total_revenue
                order_count
              }
            }
        '''
    },
    headers={'Authorization': f'Bearer {token}'}
)
data = response.json()

After (Databricks SQL connector):

python
from databricks import sql
import os

def query_metrics(metrics, dimensions, filters):
    """Query Databricks Metric Views via SQL"""

    connection = sql.connect(
        server_hostname=os.getenv("DATABRICKS_SERVER_HOSTNAME"),
        http_path=os.getenv("DATABRICKS_HTTP_PATH"),
        access_token=os.getenv("DATABRICKS_TOKEN")
    )

    # Build MEASURE clauses
    measure_clauses = [f"MEASURE({m}) AS {m}" for m in metrics]
    measures_sql = ", ".join(measure_clauses)

    # Build dimensions
    dims_sql = ", ".join(dimensions) if dimensions else ""

    # Build WHERE clause
    where_sql = " AND ".join([f"{k} {v}" for k, v in filters.items()])

    # Build GROUP BY
    group_by_sql = f"GROUP BY {dims_sql}" if dims_sql else ""

    query = f"""
        SELECT
            {dims_sql + ',' if dims_sql else ''}
            {measures_sql}
        FROM production.finance.revenue_metrics
        WHERE {where_sql}
        {group_by_sql}
    """

    cursor = connection.cursor()
    cursor.execute(query)

    columns = [desc[0] for desc in cursor.description]
    results = [dict(zip(columns, row)) for row in cursor.fetchall()]

    cursor.close()
    connection.close()

    return results

# Usage
data = query_metrics(
    metrics=["total_revenue", "order_count"],
    dimensions=["customer_region"],
    filters={"order_date_day >=": "'2024-01-01'"}
)

For REST API requirements, wrap in FastAPI:

python
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
from typing import List, Dict

app = FastAPI()

class MetricRequest(BaseModel):
    metrics: List[str]
    dimensions: List[str] = []
    filters: Dict[str, str] = {}

@app.post("/api/metrics/query")
async def query_metrics_endpoint(request: MetricRequest):
    try:
        results = query_metrics(
            metrics=request.metrics,
            dimensions=request.dimensions,
            filters=request.filters
        )
        return {"data": results, "row_count": len(results)}
    except Exception as e:
        raise HTTPException(status_code=500, detail=str(e))

@app.get("/api/metrics/list")
async def list_metrics():
    """List available metrics from Unity Catalog"""
    connection = sql.connect(
        server_hostname=os.getenv("DATABRICKS_SERVER_HOSTNAME"),
        http_path=os.getenv("DATABRICKS_HTTP_PATH"),
        access_token=os.getenv("DATABRICKS_TOKEN")
    )

    cursor = connection.cursor()
    cursor.execute("""
        SELECT column_name, comment
        FROM system.information_schema.columns
        WHERE table_catalog = 'production'
          AND table_schema = 'finance'
          AND table_name = 'revenue_metrics'
    """)

    metrics = [{"name": row[0], "description": row[1]} for row in cursor.fetchall()]
    cursor.close()
    connection.close()

    return {"metrics": metrics}

Step 6: Implement Version Control

Establish git-based workflows for metric definitions.

Export metric views to YAML:

sql
-- Export metric view definition
DESCRIBE METRIC VIEW EXTENDED production.finance.revenue_metrics;

Store exported YAML in repository:

metrics/
├── README.md
├── finance/
│   ├── revenue_metrics.yaml
│   ├── cost_metrics.yaml
│   └── margin_metrics.yaml
├── marketing/
│   ├── campaign_metrics.yaml
│   └── attribution_metrics.yaml
└── sales/
    ├── pipeline_metrics.yaml
    └── quota_metrics.yaml

CI/CD pipeline for deployments:

yaml
# .github/workflows/deploy_metrics.yml
name: Deploy Metric Views

on:
  push:
    branches: [main]
    paths: ['metrics/**/*.yaml']

jobs:
  deploy:
    runs-on: ubuntu-latest

    steps:
      - name: Checkout code
        uses: actions/checkout@v3

      - name: Install Databricks CLI
        run: |
          pip install databricks-cli

      - name: Configure Databricks
        env:
          DATABRICKS_HOST: ${{ secrets.DATABRICKS_HOST }}
          DATABRICKS_TOKEN: ${{ secrets.DATABRICKS_TOKEN }}
        run: |
          databricks configure --token <<EOF
          $DATABRICKS_HOST
          $DATABRICKS_TOKEN
          EOF

      - name: Upload YAMLs to DBFS
        run: |
          databricks fs cp --recursive --overwrite \
            metrics/ dbfs:/metrics/

      - name: Deploy metric views
        run: |
          for yaml_file in metrics/**/*.yaml; do
            catalog=$(echo $yaml_file | cut -d'/' -f2)
            schema=$(echo $yaml_file | cut -d'/' -f3)
            metric_name=$(basename $yaml_file .yaml)

            databricks sql execute \
              --statement "CREATE OR REPLACE METRIC VIEW ${catalog}.${schema}.${metric_name} FROM 'dbfs:/metrics/${yaml_file}'"
          done

      - name: Validate deployments
        run: |
          databricks sql execute \
            --statement "SELECT COUNT(*) FROM system.information_schema.tables WHERE table_type = 'METRIC_VIEW'"

Change tracking:

sql
-- Query object modification history
SELECT
  table_catalog,
  table_schema,
  table_name,
  table_owner,
  created_at,
  updated_at,
  table_properties
FROM system.information_schema.tables
WHERE table_type = 'METRIC_VIEW'
  AND table_catalog = 'production'
ORDER BY updated_at DESC;

Combine with git commit history for complete audit trail.


Validation with Fenic

Automated Metric Comparison

Fenic semantic operators enable sophisticated metric validation.

python
import fenic as fc
from pydantic import BaseModel
from typing import Literal

class ValidationResult(BaseModel):
    metric_name: str
    dimension_values: str
    status: Literal["pass", "fail", "warning"]
    variance_pct: float
    details: str

# ...

# First, create a combined text column with all the context
validation = (
    comparison
    .filter(fc.col("variance_pct") > 0.01)
    .with_column(
        "discrepancy_context",
        fc.concat_ws("\n",
            fc.concat(fc.lit("Metric: "), fc.col("metric_name")),
            fc.concat(fc.lit("Dimension: "), fc.col("customer_region")),
            fc.concat(fc.lit("Date: "), fc.col("order_date")),
            fc.concat(fc.lit("dbt value: "), fc.col("dbt_value").cast("string")),
            fc.concat(fc.lit("Databricks value: "), fc.col("databricks_value").cast("string")),
            fc.concat(fc.lit("Variance: "), fc.col("variance_pct").cast("string"), fc.lit("%"))
        )
    )
    .with_column(
        "root_cause_analysis",
        fc.semantic.extract(
            fc.col("discrepancy_context"),
            ValidationResult
        )
    )
)

# Generate validation report
report = validation.collect()

print(f"Validation Results: {len(report)} discrepancies found")
for item in report:
    result = item['root_cause_analysis']
    print(f"\n{result['status'].upper()}: {result['metric_name']}")
    print(f"  {result['details']}")
    print(f"  Variance: {result['variance_pct']:.2f}%")

Data Quality Monitoring

Establish ongoing quality checks with Fenic semantic operators.

python
import fenic as fc
from pydantic import BaseModel
from typing import Literal
from datetime import datetime, timedelta

# Define anomaly detection schema
class AnomalyCheck(BaseModel):
    check_type: Literal["range", "trend", "completeness", "consistency"]
    is_anomalous: bool
    severity: Literal["critical", "warning", "info"]
    explanation: str
    recommended_action: str

# Note: Since Fenic doesn't support SQL reading, you need to export data first
# Assuming data has been exported to parquet
yesterday = (datetime.now() - timedelta(days=1)).strftime("%Y-%m-%d")

metrics_df = (fc.read.parquet("s3://temp/recent_metrics.parquet")
    .filter(fc.col("order_date_day") >= yesterday)
)

# Create a combined context column for anomaly detection
metrics_with_context = metrics_df.with_column(
    "anomaly_context",
    fc.concat_ws("\n",
        fc.concat(fc.lit("Date: "), fc.col("order_date_day").cast("string")),
        fc.concat(fc.lit("Region: "), fc.col("customer_region")),
        fc.concat(fc.lit("Revenue: $"), fc.col("revenue").cast("string")),
        fc.concat(fc.lit("Orders: "), fc.col("orders").cast("string")),
        fc.concat(fc.lit("AOV: $"), fc.col("aov").cast("string")),
        fc.lit("Expected ranges:"),
        fc.lit("- Revenue: $50K-$500K per region per day"),
        fc.lit("- Orders: 100-5000 per region per day"),
        fc.lit("- AOV: $20-$200")
    )
)

# Semantic anomaly detection
anomalies = (
    metrics_with_context
    .with_column(
        "anomaly_analysis",
        fc.semantic.extract(
            fc.col("anomaly_context"),
            AnomalyCheck
        )
    )
    # Note: Accessing nested fields may require different syntax in actual Fenic
    # This is a conceptual representation
)

# Generate alerts
issues = anomalies.collect()

if len(issues) > 0:
    for issue in issues:
        analysis = issue['anomaly_analysis']
        print(f"ALERT [{analysis['severity'].upper()}]: {analysis['check_type']}")
        print(f"  Date: {issue['order_date_day']}, Region: {issue['customer_region']}")
        print(f"  {analysis['explanation']}")
        print(f"  Action: {analysis['recommended_action']}")

Regression Testing

Implement automated tests for metric correctness.

python
import fenic as fc
from pydantic import BaseModel
from typing import Literal

class TestResult(BaseModel):
    test_name: str
    passed: bool
    actual_value: float
    expected_value: float
    tolerance_pct: float
    message: str

def test_metric_values():
    """Test metric calculations against known values"""

    # Note: Fenic doesn't support SQL reading. Export data from Databricks to Parquet first.
    # Example: Use databricks-sql-connector to export, then read with Fenic
    
    # Test 1: Revenue for Q1 2024 by region
    # Assuming data has been exported to parquet
    q1_revenue = fc.read_parquet("s3://validation/q1_2024_revenue_by_region.parquet")

    # Expected values (from dbt baseline)
    expected = {
        'US-West': 5_000_000,
        'US-East': 4_000_000,
        'EU': 3_000_000
    }

    results = []
    for row in q1_revenue.collect():
        region = row['customer_region']
        actual = row['revenue']
        expect = expected.get(region, 0)
        variance = abs((actual - expect) / expect * 100)

        results.append(TestResult(
            test_name=f"q1_2024_revenue_{region}",
            passed=variance < 0.1,  # <0.1% tolerance
            actual_value=actual,
            expected_value=expect,
            tolerance_pct=variance,
            message=f"Expected ${expect:,.0f}, got ${actual:,.0f}"
        ))

    return results

def test_metric_relationships():
    """Test relationships between metrics"""

    # Note: Fenic doesn't support SQL reading. Export data from Databricks to Parquet first.
    
    # Test: Revenue = Orders * AOV (within rounding tolerance)
    # Assuming data has been exported with the diff calculation done in Databricks
    relationship_test = fc.read_parquet("s3://validation/metric_relationships_test.parquet")

    max_diff = relationship_test.select(
        fc.max(fc.abs(fc.col("diff")))
    ).collect()[0][0]

    return [TestResult(
        test_name="revenue_equals_orders_times_aov",
        passed=max_diff < 0.01,  # $0.01 tolerance
        actual_value=max_diff,
        expected_value=0.0,
        tolerance_pct=0,
        message=f"Max difference between revenue and orders*aov: ${max_diff:.2f}"
    )]

# Run all tests
all_tests = test_metric_values() + test_metric_relationships()

failed = [t for t in all_tests if not t.passed]

if failed:
    print(f"FAILED: {len(failed)}/{len(all_tests)} tests")
    for test in failed:
        print(f"  {test.test_name}: {test.message}")
else:
    print(f"PASSED: All {len(all_tests)} tests")

Common Migration Challenges

Challenge: Conversion Metrics Without Native Support

Problem: dbt's conversion metric type calculates cohort-based conversion rates over time windows. Databricks has no equivalent native construct.

Solution: Implement conversion logic manually with window clauses and CTEs.

sql
CREATE METRIC VIEW production.growth.conversion_metrics AS
WITH user_events AS (
  -- Get first occurrence of each event type per user
  SELECT
    user_id,
    MIN(CASE WHEN event_type = 'signup' THEN event_timestamp END) AS signup_time,
    MIN(CASE WHEN event_type = 'purchase' THEN event_timestamp END) AS purchase_time,
    MIN(CASE WHEN event_type = 'subscription' THEN event_timestamp END) AS subscription_time
  FROM main.events.user_events
  GROUP BY user_id
),
conversion_windows AS (
  -- Calculate time differences and conversion flags
  SELECT
    user_id,
    signup_time,
    purchase_time,
    subscription_time,
    CASE
      WHEN purchase_time IS NOT NULL
        AND purchase_time <= signup_time + INTERVAL 7 DAYS
      THEN 1 ELSE 0
    END AS converted_7d,
    CASE
      WHEN purchase_time IS NOT NULL
        AND purchase_time <= signup_time + INTERVAL 30 DAYS
      THEN 1 ELSE 0
    END AS converted_30d,
    CASE
      WHEN subscription_time IS NOT NULL
        AND subscription_time <= signup_time + INTERVAL 30 DAYS
      THEN 1 ELSE 0
    END AS subscribed_30d
  FROM user_events
  WHERE signup_time IS NOT NULL
)
SELECT
  DATE_TRUNC('month', signup_time) AS signup_month,
  COUNT(*) AS signups,
  SUM(converted_7d) AS conversions_7d,
  SUM(converted_30d) AS conversions_30d,
  SUM(subscribed_30d) AS subscriptions_30d,
  SUM(converted_7d) * 100.0 / COUNT(*) AS conversion_rate_7d,
  SUM(converted_30d) * 100.0 / COUNT(*) AS conversion_rate_30d,
  SUM(subscribed_30d) * 100.0 / COUNT(*) AS subscription_rate_30d
FROM conversion_windows
GROUP BY DATE_TRUNC('month', signup_time);

Validate conversion logic correctness:

python
import fenic as fc

# Load dbt conversion results
dbt_conversions = fc.read.parquet("s3://validation/dbt_conversion_metrics.parquet")

# Note: Fenic doesn't support SQL reading. Export data from Databricks first.
# Load Databricks conversion results from exported parquet
db_conversions = fc.read_parquet("s3://validation/databricks_conversion_metrics.parquet")

# Compare with tolerance
validation = (
    dbt_conversions
    .join(db_conversions, on="signup_month")
    .with_column(
        "rate_diff_7d",
        fc.abs(fc.col("dbt_conversion_rate_7d") - fc.col("db_conversion_rate_7d"))
    )
    .with_column(
        "rate_diff_30d",
        fc.abs(fc.col("dbt_conversion_rate_30d") - fc.col("db_conversion_rate_30d"))
    )
    .filter(
        (fc.col("rate_diff_7d") > 0.1) | (fc.col("rate_diff_30d") > 0.1)
    )
)

issues = validation.collect()

if len(issues) == 0:
    print("Conversion metrics validated successfully")
else:
    print(f"Found {len(issues)} conversion metric discrepancies")
    for issue in issues:
        print(f"Month: {issue['signup_month']}, "
              f"7d diff: {issue['rate_diff_7d']:.2f}%, "
              f"30d diff: {issue['rate_diff_30d']:.2f}%")

Challenge: Time Grain Proliferation

Problem: dbt auto-generates time grains from single time dimension. Databricks requires explicit dimension per grain.

Solution: Define commonly-used grains as dimensions, use SQL date clauses for ad-hoc grains.

yaml
# Metric view YAML with multiple time grains
dimensions:
  - name: order_date_day
    expr: DATE(order_timestamp)
    description: "Daily grain"

  - name: order_date_week
    expr: DATE_TRUNC('week', order_timestamp)
    description: "Weekly grain (week starts Monday)"

  - name: order_date_month
    expr: DATE_TRUNC('month', order_timestamp)
    description: "Monthly grain"

  - name: order_date_quarter
    expr: DATE_TRUNC('quarter', order_timestamp)
    description: "Quarterly grain"

  - name: order_date_year
    expr: DATE_TRUNC('year', order_timestamp)
    description: "Yearly grain"

  - name: fiscal_year
    expr: |
      CASE
        WHEN MONTH(order_timestamp) >= 4
        THEN YEAR(order_timestamp)
        ELSE YEAR(order_timestamp) - 1
      END
    description: "Fiscal year (April start)"

  - name: fiscal_quarter
    expr: |
      CASE
        WHEN MONTH(order_timestamp) BETWEEN 4 AND 6 THEN 'Q1'
        WHEN MONTH(order_timestamp) BETWEEN 7 AND 9 THEN 'Q2'
        WHEN MONTH(order_timestamp) BETWEEN 10 AND 12 THEN 'Q3'
        ELSE 'Q4'
      END
    description: "Fiscal quarter"

For custom time periods, compute in queries:

sql
-- Custom 4-week rolling windows
SELECT
  DATE_TRUNC('week', order_date_day) AS week_start,
  customer_region,
  SUM(MEASURE(total_revenue)) OVER (
    PARTITION BY customer_region
    ORDER BY DATE_TRUNC('week', order_date_day)
    ROWS BETWEEN 3 PRECEDING AND CURRENT ROW
  ) AS rolling_4wk_revenue
FROM production.finance.revenue_metrics
WHERE order_date_day >= CURRENT_DATE - 90
GROUP BY DATE_TRUNC('week', order_date_day), customer_region;

Challenge: Testing Framework Gap

Problem: dbt's testing framework doesn't translate. Need alternative validation approach.

Solution: Implement tests with Fenic semantic operators and Databricks SQL.

python
import fenic as fc
from pydantic import BaseModel
from typing import Literal, List

class MetricTest(BaseModel):
    test_name: str
    test_type: Literal["not_null", "range", "relationship", "trend", "uniqueness"]
    passed: bool
    severity: Literal["critical", "warning", "info"]
    details: str

def test_no_null_metrics():
    """Test: No NULL values in key metrics"""

    # Note: Fenic doesn't support SQL reading. Export aggregated null check from Databricks first.
    null_check_df = fc.read_parquet("s3://validation/null_check_results.parquet")
    null_check = null_check_df.collect()[0]

    has_nulls = (
        null_check["null_revenue"] > 0 or
        null_check["null_orders"] > 0 or
        null_check["null_aov"] > 0
    )

    return MetricTest(
        test_name="no_null_metrics",
        test_type="not_null",
        passed=not has_nulls,
        severity="critical" if has_nulls else "info",
        details=f"Null counts - Revenue: {null_check['null_revenue']}, Orders: {null_check['null_orders']}, AOV: {null_check['null_aov']}"
    )

def test_metric_relationships():
    """Test: Revenue = Orders * AOV within tolerance"""

    # Note: Fenic doesn't support SQL reading. Export data from Databricks first.
    relationship_check = fc.read_parquet("s3://validation/relationship_check.parquet")

    max_diff = relationship_check.select(
        fc.max(fc.col("diff"))
    ).collect()[0][0]

    passed = max_diff < 0.01  # $0.01 tolerance

    return MetricTest(
        test_name="revenue_aov_relationship",
        test_type="relationship",
        passed=passed,
        severity="critical" if not passed else "info",
        details=f"Max difference: ${max_diff:.2f}"
    )

def test_metric_ranges():
    """Test: Metrics within expected ranges"""

    # Note: Fenic doesn't support SQL reading. Export data from Databricks first.
    range_check = fc.read_parquet("s3://validation/range_check_last_7_days.parquet")

    # Check for outliers
    outliers = (
        range_check
        .filter(
            (fc.col("revenue") < 0) |
            (fc.col("revenue") > 10_000_000) |
            (fc.col("orders") < 0) |
            (fc.col("orders") > 50_000)
        )
        .count()
    )

    return MetricTest(
        test_name="metric_ranges",
        test_type="range",
        passed=outliers == 0,
        severity="warning" if outliers > 0 else "info",
        details=f"Found {outliers} out-of-range values"
    )

def test_metric_trends():
    """Test: No extreme day-over-day changes"""

    # Note: Fenic doesn't support SQL reading. Export data from Databricks first.
    # The window function calculation should be done in Databricks before export
    trend_check = fc.read_parquet("s3://validation/trend_check_with_pct_change.parquet")

    extreme_changes = (
        trend_check
        .filter(fc.col("pct_change") > 50)  # >50% day-over-day change
        .count()
    )

    return MetricTest(
        test_name="trend_stability",
        test_type="trend",
        passed=extreme_changes == 0,
        severity="warning" if extreme_changes > 0 else "info",
        details=f"Found {extreme_changes} days with >50% change"
    )

# Run all tests
def run_all_tests():
    tests = [
        test_no_null_metrics(),
        test_metric_relationships(),
        test_metric_ranges(),
        test_metric_trends()
    ]

    failed = [t for t in tests if not t.passed]
    critical = [t for t in failed if t.severity == "critical"]

    print(f"Test Results: {len(tests) - len(failed)}/{len(tests)} passed")

    if critical:
        print(f"\nCRITICAL FAILURES ({len(critical)}):")
        for test in critical:
            print(f"  {test.test_name}: {test.details}")

    if failed and not critical:
        print(f"\nWARNINGS ({len(failed)}):")
        for test in failed:
            print(f"  {test.test_name}: {test.details}")

    return len(critical) == 0

# Execute
if __name__ == "__main__":
    import sys
    success = run_all_tests()
    sys.exit(0 if success else 1)

Integrate into CI/CD:

yaml
# .github/workflows/test_metrics.yml
name: Test Databricks Metric Views

on:
  pull_request:
    paths: ['metrics/**']
  schedule:
    - cron: '0 8 * * *'  # Daily at 8 AM

jobs:
  test:
    runs-on: ubuntu-latest

    steps:
      - uses: actions/checkout@v3

      - name: Set up Python
        uses: actions/setup-python@v4
        with:
          python-version: '3.11'

      - name: Install dependencies
        run: |
          pip install fenic databricks-sql-connector pydantic

      - name: Run metric tests
        env:
          DATABRICKS_TOKEN: ${{ secrets.DATABRICKS_TOKEN }}
          DATABRICKS_SERVER_HOSTNAME: ${{ secrets.DATABRICKS_SERVER_HOSTNAME }}
          DATABRICKS_HTTP_PATH: ${{ secrets.DATABRICKS_HTTP_PATH }}
        run: |
          python tests/test_metrics.py

Post-Migration Optimization

Performance Tuning

Optimize metric query performance in Databricks.

Enable Photon:

sql
-- Verify Photon is enabled on SQL warehouse
SELECT current_version();

-- Photon provides 2-3x performance for analytical queries
-- Enable in SQL warehouse settings: Photon acceleration = ON

Optimize underlying Delta tables:

sql
-- Z-order frequently filtered columns
OPTIMIZE main.sales.orders ZORDER BY (order_date, customer_id);

-- Analyze tables for query planning
ANALYZE TABLE main.sales.orders COMPUTE STATISTICS FOR ALL COLUMNS;

-- Vacuum old files
VACUUM main.sales.orders RETAIN 168 HOURS;

Benchmark query performance:

python
import fenic as fc
import time

queries = [
    ("daily_by_region", """
        SELECT
            order_date_day,
            customer_region,
            MEASURE(total_revenue) AS revenue
        FROM production.finance.revenue_metrics
        WHERE order_date_day >= CURRENT_DATE - 30
        GROUP BY order_date_day, customer_region
    """),
    ("monthly_totals", """
        SELECT
            order_date_month,
            MEASURE(total_revenue) AS revenue,
            MEASURE(order_count) AS orders
        FROM production.finance.revenue_metrics
        WHERE order_date_day >= CURRENT_DATE - 365
        GROUP BY order_date_month
    """)
]

# Note: Fenic doesn't support SQL reading from Databricks directly.
# To benchmark queries, you would need to:
# 1. Execute queries in Databricks and export results to Parquet
# 2. Then read with Fenic to process the results
# 
# For true query benchmarking, use databricks-sql-connector directly:

from databricks import sql
import time

results = []
for query_name, query_sql in queries:
    # Use Databricks SQL connector for benchmarking
    connection = sql.connect(
        server_hostname=os.getenv("DATABRICKS_SERVER_HOSTNAME"),
        http_path=os.getenv("DATABRICKS_HTTP_PATH"),
        access_token=os.getenv("DATABRICKS_TOKEN")
    )
    
    start = time.time()
    cursor = connection.cursor()
    cursor.execute(query_sql)
    results_data = cursor.fetchall()
    duration = time.time() - start
    
    results.append({
        "query": query_name,
        "duration_seconds": duration,
        "rows": len(results_data)
    })
    
    cursor.close()
    connection.close()
    
    print(f"{query_name}: {duration:.2f}s ({len(results_data)} rows)")
    results.append({
        "query": query_name,
        "duration_seconds": duration,
        "rows": row_count
    })

    print(f"{query_name}: {duration:.2f}s ({row_count} rows)")

# Store benchmark results for tracking
fc.DataFrame(results).write.parquet("s3://benchmarks/metric_queries/")

Usage Monitoring

Track metric view adoption and query patterns.

sql
-- Query history for metric views
CREATE OR REPLACE TEMP VIEW metric_usage AS
SELECT
  user_name,
  DATE(start_time) AS query_date,
  HOUR(start_time) AS query_hour,
  statement_text,
  execution_duration_ms / 1000.0 AS execution_seconds,
  rows_produced,
  CASE
    WHEN statement_text LIKE '%production.finance.revenue_metrics%' THEN 'revenue_metrics'
    WHEN statement_text LIKE '%production.marketing.campaign_metrics%' THEN 'campaign_metrics'
    ELSE 'other'
  END AS metric_view
FROM system.query.history
WHERE statement_text LIKE '%MEASURE(%'
  AND start_time >= CURRENT_TIMESTAMP - INTERVAL 7 DAYS
  AND statement_type = 'SELECT';

-- Usage summary
SELECT
  metric_view,
  COUNT(*) AS query_count,
  COUNT(DISTINCT user_name) AS unique_users,
  AVG(execution_seconds) AS avg_duration,
  MAX(execution_seconds) AS max_duration,
  SUM(rows_produced) AS total_rows
FROM metric_usage
GROUP BY metric_view
ORDER BY query_count DESC;

Analyze adoption with Fenic:

python
import fenic as fc

# Note: Fenic doesn't support SQL reading. Export query logs from Databricks first.
# Load query logs that have been pre-exported to parquet
query_logs = fc.read_parquet("s3://validation/query_logs_last_30_days.parquet")

adoption = (
    query_logs
    .group_by("user_name")
    .agg(
        fc.sum("query_count").alias("total_queries"),
        fc.count_distinct("query_date").alias("active_days")
    )
    .with_column(
        "user_activity_summary",
        fc.concat_ws(", ",
            fc.concat(fc.lit("Queries: "), fc.col("total_queries").cast("string")),
            fc.concat(fc.lit("Active days: "), fc.col("active_days").cast("string"))
        )
    )
    .with_column(
        "user_segment",
        fc.semantic.classify(
            fc.col("user_activity_summary"),
            categories=["power_user", "regular_user", "occasional_user", "new_user"]
        )
    )
)

# Segment summary
segments = (
    adoption
    .group_by("user_segment")
    .agg(fc.count("*").alias("user_count"))
    .order_by("user_count", ascending=False)
)

segments.show()

Conclusion

Migrating from dbt Semantic Layer to Databricks Unity Catalog Metric Views consolidates metric logic within a lakehouse-native platform. The migration requires systematic translation of YAML definitions, careful validation of calculation correctness, and reconfiguration of BI tool connections.

Critical success factors:

Systematic translation - Follow consistent patterns for simple, ratio, derived, and conversion metrics. Document edge cases and custom logic.

Rigorous validation - Compare results between dbt and Databricks using automated frameworks. Investigate and resolve all discrepancies before production cutover.

Semantic data processing - Leverage Fenic's semantic operators for validation, quality monitoring, and documentation generation throughout migration.

Incremental deployment - Migrate metrics in phases by domain or complexity tier. Maintain parallel systems temporarily for rollback capability.

The investment yields operational efficiency through simplified architecture, unified governance across analytics and ML, and elimination of external infrastructure dependencies. Organizations committed to Databricks gain consistency and performance by consolidating metrics into Unity Catalog.

Additional Typedef resources:

the next generation of

data processingdata processingdata processing

Join us in igniting a new paradigm in data infrastructure. Enter your email to get early access and redefine how you build and scale data workflows with typedef.