Migrating from dbt MetricFlow to Databricks Unity Catalog Metric Views requires translating YAML semantic models into Unity Catalog objects and adapting query patterns from external APIs to native SQL with MEASURE() clauses. This guide covers translation patterns, validation techniques, and implementation strategies. Typedef's Fenic framework provides semantic data processing capabilities for metric validation and quality assurance throughout migration.
Migration Architecture
dbt MetricFlow vs Databricks Metric Views: Architectural Differences
The migration involves fundamental shifts in metric definition, storage, and access patterns.
Definition storage:
- dbt MetricFlow: YAML files in git repositories
- Databricks Metric Views: Unity Catalog objects defined via SQL DDL or YAML import
Execution model:
- dbt MetricFlow: Query generation engine compiling requests into SQL for external warehouses
- Databricks Metric Views: Native Spark SQL execution with Photon optimization
Query interface:
- dbt MetricFlow: JDBC connections via Arrow Flight, GraphQL API
- Databricks Metric Views: MEASURE() clause in SQL queries, native Databricks integrations
Access control:
- dbt MetricFlow: Warehouse-level permissions
- Databricks Metric Views: Metric view-level RBAC through Unity Catalog
Join semantics:
- dbt MetricFlow: Semantic graph with automatic multi-hop joins
- Databricks Metric Views: Explicit join specifications in YAML or DDL
Time intelligence:
- dbt MetricFlow: Automatic time grains (
metric__day,metric__month,metric__year) - Databricks Metric Views: Explicit dimension definitions per time granularity
What Transfers Directly
These elements require minimal translation:
- Simple aggregation metrics (SUM, COUNT, AVG, MIN, MAX)
- Basic ratio metrics with numerator/denominator logic
- Dimension definitions and column expressions
- Star schema relationship patterns
- Metric descriptions and business documentation
What Requires Manual Translation
These elements need significant rework:
- Conversion metrics (cohort-based, time-windowed) - No native equivalent in Databricks
- Cumulative metrics - Require window function implementations
- Automatic time grains - Must define separate dimensions for each grain
- GraphQL API queries - Convert to SQL with MEASURE() clauses
- dbt-specific metric types - Translate to Spark SQL expressions
What Does Not Transfer
These capabilities have no direct equivalent:
- GraphQL API access for programmatic queries
- dbt Cloud infrastructure and service layer
- JDBC Arrow Flight connections
- dbt test framework for metrics
- Automatic semantic graph join resolution
Pre-Migration Assessment
Metric Inventory
Catalog all metrics in your dbt project to assess migration scope.
Locate semantic models:
bashfind models/ -name "*.yml" -exec grep -l "semantic_models:" {} \;
Count metrics by type:
bashgrep -r "type: simple" models/ | wc -l grep -r "type: ratio" models/ | wc -l grep -r "type: derived" models/ | wc -l grep -r "type: conversion" models/ | wc -l
Categorize by translation difficulty:
| Category | Characteristics | Examples |
|---|---|---|
| Simple | Direct aggregations, basic filters | SUM(revenue), COUNT(orders) |
| Moderate | Ratios, derived metrics, standard joins | avg_order_value, profit_margin |
| Complex | Conversion funnels, cumulative calculations, custom expressions | signup_to_purchase_rate, lifetime_value |
Unity Catalog Readiness
Verify Databricks environment prerequisites.
Check Unity Catalog status:
sqlSHOW CATALOGS;
If results show only hive_metastore, Unity Catalog migration is required before proceeding.
Design namespace structure:
<catalog>.<schema>.<metric_view>
Examples:
production.finance.revenue_metrics
production.marketing.campaign_metrics
production.sales.pipeline_metrics
Align namespaces with organizational domains for discoverability.
Plan access control mapping:
sql-- Map dbt consumer access to Unity Catalog permissions GRANT SELECT ON METRIC VIEW production.finance.revenue_metrics TO `finance_analysts`; GRANT SELECT ON METRIC VIEW production.marketing.campaign_metrics TO `marketing_team`;
Document which roles, users, and teams need access to which metric views.
BI Tool Integration Assessment
Identify how current tools connect to dbt MetricFlow:
- Tools with native integrations (Hex, Mode) require reconfiguration
- JDBC-connected tools (Tableau, Power BI) need connector updates
- Custom applications using GraphQL API need SQL rewrites
Translation Patterns
Simple Metrics
Direct aggregations translate with minimal changes.
dbt definition:
yamlsemantic_models: - name: orders measures: - name: order_total agg: sum - name: order_count agg: count metrics: - name: total_revenue type: simple measure: order_total
Databricks YAML:
yamlmetric_view: name: order_metrics source_table: main.sales.orders measures: - name: total_revenue expr: SUM(order_amount) description: "Total order revenue" - name: order_count expr: COUNT(DISTINCT order_id) description: "Total number of orders"
Databricks SQL DDL:
sqlCREATE METRIC VIEW production.sales.order_metrics AS SELECT SUM(order_amount) AS total_revenue, COUNT(DISTINCT order_id) AS order_count, DATE(order_timestamp) AS order_date, customer_id FROM main.sales.orders;
Ratio Metrics
Convert numerator/denominator specifications to explicit expressions.
dbt definition:
yamlmetrics: - name: avg_order_value type: ratio numerator: total_revenue denominator: order_count
Databricks YAML:
yamlmeasures: - name: avg_order_value expr: SUM(order_amount) / COUNT(DISTINCT order_id) description: "Average revenue per order"
Databricks SQL DDL:
sqlCREATE METRIC VIEW production.sales.order_metrics AS SELECT SUM(order_amount) / COUNT(DISTINCT order_id) AS avg_order_value, customer_region, DATE(order_timestamp) AS order_date FROM main.sales.orders JOIN main.sales.customers ON orders.customer_id = customers.customer_id;
The key difference: dbt handles numerator/denominator separately for correct aggregation, while Databricks requires explicit expression that maintains proper aggregation semantics.
Derived Metrics
Combine multiple metrics into formulas.
dbt definition:
yamlmetrics: - name: profit type: derived expr: total_revenue - total_cost - name: profit_margin type: derived expr: profit / total_revenue
Databricks YAML:
yamlmeasures: - name: profit expr: SUM(order_amount) - SUM(cost_amount) description: "Revenue minus costs" - name: profit_margin expr: (SUM(order_amount) - SUM(cost_amount)) / SUM(order_amount) description: "Profit as percentage of revenue"
Databricks SQL DDL:
sqlCREATE METRIC VIEW production.finance.profit_metrics AS SELECT SUM(order_amount) - SUM(cost_amount) AS profit, (SUM(order_amount) - SUM(cost_amount)) / SUM(order_amount) AS profit_margin, product_category, DATE(order_timestamp) AS order_date FROM main.sales.orders JOIN main.sales.products ON orders.product_id = products.product_id;
Conversion Metrics (Manual Implementation Required)
dbt's conversion metric type has no Databricks equivalent. Implement with window functions.
dbt definition:
yamlmetrics: - name: signup_to_purchase_rate type: conversion conversion_type_params: entity: user_id calculation: conversion_rate window: 30 days base_measure: signups conversion_measure: purchases
Databricks SQL implementation:
sqlCREATE METRIC VIEW production.growth.conversion_metrics AS WITH user_events AS ( SELECT user_id, MIN(CASE WHEN event_type = 'signup' THEN event_timestamp END) AS signup_time, MIN(CASE WHEN event_type = 'purchase' THEN event_timestamp END) AS purchase_time FROM main.events.user_events GROUP BY user_id ), conversions AS ( SELECT user_id, signup_time, purchase_time, CASE WHEN purchase_time IS NOT NULL AND purchase_time <= signup_time + INTERVAL 30 DAYS THEN 1 ELSE 0 END AS converted FROM user_events WHERE signup_time IS NOT NULL ) SELECT DATE_TRUNC('month', signup_time) AS signup_month, SUM(converted) AS conversions, COUNT(*) AS signups, SUM(converted) / COUNT(*) * 100 AS conversion_rate FROM conversions;
This manual implementation requires careful validation to ensure correctness matches dbt logic.
Dimension Definitions
Translate dimension expressions and time dimensions.
dbt definition:
yamlsemantic_models: - name: orders dimensions: - name: order_date type: time type_params: time_granularity: day - name: customer_tier expr: | CASE WHEN lifetime_value > 10000 THEN 'VIP' WHEN lifetime_value > 1000 THEN 'Standard' ELSE 'Basic' END
Databricks YAML:
yamldimensions: - name: order_date_day expr: DATE(order_timestamp) - name: order_date_week expr: DATE_TRUNC('week', order_timestamp) - name: order_date_month expr: DATE_TRUNC('month', order_timestamp) - name: customer_tier expr: | CASE WHEN lifetime_value > 10000 THEN 'VIP' WHEN lifetime_value > 1000 THEN 'Standard' ELSE 'Basic' END
Key difference: dbt auto-generates time grains from single time dimension, while Databricks requires explicit definition for each grain.
Relationship and Join Definitions
Translate entity relationships to explicit joins.
dbt semantic graph:
yamlsemantic_models: - name: orders entities: - name: order_id type: primary - name: customer_id type: foreign - name: product_id type: foreign - name: customers entities: - name: customer_id type: primary - name: products entities: - name: product_id type: primary
dbt automatically resolves joins via entities. Databricks requires explicit join specifications.
Databricks YAML:
yamlmetric_view: name: order_metrics source_table: main.sales.orders joins: - table: main.sales.customers on: orders.customer_id = customers.customer_id - table: main.sales.products on: orders.product_id = products.product_id measures: - name: total_revenue expr: SUM(order_amount) dimensions: - name: customer_region expr: customers.region - name: product_category expr: products.category
Databricks SQL DDL:
sqlCREATE METRIC VIEW production.sales.order_metrics AS SELECT SUM(order_amount) AS total_revenue, COUNT(DISTINCT order_id) AS order_count, customers.region AS customer_region, products.category AS product_category, DATE(order_timestamp) AS order_date FROM main.sales.orders JOIN main.sales.customers ON orders.customer_id = customers.customer_id JOIN main.sales.products ON orders.product_id = products.product_id;
Implementation Steps
Step 1: Export dbt Semantic Models
Extract metric definitions from dbt project.
Locate all semantic models:
bash# Find semantic model files find models/ -type f -name "*.yml" | xargs grep -l "semantic_models:" # Extract to single file for review find models/ -type f -name "*.yml" -exec cat {} \; > all_metrics.yml
Document dependencies:
Create inventory spreadsheet:
| Metric Name | Type | Source Tables | Dimensions | Consumers |
|---|---|---|---|---|
| total_revenue | simple | orders | region, date | Finance dashboard |
| avg_order_value | ratio | orders | region, date, product | Executive KPIs |
| conversion_rate | conversion | users, orders | cohort_month | Growth analytics |
Step 2: Create Unity Catalog Structure
Set up catalog hierarchy for metric views.
Create catalog and schemas:
sql-- Create catalog if needed CREATE CATALOG IF NOT EXISTS production; -- Create schemas by domain CREATE SCHEMA IF NOT EXISTS production.finance; CREATE SCHEMA IF NOT EXISTS production.marketing; CREATE SCHEMA IF NOT EXISTS production.sales; CREATE SCHEMA IF NOT EXISTS production.product;
Set ownership and permissions:
sql-- Assign schema ownership ALTER SCHEMA production.finance OWNER TO `finance_team`; ALTER SCHEMA production.marketing OWNER TO `marketing_team`; -- Grant catalog-level access GRANT USE CATALOG ON CATALOG production TO `all_users`; GRANT USE SCHEMA ON SCHEMA production.finance TO `finance_analysts`;
Step 3: Register Metric Views
Create metric views using translated definitions.
Option 1: YAML-based registration
Create YAML file (production_finance_revenue_metrics.yaml):
yamlmetric_view: name: revenue_metrics source_table: main.sales.orders joins: - table: main.sales.customers on: orders.customer_id = customers.customer_id - table: main.sales.products on: orders.product_id = products.product_id measures: - name: total_revenue expr: SUM(order_amount) description: "Total order revenue before discounts" - name: total_discounts expr: SUM(discount_amount) description: "Total discount amount applied" - name: net_revenue expr: SUM(order_amount) - SUM(discount_amount) description: "Revenue after discounts" - name: avg_order_value expr: SUM(order_amount) / COUNT(DISTINCT order_id) description: "Average revenue per order" - name: order_count expr: COUNT(DISTINCT order_id) description: "Total number of orders" dimensions: - name: customer_region expr: customers.region - name: customer_segment expr: customers.segment - name: product_category expr: products.category - name: order_date_day expr: DATE(order_timestamp) - name: order_date_month expr: DATE_TRUNC('month', order_timestamp) - name: order_date_quarter expr: DATE_TRUNC('quarter', order_timestamp)
Upload to DBFS and register:
bashdatabricks fs cp production_finance_revenue_metrics.yaml \ dbfs:/metrics/finance/revenue_metrics.yaml
sqlCREATE METRIC VIEW production.finance.revenue_metrics FROM 'dbfs:/metrics/finance/revenue_metrics.yaml';
Option 2: SQL DDL registration
sqlCREATE METRIC VIEW production.finance.revenue_metrics AS SELECT -- Measures SUM(order_amount) AS total_revenue, SUM(discount_amount) AS total_discounts, SUM(order_amount) - SUM(discount_amount) AS net_revenue, SUM(order_amount) / COUNT(DISTINCT order_id) AS avg_order_value, COUNT(DISTINCT order_id) AS order_count, -- Dimensions customers.region AS customer_region, customers.segment AS customer_segment, products.category AS product_category, DATE(order_timestamp) AS order_date_day, DATE_TRUNC('month', order_timestamp) AS order_date_month, DATE_TRUNC('quarter', order_timestamp) AS order_date_quarter FROM main.sales.orders JOIN main.sales.customers ON orders.customer_id = customers.customer_id JOIN main.sales.products ON orders.product_id = products.product_id;
Add metadata:
sqlALTER METRIC VIEW production.finance.revenue_metrics SET TBLPROPERTIES ( 'description' = 'Core revenue metrics for financial reporting and analysis', 'owner' = 'finance_team@company.com', 'certified' = 'true', 'domain' = 'finance', 'last_validated' = '2024-12-15' ); COMMENT ON METRIC VIEW production.finance.revenue_metrics IS 'Revenue metrics including gross revenue, discounts, net revenue, and order-level aggregations';
Configure access:
sql-- Grant read access GRANT SELECT ON METRIC VIEW production.finance.revenue_metrics TO `finance_analysts`; GRANT SELECT ON METRIC VIEW production.finance.revenue_metrics TO `executive_dashboard_service`; -- Verify grants SHOW GRANTS ON METRIC VIEW production.finance.revenue_metrics;
Step 4: Validate Metric Accuracy
Compare results between dbt and Databricks to ensure correctness.
Query both systems identically:
dbt MetricFlow query (via JDBC):
sqlSELECT customer_region, total_revenue, avg_order_value, order_count FROM semantic_layer.metrics WHERE order_date >= '2024-01-01' AND order_date < '2024-02-01'
Databricks query:
sqlSELECT customer_region, MEASURE(total_revenue) AS total_revenue, MEASURE(avg_order_value) AS avg_order_value, MEASURE(order_count) AS order_count FROM production.finance.revenue_metrics WHERE order_date_day >= '2024-01-01' AND order_date_day < '2024-02-01' GROUP BY customer_region;
Automated validation with Fenic:
pythonimport fenic as fc # Configure session config = fc.SessionConfig( default_model="gpt-4o-mini", max_retries=3, timeout=30 ) session = fc.FenicSession(config) fc.set_session(session) # Load comparison datasets dbt_results = fc.read_parquet("s3://validation/dbt_metrics_jan2024.parquet") databricks_results = fc.read_parquet("s3://validation/databricks_metrics_jan2024.parquet") # Join and calculate variance comparison = ( dbt_results .join( databricks_results, on=["customer_region"], how="full_outer" ) .with_column( "revenue_diff", fc.col("dbt_total_revenue") - fc.col("databricks_total_revenue") ) .with_column( "revenue_variance_pct", fc.abs(fc.col("revenue_diff") / fc.col("dbt_total_revenue") * 100) ) .with_column( "aov_diff", fc.col("dbt_avg_order_value") - fc.col("databricks_avg_order_value") ) ) # Flag significant discrepancies issues = comparison.filter( (fc.col("revenue_variance_pct") > 0.01) | # >0.01% variance (fc.abs(fc.col("aov_diff")) > 0.01) # >$0.01 difference ) results = issues.collect() if len(results) > 0: print(f"Found {len(results)} discrepancies requiring investigation") for row in results: print(f"Region: {row['customer_region']}") print(f" Revenue variance: {row['revenue_variance_pct']:.4f}%") print(f" AOV difference: ${row['aov_diff']:.2f}") else: print("All metrics validated successfully")
Root cause analysis for mismatches:
Common discrepancy causes:
- Join cardinality differences: Fan-out from incorrect join relationships
- Time zone conversions: Date/timestamp handling varies between systems
- Null handling: Aggregation behavior with NULL values
- Filter timing: WHERE clauses applied at different query stages
- Rounding differences: Floating point precision in calculations
Investigate specific issues:
python# Note: Fenic's semantic.map works on a single column and applies a transformation. # For complex multi-field analysis, you need to first combine fields into a single text column investigation = ( comparison .filter(fc.col("customer_region") == "US-West") .with_column( "analysis_input", fc.concat_ws(" | ", fc.lit("Metric: "), fc.col("metric_name"), fc.lit("Dimension: "), fc.col("customer_region"), fc.lit("Date: "), fc.col("order_date"), fc.lit("dbt value: "), fc.col("dbt_value"), fc.lit("Databricks value: "), fc.col("databricks_value"), fc.lit("Variance: "), fc.col("variance_pct") ) ) .with_column( "analysis", fc.semantic.map( fc.col("analysis_input"), """Analyze this metric discrepancy and identify the likely cause. Consider: Join cardinality (duplicates), Time zone handling, Null handling, Filter order. Provide specific hypothesis.""" ) ) ) investigation.show()
Step 5: Update BI Tool Connections
Reconfigure dashboards to query Databricks Metric Views.
Tableau migration:
- Open Tableau workbook
- Data → New Data Source → Databricks
- Configure connection:
- Server: your-workspace.cloud.databricks.com
- HTTP Path: /sql/1.0/warehouses/your-warehouse-id
- Authentication: Personal Access Token or OAuth
- Update custom SQL:
sql-- Old dbt query SELECT customer_region, total_revenue, avg_order_value FROM semantic_layer.metrics WHERE order_date >= DATE('2024-01-01') -- New Databricks query SELECT customer_region, MEASURE(total_revenue) AS total_revenue, MEASURE(avg_order_value) AS avg_order_value FROM production.finance.revenue_metrics WHERE order_date_day >= DATE('2024-01-01') GROUP BY customer_region
- Republish workbook to Tableau Server
Power BI migration:
- Get Data → More → Databricks
- Enter connection details
- Update Power Query M code:
objectivec// Old dbt connection let Source = Odbc.Query("dsn=dbt_metrics", "SELECT customer_region, total_revenue FROM semantic_layer.metrics") in Source // New Databricks connection let Source = Databricks.Catalogs("your-workspace.cloud.databricks.com", "/sql/1.0/warehouses/your-warehouse-id", [Catalog = "production", Schema = "finance"]), revenue_metrics = Source{[Name="revenue_metrics"]}[Data], custom_query = Databricks.Query("your-workspace.cloud.databricks.com", "/sql/1.0/warehouses/your-warehouse-id", "SELECT customer_region, MEASURE(total_revenue) AS total_revenue FROM production.finance.revenue_metrics GROUP BY customer_region") in custom_query
- Refresh data model
- Update scheduled refreshes
Mode Analytics migration:
Replace native dbt integration with custom SQL:
sql-- Mode report query WITH monthly_metrics AS ( SELECT order_date_month AS month, customer_region, MEASURE(total_revenue) AS revenue, MEASURE(order_count) AS orders, MEASURE(avg_order_value) AS aov FROM production.finance.revenue_metrics WHERE order_date_day >= '2024-01-01' GROUP BY order_date_month, customer_region ) SELECT month, customer_region, revenue, orders, aov, revenue - LAG(revenue, 1) OVER (PARTITION BY customer_region ORDER BY month) AS revenue_growth FROM monthly_metrics ORDER BY month DESC, revenue DESC;
Update report parameters and filters to match new schema.
Custom applications using GraphQL:
Replace GraphQL queries with SQL via Databricks connectors.
Before (dbt GraphQL):
pythonimport requests response = requests.post( 'https://semantic-layer.cloud.getdbt.com/api/graphql', json={ 'query': ''' query { metrics( metrics: ["total_revenue", "order_count"] dimensions: ["customer_region"] where: "order_date >= '2024-01-01'" ) { customer_region total_revenue order_count } } ''' }, headers={'Authorization': f'Bearer {token}'} ) data = response.json()
After (Databricks SQL connector):
pythonfrom databricks import sql import os def query_metrics(metrics, dimensions, filters): """Query Databricks Metric Views via SQL""" connection = sql.connect( server_hostname=os.getenv("DATABRICKS_SERVER_HOSTNAME"), http_path=os.getenv("DATABRICKS_HTTP_PATH"), access_token=os.getenv("DATABRICKS_TOKEN") ) # Build MEASURE clauses measure_clauses = [f"MEASURE({m}) AS {m}" for m in metrics] measures_sql = ", ".join(measure_clauses) # Build dimensions dims_sql = ", ".join(dimensions) if dimensions else "" # Build WHERE clause where_sql = " AND ".join([f"{k} {v}" for k, v in filters.items()]) # Build GROUP BY group_by_sql = f"GROUP BY {dims_sql}" if dims_sql else "" query = f""" SELECT {dims_sql + ',' if dims_sql else ''} {measures_sql} FROM production.finance.revenue_metrics WHERE {where_sql} {group_by_sql} """ cursor = connection.cursor() cursor.execute(query) columns = [desc[0] for desc in cursor.description] results = [dict(zip(columns, row)) for row in cursor.fetchall()] cursor.close() connection.close() return results # Usage data = query_metrics( metrics=["total_revenue", "order_count"], dimensions=["customer_region"], filters={"order_date_day >=": "'2024-01-01'"} )
For REST API requirements, wrap in FastAPI:
pythonfrom fastapi import FastAPI, HTTPException from pydantic import BaseModel from typing import List, Dict app = FastAPI() class MetricRequest(BaseModel): metrics: List[str] dimensions: List[str] = [] filters: Dict[str, str] = {} @app.post("/api/metrics/query") async def query_metrics_endpoint(request: MetricRequest): try: results = query_metrics( metrics=request.metrics, dimensions=request.dimensions, filters=request.filters ) return {"data": results, "row_count": len(results)} except Exception as e: raise HTTPException(status_code=500, detail=str(e)) @app.get("/api/metrics/list") async def list_metrics(): """List available metrics from Unity Catalog""" connection = sql.connect( server_hostname=os.getenv("DATABRICKS_SERVER_HOSTNAME"), http_path=os.getenv("DATABRICKS_HTTP_PATH"), access_token=os.getenv("DATABRICKS_TOKEN") ) cursor = connection.cursor() cursor.execute(""" SELECT column_name, comment FROM system.information_schema.columns WHERE table_catalog = 'production' AND table_schema = 'finance' AND table_name = 'revenue_metrics' """) metrics = [{"name": row[0], "description": row[1]} for row in cursor.fetchall()] cursor.close() connection.close() return {"metrics": metrics}
Step 6: Implement Version Control
Establish git-based workflows for metric definitions.
Export metric views to YAML:
sql-- Export metric view definition DESCRIBE METRIC VIEW EXTENDED production.finance.revenue_metrics;
Store exported YAML in repository:
metrics/
├── README.md
├── finance/
│ ├── revenue_metrics.yaml
│ ├── cost_metrics.yaml
│ └── margin_metrics.yaml
├── marketing/
│ ├── campaign_metrics.yaml
│ └── attribution_metrics.yaml
└── sales/
├── pipeline_metrics.yaml
└── quota_metrics.yaml
CI/CD pipeline for deployments:
yaml# .github/workflows/deploy_metrics.yml name: Deploy Metric Views on: push: branches: [main] paths: ['metrics/**/*.yaml'] jobs: deploy: runs-on: ubuntu-latest steps: - name: Checkout code uses: actions/checkout@v3 - name: Install Databricks CLI run: | pip install databricks-cli - name: Configure Databricks env: DATABRICKS_HOST: ${{ secrets.DATABRICKS_HOST }} DATABRICKS_TOKEN: ${{ secrets.DATABRICKS_TOKEN }} run: | databricks configure --token <<EOF $DATABRICKS_HOST $DATABRICKS_TOKEN EOF - name: Upload YAMLs to DBFS run: | databricks fs cp --recursive --overwrite \ metrics/ dbfs:/metrics/ - name: Deploy metric views run: | for yaml_file in metrics/**/*.yaml; do catalog=$(echo $yaml_file | cut -d'/' -f2) schema=$(echo $yaml_file | cut -d'/' -f3) metric_name=$(basename $yaml_file .yaml) databricks sql execute \ --statement "CREATE OR REPLACE METRIC VIEW ${catalog}.${schema}.${metric_name} FROM 'dbfs:/metrics/${yaml_file}'" done - name: Validate deployments run: | databricks sql execute \ --statement "SELECT COUNT(*) FROM system.information_schema.tables WHERE table_type = 'METRIC_VIEW'"
Change tracking:
sql-- Query object modification history SELECT table_catalog, table_schema, table_name, table_owner, created_at, updated_at, table_properties FROM system.information_schema.tables WHERE table_type = 'METRIC_VIEW' AND table_catalog = 'production' ORDER BY updated_at DESC;
Combine with git commit history for complete audit trail.
Validation with Fenic
Automated Metric Comparison
Fenic semantic operators enable sophisticated metric validation.
pythonimport fenic as fc from pydantic import BaseModel from typing import Literal class ValidationResult(BaseModel): metric_name: str dimension_values: str status: Literal["pass", "fail", "warning"] variance_pct: float details: str # ... # First, create a combined text column with all the context validation = ( comparison .filter(fc.col("variance_pct") > 0.01) .with_column( "discrepancy_context", fc.concat_ws("\n", fc.concat(fc.lit("Metric: "), fc.col("metric_name")), fc.concat(fc.lit("Dimension: "), fc.col("customer_region")), fc.concat(fc.lit("Date: "), fc.col("order_date")), fc.concat(fc.lit("dbt value: "), fc.col("dbt_value").cast("string")), fc.concat(fc.lit("Databricks value: "), fc.col("databricks_value").cast("string")), fc.concat(fc.lit("Variance: "), fc.col("variance_pct").cast("string"), fc.lit("%")) ) ) .with_column( "root_cause_analysis", fc.semantic.extract( fc.col("discrepancy_context"), ValidationResult ) ) ) # Generate validation report report = validation.collect() print(f"Validation Results: {len(report)} discrepancies found") for item in report: result = item['root_cause_analysis'] print(f"\n{result['status'].upper()}: {result['metric_name']}") print(f" {result['details']}") print(f" Variance: {result['variance_pct']:.2f}%")
Data Quality Monitoring
Establish ongoing quality checks with Fenic semantic operators.
pythonimport fenic as fc from pydantic import BaseModel from typing import Literal from datetime import datetime, timedelta # Define anomaly detection schema class AnomalyCheck(BaseModel): check_type: Literal["range", "trend", "completeness", "consistency"] is_anomalous: bool severity: Literal["critical", "warning", "info"] explanation: str recommended_action: str # Note: Since Fenic doesn't support SQL reading, you need to export data first # Assuming data has been exported to parquet yesterday = (datetime.now() - timedelta(days=1)).strftime("%Y-%m-%d") metrics_df = (fc.read.parquet("s3://temp/recent_metrics.parquet") .filter(fc.col("order_date_day") >= yesterday) ) # Create a combined context column for anomaly detection metrics_with_context = metrics_df.with_column( "anomaly_context", fc.concat_ws("\n", fc.concat(fc.lit("Date: "), fc.col("order_date_day").cast("string")), fc.concat(fc.lit("Region: "), fc.col("customer_region")), fc.concat(fc.lit("Revenue: $"), fc.col("revenue").cast("string")), fc.concat(fc.lit("Orders: "), fc.col("orders").cast("string")), fc.concat(fc.lit("AOV: $"), fc.col("aov").cast("string")), fc.lit("Expected ranges:"), fc.lit("- Revenue: $50K-$500K per region per day"), fc.lit("- Orders: 100-5000 per region per day"), fc.lit("- AOV: $20-$200") ) ) # Semantic anomaly detection anomalies = ( metrics_with_context .with_column( "anomaly_analysis", fc.semantic.extract( fc.col("anomaly_context"), AnomalyCheck ) ) # Note: Accessing nested fields may require different syntax in actual Fenic # This is a conceptual representation ) # Generate alerts issues = anomalies.collect() if len(issues) > 0: for issue in issues: analysis = issue['anomaly_analysis'] print(f"ALERT [{analysis['severity'].upper()}]: {analysis['check_type']}") print(f" Date: {issue['order_date_day']}, Region: {issue['customer_region']}") print(f" {analysis['explanation']}") print(f" Action: {analysis['recommended_action']}")
Regression Testing
Implement automated tests for metric correctness.
pythonimport fenic as fc from pydantic import BaseModel from typing import Literal class TestResult(BaseModel): test_name: str passed: bool actual_value: float expected_value: float tolerance_pct: float message: str def test_metric_values(): """Test metric calculations against known values""" # Note: Fenic doesn't support SQL reading. Export data from Databricks to Parquet first. # Example: Use databricks-sql-connector to export, then read with Fenic # Test 1: Revenue for Q1 2024 by region # Assuming data has been exported to parquet q1_revenue = fc.read_parquet("s3://validation/q1_2024_revenue_by_region.parquet") # Expected values (from dbt baseline) expected = { 'US-West': 5_000_000, 'US-East': 4_000_000, 'EU': 3_000_000 } results = [] for row in q1_revenue.collect(): region = row['customer_region'] actual = row['revenue'] expect = expected.get(region, 0) variance = abs((actual - expect) / expect * 100) results.append(TestResult( test_name=f"q1_2024_revenue_{region}", passed=variance < 0.1, # <0.1% tolerance actual_value=actual, expected_value=expect, tolerance_pct=variance, message=f"Expected ${expect:,.0f}, got ${actual:,.0f}" )) return results def test_metric_relationships(): """Test relationships between metrics""" # Note: Fenic doesn't support SQL reading. Export data from Databricks to Parquet first. # Test: Revenue = Orders * AOV (within rounding tolerance) # Assuming data has been exported with the diff calculation done in Databricks relationship_test = fc.read_parquet("s3://validation/metric_relationships_test.parquet") max_diff = relationship_test.select( fc.max(fc.abs(fc.col("diff"))) ).collect()[0][0] return [TestResult( test_name="revenue_equals_orders_times_aov", passed=max_diff < 0.01, # $0.01 tolerance actual_value=max_diff, expected_value=0.0, tolerance_pct=0, message=f"Max difference between revenue and orders*aov: ${max_diff:.2f}" )] # Run all tests all_tests = test_metric_values() + test_metric_relationships() failed = [t for t in all_tests if not t.passed] if failed: print(f"FAILED: {len(failed)}/{len(all_tests)} tests") for test in failed: print(f" {test.test_name}: {test.message}") else: print(f"PASSED: All {len(all_tests)} tests")
Common Migration Challenges
Challenge: Conversion Metrics Without Native Support
Problem: dbt's conversion metric type calculates cohort-based conversion rates over time windows. Databricks has no equivalent native construct.
Solution: Implement conversion logic manually with window clauses and CTEs.
sqlCREATE METRIC VIEW production.growth.conversion_metrics AS WITH user_events AS ( -- Get first occurrence of each event type per user SELECT user_id, MIN(CASE WHEN event_type = 'signup' THEN event_timestamp END) AS signup_time, MIN(CASE WHEN event_type = 'purchase' THEN event_timestamp END) AS purchase_time, MIN(CASE WHEN event_type = 'subscription' THEN event_timestamp END) AS subscription_time FROM main.events.user_events GROUP BY user_id ), conversion_windows AS ( -- Calculate time differences and conversion flags SELECT user_id, signup_time, purchase_time, subscription_time, CASE WHEN purchase_time IS NOT NULL AND purchase_time <= signup_time + INTERVAL 7 DAYS THEN 1 ELSE 0 END AS converted_7d, CASE WHEN purchase_time IS NOT NULL AND purchase_time <= signup_time + INTERVAL 30 DAYS THEN 1 ELSE 0 END AS converted_30d, CASE WHEN subscription_time IS NOT NULL AND subscription_time <= signup_time + INTERVAL 30 DAYS THEN 1 ELSE 0 END AS subscribed_30d FROM user_events WHERE signup_time IS NOT NULL ) SELECT DATE_TRUNC('month', signup_time) AS signup_month, COUNT(*) AS signups, SUM(converted_7d) AS conversions_7d, SUM(converted_30d) AS conversions_30d, SUM(subscribed_30d) AS subscriptions_30d, SUM(converted_7d) * 100.0 / COUNT(*) AS conversion_rate_7d, SUM(converted_30d) * 100.0 / COUNT(*) AS conversion_rate_30d, SUM(subscribed_30d) * 100.0 / COUNT(*) AS subscription_rate_30d FROM conversion_windows GROUP BY DATE_TRUNC('month', signup_time);
Validate conversion logic correctness:
pythonimport fenic as fc # Load dbt conversion results dbt_conversions = fc.read.parquet("s3://validation/dbt_conversion_metrics.parquet") # Note: Fenic doesn't support SQL reading. Export data from Databricks first. # Load Databricks conversion results from exported parquet db_conversions = fc.read_parquet("s3://validation/databricks_conversion_metrics.parquet") # Compare with tolerance validation = ( dbt_conversions .join(db_conversions, on="signup_month") .with_column( "rate_diff_7d", fc.abs(fc.col("dbt_conversion_rate_7d") - fc.col("db_conversion_rate_7d")) ) .with_column( "rate_diff_30d", fc.abs(fc.col("dbt_conversion_rate_30d") - fc.col("db_conversion_rate_30d")) ) .filter( (fc.col("rate_diff_7d") > 0.1) | (fc.col("rate_diff_30d") > 0.1) ) ) issues = validation.collect() if len(issues) == 0: print("Conversion metrics validated successfully") else: print(f"Found {len(issues)} conversion metric discrepancies") for issue in issues: print(f"Month: {issue['signup_month']}, " f"7d diff: {issue['rate_diff_7d']:.2f}%, " f"30d diff: {issue['rate_diff_30d']:.2f}%")
Challenge: Time Grain Proliferation
Problem: dbt auto-generates time grains from single time dimension. Databricks requires explicit dimension per grain.
Solution: Define commonly-used grains as dimensions, use SQL date clauses for ad-hoc grains.
yaml# Metric view YAML with multiple time grains dimensions: - name: order_date_day expr: DATE(order_timestamp) description: "Daily grain" - name: order_date_week expr: DATE_TRUNC('week', order_timestamp) description: "Weekly grain (week starts Monday)" - name: order_date_month expr: DATE_TRUNC('month', order_timestamp) description: "Monthly grain" - name: order_date_quarter expr: DATE_TRUNC('quarter', order_timestamp) description: "Quarterly grain" - name: order_date_year expr: DATE_TRUNC('year', order_timestamp) description: "Yearly grain" - name: fiscal_year expr: | CASE WHEN MONTH(order_timestamp) >= 4 THEN YEAR(order_timestamp) ELSE YEAR(order_timestamp) - 1 END description: "Fiscal year (April start)" - name: fiscal_quarter expr: | CASE WHEN MONTH(order_timestamp) BETWEEN 4 AND 6 THEN 'Q1' WHEN MONTH(order_timestamp) BETWEEN 7 AND 9 THEN 'Q2' WHEN MONTH(order_timestamp) BETWEEN 10 AND 12 THEN 'Q3' ELSE 'Q4' END description: "Fiscal quarter"
For custom time periods, compute in queries:
sql-- Custom 4-week rolling windows SELECT DATE_TRUNC('week', order_date_day) AS week_start, customer_region, SUM(MEASURE(total_revenue)) OVER ( PARTITION BY customer_region ORDER BY DATE_TRUNC('week', order_date_day) ROWS BETWEEN 3 PRECEDING AND CURRENT ROW ) AS rolling_4wk_revenue FROM production.finance.revenue_metrics WHERE order_date_day >= CURRENT_DATE - 90 GROUP BY DATE_TRUNC('week', order_date_day), customer_region;
Challenge: Testing Framework Gap
Problem: dbt's testing framework doesn't translate. Need alternative validation approach.
Solution: Implement tests with Fenic semantic operators and Databricks SQL.
pythonimport fenic as fc from pydantic import BaseModel from typing import Literal, List class MetricTest(BaseModel): test_name: str test_type: Literal["not_null", "range", "relationship", "trend", "uniqueness"] passed: bool severity: Literal["critical", "warning", "info"] details: str def test_no_null_metrics(): """Test: No NULL values in key metrics""" # Note: Fenic doesn't support SQL reading. Export aggregated null check from Databricks first. null_check_df = fc.read_parquet("s3://validation/null_check_results.parquet") null_check = null_check_df.collect()[0] has_nulls = ( null_check["null_revenue"] > 0 or null_check["null_orders"] > 0 or null_check["null_aov"] > 0 ) return MetricTest( test_name="no_null_metrics", test_type="not_null", passed=not has_nulls, severity="critical" if has_nulls else "info", details=f"Null counts - Revenue: {null_check['null_revenue']}, Orders: {null_check['null_orders']}, AOV: {null_check['null_aov']}" ) def test_metric_relationships(): """Test: Revenue = Orders * AOV within tolerance""" # Note: Fenic doesn't support SQL reading. Export data from Databricks first. relationship_check = fc.read_parquet("s3://validation/relationship_check.parquet") max_diff = relationship_check.select( fc.max(fc.col("diff")) ).collect()[0][0] passed = max_diff < 0.01 # $0.01 tolerance return MetricTest( test_name="revenue_aov_relationship", test_type="relationship", passed=passed, severity="critical" if not passed else "info", details=f"Max difference: ${max_diff:.2f}" ) def test_metric_ranges(): """Test: Metrics within expected ranges""" # Note: Fenic doesn't support SQL reading. Export data from Databricks first. range_check = fc.read_parquet("s3://validation/range_check_last_7_days.parquet") # Check for outliers outliers = ( range_check .filter( (fc.col("revenue") < 0) | (fc.col("revenue") > 10_000_000) | (fc.col("orders") < 0) | (fc.col("orders") > 50_000) ) .count() ) return MetricTest( test_name="metric_ranges", test_type="range", passed=outliers == 0, severity="warning" if outliers > 0 else "info", details=f"Found {outliers} out-of-range values" ) def test_metric_trends(): """Test: No extreme day-over-day changes""" # Note: Fenic doesn't support SQL reading. Export data from Databricks first. # The window function calculation should be done in Databricks before export trend_check = fc.read_parquet("s3://validation/trend_check_with_pct_change.parquet") extreme_changes = ( trend_check .filter(fc.col("pct_change") > 50) # >50% day-over-day change .count() ) return MetricTest( test_name="trend_stability", test_type="trend", passed=extreme_changes == 0, severity="warning" if extreme_changes > 0 else "info", details=f"Found {extreme_changes} days with >50% change" ) # Run all tests def run_all_tests(): tests = [ test_no_null_metrics(), test_metric_relationships(), test_metric_ranges(), test_metric_trends() ] failed = [t for t in tests if not t.passed] critical = [t for t in failed if t.severity == "critical"] print(f"Test Results: {len(tests) - len(failed)}/{len(tests)} passed") if critical: print(f"\nCRITICAL FAILURES ({len(critical)}):") for test in critical: print(f" {test.test_name}: {test.details}") if failed and not critical: print(f"\nWARNINGS ({len(failed)}):") for test in failed: print(f" {test.test_name}: {test.details}") return len(critical) == 0 # Execute if __name__ == "__main__": import sys success = run_all_tests() sys.exit(0 if success else 1)
Integrate into CI/CD:
yaml# .github/workflows/test_metrics.yml name: Test Databricks Metric Views on: pull_request: paths: ['metrics/**'] schedule: - cron: '0 8 * * *' # Daily at 8 AM jobs: test: runs-on: ubuntu-latest steps: - uses: actions/checkout@v3 - name: Set up Python uses: actions/setup-python@v4 with: python-version: '3.11' - name: Install dependencies run: | pip install fenic databricks-sql-connector pydantic - name: Run metric tests env: DATABRICKS_TOKEN: ${{ secrets.DATABRICKS_TOKEN }} DATABRICKS_SERVER_HOSTNAME: ${{ secrets.DATABRICKS_SERVER_HOSTNAME }} DATABRICKS_HTTP_PATH: ${{ secrets.DATABRICKS_HTTP_PATH }} run: | python tests/test_metrics.py
Post-Migration Optimization
Performance Tuning
Optimize metric query performance in Databricks.
Enable Photon:
sql-- Verify Photon is enabled on SQL warehouse SELECT current_version(); -- Photon provides 2-3x performance for analytical queries -- Enable in SQL warehouse settings: Photon acceleration = ON
Optimize underlying Delta tables:
sql-- Z-order frequently filtered columns OPTIMIZE main.sales.orders ZORDER BY (order_date, customer_id); -- Analyze tables for query planning ANALYZE TABLE main.sales.orders COMPUTE STATISTICS FOR ALL COLUMNS; -- Vacuum old files VACUUM main.sales.orders RETAIN 168 HOURS;
Benchmark query performance:
pythonimport fenic as fc import time queries = [ ("daily_by_region", """ SELECT order_date_day, customer_region, MEASURE(total_revenue) AS revenue FROM production.finance.revenue_metrics WHERE order_date_day >= CURRENT_DATE - 30 GROUP BY order_date_day, customer_region """), ("monthly_totals", """ SELECT order_date_month, MEASURE(total_revenue) AS revenue, MEASURE(order_count) AS orders FROM production.finance.revenue_metrics WHERE order_date_day >= CURRENT_DATE - 365 GROUP BY order_date_month """) ] # Note: Fenic doesn't support SQL reading from Databricks directly. # To benchmark queries, you would need to: # 1. Execute queries in Databricks and export results to Parquet # 2. Then read with Fenic to process the results # # For true query benchmarking, use databricks-sql-connector directly: from databricks import sql import time results = [] for query_name, query_sql in queries: # Use Databricks SQL connector for benchmarking connection = sql.connect( server_hostname=os.getenv("DATABRICKS_SERVER_HOSTNAME"), http_path=os.getenv("DATABRICKS_HTTP_PATH"), access_token=os.getenv("DATABRICKS_TOKEN") ) start = time.time() cursor = connection.cursor() cursor.execute(query_sql) results_data = cursor.fetchall() duration = time.time() - start results.append({ "query": query_name, "duration_seconds": duration, "rows": len(results_data) }) cursor.close() connection.close() print(f"{query_name}: {duration:.2f}s ({len(results_data)} rows)") results.append({ "query": query_name, "duration_seconds": duration, "rows": row_count }) print(f"{query_name}: {duration:.2f}s ({row_count} rows)") # Store benchmark results for tracking fc.DataFrame(results).write.parquet("s3://benchmarks/metric_queries/")
Usage Monitoring
Track metric view adoption and query patterns.
sql-- Query history for metric views CREATE OR REPLACE TEMP VIEW metric_usage AS SELECT user_name, DATE(start_time) AS query_date, HOUR(start_time) AS query_hour, statement_text, execution_duration_ms / 1000.0 AS execution_seconds, rows_produced, CASE WHEN statement_text LIKE '%production.finance.revenue_metrics%' THEN 'revenue_metrics' WHEN statement_text LIKE '%production.marketing.campaign_metrics%' THEN 'campaign_metrics' ELSE 'other' END AS metric_view FROM system.query.history WHERE statement_text LIKE '%MEASURE(%' AND start_time >= CURRENT_TIMESTAMP - INTERVAL 7 DAYS AND statement_type = 'SELECT'; -- Usage summary SELECT metric_view, COUNT(*) AS query_count, COUNT(DISTINCT user_name) AS unique_users, AVG(execution_seconds) AS avg_duration, MAX(execution_seconds) AS max_duration, SUM(rows_produced) AS total_rows FROM metric_usage GROUP BY metric_view ORDER BY query_count DESC;
Analyze adoption with Fenic:
pythonimport fenic as fc # Note: Fenic doesn't support SQL reading. Export query logs from Databricks first. # Load query logs that have been pre-exported to parquet query_logs = fc.read_parquet("s3://validation/query_logs_last_30_days.parquet") adoption = ( query_logs .group_by("user_name") .agg( fc.sum("query_count").alias("total_queries"), fc.count_distinct("query_date").alias("active_days") ) .with_column( "user_activity_summary", fc.concat_ws(", ", fc.concat(fc.lit("Queries: "), fc.col("total_queries").cast("string")), fc.concat(fc.lit("Active days: "), fc.col("active_days").cast("string")) ) ) .with_column( "user_segment", fc.semantic.classify( fc.col("user_activity_summary"), categories=["power_user", "regular_user", "occasional_user", "new_user"] ) ) ) # Segment summary segments = ( adoption .group_by("user_segment") .agg(fc.count("*").alias("user_count")) .order_by("user_count", ascending=False) ) segments.show()
Conclusion
Migrating from dbt Semantic Layer to Databricks Unity Catalog Metric Views consolidates metric logic within a lakehouse-native platform. The migration requires systematic translation of YAML definitions, careful validation of calculation correctness, and reconfiguration of BI tool connections.
Critical success factors:
Systematic translation - Follow consistent patterns for simple, ratio, derived, and conversion metrics. Document edge cases and custom logic.
Rigorous validation - Compare results between dbt and Databricks using automated frameworks. Investigate and resolve all discrepancies before production cutover.
Semantic data processing - Leverage Fenic's semantic operators for validation, quality monitoring, and documentation generation throughout migration.
Incremental deployment - Migrate metrics in phases by domain or complexity tier. Maintain parallel systems temporarily for rollback capability.
The investment yields operational efficiency through simplified architecture, unified governance across analytics and ML, and elimination of external infrastructure dependencies. Organizations committed to Databricks gain consistency and performance by consolidating metrics into Unity Catalog.
Additional Typedef resources:
