<< goback()

How to Migrate from Cube to Databricks Unity Catalog Metric Views (Databricks)

Typedef Team

How to Migrate from Cube to Databricks Unity Catalog Metric Views (Databricks)

Databricks Unity Catalog Metric Views provide a lakehouse-native semantic layer that defines metrics once and queries them flexibly across any dimension at runtime. This guide covers the complete workflow from raw data preparation using Fenic's semantic operators through metric view creation, querying patterns, and production deployment strategies.


Why Databricks Metric Views

Databricks Unity Catalog Metric Views address the metric definition problem at the lakehouse layer. Rather than maintaining metric logic in BI tools or external services, you define measures and dimensions as first-class catalog objects that integrate directly with Unity Catalog's governance framework.

The architecture provides three core benefits:

Native execution - Metric view queries execute in Spark SQL with Photon's vectorized engine. No external caching layers or middleware latency.

Runtime flexibility - Unlike materialized views that lock dimensions at creation time, metric views separate measure definitions from dimension groupings. Query the same metric across any available dimension without rebuilding aggregations.

Lakehouse governance - RBAC, row-level security, column masking, audit logs, and lineage work identically for metric views and tables. Access control policies apply consistently across all data assets.

Organizations using Databricks for data engineering, ML, and analytics consolidate metric definitions at the catalog layer. Data scientists, analysts, and BI tools consume the same governed metrics through SQL, Python notebooks, and native integrations.


Architecture Foundations

Metric View Components

A metric view specification includes five elements:

Source - The base table or SQL query providing fact data. Fully-qualified three-level names (catalog.schema.table) identify sources.

Measures - Aggregation expressions using SQL. Each measure defines how to compute a metric across any dimension grouping.

Dimensions - Column expressions for grouping and filtering. Dimensions can reference source columns or joined table attributes.

Joins - Relationships to dimension tables. Star and snowflake schemas supported with explicit join conditions.

Semantic metadata - Descriptions, synonyms, display names, and format specifications for AI/BI tool integration.

The YAML definition registers in Unity Catalog as a queryable object. When you query a metric view, Spark SQL's Catalyst optimizer rewrites the request into physical SQL that executes on Delta Lake tables.

Execution Model

Query execution follows this path:

  1. User issues SQL with MEASURE() clauses referencing metric view measures
  2. Catalyst optimizer parses the MEASURE() references
  3. Query planner expands measures into their defined aggregation expressions
  4. Join logic injects based on requested dimensions
  5. Optimizer applies standard transformations (predicate pushdown, join reordering)
  6. Photon executes vectorized aggregations on Delta Lake data
  7. Results return through standard Spark result sets

Performance scales with SQL warehouse sizing. Small warehouses handle sub-second queries on moderate datasets. Large warehouses process billions of rows with distributed parallelism.

Delta Lake's data skipping prunes files based on min/max statistics. Z-ordering on common filter columns further optimizes scan performance. Result caching serves identical queries instantly on subsequent executions.


Data Preparation with Fenic

Raw data often requires transformation before metric view definitions become straightforward. Fenic provides semantic operators for preprocessing tasks that traditional ETL struggles with.

Schema Extraction from Unstructured Sources

Many datasets contain unstructured fields where metric-relevant attributes hide in text. Fenic's semantic extraction transforms these into structured columns:

python
import fenic as fc
from pydantic import BaseModel
from typing import List, Literal

class OrderMetadata(BaseModel):
    product_category: Literal["electronics", "clothing", "home", "other"]
    priority: Literal["standard", "expedited", "overnight"]
    tags: List[str]

config = fc.SessionConfig(
    app_name="order_preprocessing",
    semantic=fc.SemanticConfig(
        language_models={
            "extractor": fc.OpenAILanguageModel(
                model_name="gpt-4o-mini",
                rpm=500,
                tpm=200_000
            )
        }
    )
)

session = fc.Session.get_or_create(config)

# Load raw orders with unstructured descriptions
orders = session.table("catalog.schema.raw_orders")
# Extract structured attributes
structured = (
    orders
    .with_column(
        "metadata",
        fc.semantic.extract(
            fc.col("order_notes"),
            OrderMetadata,
            model_alias="extractor"
        )
    )
    .unnest("metadata")
)

# Write enriched table for metric view source
structured.write.save_as_table(
    "catalog.schema.orders_enriched",
    mode="overwrite"
)

The metric view then references the enriched table with clean categorical dimensions:

yaml
version: 1.1
source: catalog.schema.orders_enriched

measures:
  - name: order_count
    expr: COUNT(*)

  - name: total_revenue
    expr: SUM(amount)

dimensions:
  - name: product_category
    expr: product_category

  - name: priority
    expr: priority

This pattern eliminates complex CASE WHEN logic in metric definitions by preprocessing dimensions at the table layer.

Semantic Classification for Dimension Enrichment

When source data lacks categorical dimensions, semantic classification adds them:

python
feedback = session.table("catalog.schema.customer_feedback")
classified = feedback.with_column(
    "category",
    fc.semantic.classify(
        fc.col("feedback_text"),
        classes=["product_issue", "billing_question", "feature_request", "complaint"],
        model_alias="extractor"
    )
)

classified.write.save_as_table(
    "catalog.schema.feedback_classified",
    mode="overwrite"
)

Metric views aggregate on the classified dimension without runtime inference:

yaml
version: 1.1
source: catalog.schema.feedback_classified

measures:
  - name: feedback_count
    expr: COUNT(*)

dimensions:
  - name: category
    expr: category

  - name: submitted_date
    expr: DATE(submitted_at)

Queries group by category instantly since classification happened during preprocessing.

Joining Tables with Semantic Matching

When foreign keys don't exist or are unreliable, semantic joins link related records:

python
customers = session.table("catalog.schema.customers")
accounts = session.table("catalog.schema.accounts")

join_predicate = """
Are these the same entity?
Customer: {{left_name}}, {{left_city}}
Account: {{right_name}}, {{right_location}}
"""

linked = customers.semantic.join(
    other=accounts,
    predicate=join_predicate,
    left_on=fc.text.concat(fc.col("name"), fc.lit(" "), fc.col("city")),
    right_on=fc.text.concat(fc.col("account_name"), fc.lit(" "), fc.col("location"))
)

linked.write.save_as_table(
    "catalog.schema.customer_account_mapping",
    mode="overwrite"
)

The mapping table enables proper joins in metric views where relationships were previously ambiguous.

For complete API reference and examples, see the Fenic semantic operators guide.


Creating Metric Views

Basic Metric View Structure

Start with a simple metric view defining core aggregations:

yaml
version: 1.1
source: catalog.schema.orders

measures:
  - name: order_count
    expr: COUNT(*)
    comment: "Total number of orders"

  - name: total_revenue
    expr: SUM(amount)
    comment: "Sum of order amounts"

  - name: average_order_value
    expr: AVG(amount)
    comment: "Mean order amount"

dimensions:
  - name: status
    expr: status
    comment: "Order status"

  - name: order_date
    expr: DATE(created_at)
    comment: "Order date"

  - name: order_month
    expr: DATE_TRUNC('month', created_at)
    comment: "Order month"

Register via SQL:

sql
CREATE METRIC VIEW catalog.schema.orders_metrics
WITH METRICS LANGUAGE YAML AS $$
-- paste YAML here
$$;

Or use Databricks Catalog Explorer UI:

  1. Navigate to Catalog
  2. Select target schema
  3. Click "Create" > "Metric View"
  4. Paste YAML in editor
  5. Save

Advanced Measure Types

Filtered measures use CASE WHEN for conditional aggregations:

yaml
measures:
  - name: high_value_orders
    expr: COUNT(CASE WHEN amount > 1000 THEN 1 END)
    comment: "Orders exceeding $1000"

  - name: completed_orders
    expr: COUNT(CASE WHEN status = 'completed' THEN 1 END)
    comment: "Successfully completed orders"

Ratio measures reference other measures by name:

yaml
measures:
  - name: completed_count
    expr: COUNT(CASE WHEN status = 'completed' THEN 1 END)

  - name: total_count
    expr: COUNT(*)

  - name: completion_rate
    expr: CAST(completed_count AS DOUBLE) / NULLIF(total_count, 0)
    comment: "Percentage of completed orders"

The query engine resolves measure dependencies correctly regardless of dimension grouping.

Distinct count measures handle cardinality across joins properly:

yaml
measures:
  - name: unique_customers
    expr: COUNT(DISTINCT customer_id)
    comment: "Number of distinct customers"

When querying with dimensions from joined tables, Spark SQL ensures the distinct count evaluates after join completion.

Star Schema Joins

Metric views support multi-table joins following star schema patterns:

yaml
version: 1.1
source: catalog.schema.orders

joins:
  - name: customers
    source: catalog.schema.customers
    on: source.customer_id = customers.id

  - name: products
    source: catalog.schema.products
    on: source.product_id = products.id

measures:
  - name: total_revenue
    expr: SUM(amount)

dimensions:
  - name: customer_region
    expr: customers.region

  - name: customer_tier
    expr: customers.tier

  - name: product_category
    expr: products.category

  - name: product_brand
    expr: products.brand

Queries mix dimensions from any joined table. Spark SQL generates appropriate joins and handles cardinality automatically.

Snowflake Schema Multi-Hop Joins

Databricks Runtime 17.1+ supports multi-hop joins through normalized dimension tables:

yaml
version: 1.1
source: catalog.schema.orders

joins:
  - name: stores
    source: catalog.schema.stores
    on: source.store_id = stores.id

  - name: regions
    source: catalog.schema.regions
    on: stores.region_id = regions.id

measures:
  - name: total_revenue
    expr: SUM(amount)

dimensions:
  - name: store_name
    expr: stores.name

  - name: region_name
    expr: regions.name

  - name: region_country
    expr: regions.country

The query engine traverses orders -> stores -> regions automatically when you reference region dimensions.

Semantic Metadata for AI Integration

Add metadata that helps AI assistants like Databricks Genie map natural language to metrics:

yaml
measures:
  - name: total_revenue
    expr: SUM(amount)
    comment: "Total order revenue excluding refunds and discounts"
    synonym: ["sales", "revenue", "income"]
    format: currency
    display_name: "Total Revenue"

dimensions:
  - name: customer_tier
    expr: tier
    comment: "Customer classification: VIP, Standard, or Basic"
    display_name: "Customer Tier"
    synonym: ["customer level", "customer class"]

Genie uses synonyms to match user questions like "What's our sales by customer level?" to the correct metric and dimension.


Querying Metric Views

Basic Query Patterns

Query metric views with standard SQL using the MEASURE() clause:

sql
SELECT
  status,
  MEASURE(order_count) as order_count,
  MEASURE(total_revenue) as total_revenue
FROM catalog.schema.orders_metrics
GROUP BY status;

The MEASURE() clause tells Spark SQL to evaluate the measure definition rather than treating it as a regular column.

Important: Metric views don't support SELECT * queries. Explicitly name dimensions and measures:

sql
-- This works
SELECT
  status,
  order_date,
  MEASURE(total_revenue)
FROM catalog.schema.orders_metrics
GROUP BY status, order_date;

-- This fails
SELECT * FROM catalog.schema.orders_metrics;

Time-Based Analysis

Query at different time grains using pre-defined time dimensions:

sql
-- Daily revenue
SELECT
  order_date,
  MEASURE(total_revenue) as revenue
FROM catalog.schema.orders_metrics
WHERE order_date >= '2025-01-01'
GROUP BY order_date
ORDER BY order_date;

-- Monthly aggregation
SELECT
  order_month,
  MEASURE(total_revenue) as revenue,
  MEASURE(order_count) as orders
FROM catalog.schema.orders_metrics
GROUP BY order_month
ORDER BY order_month;

Define multiple time grain dimensions in the metric view YAML to support different aggregation levels.

Multi-Dimension Grouping

Combine dimensions from fact and joined tables:

sql
SELECT
  customer_region,
  product_category,
  order_month,
  MEASURE(total_revenue) as revenue,
  MEASURE(unique_customers) as customer_count
FROM catalog.schema.orders_metrics
WHERE order_month >= '2025-01-01'
GROUP BY customer_region, product_category, order_month
ORDER BY revenue DESC;

Spark SQL generates optimized joins and aggregations based on the metric view definition.

Filtering Strategies

Apply filters on dimensions before aggregation:

sql
SELECT
  product_category,
  MEASURE(total_revenue) as revenue
FROM catalog.schema.orders_metrics
WHERE customer_region = 'US'
  AND order_date BETWEEN '2025-01-01' AND '2025-03-31'
  AND status = 'completed'
GROUP BY product_category;

Predicate pushdown occurs automatically. Delta Lake's data skipping prunes files based on filter predicates.

Window Functions with Measures

Combine measures with window functions for advanced analytics:

sql
SELECT
  order_month,
  product_category,
  MEASURE(total_revenue) as revenue,
  SUM(MEASURE(total_revenue)) OVER (
    PARTITION BY product_category
    ORDER BY order_month
    ROWS BETWEEN 2 PRECEDING AND CURRENT ROW
  ) as rolling_3month_revenue
FROM catalog.schema.orders_metrics
GROUP BY order_month, product_category
ORDER BY product_category, order_month;

This computes 3-month rolling revenue by category.

Common Table Expressions with Metric Views

Use CTEs to build layered analytics:

sql
WITH monthly_metrics AS (
  SELECT
    order_month,
    customer_region,
    MEASURE(total_revenue) as revenue,
    MEASURE(order_count) as orders
  FROM catalog.schema.orders_metrics
  GROUP BY order_month, customer_region
)
SELECT
  customer_region,
  AVG(revenue) as avg_monthly_revenue,
  SUM(orders) as total_orders
FROM monthly_metrics
GROUP BY customer_region;

CTEs work naturally with metric views for multi-stage calculations.


Production Deployment Patterns

SQL Warehouse Configuration

Choose warehouse sizing based on query patterns and data volume:

Serverless SQL - Auto-scales based on query complexity. Ideal for variable workloads with unpredictable concurrency.

Classic SQL warehouses - Fixed cluster sizes (Small, Medium, Large, X-Large). Better cost predictability for steady workloads.

Enable Delta cache on warehouses querying metric views frequently. Hot data stays in local SSD cache for faster subsequent queries.

Performance Optimization

Z-ordering organizes data files for common filter columns:

sql
OPTIMIZE catalog.schema.orders
ZORDER BY (order_date, customer_region);

Run after bulk loads or periodically for tables with frequent inserts.

Materialized views cache expensive metric view queries at fixed dimensions:

sql
CREATE MATERIALIZED VIEW catalog.schema.monthly_revenue_cache AS
SELECT
  order_month,
  customer_region,
  product_category,
  MEASURE(total_revenue) as revenue,
  MEASURE(order_count) as orders
FROM catalog.schema.orders_metrics
GROUP BY order_month, customer_region, product_category;

Schedule refreshes via Databricks jobs. Query the materialized view for cached results or the metric view for ad-hoc dimensions.

Query result cache serves identical queries instantly. No configuration needed - Spark SQL caches automatically when underlying data hasn't changed.

Access Control Configuration

Grant permissions at metric view level:

sql
GRANT SELECT ON METRIC VIEW catalog.schema.orders_metrics
TO `analyst_role`;

GRANT SELECT ON METRIC VIEW catalog.schema.revenue_metrics
TO `finance_team`;

Users need SELECT on metric views but not necessarily on source tables. This enables governed self-service where analysts query metrics without raw data access.

Row-level security applies at the table layer:

sql
CREATE FUNCTION catalog.schema.region_filter(
  user_region STRING,
  row_region STRING
)
RETURNS BOOLEAN
RETURN user_region = row_region;

ALTER TABLE catalog.schema.orders
SET ROW FILTER catalog.schema.region_filter(
  current_user_region(),
  region
);

Row filters propagate through metric view queries automatically.

Column masking protects sensitive attributes:

sql
CREATE FUNCTION catalog.schema.mask_customer_email(email STRING)
RETURNS STRING
RETURN CASE
  WHEN is_account_group_member('admin') THEN email
  ELSE 'REDACTED'
END;

ALTER TABLE catalog.schema.customers
ALTER COLUMN email
SET MASK catalog.schema.mask_customer_email;

Masked columns in joined tables remain masked in metric view results.


BI Tool Integration

Native SQL Connectivity

Most BI tools connect via JDBC/ODBC to Databricks SQL warehouses. Configure connections with:

  • Server hostname from SQL warehouse configuration
  • HTTP path from SQL warehouse details
  • Personal access token or OAuth for authentication

Query metric views using standard SQL in the BI tool's SQL editor:

sql
SELECT
  customer_region,
  product_category,
  MEASURE(total_revenue) as revenue
FROM catalog.schema.orders_metrics
WHERE order_date >= '2025-01-01'
GROUP BY customer_region, product_category;

Tableau Integration

Tableau connects to Databricks as a data source:

  1. Add new data source
  2. Select "Databricks" connector
  3. Enter SQL warehouse connection details
  4. Browse to metric view in catalog
  5. Drag dimensions and measures to canvas

Use custom SQL for complex metric view queries:

sql
SELECT
  DATE_TRUNC('month', order_date) as month,
  customer_region,
  MEASURE(total_revenue) as revenue
FROM catalog.schema.orders_metrics
WHERE order_date >= CURRENT_DATE - INTERVAL 12 MONTHS
GROUP BY month, customer_region

Power BI Integration

Power BI uses the Databricks connector:

  1. Get Data > Databricks
  2. Enter server hostname and HTTP path
  3. Select DirectQuery mode for live queries
  4. Choose metric view from catalog
  5. Build visualizations with DAX or SQL

DirectQuery mode issues SQL to Databricks for each visual refresh. Import mode caches results locally.

Python Notebooks

Query metric views directly from notebooks:

python
# Using Spark SQL
result = spark.sql("""
    SELECT
        customer_region,
        MEASURE(total_revenue) as revenue,
        MEASURE(order_count) as orders
    FROM catalog.schema.orders_metrics
    GROUP BY customer_region
""")

display(result)

# Using DataFrame API
from pyspark.sql import functions as F

df = spark.table("catalog.schema.orders_metrics")
aggregated = df.groupBy("customer_region").agg(
    F.expr("MEASURE(total_revenue)").alias("revenue"),
    F.expr("MEASURE(order_count)").alias("orders")
)

display(aggregated)

Agent Preprocessing Workflows

AI agents querying metric views benefit from structured context prepared offline. Fenic's semantic operators handle the heavy preprocessing outside request paths.

Document Processing Pipeline

Process documents into agent-queryable knowledge bases:

python
documents = session.table("catalog.schema.support_docs")
processed = (
    documents
    .with_column(
        "parsed",
        fc.col("content").cast(fc.MarkdownType)
    )
    .with_column(
        "chunks",
        fc.markdown.extract_header_chunks(
            "parsed",
            header_level=2
        )
    )
    .explode("chunks")
    .with_column(
        "embedding",
        fc.semantic.embed(fc.col("chunks").content)
    )
    .with_column(
        "summary",
        fc.semantic.map(
            "Summarize in 100 words: {{content}}",
            content=fc.col("chunks").content,
            model_alias="extractor"
        )
    )
)

processed.write.save_as_table(
    "catalog.schema.agent_knowledge_base",
    mode="overwrite"
)

Agents query the preprocessed embeddings and summaries alongside metric views for context-aware analytics.

Semantic Search Integration

Combine vector search with metric views:

python
from pydantic import BaseModel
from typing import List

class QueryResult(BaseModel):
    relevant_docs: List[str]
    metric_context: str

# User query: "What was revenue in Q4?"
query_embedding = fc.semantic.embed(fc.lit("Q4 revenue analysis"))

# Find relevant documents
relevant = (
    session.table("catalog.schema.agent_knowledge_base")
    .with_column(
        "similarity",
        fc.embedding.compute_similarity(
            "embedding",
            query_embedding,
            metric="cosine"
        )
    )
    .filter(fc.col("similarity") > 0.7)
    .order_by(fc.col("similarity").desc())
    .limit(5)
)

# Agent queries metric view for actual data
revenue = spark.sql("""
    SELECT
        order_month,
        MEASURE(total_revenue) as revenue
    FROM catalog.schema.orders_metrics
    WHERE order_month >= '2024-10-01'
        AND order_month < '2025-01-01'
    GROUP BY order_month
""")

This pattern separates expensive semantic search (batch preprocessing) from real-time metric queries (fast SQL execution).

For complete agent workflow patterns, see Fenic's AI pipeline architecture guide.


Monitoring and Observability

Query Performance Tracking

Monitor metric view query performance through Unity Catalog audit logs:

sql
SELECT
    request_id,
    user_identity.email as user,
    request_params.sql_statement,
    response.result.duration_ms as duration_ms,
    request_params.warehouse_id,
    request_params.start_time
FROM system.access.audit
WHERE service_name = 'databricksSql'
    AND action_name = 'query'
    AND request_params.sql_statement LIKE '%MEASURE(%'
ORDER BY duration_ms DESC
LIMIT 100;

Identify slow queries for optimization. Analyze execution plans with EXPLAIN:

sql
EXPLAIN EXTENDED
SELECT
  customer_region,
  product_category,
  MEASURE(total_revenue)
FROM catalog.schema.orders_metrics
GROUP BY customer_region, product_category;

Review physical plan for excessive shuffles, full table scans, or inefficient join strategies.

Usage Analytics

Track which metrics teams query most frequently:

sql
WITH metric_usage AS (
  SELECT
    user_identity.email as user,
    request_params.sql_statement,
    COUNT(*) as query_count
  FROM system.access.audit
  WHERE service_name = 'databricksSql'
    AND action_name = 'query'
    AND request_params.sql_statement LIKE '%catalog.schema.orders_metrics%'
    AND request_params.start_time >= CURRENT_DATE - INTERVAL 30 DAYS
  GROUP BY user_identity.email, request_params.sql_statement
)
SELECT
  user,
  query_count,
  sql_statement
FROM metric_usage
ORDER BY query_count DESC
LIMIT 20;

Popular metrics inform which to optimize via materialized views or table tuning.

Data Quality Validation

Validate metric view results against expected ranges:

python
# Query metrics
result = spark.sql("""
    SELECT
        MEASURE(total_revenue) as revenue,
        MEASURE(order_count) as orders,
        MEASURE(unique_customers) as customers
    FROM catalog.schema.orders_metrics
    WHERE order_date = CURRENT_DATE - 1
""").collect()[0]

# Validate ranges
assert result.revenue > 0, "Revenue cannot be zero"
assert result.orders > 0, "Order count cannot be zero"
assert result.customers > 0, "Customer count cannot be zero"
assert result.revenue / result.orders < 10000, "Average order value exceeds threshold"

# Log to monitoring system
print(f"Daily validation passed: {result}")

Integrate validation into Databricks jobs that run post-ETL to catch data quality issues before they reach dashboards.


Version Control and CI/CD

Repository Structure

Organize metric view definitions in version control:

repository/
├── tables/
│   ├── orders.sql
│   ├── customers.sql
│   └── products.sql
├── metric_views/
│   ├── sales/
│   │   ├── orders_metrics.yaml
│   │   └── revenue_metrics.yaml
│   └── marketing/
│       └── campaign_metrics.yaml
└── deployment/
    ├── deploy_metric_views.py
    └── test_metrics.py

Store YAML definitions alongside table DDL for complete infrastructure as code.

Automated Deployment

Deploy metric views via Python scripts:

python
from databricks import sql
import yaml
from pathlib import Path

def deploy_metric_view(yaml_path, catalog, schema, connection):
    with open(yaml_path) as f:
        yaml_content = f.read()

    view_name = Path(yaml_path).stem

    cursor = connection.cursor()
    cursor.execute(f"""
        CREATE OR REPLACE METRIC VIEW {catalog}.{schema}.{view_name}
        WITH METRICS LANGUAGE YAML AS $${yaml_content}$$
    """)

    print(f"Deployed {view_name}")

# Deploy all metric views
connection = sql.connect(
    server_hostname=os.getenv("DATABRICKS_HOST"),
    http_path=os.getenv("DATABRICKS_HTTP_PATH"),
    access_token=os.getenv("DATABRICKS_TOKEN")
)

for yaml_file in Path("metric_views").rglob("*.yaml"):
    deploy_metric_view(yaml_file, "production", "analytics", connection)

Testing Strategy

Validate metric view logic before deployment:

python
def test_metric_view(view_name, expected_results):
    result = spark.sql(f"""
        SELECT
            MEASURE(total_revenue) as revenue,
            MEASURE(order_count) as orders
        FROM {view_name}
        WHERE order_date = '2025-01-01'
    """).collect()[0]

    assert abs(result.revenue - expected_results['revenue']) < 0.01
    assert result.orders == expected_results['orders']

    print(f"Test passed for {view_name}")

# Run tests
test_metric_view(
    "dev.analytics.orders_metrics",
    {'revenue': 150000.00, 'orders': 500}
)

Deploy to dev environment first, run test suite, then promote to production after validation.


Best Practices

Organize by Business Domain

Structure metric views around business domains rather than technical table structure:

catalog.analytics schema:
├── finance_revenue_metrics
├── finance_cost_metrics
├── marketing_campaign_metrics
├── sales_pipeline_metrics
└── product_usage_metrics

Domain organization improves discoverability in Unity Catalog search and aligns with team ownership.

Define Time Dimensions at Multiple Grains

Support different aggregation levels without query-time DATE_TRUNC calls:

yaml
dimensions:
  - name: order_date
    expr: DATE(created_at)

  - name: order_week
    expr: DATE_TRUNC('week', created_at)

  - name: order_month
    expr: DATE_TRUNC('month', created_at)

  - name: order_quarter
    expr: DATE_TRUNC('quarter', created_at)

  - name: order_year
    expr: YEAR(created_at)

Pre-defined grains optimize query execution and simplify user queries.

Document Measure Calculations

Add clear comments explaining measure logic:

yaml
measures:
  - name: net_revenue
    expr: SUM(amount - discount - refund)
    comment: |
      Net revenue after subtracting discounts and refunds.
      Excludes canceled orders.
      Used for financial reporting - do not modify without finance approval.

Documentation appears in Unity Catalog UI and helps users understand metric definitions.

Separate Fact and Dimension Concerns

Keep measure definitions simple by handling complex dimension logic at the table layer. Use Fenic preprocessing to create clean dimensions rather than nested CASE WHEN expressions in the metric view.

Validate Join Cardinality

Test join logic produces expected row counts:

sql
-- Check join doesn't cause row explosion
WITH base_count AS (
  SELECT COUNT(*) as cnt
  FROM catalog.schema.orders
),
joined_count AS (
  SELECT COUNT(*) as cnt
  FROM catalog.schema.orders_metrics
)
SELECT
  b.cnt as base_rows,
  j.cnt as joined_rows,
  j.cnt - b.cnt as row_difference
FROM base_count b, joined_count j;

Unexpected row differences indicate join issues causing double-counting in measures.


Conclusion

Databricks Unity Catalog Metric Views provide lakehouse-native metric definitions that execute with Spark SQL's full optimization capabilities. The runtime flexibility of querying metrics across any dimension without pre-aggregation, combined with Unity Catalog's governance features, makes metric views a production-ready semantic layer for lakehouse architectures.

Data preparation tasks benefit from Fenic's semantic operators that extract structure from unstructured sources, classify content for dimensional analysis, and join tables through semantic matching. These preprocessing workflows create clean source tables that simplify metric view definitions and improve query performance.

The workflow progresses from raw data transformation through Fenic pipelines, metric view creation with YAML specifications, SQL querying with the MEASURE() clause, and production deployment with proper access control and monitoring. Organizations consolidate metric logic at the catalog layer where governance integrates naturally and all analytical tools query consistent definitions.

For semantic data transformation patterns, explore Fenic's composable operators guide. For AI pipeline architecture that complements metric views, see the reliable AI pipelines guide. For integration with agent frameworks, review the LangChain with Fenic tutorial.

the next generation of

data processingdata processingdata processing

Join us in igniting a new paradigm in data infrastructure. Enter your email to get early access and redefine how you build and scale data workflows with typedef.