How to Migrate from data.world to Databricks Unity Catalog Metric Views

What Is data.world and What Are Databricks Metric Views?

Semantic layers create a governed API for business metrics, defining calculations once and exposing them consistently across all analytics consumers. Two distinct approaches have emerged: catalog-based semantic layers like data.world that sit on top of warehouses, and lakehouse-native solutions like Databricks Unity Catalog Metric Views that embed semantics directly into the data platform.

data.world's approach: data.world operates as a collaborative data catalog with semantic modeling capabilities. The platform ingests metadata from various data warehouses, creating a unified catalog where teams document datasets, define SQL-based metrics as saved queries, and share knowledge across the organization. Users query metrics through data.world's JDBC connection, which translates requests to the underlying warehouse.

The value proposition centers on breaking down data silos through cataloging and collaboration. Teams catalog datasets from Snowflake, Redshift, or BigQuery into data.world, add business context through documentation, and create "semantic models" as collections of saved SQL queries that encapsulate business logic.

Databricks Unity Catalog Metric Views: Unity Catalog Metric Views take a fundamentally different architectural approach. Metrics become first-class catalog objects that live alongside tables and views within the Databricks lakehouse. Instead of routing through an external catalog layer, metrics execute natively on Spark SQL and Photon compute with full access to Delta Lake optimization features.

The system uses YAML or SQL DDL to define metrics with measures (aggregations), dimensions (slicing attributes), and join relationships. When analysts query a metric view using the MEASURE() clause, Unity Catalog compiles this into optimized Spark SQL that leverages the full power of distributed compute, columnar storage, and adaptive query execution.

The core difference: data.world abstracts metrics through saved queries in an external catalog, while Unity Catalog embeds metrics as native database objects that integrate deeply with lakehouse compute, governance, and ML workflows.

How Teams Use data.world for Semantic Modeling Today

Organizations adopted data.world primarily as a data discovery and documentation platform that evolved to include lightweight semantic modeling capabilities.

The Catalog-First Workflow

Data teams connect data.world to their existing warehouses (Snowflake, Redshift, BigQuery) and initiate metadata ingestion. The platform imports table schemas, column names, and relationships, creating a searchable inventory of data assets. Business analysts add human-readable descriptions, tag datasets by domain, and link terms to a business glossary.

This catalog becomes the "source of truth" for what data exists and what it means. A marketing analyst searching for "customer acquisition cost" finds the relevant tables through data.world's search interface, reads the documentation other analysts added, and understands which fields to use.

SQL Queries as Semantic Models

Metric definitions in data.world take the form of saved SQL queries. An analyst creates a new query that joins tables, applies filters, and calculates aggregations:

sql
SELECT
  c.region,
  p.category,
  DATE_TRUNC('month', o.order_date) AS month,
  SUM(o.amount * (1 - o.discount_rate)) AS net_revenue,
  COUNT(DISTINCT o.order_id) AS order_count
FROM warehouse.orders o
JOIN warehouse.customers c ON o.customer_id = c.customer_id
JOIN warehouse.products p ON o.product_id = p.product_id
WHERE o.status = 'completed'
GROUP BY c.region, p.category, DATE_TRUNC('month', o.order_date)

This query gets saved in a data.world project with a name like "Monthly Revenue by Region and Category," along with documentation explaining the business logic. Other team members can now reference this query instead of writing their own joins and filters.

BI Tool Consumption Pattern

Tableau and Power BI users connect to data.world via JDBC, treating saved queries as virtual tables. Instead of connecting directly to the warehouse and navigating dozens of raw tables, analysts see a curated list of "semantic models" (saved queries) with friendly names.

When a dashboard queries the "Monthly Revenue by Region" model, data.world forwards the SQL to the underlying warehouse, retrieves results, and passes them back to the BI tool. The catalog acts as a pass-through layer that adds metadata context but doesn't fundamentally change how queries execute.

Collaboration and Version Control

Multiple analysts contribute to the same data.world project, iterating on query definitions. When business logic changes—say, the finance team decides to exclude returns from revenue calculations—someone updates the saved query. In theory, all downstream dashboards automatically reflect this change since they reference the centralized definition.

Teams use data.world's commenting and discussion features to debate metric definitions, ask questions about edge cases, and document decisions. The platform provides a social layer around data assets that traditional warehouses lack.

The Architectural Problems with Catalog-Based Semantic Layers

While data.world solved early data discovery challenges, its catalog-over-warehouse architecture creates fundamental limitations that become apparent at scale.

Metadata Synchronization Lag

data.world's catalog depends on periodic metadata syncs from source warehouses. This batch synchronization creates inherent staleness:

Tables get renamed, columns change data types, or entire schemas evolve in the warehouse, but data.world doesn't reflect these changes until the next sync runs. Analysts work with outdated schema information, leading to failed queries when they reference columns that no longer exist or have different names.

Saved queries break silently when schemas drift. The revenue query that worked yesterday fails today because someone renamed discount_rate to discount_pct in the warehouse. data.world has no way to validate saved queries against live schemas, so these breaks only surface when someone tries to run the query.

Manual documentation maintenance becomes a full-time burden. Unlike systems that extract metadata from code comments or git commits, data.world requires humans to update descriptions whenever logic changes. In practice, documentation quickly falls out of sync with reality.

Query-Based Semantics Lack Type Intelligence

Treating metrics as saved SQL queries provides reusability but misses the semantic richness modern analytics demands:

No metric type awareness: The system can't distinguish between simple aggregations (SUM of revenue), ratios (revenue per customer), or derived metrics (profit margin calculated from other metrics). Every metric is just SQL text, so the platform can't ensure correct aggregation behavior when users slice by different dimensions.

Complex time-based calculations require verbose SQL: Metrics like "30-day retention rate" or "conversion within 7 days of signup" need complex window functions and date logic. Without native support for cohort-based or time-windowed metric types, analysts write error-prone SQL that's hard to maintain and easy to implement incorrectly.

Dimension grain handling is entirely manual: Want to see revenue by day, week, month, quarter, and year? Create five separate saved queries with nearly identical logic except for the DATE_TRUNC grain. Each query duplicates joins, filters, and business logic, creating maintenance burden and opportunities for subtle inconsistencies.

Aggregation semantics aren't enforced: A saved query computes average order value as SUM(revenue) / COUNT(orders). This works fine when querying at the customer level. But if someone adds a product category dimension to their dashboard, the ratio calculation breaks—they're now computing the average of averages across categories, not the true average order value. The system has no way to know this is wrong.

Governance Gaps in the Catalog Layer

As organizations mature their analytics practices, data.world's governance model reveals limitations:

Access control operates at the warehouse level, not the metric level: If an analyst needs access to revenue metrics, they need SELECT permissions on the underlying orders table in the warehouse. You can't grant someone access to "aggregate revenue by region" without also giving them access to see individual order amounts. This all-or-nothing approach prevents truly governed self-service.

No certification or trust signals: All saved queries live in the same namespace with no native way to mark some as "production certified" and others as "experimental draft." Analysts see a flat list of queries without knowing which ones finance has officially blessed as the "true" revenue metric versus someone's test query from six months ago.

Audit logging lives in the warehouse, not the catalog: You can see which SQL queries executed against Snowflake, but not which data.world semantic models users actually consumed. This makes usage analytics incomplete—you know raw SQL ran, but not which business metrics were requested or how users arrived at those queries through the catalog interface.

Lineage tracking is limited: data.world can show that a saved query reads from certain tables, but it can't trace how that metric flows into downstream ML models, operational systems, or external APIs. The lineage stops at the catalog boundary.

The Lakehouse Integration Problem

Organizations running Databricks face specific architectural friction with catalog-based semantic layers:

Metrics live in a separate system from compute: Your semantic definitions exist in data.world while query execution happens in Databricks. This separation requires constant synchronization of schemas, permissions, and query patterns. When someone adds a column to a Delta table, that change must propagate to data.world before analysts can use it in metrics.

No native notebook integration: Data scientists working in Databricks notebooks query raw Delta tables directly, completely bypassing the semantic layer. The "customer lifetime value" metric carefully defined in data.world gets reimplemented (differently) in Python for ML feature engineering. BI dashboards and ML models end up with divergent metric definitions despite leadership's intent to maintain consistency.

ML workflows can't reference governed metrics: Training a churn prediction model requires features like "total purchase amount last 90 days" and "days since last order." These are metrics that exist in data.world as saved queries, but there's no way to call them from a notebook. Data scientists write their own feature logic, and now the same metric exists in two places with inevitable drift over time.

Streaming analytics aren't supported: Real-time dashboards built on Delta Live Tables can't leverage data.world metrics, which assume batch query patterns. Teams building streaming pipelines end up duplicating metric logic in Spark Structured Streaming code.

Performance optimization happens at the wrong layer: data.world can't optimize query execution because it just forwards SQL to the warehouse. Databricks-specific optimizations like liquid clustering, Delta cache, or adaptive query execution don't integrate with the semantic layer. You're paying for advanced lakehouse features but can't leverage them through the metrics interface.

Why Databricks Unity Catalog Metric Views Solve These Problems

Unity Catalog Metric Views address the fundamental architectural issues through lakehouse-native integration.

Native Catalog Objects, Not External Metadata

Metric views are first-class Unity Catalog objects that live alongside tables and views in your lakehouse. There's no synchronization lag because the catalog and the data share the same metadata store.

When someone modifies a Delta table schema, Unity Catalog immediately knows about it. Metric views that reference that table can be automatically validated. If a column rename would break a metric definition, the system prevents the rename or flags the conflict before it causes downstream failures.

Schema evolution becomes manageable. Add a new column to the orders table, and you can immediately reference it in metric view dimensions without waiting for external systems to sync. Drop a column, and Unity Catalog identifies which metric views depend on it, preventing silent breakage.

Documentation lives with the metrics. Descriptions, ownership information, and business glossary terms are native properties of metric view objects:

sql
CREATE METRIC VIEW finance.revenue_metrics
COMMENT 'Official revenue metrics certified by finance team for reporting and forecasting'
AS ...

These descriptions appear everywhere the metric appears—in SQL autocomplete, UI explorers, API responses, and even in Databricks Assistant when answering natural language queries.

Type-Aware Metric Semantics

Unity Catalog understands measures as aggregation expressions, not just saved SQL text:

yaml
measures:
  - name: total_revenue
    expr: SUM(order_amount)
    description: "Sum of all completed order amounts"

  - name: order_count
    expr: COUNT(DISTINCT order_id)

  - name: average_order_value
    expr: SUM(order_amount) / COUNT(DISTINCT order_id)
    description: "Revenue per order, computed correctly at any grain"

The platform knows total_revenue is a simple aggregation, while average_order_value is a ratio that requires special handling. When someone queries revenue by customer region, then drills down by product category, the system recomputes the ratio correctly at each grain—it doesn't average the regional averages.

Derived measures reference other measures:

yaml
measures:
  - name: gross_profit
    expr: total_revenue - total_cost

  - name: profit_margin
    expr: gross_profit / total_revenue

The platform ensures these compute in the correct order. profit_margin can't execute until both gross_profit and total_revenue have been calculated, and the system handles this dependency automatically.

Dimensions can be computed expressions:

yaml
dimensions:
  - name: order_date
    expr: DATE(order_timestamp)

  - name: order_month
    expr: DATE_TRUNC('month', order_timestamp)

  - name: customer_tier
    expr: CASE
            WHEN lifetime_value > 10000 THEN 'VIP'
            WHEN lifetime_value > 1000 THEN 'Standard'
            ELSE 'Basic'
          END

Instead of maintaining separate queries for daily versus monthly aggregations, define multiple time grain dimensions once. Analysts choose which grain to query without needing separate "metric definitions" for each time period.

True Metric-Level Governance

Unity Catalog's RBAC operates at the metric view level, enabling governed self-service:

sql
GRANT SELECT ON METRIC VIEW finance.revenue_metrics TO `analyst_role`;

Analysts in analyst_role can now query revenue metrics without needing access to the underlying orders table. They see aggregated data only—no individual order amounts, customer IDs, or other sensitive fields. The base tables can have row-level security or column masking that automatically applies through metric views.

Certification flags distinguish trusted metrics:

sql
ALTER METRIC VIEW finance.revenue_metrics
SET TBLPROPERTIES ('certified' = 'true', 'owner' = 'finance_lead@company.com');

Users browsing the catalog see trust badges on certified metrics. Databricks Assistant prioritizes certified metrics when generating queries from natural language. Governance teams can enforce policies like "production dashboards must only use certified metrics."

Domain tagging enables organized discovery:

sql
ALTER METRIC VIEW finance.revenue_metrics
SET TAGS ('domain' = 'finance', 'pii' = 'false', 'refresh' = 'daily');

The three-level namespace (catalog.schema.object) combined with tags creates intuitive organization. Finance metrics live under metrics_catalog.finance.*, marketing metrics under metrics_catalog.marketing.*. Users search by domain, filter by tags, and quickly find relevant metrics without wading through hundreds of unorganized objects.

Lineage visualization spans the full pipeline:

Unity Catalog UI shows end-to-end lineage from raw data ingestion through transformation to metric views to consuming dashboards and ML models. Click on the revenue_metrics object and see:

Upstream: Which Delta tables feed it
Transformations: What join logic and aggregations are applied
Downstream: Which Tableau dashboards, notebooks, and ML pipelines consume it
Schema evolution: How the metric definition has changed over time

This complete lineage view enables impact analysis. Before modifying the orders table schema, see exactly which metrics and downstream consumers would be affected.

Lakehouse-Native Execution Architecture

Metric view queries execute on Databricks compute with full access to performance optimizations:

Spark SQL compilation: When you query MEASURE(total_revenue), Unity Catalog translates this into optimized Spark SQL:

sql
-- User writes:
SELECT customer_region, MEASURE(total_revenue)
FROM finance.revenue_metrics
WHERE order_month >= '2025-01-01'
GROUP BY customer_region;

-- Unity Catalog generates:
SELECT
  c.region AS customer_region,
  SUM(o.order_amount) AS total_revenue
FROM orders o
JOIN customers c ON o.customer_id = c.customer_id
WHERE DATE_TRUNC('month', o.order_timestamp) >= '2025-01-01'
GROUP BY c.region;

The generated SQL goes through Catalyst optimizer, which applies:

Predicate pushdown (filters apply before joins)
Projection pruning (only read necessary columns)
Join reordering based on table statistics
Broadcast joins for small dimension tables

Photon acceleration: Metric queries benefit from Photon's vectorized execution engine. Aggregations like SUM, COUNT, and AVG process data in columnar batches using SIMD instructions, delivering 2-3× faster performance than standard Spark for analytical workloads.

Delta Lake optimizations automatically apply:

Liquid clustering on base tables reduces data scanning
Delta cache keeps frequently accessed data in local SSD
Z-ordering on partition keys speeds up filtered queries
File pruning skips irrelevant data files based on metadata

These optimizations work transparently through metric views. When you cluster the orders table by customer_id, queries for revenue by customer automatically benefit without changing the metric view definition.

Adaptive Query Execution (AQE): Spark adjusts query plans during execution based on actual data characteristics. If a dimension table is smaller than expected, AQE switches from shuffle join to broadcast join mid-query. If aggregation skew is detected, AQE redistributes work across executors.

None of this is possible with catalog-based semantic layers that forward SQL to external warehouses. Unity Catalog leverages the full power of Databricks compute because metrics are native lakehouse objects.

Unified ML and BI Metric Definitions

The most significant architectural advantage: data scientists and business analysts use identical metric definitions.

Python notebooks query metric views directly:

python
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

# Feature engineering using certified business metrics
features_df = spark.sql("""
  SELECT
    customer_id,
    MEASURE(total_lifetime_revenue) AS ltv,
    MEASURE(avg_days_between_orders) AS purchase_frequency,
    MEASURE(days_since_last_order) AS recency
  FROM marketing.customer_metrics
  WHERE signup_date < '2025-01-01'
""")

# Train churn model
from pyspark.ml.classification import GradientBoostedTreeClassifier
gbt = GradientBoostedTreeClassifier(labelCol='churned_within_90_days')
model = gbt.fit(features_df)

Now when the marketing team updates how "lifetime revenue" gets calculated (e.g., excluding refunds), both the executive dashboard and the churn prediction model automatically use the new logic. Metric consistency across BI and ML becomes architecturally guaranteed rather than a coordination challenge.

MLflow experiments link to metric definitions: When logging model training runs, reference the metric view that provided features:

python
import mlflow

with mlflow.start_run():
    mlflow.log_param('feature_source', 'marketing.customer_metrics')
    mlflow.log_param('metrics_version', '2025-01-15')
    mlflow.spark.log_model(model, 'churn_model')

Now model lineage traces back to the specific metric definitions used. If metrics change, you can identify which models need retraining.

Streaming pipelines use the same metrics: Delta Live Tables can reference metric view logic for real-time aggregations:

python
@dlt.table
def real_time_revenue():
  return (
    spark.readStream
      .table("orders_stream")
      .join(spark.table("customers"), "customer_id")
      .groupBy("customer_region", window("order_timestamp", "1 hour"))
      .agg(sum("order_amount").alias("hourly_revenue"))
  )

The aggregation logic matches the batch revenue_metrics definition, ensuring real-time dashboards show numbers consistent with overnight reports.

The Future of Lakehouse Semantic Layers

Several architectural trends will shape how organizations build and consume metrics over the next few years.

Agent-Native Interfaces Replace SQL

Natural language will become the primary way users interact with metrics. Instead of writing MEASURE() clauses, analysts ask "show me Q4 revenue trends by product category" and AI agents generate appropriate queries.

Databricks Assistant already demonstrates this pattern. It reads metric view descriptions and synonyms, maps natural language to measures and dimensions, and generates correct SQL. As language models improve, the abstraction layer between user intent and metric query will become increasingly transparent.

This shift makes semantic layer architecture even more critical. Agents need deterministic metric definitions to avoid hallucination. A metric view that defines "monthly recurring revenue" with precise business logic gives the agent something concrete to reference. Without semantic layers, agents write arbitrary SQL that might be confidently wrong.

Cross-Platform Federation Enables Multi-Lakehouse Metrics

Databricks recently launched federation capabilities that allow Unity Catalog to query external data sources like Snowflake, BigQuery, and PostgreSQL. This enables metric views that span multiple platforms while maintaining lakehouse performance characteristics.

A future metric view might join Databricks Delta tables with Snowflake dimension tables, computing aggregations across both:

yaml
metric_view:
  name: cross_platform_revenue
  source_table: databricks.sales.orders

  joins:
    - table: snowflake.warehouse.customers
      on: orders.customer_id = customers.customer_id

Query pushdown optimization ensures filters and projections execute close to the data, minimizing data movement. The semantic layer abstracts the complexity of multi-cloud data access while Unity Catalog handles federation mechanics.

Real-Time Metrics Become Standard

Batch-computed metrics refreshed overnight are giving way to streaming architectures where metrics update continuously as events arrive. Delta Live Tables' streaming mode combined with metric views enables real-time dashboards that stay current.

Imagine a metric view over a streaming event table:

yaml
metric_view:
  name: live_website_metrics
  source_table: streaming.clickstream_events
  streaming: true

  measures:
    - name: active_sessions
      expr: COUNT(DISTINCT session_id)
    - name: page_views
      expr: COUNT(*)

Dashboards query this metric view and see session counts and page views with sub-second freshness. The semantic layer abstracts whether data is streaming or batch—consumers use the same query patterns for both.

Automated Optimization Through Usage Patterns

Future versions of Unity Catalog will analyze metric query patterns and automatically apply optimizations. If revenue_metrics gets queried by customer region thousands of times per day, the system might suggest:

Liquid clustering the base table by customer_id
Creating a materialized view for the region-level aggregation
Pre-computing and caching the most common dimension combinations

These recommendations would appear in the Unity Catalog UI with one-click implementation. The semantic layer becomes self-tuning based on actual workload characteristics.

Integration with External BI Tools Deepens

While Unity Catalog Metric Views currently require tools to use custom SQL with MEASURE() clauses, announced partnerships with Tableau, Power BI, and Sigma will enable native "drag and drop" experiences.

Connect Tableau to Unity Catalog, and it discovers available metric views. Measures appear as draggable fields in the data pane alongside dimensions. Tableau generates the appropriate MEASURE() queries transparently, giving analysts the same experience as connecting to an OLAP cube.

This native integration pattern will expand across the BI ecosystem as Open Semantic Interchange (OSI) standards mature. The goal: define metrics once in Unity Catalog, consume from any OSI-compatible tool without custom SQL.

Building Reliable Data Foundations for Metric Accuracy

Semantic layers depend on high-quality data pipelines that ensure metrics compute correctly. Teams migrating from catalog-based tools often underestimate the pipeline complexity required for lakehouse architectures.

Typedef helps organizations build production-grade data processing that maintains metric accuracy through semantic validation and type-safe transformations. When your Unity Catalog Metric Views depend on clean Delta tables, having reliable data pipelines becomes critical for trust in your metrics.

Modern AI-native data processing requires more than traditional ETL. Typedef's approach ensures data flowing into your metric views maintains quality through composable semantic operators and built-in reliability patterns for distributed systems.

Lakehouse-native semantic layers represent a fundamental architectural shift from catalog-based tools. Unity Catalog Metric Views eliminate external dependencies while unlocking capabilities impossible when metrics live outside your data platform. For organizations committed to Databricks, native metric views provide the governed, performant, and unified analytics foundation modern teams require.