Blog

Data orchestration: how it works and what it can't do alone

Clean DAGs and passing dbt tests don't automatically mean clean decisions.

The Hex Team

Data

May 21, 2026

Your Airflow DAGs are green. dbt models are tested. Fivetran syncs are humming along on schedule. By every pipeline metric, things are working. And yet your Slack is still flooded with "quick pull" requests, finance and marketing are quoting different revenue numbers in the same meeting, and half the dashboards in your BI tool haven't been opened in months.

This is the paradox a lot of data teams live inside: the infrastructure works, but the value doesn't land. Understanding why starts with understanding what data orchestration does, and where its responsibilities end.

What data orchestration is and how it works

Data orchestration is the automated coordination of data workflows across systems. It handles scheduling, execution order, dependency management, and monitoring for the tasks that collect, transform, and deliver your data.

The difference from simple automation is scope. Automation runs a single task on its own, while orchestration coordinates multiple automated tasks into a full workflow, with decisions, retries, and alerts baked in. A Salesforce sync runs as part of a wider workflow: after an upstream schema migration completes and before the dbt models that depend on it kick off.

Most orchestration platforms use Directed Acyclic Graphs (DAGs) to define these dependencies. Each node is a task, each edge is a dependency, and the orchestrator makes sure nothing runs out of order. When Salesforce's API hits a rate limit at 2 a.m., the system waits and retries. You sleep through it.

Orchestration typically spans three stages of a data workflow:

Collection. Gathering data from disparate sources (databases, APIs, cloud platforms, SaaS tools) and routing it into centralized storage.
Transformation. Converting raw data into usable formats. Deduplication, standardization, and aggregation turn scattered records into something an analyst can query. In modern ELT architectures, this happens inside the warehouse using tools like dbt, with orchestration coordinating when those transformations run.
Activation. Delivering processed data to the consumption layer: BI platforms, notebooks, data apps, or APIs where people use it. This is the stage that matters most to everyone downstream, and the stage where value most often breaks down.

All three stages run on a single dependency graph, managed end to end by the orchestrator.

How orchestration differs from ETL, data integration, and workflow automation

ETL moves data from A to B. Data integration is the strategy for unifying data across systems. Data orchestration coordinates both, plus quality checks, transformations, and delivery, across a single dependency graph. The confusion between them shapes how you invest in your stack.

ETL

ETL is a specific data movement process: extract a dataset, reshape it, and load it somewhere else. It solves a defined problem in isolation, with no dependency management across other pipelines, no lineage visibility, and no automated quality testing.

Data integration

Data integration is the broader strategy: ETL plus synchronization, replication, and virtualization, often operating in real time.

Workflow automation

Workflow automation tools like Zapier handle point-to-point triggers (an action fires when a condition is met), but they aren't designed for complex dependency graphs or pipeline-level monitoring.

Orchestration manages the complete lifecycle: ETL processes, ELT workflows, data quality checks, ML model training, reverse ETL, and reporting tasks, all within a unified dependency graph. ETL moves data into the warehouse. Orchestration makes sure that data arrives reliably, on schedule, with dependencies resolved, so the analyst downstream isn't debugging pipeline failures when they should be answering business questions.

Why orchestration alone doesn't solve the "last mile" problem

Orchestration solves the supply side of data, but it can't fix problems on the demand side, where people actually use it. You can have every DAG green, every dbt test passing, data landing in Snowflake exactly on schedule, and still have a data team drowning in ad hoc requests, dashboards nobody trusts, and business users bypassing the whole stack for ChatGPT.

This is the "last mile" problem: the gap between data being technically available and data driving decisions. You'll usually recognize the gap by a few consistent signals.

Dashboards nobody trusts. If business users routinely question dashboard numbers or avoid using them entirely, the problem usually sits downstream: metric consistency, staleness, or lack of context. When people stop trusting dashboards, they route around them, which only compounds the problem.

Request backlogs growing despite clean pipelines. This is the clearest tell. Your DAGs are green, your dbt tests pass, data is fresh, and yet the analytics team's ticket queue keeps growing. You can usually trace it to a combination of data quality concerns and unclear metric ownership. Both are downstream problems that orchestration doesn't address.

Business users bypassing governed data for ChatGPT or spreadsheets. When people choose unvetted external tools over your carefully built stack, it signals that the governed path is too slow, too confusing, or too restrictive. According to Hex's State of Data Teams 2026 report, 31% of data leaders cite trust as their top concern with AI — nearly twice any other concern. That trust gap drives workarounds.

Multiple versions of the same metric circulating. If finance, marketing, and product each maintain their own definition of "revenue" or "active user," orchestration can't fix it. The data arrives consistently, but the interpretations diverge downstream. Three competing versions of a metric erode confidence in everything.

Analysts spending most of their time on maintenance. If your team is maintaining dashboards, rebuilding queries, and fielding "can you pull this?" requests instead of doing deep work, the activation gap is real. The orchestration side is working. The analytics layer isn't pulling its weight.

Extending governance into the analytics layer

To carry governance through to the point where someone gets a reliable answer, you'll typically build it out progressively rather than all at once. Orchestration covers the upstream pieces: lineage, scheduling metadata, and error handling. But it doesn't carry definitions and controls into the tools where people ask questions.

Metric definitions in one place

A semantic layer acts as the shared dictionary for your most important metrics, the place where "active users" or "churn rate" carries one reliable meaning. When that definition lives in a centralized layer rather than scattered across individual dashboards, every tool pulling from it uses the same calculation.

You don't need a fully built-out semantic model on day one. You can start by endorsing a handful of core tables and adding warehouse descriptions that give agents and analysts useful context. Workspace rules handle organizational standards. Semantic models come next, and you can author them directly in Hex or sync them from dbt, Snowflake Semantic Views, or Cube. The pattern is to start with what you have and deepen as your team invests further, not to treat the semantic layer as a prerequisite before anything works.

Access controls that apply across tools

Configuring permissions tool-by-tool creates gaps in ways that are easy to miss until something goes wrong. A user without BI access finds the same data through a CSV someone shared in Slack. An AI interface queries tables that the user shouldn't see because the permissions were set on the BI tool but not on the underlying model.

Access policies defined at the semantic layer apply to every query, from the BI tool, from Threads, from an analyst's notebook — regardless of which interface makes the request. And as the number of surfaces querying your data grows, keeping that consistent manually gets harder.

Lineage beyond pipeline execution

Pipeline-level lineage tells you which tasks ran and when, useful for debugging whether data arrived late or a task failed. But when a dashboard number looks wrong, pipeline lineage often isn't enough. If the DAG ran clean and the data landed on time, the problem probably lives one layer up: someone changed the churn calculation last Tuesday, or "active user" now excludes trial accounts, or a dbt model was updated without cascading the change to the BI tool that depends on it.

Semantic lineage traces how business metrics are defined, calculated, and changed over time. It tells you which dashboards are downstream of a given metric definition, so you can assess the blast radius of a change before you make it. Teams with both layers of lineage can isolate problems faster and communicate impact more precisely when something breaks.

Version control for governance changes

The same PR review you'd apply to a schema migration should apply to a metric definition change. dbt makes that discipline consistent: changes to governance policies go through the same workflow as data transformations, version-controlled, automatically tested, and auditable.

The combined effect is governance that travels with the data from the pipeline all the way to the chart. Data quality remains a top concern for data teams and is now the primary barrier cited for AI adoption. That focus can't stop at the warehouse boundary.

What changes when orchestration meets a governed analytics layer

Reduced manual handoffs

Metrics defined once in a semantic layer flow automatically to every downstream tool. The analyst doesn't rebuild the revenue calculation in each dashboard. The analytics engineer doesn't field tickets asking "which version of churn is correct?" One definition, used everywhere. When you pair orchestrated data with governed self-serve analytics, the ad hoc request volume tends to shift over the following months: business users self-serve more, and analysts move from reactive ticket-taking toward more strategic work.

Self-serve that actually works

Self-service analytics fails when business users are handed a login and told to figure it out. It works when the definitions are already baked in, when "revenue" in the self-serve tool matches what the CFO trusts. In Hex, conversational queries through Threads generate SQL that respects the semantic models and access controls the data team has defined. The business user gets an answer in minutes. The data team doesn't get a Slack ping.

This played out at PandaDoc's team, where roughly 80% of analyst time had gone to repetitive requests. After implementing Threads backed by governed semantic models, business users could self-serve answers that previously needed analyst involvement, and analysts spent less time fielding reactive requests and more on work that required their judgment. Self-serve scales when the definitions are governed and easy to reuse; it stalls when they're not.

AI answers grounded in defined context

AI assistants querying raw tables will hallucinate. Those grounded in semantic models, where "revenue" is explicitly defined, table relationships are mapped, and business rules are codified, produce answers you can trust. The semantic layer makes those definitions explicit and enforceable rather than implicit and invisible.

Context Studio gives you visibility into exactly this: which questions your users are actually asking, where the context is thin, and where the underlying definitions need refinement. It closes the feedback loop so the governed context keeps improving as usage grows.

What works consistently is starting small: three to five core metrics, validated against trusted sources before you roll them out, then expanded from there. One error in a widely-trusted metric destroys confidence in everything downstream, which is exactly why a small, verified foundation matters more than a complete one.

Closing the gap

Data orchestration is an essential infrastructure. It keeps your pipelines running, your dependencies resolved, and your data fresh. But it doesn't deliver the thing everyone actually wants: trusted answers, quickly, without a ticket queue in between.

Closing that gap means extending governance from the pipeline all the way to the analytics layer: consistent metric definitions, access controls that work across tools, and AI analytics that queries governed context instead of raw tables. A practical starting point is endorsing your most-queried tables and adding descriptions that give agents and analysts useful context. The semantic layer deepens from there as your team defines the metrics that matter most.

Hex is built for exactly that handoff: connecting the orchestrated data your pipelines produce to the workspace where analysts can explore it, business users can self-serve on it, and AI agents can reason over it with full context. Request a demo or sign up to see how the workflow feels.

Frequently Asked Questions

How do orchestration tools like Airflow and Dagster connect to analytics platforms?

Most modern orchestration platforms can trigger downstream actions when a pipeline run completes. An Airflow DAG or Dagster job, for example, can kick off a data app refresh, set a cache for business users, or send a notification that fresh data is available. The key architectural detail is that the orchestrator handles scheduling and dependency resolution upstream, while the analytics platform handles consumption and governance downstream. Choosing platforms with native warehouse integrations at this boundary reduces the manual glue code your team has to maintain.

Do you need a semantic layer if your team is small?

Even small teams benefit from defining core metrics in one place, though the implementation can stay lightweight. You don't need a fully built-out semantic model on day one. Endorsing a handful of core tables, adding warehouse descriptions, and documenting three to five key metrics (revenue, active users, churn) with explicit calculation logic goes a long way toward consistency. Hex's Modeling Agent lets platform owners author semantic models directly within the analytics workspace, or sync existing definitions from dbt, Snowflake, or Cube, so you can start with what's most critical and expand as your use cases grow.

What's the biggest mistake teams make when implementing self-serve analytics?

The most common failure is treating self-serve as a one-time deployment: configure the tool, run a training session, and tell users to explore. Self-service doesn't work that way. It needs curated data models, consistent metric definitions, and an ongoing feedback loop between the data team and business users. The teams that succeed treat self-serve as a product they actively maintain: they monitor which questions people ask, identify where context gaps exist, and iterate on the definitions and documentation that make accurate self-service possible. The infrastructure question is the easy part. The curation question never stops.

Get "The Data Leader’s Guide to Agentic Analytics" — a practical roadmap for understanding and implementing AI to accelerate your data team.

Download

Request a demo