AI Workflows Need Topological Sort

Every AI workflow is a dependency problem. You have steps that produce outputs, other steps that consume those outputs, and a hard constraint: consumers cannot run before their producers finish. Get the order wrong and you read stale data, call a tool with missing context, or trigger an agent before its inputs are ready.

Directed acyclic graphs (DAGs) are the right model for this. Topological sort turns a DAG into an execution order. Together they form a primitive in applied AI system execution, and understanding them at a first-principles level is important when you design, debug, and scale workflows.

The Dependency Problem in AI Workflows

Take a realistic document processing pipeline:

Fetch raw documents from storage
Chunk and clean the text
Embed each chunk
Store embeddings in a vector index
Run a retrieval query against the index
Pass retrieved chunks into a prompt template
Call the LLM
Parse and validate the output

Each step depends on the one before it. If you run step 3 before step 2 finishes, you embed dirty text. If you run step 5 before step 4, you query an incomplete index. The dependencies are not optional constraints - they are correctness constraints.

In a simple linear pipeline you can just run steps 1 through 8 in order. But real workflows branch. Some steps are independent of each other. Some steps fan out into parallel subtasks and fan back in. A linear list breaks down fast.

This is where a DAG becomes the right abstraction.

Modeling Workflows as DAGs

A DAG represents a workflow as a set of nodes (tasks) and directed edges (dependencies). An edge from A to B means “A must complete before B starts.” The acyclicity constraint means there are no circular dependencies - a task cannot transitively depend on itself.

Here is a branching RAG pipeline modeled as a DAG:

Workflow-RAG

keyword_filter and retrieve are independent of each other once store_index finishes. They can run in parallel. merge_context cannot start until both finish. A linear list cannot express this. A DAG can.

Topological Sort

A topological ordering of a DAG is a sequence of all nodes such that for every edge $u \to v$ , node $u$ appears before $v$ . Producers always precede consumers. Kahn’s algorithm computes this in $O(V + E)$ time.

Applied to the pipeline above, this produces an order where fetch_documents runs first, call_llm runs last, and keyword_filter and retrieve appear before merge_context but without any ordering constraint between each other. That freedom is what enables parallelism.

Topological Sort Beyond Ordering

Parallelism for Free

Nodes at the same “level” of the topological order have no dependency between them. They can run concurrently. A smarter scheduler groups nodes by their earliest possible start time:

def execution_levels(graph: dict[str, list[str]]) -> list[list[str]]:
    in_degree = defaultdict(int)
    for node in graph:
        for dep in graph[node]:
            in_degree[dep] += 1

    levels = []
    ready = [n for n in graph if in_degree[n] == 0]

    while ready:
        levels.append(ready)
        next_ready = []
        for node in ready:
            for dep in graph[node]:
                in_degree[dep] -= 1
                if in_degree[dep] == 0:
                    next_ready.append(dep)
        ready = next_ready

    return levels

Each list in levels is a batch of tasks that can execute in parallel. Tools like Prefect and Airflow compute exactly this to maximize executor throughput.

Cycle Detection Before Execution

If a user or a config declares a circular dependency, len(order) != len(graph) catches it before a single task runs. This is not a nice-to-have. A cycle means a deadlock: A waits for B, B waits for A, nothing makes progress. Detecting it at definition time rather than runtime is the difference between a clear error message and a hung pipeline at 2am.

Multi-Agent Systems Are DAGs

Multi-agent orchestration frameworks like LangGraph model agent interactions as DAGs for the same reason: to enforce correct execution order across agents that produce and consume each other’s outputs.

Consider a research workflow with four agents:

Workflow - Multi-agent

The orchestrator cannot dispatch writer_agent until critic_agent finishes, and cannot dispatch critic_agent until summarizer_agent finishes. Topological sort produces this order automatically from the dependency declarations. The orchestrator does not need to hardcode sequencing logic.

Now add a parallel branch:

Workflow - Pipeline

search_agent and retrieval_agent are independent. They can run simultaneously. summarizer_agent waits on both. The topological sort respects this: both search agents appear in the same execution level, and summarizer_agent appears in the next level only after both have zero remaining dependencies.

This scales. Add ten agents, add conditional branches, add fan-outs - the DAG model and topological sort handle the complexity. Hardcoded sequential dispatch does not.

Incremental Re-execution

When an upstream task fails or an input changes, you do not need to re-run the entire pipeline. Traverse the graph forward from the affected node and recompute only the nodes in its subgraph. Every node outside that subgraph has valid cached output.

This is standard in build systems (Bazel, Buck) and is increasingly common in AI pipeline frameworks. The DAG structure is what makes it tractable. Without explicit dependency edges you cannot know which downstream nodes are affected.

What to Watch Out For

Fan-in bottlenecks are real. If ten parallel tasks all feed into one merge node, the merge node cannot start until the slowest of the ten finishes. Topological sort tells you the order correctly but does not automatically balance work. Profile the critical path - the longest chain of dependent tasks - to find where parallelism gains are actually limited.

Cycles in configuration are a user error, not a framework bug. Build validation that catches them at workflow definition time, surfaces a clear error with the offending cycle identified, and rejects the workflow before any execution begins.

The Mental Model to Keep

A DAG is a contract. It says: here are the tasks, here are their dependencies, and here is the guarantee that there are no circular waits. Topological sort is the mechanism that converts that contract into an actionable execution schedule.

When you design an AI workflow, draw the dependency graph first. Identify which tasks are truly sequential and which are independent. That graph is your specification. The topological ordering is your scheduler’s input. Everything else - parallelism, cycle safety, incremental recomputation - falls out of the structure you have already declared.

Footnote: DAGs model AI workflows and multi-agent systems as dependency graphs where edges encode execution order constraints. Topological sort converts this graph into a valid task schedule in $O(V + E)$ time, detects circular dependencies before execution begins, and reveals which tasks can run in parallel. For multi-agent orchestration, this means agents are dispatched only when their inputs are ready, with no hardcoded sequencing logic required.