Architectural decisions, trade-offs, and what 16 years of enterprise systems taught me about production AI.
Every AI framework promises you “agents in five lines of code.” And they deliver — for demos. But the moment you need agents that talk to each other, fall back gracefully when a model rate-limits you at 2 AM, remember what you told them last Tuesday, and stream responses in real time to a Slack channel — you realize those five lines were the easy part.
That’s why I built Sutra, an AI orchestration platform, from scratch. Not because existing tools are bad, but because the hard problems in production AI aren’t about calling an LLM. They’re about everything around it.
After 16 years of building enterprise systems, here’s what I’ve learned: the gap between a working demo and a production AI orchestration platform is never about the core technology. It’s about routing, resilience, state, and all the plumbing nobody talks about at conferences. Sutra is what happens when you take that lesson seriously.
Here’s what the build taught me.
Why an AI orchestration platform is not just an inference wrapper
Most AI frameworks treat the LLM call as the center of the universe. In production, the LLM call is maybe 30% of the work. The other 70% is:
Context assembly: What memories, project context, and conversation history should this agent see right now?
Model routing: Which provider and model should handle this specific request, given cost, latency, and rate limits?
Error resilience: What happens when Claude is rate-limited and GPT-4 is overloaded simultaneously?
State management: How do you keep agents running across server restarts?
Here’s how I think about it: if you’ve ever designed a microservices platform — with service discovery, circuit breakers, and load balancing — you already know most of the patterns that matter in AI orchestration. The LLM is just another service in the mesh.
Sutra’s orchestrator handles all of this in a single routing layer. Every message flows through it:
User message → token guard (does this fit the context window?) → memory injection (core memories + semantic recall + project context) → model acquisition (smart routing picks the optimal model) → agent execution (LangGraph ReAct loop with tools) → usage tracking + execution trace → stream back to client
This pipeline runs identically whether the message comes from the web UI, Telegram, Slack, or WhatsApp. The orchestrator doesn’t care about the transport — it cares about routing the right message to the right agent with the right context. That separation is a design choice I’ve applied to every distributed system I’ve built. It pays off every time.
Purpose-based model routing
Early on, I hardcoded models per agent. Agent A uses Claude, Agent B uses GPT-4. Simple, but wasteful. A research query doesn’t need the same model as a code generation task, even within the same agent. And from a cost perspective, it makes no sense to burn premium tokens on work a lighter model handles just as well.
So I built purpose-based routing. Each agent gets tagged with a purpose — “research,” “coding,” “writing” — and each purpose maps to a priority list of models with token budgets:
# At request time, not at startup model, provider = await llm_queue.acquire_model( purpose_id=agent.purpose_id, estimated_tokens=token_count, excluded_models=failed_models ) # Build a fresh executor for this specific request executor = build_agent(config, llm=create_chat_model(provider, model))
The key insight: agents are stateless executables. Instead of binding an agent to a model at startup, I build a fresh executor per request. This enables load balancing across providers, automatic fallback when one provider hits rate limits, and cost optimization by routing simple queries to cheaper models.
The real question is why more frameworks don’t do this by default. The answer, I think, is that most frameworks optimize for developer experience in the first ten minutes, not operational efficiency in month ten. When you’re running agents at scale and the token bill matters, purpose-based routing isn’t a nice-to-have — it’s table stakes.
When a model fails, the router excludes it and tries the next one in the priority chain — up to five fallback attempts before giving up.
The tool factory pattern
Sutra supports 100+ tools across categories — file I/O, GitHub, web scraping, email, memory management, task decomposition, multi-agent delegation, and more. Loading all of them for every agent would be wasteful and dangerous (you don’t want every agent to have shell access).
The solution is a two-tier tool system:
Static tools are pre-instantiated singletons — things like web search and clipboard that don’t need per-agent context.
Factory tools are generated on-demand with agent-specific closures:
def create_memory_tools(agent_id: str) -> list[StructuredTool]: async def save_memory(content: str, tier: str = "recall") -> str: # This closure captures agent_id — each agent # can only access its own memories await memory_store.save(agent_id=agent_id, content=content, tier=tier) return f"Memory saved to {tier} tier." return [StructuredTool.from_function(save_memory, ...)]
When an agent starts, get_tools_by_ids() checks which tool categories are needed, instantiates only those factories, and returns the merged list. An agent with ["web_search", "memory_save", "ask_agent"] enabled gets exactly three tools — not a hundred.
If you’ve worked with enterprise permission systems, this pattern will look familiar. It’s the principle of least privilege, applied to AI tooling. Each agent gets exactly the capabilities it needs and nothing more. That’s not just good security — it keeps the model focused and reduces the chance of tool-choice errors.
Memory as a first-class system
Most chat applications treat memory as “append the last N messages.” That breaks down fast. A conversation with 200 messages can’t fit in any context window, and naive truncation loses critical context.
Here’s how I think about it: memory in an AI system should work the way it works for a strong team member. They remember the important decisions without being asked, they recall relevant details when the topic comes up, and they don’t bring up everything they’ve ever heard in every conversation. That’s the mental model behind Sutra’s tiered memory architecture:
Core memories: Always injected. The agent’s persistent knowledge about the user and itself — the equivalent of things a colleague just knows about you.
Recall memories: Semantically searched at query time using pgvector embeddings. The agent remembers what’s relevant to this question, not everything it’s ever seen.
Project context: Scoped to the active project — decisions, constraints, and facts that matter for the current work.
On top of this, conversation history uses windowed loading: recent messages are kept at full fidelity, while older messages are summarized. The orchestrator calculates token budgets to ensure the assembled context always fits the model’s window:
async def _guard_token_limit(self, messages, agent_id): total = estimate_tokens(messages) if total > model_limit * 0.9: messages = await emergency_trim(messages) return messages
Memory extraction happens asynchronously after each response — a fire-and-forget task that analyzes the conversation for facts worth remembering, without blocking the user’s response. This is where a lot of the long-term value lives in any orchestration platform. An agent that gets smarter about your context over time is categorically different from one that starts fresh every session.
If you’re building production AI systems, I write about architecture decisions like these every week. Subscribe to the newsletter → to get new posts on agentic AI, system design, and enterprise tech strategy.
Resilience: how the AI orchestration platform handles failure
Production AI systems fail in creative ways. Models get rate-limited. Providers have outages. Context windows overflow. Tokens get expensive. Anyone who has run distributed systems at scale knows the pattern: it’s not about preventing failure, it’s about failing gracefully. I built resilience into every layer:
Rate limit handling: When a model returns 429, the router marks it as excluded for this request and acquires the next priority model. The user sees a brief “falling back to alternate model” event in the stream, then the response continues seamlessly.
Context overflow: If the assembled message exceeds the model’s context window, the orchestrator triggers emergency trimming — aggressively summarizing history while preserving the current query and system prompt.
Circuit breaker: If a provider fails repeatedly, a circuit breaker opens and prevents further requests for a cooldown period. This stops cascading failures from one degraded provider from affecting the entire system.
Provider abstraction: The LLM registry wraps eight providers (OpenAI, Anthropic, Google Gemini, Groq, OpenRouter, Perplexity, Ollama, and Clod.io) behind a unified interface. Adding a new provider means implementing one create_chat_model() method and one list_models() method.
This is the kind of engineering that never shows up in demos but defines whether a platform survives its first month in production.
Agent-to-agent delegation
Single agents hit a ceiling. A coding agent shouldn’t also be your research agent. But they need to collaborate — and the collaboration model matters more than most people realize.
Every agent in Sutra automatically gets an ask_agent tool that lets it delegate to any other running agent:
@tool async def ask_agent(agent_name: str, message: str) -> str: """Ask another agent for help with a specific task.""" target = await agent_manager.find_by_name(agent_name) response = await orchestrator.route_message( agent_id=target.id, message=message, ) return response
This enables emergent workflows. A research agent can ask a coding agent to write a script, which can ask a review agent to check it. The orchestrator detects @mentions in messages and adds delegation hints, making collaboration feel natural.
For structured collaboration, Sutra supports group discussions — multiple agents in a moderated conversation with turn-taking, voting, and consensus mechanisms. Useful for design reviews, brainstorming, and decision-making. This is where agentic AI starts to look less like a tool and more like a team — which, if you ask me, is where the real leverage sits for enterprises in 2026. Gartner predicts that 40% of enterprise applications will integrate task-specific AI agents by the end of 2026, up from less than 5% today. The ones that succeed will need exactly this kind of orchestration layer underneath.
Extending the AI orchestration platform with MCP
The Model Context Protocol (MCP) was a game-changer for extensibility. Instead of building every integration natively, Sutra connects to MCP servers and dynamically wraps their tools:
async def get_langchain_tool(self, server_id, tool_schema): # Convert MCP tool schema to LangChain StructuredTool # with dynamic Pydantic model for input validation fields = {} for name, prop in tool_schema.input_schema.items(): fields[name] = (type_map[prop["type"]], ...) InputModel = create_model(f"{tool_name}Input", **fields) return StructuredTool(name=tool_name, func=invoke, args_schema=InputModel)
On startup, Sutra connects to all configured MCP servers, discovers their tools, and makes them available to agents. An agent with a Neon database MCP server can run SQL queries, manage branches, and analyze schemas — without any custom integration code. This is the kind of composability that makes a platform genuinely extensible rather than just configurable.
The frontend: streaming done right
The chat interface uses Server-Sent Events for streaming. Each SSE event carries a type — token, tool_start, tool_end, fallback, error, done — that the frontend maps to UI states:
Tokens animate in as they arrive.
Tool calls show as collapsible badges with inputs and outputs.
Model fallbacks display a subtle notification.
Errors surface inline without breaking the stream.
The conversation sidebar updates in real-time via WebSocket, so new conversations appear instantly when created from Telegram or Slack. Small detail, but it’s the kind of polish that separates a platform from a prototype.
What I’d do differently
Start with async from day one. Retrofitting async into a sync codebase is painful. Every database call, every HTTP request, every file operation needs to be async in an agent orchestration system. I was fortunate to start with FastAPI and asyncpg, but even then, some early synchronous patterns had to be refactored.
Invest in execution traces early. I added the ExecutionTrace model late, and I wish I’d had it from the start. Knowing exactly what input an agent received, what it produced, how long it took, and how many tokens it consumed is invaluable for debugging and optimization. This is the equivalent of distributed tracing in a microservices stack — you don’t realize how much you need it until you’re debugging a failure at 2 AM.
Don’t underestimate conversation management. Windowed history, summarization, token counting, context assembly — this is where most of the complexity lives. The LLM call itself is straightforward. Everything you feed into it is the hard part. After building this, I’d argue that context engineering is the single most underrated skill in production AI.
The stack
For those who want the specifics:
Backend: FastAPI + LangGraph + SQLAlchemy (async) + PostgreSQL + Redis
Frontend: Next.js 14 (App Router) + Zustand + TanStack Query
Agent framework: LangGraph’s
create_react_agentwith custom tool injectionVector search: pgvector with HNSW indexes
Task queue: Celery with Redis broker
Integrations: Slack (Bolt), Telegram (python-telegram-bot), WhatsApp (Twilio)
Infrastructure: Docker Compose with hot-reload for development
The takeaway
Building an AI orchestration platform taught me something I’ve seen over and over across 16 years of enterprise systems: the interesting problems are never in the headline technology — they’re in the systems engineering around it. Model routing, error resilience, memory management, tool orchestration, multi-agent coordination — these are distributed systems problems wearing an AI hat.
If you’re building your own AI orchestration platform, start with the orchestration layer. Get the message routing, context assembly, and error handling right. The LLM call is the easy part. Everything else is the platform. And that “everything else” is where the real competitive advantage lives — for the platform builder and for the enterprise deploying it. Deloitte’s 2026 TMT Predictions estimate the autonomous AI agent market at $8.5 billion this year, growing to $45 billion by 2030 — but only if enterprises get orchestration right. The plumbing is the product.
Want to go deeper? Read What Is Generative AI? A Clear 2026 Guide for the foundational context behind the models that power platforms like Sutra, or explore more posts on Agentic AI and System Design.
FAQs
What is an AI orchestration platform?
An AI orchestration platform is the coordination layer that sits between your users and your AI models. It handles everything the LLM call itself does not — context assembly, model routing, error fallbacks, memory management, tool access, and multi-agent delegation. Think of it as the control plane for production AI: it decides which model handles a request, what context that model sees, and what happens when something fails. Without orchestration, you have a chatbot. With it, you have a system.
How is an AI orchestration platform different from an AI framework like LangChain?
Frameworks like LangChain give you building blocks — chains, agents, tool interfaces. An AI orchestration platform uses those blocks but adds the production layer on top: smart model routing across multiple providers, tiered memory that persists across sessions, rate limit handling with automatic fallbacks, transport-agnostic message routing (web, Slack, Telegram), and multi-agent coordination. The framework helps you call an LLM. The orchestration platform helps you run that LLM reliably at scale, across agents, users, and channels.
Sutra is an open-source multi-agent orchestration platform. Explore the code on GitHub → and see how these architectural decisions come together in practice.