Engineering Insights

Beyond the Chatbot

The first generation of enterprise LLM adoption was the chatbot: a text box connected to GPT-4 with some system prompt engineering. Useful, but limited. The chatbot answers questions. It does not do work.

The second generation — what we are building now — is agentic AI: systems where an LLM reasons about a goal, decomposes it into subtasks, selects and invokes tools, interprets the results, and iterates until the objective is achieved. The shift from "answer my question" to "accomplish my goal" is the most significant architectural change in enterprise AI since the transformer.

At Exosolve, we have deployed agentic systems for contract review, financial analysis, incident response, and compliance monitoring. This post describes the patterns and architectures we have converged on.

The Model Context Protocol (MCP)

One of the most important developments in the agentic AI space is the Model Context Protocol, originally introduced by Anthropic. MCP standardizes how LLMs connect to external tools and data sources, solving the fragmentation problem that previously required custom integrations for every tool.

We have adopted MCP as our standard integration layer. The protocol defines a clean client-server architecture: the LLM runtime acts as an MCP client, and each tool or data source exposes an MCP server. The server declares its capabilities — tools, resources, and prompts — and the client can discover and invoke them dynamically.

typescript
// Example MCP server exposing a contract search tool
import { McpServer } from "@modelcontextprotocol/sdk/server/mcp.js";
import { z } from "zod";

const server = new McpServer({
  name: "contract-search",
  version: "1.0.0",
});

server.tool(
  "search_contracts",
  "Search executed contracts by party name, date range, or clause type",
  {
    query: z.string().describe("Natural language search query"),
    contract_type: z.enum(["nda", "msa", "sow", "all"]).optional(),
    date_after: z.string().optional().describe("ISO date string"),
    max_results: z.number().default(10),
  },
  async ({ query, contract_type, date_after, max_results }) => {
    const results = await contractIndex.search(query, {
      filter: { type: contract_type, date_gte: date_after },
      limit: max_results,
    });
    return {
      content: [{
        type: "text",
        text: JSON.stringify(results, null, 2),
      }],
    };
  }
);

The power of MCP is composability. An agent can discover available tools at runtime, which means you can add new capabilities to a deployed agent without changing the agent's code. We maintain a registry of MCP servers — contract search, CRM lookup, calendar management, code execution, database queries — and agents compose them as needed.

Multi-Agent Orchestration

For complex workflows, a single agent with many tools becomes unwieldy. The LLM's context window fills with tool descriptions, the reasoning becomes muddled, and reliability drops. We use multi-agent architectures for any workflow with more than 5-6 distinct capabilities.

Our standard pattern is the supervisor-worker architecture:

Supervisor agent: Receives the high-level goal, decomposes it into subtasks, delegates to specialized workers, synthesizes results, and handles error recovery. The supervisor has access to a workflow graph that defines valid task sequences and dependencies.

Worker agents: Each worker is a focused agent with 2-3 tools and a narrow system prompt. A contract analysis workflow might have: a Document Retrieval Worker (searches and fetches contracts), a Clause Analysis Worker (extracts and classifies clauses), a Risk Assessment Worker (evaluates risk factors against a rubric), and a Report Generation Worker (compiles findings into a structured report).

python
class SupervisorAgent:
    def __init__(self, workers: dict[str, WorkerAgent], workflow: WorkflowGraph):
        self.workers = workers
        self.workflow = workflow
        self.state = WorkflowState()

    async def execute(self, goal: str) -> WorkflowResult:
        plan = await self.plan(goal)

        for step in plan.steps:
            worker = self.workers[step.worker_id]
            try:
                result = await worker.execute(
                    task=step.task,
                    context=self.state.get_context(step.dependencies)
                )
                self.state.record(step.id, result)
            except WorkerError as e:
                recovery = await self.handle_failure(step, e)
                if recovery.action == "retry":
                    result = await worker.execute(step.task, recovery.modified_context)
                    self.state.record(step.id, result)
                elif recovery.action == "escalate":
                    return WorkflowResult(status="needs_human_review",
                                          partial_results=self.state)

        return await self.synthesize(self.state)

Communication between agents happens through structured messages, not free-form text. Each worker returns a typed result object. This makes the system debuggable and testable — you can unit test each worker in isolation and integration test the full workflow.

Tool Use Patterns

We have converged on several principles for designing agent tools:

Tools should be atomic and idempotent. A tool should do one thing. "Search contracts" is a good tool. "Search contracts and summarize the results" is two tools merged — the agent should handle the summarization. Idempotency matters because agents retry. If a tool call fails mid-execution and the agent retries, the system state should not be corrupted.

Return structured data, not prose. Tools should return JSON, not natural language summaries. The agent can interpret structured data; it cannot reliably parse another LLM's prose output. When a search tool returns results, it returns document IDs, relevance scores, and metadata — not a paragraph describing what it found.

Implement dry-run modes for write operations. Any tool that modifies state (sending emails, updating records, creating tickets) should support a preview mode that shows what would happen without executing. This is essential for human-in-the-loop workflows.

Guardrails Architecture

Enterprise agentic AI without guardrails is a liability. We implement guardrails at three levels:

Input guardrails: Before the agent processes a request, we classify the intent and reject out-of-scope requests. A contract analysis agent should not be answering general knowledge questions. We use a lightweight classifier (distilbert fine-tuned on ~2,000 examples) that adds <10ms latency.

Execution guardrails: Every tool call is validated against a policy engine before execution. Policies define which tools an agent can use, what parameters are allowed, and rate limits. For example: the contract search agent can query the contract database but cannot access the HR system. The financial analysis agent can read portfolio data but cannot execute trades without human approval.

Output guardrails: Before returning results to the user, we run the output through content filters, PII detection, and domain-specific validators. For legal outputs, this includes verifying that cited clauses actually exist in the referenced documents.

Human-in-the-Loop

Full autonomy is rarely appropriate in enterprise contexts. We design our systems with explicit escalation points:

Confidence-based escalation. When the agent's confidence in a decision drops below a threshold, it pauses and requests human review. The human sees the agent's reasoning, the evidence it considered, and its tentative conclusion.

Approval gates. Certain actions — sending external communications, modifying financial records, approving contracts — always require human approval regardless of confidence. The agent prepares the action and presents it for one-click approval.

Supervision dashboards. For long-running workflows, we provide real-time visibility into what the agent is doing, what tools it has called, what decisions it has made, and where it is in the workflow graph. Operators can intervene at any point.

The goal is not to replace human judgment. The goal is to handle the 80% of routine work autonomously so that humans can focus their judgment on the 20% that actually requires expertise. In our contract review deployments, lawyers review AI-flagged issues rather than reading entire contracts. Their expertise is concentrated where it matters most.

Lessons Learned

After deploying agentic systems across multiple enterprise clients, our key lessons are: start with a narrow, well-defined workflow before attempting general-purpose agents; invest heavily in tool design — the quality of your tools determines the ceiling of your agent; structured communication between agents is non-negotiable for reliability; guardrails are not a feature, they are a requirement; and human-in-the-loop is not a weakness of the system, it is a design choice that builds trust and enables adoption.

Agentic AIMCPMulti-AgentLLMOrchestration

Agentic AI in the Enterprise: From Chatbots to Autonomous Workflows

Beyond the Chatbot

The Model Context Protocol (MCP)

Multi-Agent Orchestration

Tool Use Patterns

Guardrails Architecture

Human-in-the-Loop

Lessons Learned

Related Articles

Building Production RAG Systems: Lessons from 50,000 Legal Documents

Why RAG Beats Fine-Tuning for 90% of Enterprise Use Cases

Building a Legal AI That Lawyers Actually Trust

Want to build something similar?