3 min read

ServiceNow, Vanguard, and Arklex.ai on AI Agent Engineering Bottlenecks

ServiceNow, Vanguard, and Arklex.ai on AI Agent Engineering Bottlenecks
ServiceNow, Vanguard, and Arklex.ai on AI Agent Engineering Bottlenecks
6:06

The AI Agent Conference in New York draws practitioners who are shipping production AI systems at serious enterprise scale. A panel featuring Rama Krishna Raju Samantapudi from ServiceNow, Karun Appapogu from Vanguard, and Zhou Yu, co-founder of Arklex.ai, spent their session on something most AI content skips entirely: what actually breaks when you try to move an agent from prototype to production workflow.

Their answer, consistently and from three different vantage points: the bottlenecks are almost never the model. They're testing, simulation, governance, context quality, observability, and deployment safety. The engineering disciplines that make distributed systems reliable — not the AI-specific work everyone focuses on.

Most Teams Build First and Test Later. That's Backwards.

The panel opened on simulation and the room felt it. "Most people build agents first and test later. That is backwards."

The reason this matters more for agents than for traditional software: agent behavior is genuinely hard to predict. Workflows branch dynamically. Hidden failure modes emerge from combinations of inputs that nobody anticipated during development. "The unknown coverage problem is huge," one panelist said. "Edge cases are impossible to predict manually."

The solution isn't just more testing — it's large-scale simulation before any production deployment. Running agents against synthetic and representative scenarios at volume, specifically to surface the behavior you didn't design for. "Before launching to production, we need simulations." This is table stakes in mature software engineering. It's still not standard practice in AI agent development, and that gap is where a lot of production failures are coming from.

Observability Has to Include Reasoning, Not Just Outputs

Traditional software observability logs what happened. For agents, the panel argued that's insufficient. You need to understand why.

"We need to understand not only what happened, but why. Reasoning traces matter."

This means capturing reasoning traces, planning traces, and full execution history — not just final outputs. An agent that reaches the right answer through faulty reasoning is a production risk. An agent that reaches the wrong answer and you can't trace why is an undebuggable one. Semantic observability — understanding the agent's decision path, not just its result — is the difference between a system you can improve and a system you can only restart.

"We need access controls and observability. Agents acting independently create risk." Governance without observability is just hope.

Context Quality Beats Context Size

One of the clearest ideas from the session, and one that cuts against a common instinct. When agents underperform, the reflex is to give them more context. More documents, more history, more data fed into the prompt. The panel pushed back directly.

"The quality of context matters more than the quantity. More context is not always better."

The issue is retrieval and organization. Poorly structured, low-quality, or irrelevant context doesn't help an agent reason better — it introduces noise, increases latency, and can actively degrade output quality. "The hardest part is understanding the data. Context engineering is the real challenge." Getting the right information to the agent at the right time, in a form it can actually use, is harder and more important than maximizing context window utilization.

Human Checkpoints Are Not Optional

Despite the push toward autonomy, the panel was clear that human intervention points aren't a design weakness — they're a design requirement. "Human checkpoints are still necessary. You need places where humans can intervene."

This is partly a safety argument and partly a practical one. Autonomous agents operating without intervention points create compounding failure risk — errors that propagate through multi-step workflows before anyone notices. Checkpoints interrupt that propagation. They also create the feedback loops needed to improve agent behavior over time. A system that never pauses for human review is a system that can't learn from its mistakes in any systematic way.

Don't Use the Largest Model for Everything

The panel's discussion of model orchestration was practical and worth sitting with. Not every task in an agent workflow requires frontier model reasoning. Classification, extraction, routing decisions, and simple automation tasks have very different requirements than complex multi-step reasoning or planning.

"There is no reason to use the largest model for every task. Complex reasoning requires different models than extraction."

The corollary: workflow design often matters more than model selection or fine-tuning. Getting the task routing right — matching each step in a workflow to the appropriate model and inference strategy — produces better results and better economics than trying to solve every problem with a single large model. "Production AI is not just prompting. Engineering discipline matters."

New call-to-action

Agent Engineering Is Becoming Systems Engineering

The panel's closing framing was worth noting directly. AI agent engineering is converging with DevOps, distributed systems, observability, and infrastructure engineering. The skills required to run reliable production AI systems are the skills required to run reliable production software systems — with an additional layer of complexity introduced by probabilistic, non-deterministic agents.

That's not a diminishment of AI engineering. It's a maturation of it. The field is moving from "who can build the most impressive agent" to "who can build agents that are testable, observable, governable, and improvable over time." Those are different standards, and the teams that meet them will be the ones with durable production deployments.


This session was presented at the AI Agent Conference 2026 in New York. Panelists represented ServiceNow, Vanguard, and Arklex.ai.

xAI's Grok Build Adds

xAI's Grok Build Adds "Arena Mode"—Multiple AI Agents Compete to Produce Best Code

xAI's Grok Build is evolving from coding assistant into something closer to a full IDE—with a twist. New code findings reveal the company is...

Read More
Microsoft is Filling Teams With AI Agents

Microsoft is Filling Teams With AI Agents

Microsoft just declared war on workplace chaos with the most comprehensive AI agent rollout in corporate history. The company is flooding Teams with...

Read More
Citi and Hebbia on Agentic AI in Financial Services

Citi and Hebbia on Agentic AI in Financial Services

At the AI Agent Conference in New York, a session featuring Sirisha Kadamalakalva, Managing Director and Global Head of AI/ML Investment Banking at...

Read More