Oversee’s BTN Group webinar brought together Christine Sykes (COO, Direct Travel), Steve Clagg (Founder, Claghouse Consulting), and Oded Zilinsky (CPO, Oversee) to define these metrics with real operational examples.
Why the Pass/Fail Mindset Breaks Down for AI
For decades, TMC operations teams measured new technology the same way: implement a process, check if it worked, move on. A rebooking rule either fired or it failed.
AI operates differently. Instead of replacing one workflow end to end, it touches many workflows at once, each at a different depth. A single flight booking request might involve AI catching a missing departure city, pulling the traveler’s policy, initiating a GDS search, drafting a response, and confirming the booking. On average, that conversation spans seven email exchanges, and almost every step offers an opportunity for AI to remove friction.
The gains on each step feel small. But they compound over thousands of interactions. “AI isn’t a single-use key switch,” as Christine Sykes, COO at Direct Travel, put it. “It improves many parts of many workflows at once.”
From the buyer side, the framing is even simpler. Travel managers don’t care whether AI or a human handled the task. They care about outcome fidelity: did the right thing happen, for the right traveler, at the right time, within the rules of the program? Measuring that requires a new scorecard.
The Five-Dimension AI Scorecard for TMC Operations
1. Productivity: How Much More Work Can the Same Team Handle?
The old scoreboard tracked tickets in, tickets out. That measured how busy your agency was, not whether anyone was better off.
Productivity in the AI era means capacity creation. A 20% reduction in time to resolution effectively creates 25% more capacity per advisor, as Sykes described it. The same team completes more work, overtime drops, and the operation absorbs the 30 to 50% disruption spikes that have become the norm, without immediately hiring.
KPIs: time saved per interaction, emails automated as a percentage of total volume, productivity per agent, cost to serve quarter over quarter.
2. Quality: How Accurate and Consistent Are AI-Assisted Outputs?
Speed without accuracy creates rework. And rework erodes every productivity gain.
The bar: AI-assisted outputs need to match or exceed the accuracy of fully manual workflows. First-time yield, getting it right without back-and-forth, surfaced as a priority metric across the panel. Quality measurement also acts as an early warning system. If first-time yield declines in month 8 or 15, the model may need retraining before travelers feel the drift.
KPIs: first-time yield, error rates on automated versus agent-handled interactions, consistency across request types.
3. Containment: How Often Do Cases Escalate or Reopen
Containment measures whether AI carries a task to completion or generates more handoffs than it resolves.
Two modes affect containment differently. In coded workflows, AI follows exact steps: a traveler asking for a resent itinerary gets an automated response in under a minute. TMCs can fully automate 5 to 20% of requests that way. In the second mode, AI acts as a reasoning layer across the remaining 80%, pulling policies and drafting responses for agent review. Containment here tracks how much manual work AI absorbed, even when the agent made the final call.
KPIs: touchless resolution rate, escalation rate, reopen rate.
4. Traveler Experience: Are Travelers Actually Happier?
Every other dimension is internal plumbing. Traveler experience closes the loop. If productivity and quality improve but satisfaction stays flat, the gains aren’t reaching the end user.
AI has access to booking history, traveler profiles, and full policy sets. With that context, every interaction can feel tailored. The goal, as Zilinsky described it, is for travelers to feel genuinely impressed, not just adequately served. NPS and CSAT segmented by AI-assisted versus manual interactions give the clearest signal, particularly during disruption events.
KPIs: NPS and CSAT by interaction type, satisfaction during disruptions, response time as experienced by the traveler.
5. Resilience: Does Performance Hold Under Surge?
Resilience separates AI that runs as a feature from AI that functions as infrastructure.
When AI absorbs routine requests during disruption spikes, agents focus on high-empathy interactions with stranded travelers. Without that layer, overtime spikes, SLAs break, and cost-to-serve collapses. The structural test is volatility. Embedded AI smooths the curve: FCR, CSAT, and compliance hold in a tight band week over week. As Clagg put it, “reduced volatility means the floor got raised permanently, not just the ceiling occasionally.”
KPIs: SLA adherence during disruption versus baseline, time to clear backlog, volatility in service metrics quarter over quarter.
Governance and the Long View
A scorecard only works if the AI behind it stays trustworthy. Two risks erode value quietly: governance drift and agent resistance.
Drift accumulates slowly: one edge case, one client-specific override, one missed policy update. Clagg’s recommendation: run adversarial testing quarterly. Feed synthetic cases through AI that probe policy boundaries and duty-of-care triggers. Catch drift before a real traveler does.
Agent trust matters equally. If frontline teams don’t adopt it, no dashboard metric compensates. The approach that works: start with pilots proving AI saves time on tedious tasks. When agents see faster resolution and less overtime, adoption follows naturally.
From Scorecard to Operating Layer
Any system can show a spike in month one. The real proof comes in month 7, month 11, month 15. If productivity gains hold or compound at that point, AI has become an operating model.
The five-dimension scorecard gives TMC operations leaders the framework to track exactly that. Not whether AI passed or failed on one use case, but whether the operation got faster, more accurate, and more resilient quarter after quarter. The scorecard is how you justify ongoing investment, and how you know when to tune, expand, or pull back.
Oversee’s AgentSee reduces average handling time by up to 50%, automates over 70% of ticket handling, and drives productivity improvements of up to 90% in selected processes. Enterprise-grade infrastructure, private LLMs, full audit logging, no third-party data sharing.
If you want to see how AgentSee fits into your existing workflows, we’d welcome the conversation.
Book a walkthrough here.