Governed Intelligence per Dollar. What Healthcare Already Knows About Metrics That Lie.

Healthcare already ran this experiment. We called it fee-for-service, and it took us twenty years to fix.

Jul 02, 2026

A new metric just earned legitimacy in the AI industry, and almost nobody writing about it has spent a career in healthcare. That’s the gap this piece is here to close.

Weeks ago, Microsoft published its official model card for MAI-Code-1-Flash, and buried in the release notes was a detail worth pausing on: average token usage sitting next to the benchmark score, for the first time on a card like this. Microsoft reports MAI-Code-1-Flash beating Claude Haiku 4.5 on every core coding benchmark tested, including a 16-point lead on SWE-Bench Pro, while solving harder problems on SWE-Bench Verified using up to 60 percent fewer tokens. Same class of task, a fraction of the token spend.

Artificial Analysis is already running this exact comparison across frontier models, not just efficient ones. Claude Opus 4.8 scores 56 on their Intelligence Index against GPT-5.5’s 55, essentially a tie. But running the full Index costs $4,011.58 on Opus 4.8 versus $2,818.79 on GPT-5.5, a 42 percent gap for a single point of capability. Same answer, meaningfully different bill.

The read across the industry is consistent, and correct as far as it goes. The era of reaching for the biggest model regardless of cost is ending. The question shifting inside every enterprise isn’t “which model is smartest,” it’s “which model clears the bar for this task at the lowest cost.” Enterprises burning through annual token budgets in months are learning this in real time, with CFOs forcing the lesson.

What that framing leaves out is what happens when the cheap, efficient answer is wrong.

The metric healthcare already tried once

I’ve spent thirty years building healthcare technology, and I recognized intelligence per dollar the moment I saw it. In my GE and IDX years, I watched the incentive up close: the pull was always toward more, more procedures, more encounters, more billable units, because that’s what the metric rewarded. Nobody in those rooms was acting in bad faith. The measurement itself was doing the steering. I watched a version of that number dominate an entire industry for decades, and I watched what it cost us to unwind.

Fee-for-service reimbursement optimized for one thing: volume per dollar. More procedures, more visits, more billable units, all rewarded, none of it weighted against whether the patient actually got better. It took healthcare roughly twenty years and a genuine movement, value-based care, to force outcomes and risk back into that equation. The lesson wasn’t that efficiency is the wrong goal. It’s that efficiency measured without an outcome or a risk term isn’t efficiency. It’s volume with better math.

Intelligence per dollar, as the AI industry is defining it today, is fee-for-service wearing a new unit. Capability per token, throughput per dollar. Nothing in the numerator or denominator asks what the answer cost when it was wrong.

In a coding benchmark, that’s a rounding error. In a hospital, it’s the whole story.

The risk premium hiding in the number

The clinical data makes the stakes concrete. Recent peer-reviewed work puts hallucination rates for state-of-the-art medical LLMs at 15 to 40 percent on clinical tasks, with drug interaction and treatment questions among the hardest because the underlying knowledge shifts constantly and precision matters. Here’s the detail that should reframe how you read that number: a study in npj Digital Medicine measured the same class of models on a narrow, clinician-reviewed summarization task and found a hallucination rate of 1.47 percent. Same technology. The difference between 1.47 and 40 isn’t the model. It’s the governance around the task. That gap is the entire argument of this piece, expressed as data.

The safety community has already called it. In January 2026, ECRI, the independent patient safety organization, ranked misuse of AI chatbots as the number one health technology hazard of the year in its annual Top 10 report, ahead of any hardware or infrastructure risk, citing OpenAI’s own analysis that more than 40 million people turn to ChatGPT for health information every day, on tools that aren’t regulated as medical devices and haven’t been validated for clinical use.

Now put that next to the adoption data. Eliciting Insights’ second annual survey of 120 US health systems found 75 percent now using or planning at least one AI solution, with multi-solution deployment up 67 percent year over year. Their companion readiness research puts formal governance at roughly 18 percent of those organizations, most lacking data policies and staff with the skills to evaluate what they’ve deployed. That’s not a rounding gap. That’s most of the industry running production AI with no floor under the tools reaching clinical and financial workflows.

CFOs are already feeling this, even where they can’t name it yet. Sixty percent call revenue cycle the single biggest AI opportunity in front of them, but only 39 percent believe AI will actually reduce their overall costs. That gap is a risk premium. It’s the cost of operating without a governance floor, priced in before anyone runs a token calculation. A single unmitigated event, a biased denial model, a hallucinated entry in clinical documentation, can erase multiple years of ROI in a single quarter, and none of that shows up in a capability-per-token ratio. It shows up on the P&L.

Intelligence per dollar without a risk term isn’t a cost metric. It’s a liability with better math, the same trade healthcare already made once and spent two decades unwinding.

The resolved metric: Governed Intelligence per Dollar

The instinct behind intelligence per dollar is right. Cost discipline in AI deployment is overdue, and defaulting to the most expensive frontier model for every task was never going to last. Healthcare doesn’t need to reject the metric. It needs the same correction it already made to fee-for-service: put the outcome and the risk back into the equation before optimizing the cost.

Call it Governed Intelligence per Dollar: capability delivered per dollar spent, denominated not just by token price but by clinical risk exposure and audit posture. Whether the system’s outputs are traceable. Whether a human is positioned to catch the failure mode before it reaches a patient or a claim. Whether the deployment sits inside a governance structure that can show what happened and why.

Two models with identical benchmark scores and identical token costs aren’t equivalent if one runs inside an audited, governed agent framework and the other is a wrapper around an API call with no oversight layer. The cheaper model on paper isn’t the cheaper model in practice once you price in what happens the first time it’s wrong in front of a regulator, a plaintiff’s attorney, or a patient.

The governance floor: five questions before you optimize a single token

Run any AI deployment in your organization against these. A “no” on any of them means your intelligence-per-dollar number for that system is fiction.

Can you trace any given output back to the model, the prompt, and the data that produced it?
Is a human positioned to catch the failure mode before it reaches a patient or a claim?
Can you produce the approval trail for this workflow on demand, who authorized it, when, and for what scope?
Is anyone monitoring for drift, and would they know within days, not quarters, if behavior changed?
When an output gets flagged, is there a defined path from flag to resolution, or does it go to an inbox?

Five yeses is a governance floor. Anything less, and the honest math says your denominator is missing its largest term.

This is also, for what it’s worth, the economics of what I’ve been calling the Fourth Tier: the layer of domain-governed agents that sits above the SDKs, the frameworks, and the coding agents. Governed Intelligence per Dollar is the number that layer competes on.

What this means operationally

Sequencing matters more than the metric itself. Health systems and life sciences organizations chasing intelligence per dollar as currently defined are optimizing cost inside an ungoverned environment, the same mistake fee-for-service made at scale. The correction is simple to state, even if it’s hard to execute: govern the agent layer first, then optimize cost inside that envelope, not the reverse.

In practice that means treating governance as infrastructure rather than a policy document. Know what your agents are doing, who approved the workflow, what happens when an output gets flagged, and whether you can produce that trail on demand. Once that floor exists, the intelligence-per-dollar conversation becomes genuinely useful. You can shop for the cheapest model that clears your quality and safety bar, the same way the rest of the industry is learning to, because you’ve already fenced off the downside that makes an ungoverned “cheap” model expensive.

We’ve done this before, and we know how the story ends if we skip the governance step. The organizations that get this right in 2026 won’t be the ones with the lowest token bill. They’ll be the ones who already knew, from thirty years of reimbursement reform, that an efficiency metric without a risk term is just volume wearing a new unit.

If you’re pricing governance into your AI budget for FY27, or arguing with someone who isn’t, I want to compare notes. Reply here or find me on LinkedIn.

Sources

Paul J. Swider is CEO and Chief AI Officer at RealActivity, a Microsoft Partner specializing in mission-critical AI for healthcare systems. He has 30+ years in healthcare technology, has trained over 3,000 engineers across GE, IDX, and Microsoft, and is the founder of BOSHUG, the Boston Healthcare Cloud & AI Community spanning 50+ countries.

Discussion about this post

Ready for more?