The Clinical App Store Has No Building Inspector.

Your health system just approved four AI vendors. Your patient just experienced them as one.

Jul 03, 2026

That sentence is the whole article. Everything below is the argument for why it should change how you read the most important healthcare AI story of the year, and why the company at the center of it is the one best positioned to fix what’s missing.

The platform moment

At HIMSS 2026, Microsoft announced that Dragon Copilot had evolved from an ambient documentation tool into a unified clinical AI assistant, with more than 100,000 clinicians using it daily across nine countries. The bigger story sat one layer down: health systems can now deploy partner-built AI agents from companies like Canary Speech, Humata Health, Optum, and Regard directly inside the Dragon Copilot workflow through Microsoft Marketplace. Voice biomarker analysis, prior authorization, clinical decision support, revenue cycle intelligence, all running inside one ambient session, invoked wherever the clinician’s cursor happens to be.

Let me say clearly what this gets right, because it gets a lot right. For years, the standard CIO complaint about clinical AI was point-solution sprawl: a dozen vendors, a dozen interfaces, a dozen contracts, and clinicians toggling between all of them. Microsoft consolidated that into a single workflow with a single ambient capture layer, and the market has validated the move. A January 2026 KLAS report described ambient AI as having graduated from experimental pilot to operational necessity. Microsoft even extended the platform to rural hospitals at a steep discount, putting clinical AI within reach of facilities that had been priced out entirely. As platform architecture, this is the most consequential and, frankly, the most thoughtful consolidation healthcare AI has seen.

And Microsoft has invested in governance more seriously than any platform vendor in this market. The healthcare agent service that underpins partner integrations ships with clinical, chat, and compliance safeguards, including healthcare-adapted filters that verify clinical evidence, detect hallucinations and omissions, and enforce credible sources. Every agent admitted to this ecosystem passes through more scrutiny than most standalone clinical AI deployments ever see.

Which is exactly why the remaining gap is worth naming precisely. Every agent in the marketplace passed inspection. Nobody inspected the building.

The composition problem

Walk through one plausible session. A patient speaks with their physician. A voice biomarker agent analyzes the conversation and flags markers consistent with depression. That context is available to a decision support agent, which surfaces a possible comorbidity. That output shapes documentation, which a prior authorization agent uses to pre-clear a treatment pathway before the encounter ends.

Four agents. Four vendors. Each individually vetted, individually safeguarded, individually approved by your governance committee, if you have one, and as I wrote in my last piece, only about 18 percent of health systems do. Each did its job correctly by its own definition.

But the patient didn’t experience four agents. The patient experienced one clinical encounter whose outcome was shaped by a sequence no one reviewed, no one approved, and no one is monitoring. The voice agent’s confidence became the decision agent’s input became the prior auth agent’s justification. If the first agent’s model drifts, the error doesn’t stay in its lane. It propagates downstream, laundered into legitimacy at each step, wearing the credibility of every agent it passes through.

Here’s what makes this an architecture question rather than speculation: Microsoft’s own engineering guidance already sees it. The Copilot Studio multi-agent documentation notes that connected agents require careful governance, and that conversation history passes between agents by default. The handoff is the product. It’s also the risk surface.

Healthcare has a name for this distinction in the physical world. When a hospital builds a new wing, it doesn’t just verify that the electrician, the plumber, and the steelworker are each licensed. Someone inspects the building, because a structure can fail in ways no individual trade caused. Marketplace admission is a permit. Nobody is yet doing the inspection.

The five questions, asked of the stack

In my last piece I proposed a governance floor: five questions any AI deployment must answer before its cost math means anything. Agent composition doesn’t just inherit those questions. It compounds them.

Traceability. Single deployment: can you trace an output to the model, prompt, and data that produced it? Composed: can you trace it across a shared context window with four authors, where vendor B’s input was vendor A’s output? Today, each vendor can show you their segment. Nobody can show you the sequence.

Human positioning. Single: is a human placed to catch the failure before it reaches a patient or a claim? Composed: the clinician reviews the final note, but the prior auth was shaped three agents upstream. The human is positioned at the end of a pipeline whose interior they cannot see.

Approval trail. Single: who authorized this workflow, when, for what scope? Composed: scope creep no longer happens by expansion. It happens by combination. You approved the voice agent for wellness screening and the decision agent for documentation support. You never approved the pair.

Drift monitoring. Single: would you know within days if behavior changed? Composed: drift in one agent surfaces as puzzling behavior in another. Your evaluation set tests each agent alone, and every agent passes while the stack degrades.

Incident path. Single: does a flag have an owner and a clock? Composed: a flagged output has four vendors, each of whom can plausibly and even correctly say their component performed as designed. Accountability without an owner isn’t accountability. It’s a routing problem wearing accountability’s badge.

None of this is an argument against composition. Composition is where the clinical value lives; a prior auth agent that can’t see clinical context is exactly the dumb automation we’re trying to leave behind. It’s an argument that the unit of governance has to move up a level, from the agent to the stack, because that’s where the risk moved.

The inspector’s job is open

So who inspects the building? Today, by default, the health system owns composition risk, because liability follows the license and the patient relationship, and because no one else has claimed the job. Most health systems don’t know they own it. That should change this budget cycle: any organization deploying stacked agents should be asking vendors, and their platform, for stack-level answers to those five questions, not per-agent attestations.

But the honest answer is that the platform is best positioned to do the inspecting, and this is where I’d rather hand Microsoft a compliment and a challenge than a critique. Microsoft already built the hard parts. Agent identities exist. Per-agent safeguards exist. A healthcare-adapted orchestrator, already being explored for multi-agent tumor board workflows at institutions like Stanford, Johns Hopkins, and Mass General Brigham, exists. The distance from there to composition-level governance, session-level audit trails that cross vendor boundaries, stack-level drift monitoring, a combination approval model, is shorter for Microsoft than for anyone else in the market. The company that turned clinical AI into a platform is the natural company to turn composition governance into a platform feature.

And there’s a business case, not just a safety case. The first platform that lets a CIO answer all five questions at the stack level hasn’t just reduced risk. It has built the feature every governed health system will eventually be required to buy. Inspection isn’t a tax on the app store. It’s the moat.

The through-line from my last piece holds. Governed intelligence per dollar is still the metric that matters; composition just changes the arithmetic. Risk in a stack isn’t additive, it’s multiplicative, and the denominator knows it even when the dashboard doesn’t. Govern the composition first, then optimize everything inside it.

If you’re running stacked agents in production today, or deciding this quarter whether to, I want to hear how you’re answering the five questions at the stack level. Reply here or find me on LinkedIn.

Sources

Paul J. Swider is CEO and Chief AI Officer at RealActivity, a Microsoft Partner specializing in mission-critical AI for healthcare systems. He has 30+ years in healthcare technology, has trained over 3,000 engineers across GE, IDX, and Microsoft, and is the founder of BOSHUG, the Boston Healthcare Cloud & AI Community spanning 50+ countries.

Discussion about this post

Ready for more?