Who Insures the Machine? - Brian Kelleher

In February 2024, Air Canada found itself in a small claims tribunal over $812. A grieving passenger named Jake Moffatt had used the airline's website chatbot to ask about bereavement fares. The chatbot told him he could book a flight at full price and then retroactively apply for a bereavement discount within 90 days. This was not correct, though. Air Canada's actual policy didn't allow retroactive applications, and when Moffatt tried to claim the discount after his flight, the airline refused.

When Moffatt disputed this, Air Canada argued, with apparent sincerity, that its chatbot was "a separate legal entity that is responsible for its own actions." The tribunal rejected this as absurd, ruling that Air Canada was obviously responsible for all information on its website, whether it came from a static page or a chatbot. The company was ordered to pay the refund.

Stories of chatbot hallucinations are common. But this case exposes something deeper than a customer service mishap: a structural gap in how we price and manage the risk of AI systems. The $812 was a rounding error for Air Canada. The principle that companies cannot outsource accountability to their tools will compound as AI systems become more capable and more autonomous. And if companies are liable for their AI's mistakes, someone will need to insure that liability. The question is whether anyone can price it accurately.

I think about this question a lot. I've spent the last four years building AI systems for healthcare and have helped hundreds of doctors automate medical documentation. My dad is a neurosurgeon, and my mum is a radiologist, so I grew up with differential diagnosis and discectomy over the dinner table and, as a result, have a sense of how high the stakes are if a mistake is made. A hallucinated response in a patient chart is dramatically different from a chatbot quoting the wrong bereavement policy.

I believe AI insurance is going to be systematically mispriced for years, probably in both directions simultaneously. Insurers will underestimate AI's reliability in general use, overpricing coverage for the mundane, high-frequency errors, while underestimating the catastrophic tail risks that arise when AI systems fail in unexpected ways. This mispricing will be driven by technical opacity, the fundamental difficulty of evaluating AI performance, and a market that lacks the institutional knowledge to tell the difference between a robust system and a fragile one. In regulated industries like healthcare and law, where the consequences of getting it wrong are most severe, this dynamic will act as a drag on adoption, not because the technology isn't ready, but because the insurance infrastructure isn't.

The evaluation problem

To insure something, you need to know how often it goes wrong and how badly. Car insurers have decades of actuarial data. They segment crash rates by age, geography, and vehicle type. Medical malpractice insurers can draw on historical patterns of clinical error. When it comes to AI systems, we lack the data required to make a risk assessment. And even if we collect it, the rapid pace of AI progress means it will likely be rendered obsolete after a new model release or agentic framework.

The first issue is benchmarks. A benchmark is a standardised test used to measure how well an AI system performs. Think of it as an exam. The problem is that models are often judged on their performance on benchmarks, so there is huge incentive for model developers to train on benchmarks directly, or to pick tasks that are complementary or very close to the real benchmark, in the hopes that there will be a direct performance improvement, even if general abilities don’t advance. The AI community calls this "benchmaxing": optimising for the test rather than for the underlying capability the test is meant to measure. As an insurance provider trying to assess risk, you can't rely on a model's published benchmark scores, because those scores may not reflect real-world performance.

The second issue is that the real world is far messier than any benchmark. Consider the difference between a closed-loop medical device like an insulin pump that adjusts glucose delivery based on a sensor reading, and an AI agent making diagnostic suggestions in a hospital. The insulin pump operates in a constrained environment with relatively clean data and well-defined outcomes. The diagnostic agent encounters an enormous long tail of possible symptom combinations, patient histories, and contextual factors. Traditional insurance frameworks were built for the former kind of system. The latter is harder to evaluate, not because the maths is fundamentally different, but because the variance is much greater and the failure modes are less predictable.

The third issue is pace. Models improve, but they don't become risk-free. Often, they develop new, subtler failure modes that are harder to measure precisely because the error rates are smaller. A system that hallucinates five percent of the time is easy to spot in testing. A system that hallucinates 0.3 percent of the time is much harder to catch, but at scale, that 0.3 percent can still produce significant harm.

These three problems (gamed benchmarks, messy deployment environments, and rapid model evolution) compound to create a situation where even well-intentioned insurers are working with unreliable signals. Unreliable signals lead to unreliable pricing.

The mispricing thesis

I believe AI insurance will be systematically mispriced, and interestingly, I think the mispricing doesn't go in one direction.

On the general case, insurers are likely to overestimate the risk. AI systems that handle routine, well-scoped tasks (processing standard documents, answering common customer queries, generating boilerplate code) tend to perform reliably and consistently. But because insurers lack the technical literacy to distinguish between a well-constrained system and a general-purpose one, they're likely to apply blanket risk premiums that don't reflect how good the technology actually is at the median task. This effectively taxes responsible AI deployment, making it more expensive than it needs to be and slowing adoption.

On the tails, insurers are likely to underestimate the risk. The failure modes of AI systems are not normally distributed. They cluster in unpredictable ways, especially at the edges of the system's training distribution. A coding assistant might ship production-grade code 98 percent of the time, but the remaining two percent isn't just slightly wrong code. It could be code that deletes production data, introduces a security vulnerability, or silently corrupts a database in ways that aren't discovered for months. The distance between the median failure and the worst-case failure is much larger for AI systems than for most insured risks, and that fat tail is precisely what's hardest to model with limited data.

To put it in insurance terms: the expected loss is probably lower than most insurers think, but the variance around that expectation is probably higher. This is a dangerous combination. It means insurers will charge too much for low-risk deployments (driving away good customers) and too little for high-risk ones (accumulating exposure they haven't priced for). The technical opacity of AI systems makes it difficult for underwriters to tell which category a given deployment falls into, and the current generation of evaluation tools isn't good enough to close that gap reliably.

What better evaluation could look like

If mispricing is the core problem, then better evaluation is the core solution. The question is what that actually looks like in practice.

I think the most promising current paradigm combines automated verification with AI-as-judge evaluation, and the balance between these two depends on the domain. In some cases, you can design tests that autonomously verify whether a given AI action was good or bad. In software engineering, for instance, you can measure whether AI-generated code passes a test suite, whether it introduces regressions, whether pull requests get merged without modification, or whether it triggers errors in production monitoring tools like Sentry. These are deterministic, instrumentable signals. In other domains, such as legal analysis, medical decision-making, and creative work, the outputs aren't so easily reduced to pass/fail. This is where AI-as-judge becomes necessary: using a separate AI system (or a human expert, or both) to evaluate the quality of the primary system's outputs against a structured rubric.

The rubric is where it gets interesting. One of the most effective approaches I've seen is to anthropomorphise the evaluation. Rather than trying to specify every possible failure mode in advance, you ask: how would you evaluate a junior professional doing this same task? If you're evaluating an AI medical scribe, you design a rubric that mirrors how a senior clinician would review a junior doctor's notes. Is the clinical terminology accurate? Are symptoms correctly attributed to the patient's history? Is the assessment logically consistent with the documented findings? Are relevant negatives noted? Is the plan appropriate given the diagnosis? This framing is transferable across domains. For legal AI, you evaluate the way a senior partner would review a junior associate's research memo. For financial AI, the way a portfolio manager would review an analyst's recommendation. The advantage of this approach is that it draws on existing professional standards, rubrics that domain experts already carry in their heads, rather than trying to invent entirely new evaluation frameworks for AI.

Of course, these rubrics have limitations. They capture the known dimensions of quality, but AI systems can fail in ways that no junior professional ever would: hallucinating case citations that don't exist, confidently mixing up two patients' histories, producing outputs that are internally consistent but detached from reality. Any evaluation framework needs adversarial testing specifically designed to probe for AI-specific failure modes.

The ultimate goal, I think, is something closer to how autonomous vehicles are tested at scale: massive deployment with continuous, granular monitoring across a wide range of conditions, building a statistical picture of reliability that is both deep and current. We're a long way from that for most AI applications, but the trajectory points in that direction.

What's already happening

Despite the difficulty of the problem, companies are already building in this space.

Munich Re, one of the world's largest reinsurers, has been writing AI performance insurance since 2018, notably before ChatGPT and the current wave of public attention. Their product, aiSure, offers insurance-backed performance guarantees for AI systems. Their technical team evaluates an AI model's robustness, quantifies the probability and severity of underperformance, and prices a premium accordingly. If the AI then fails to perform as promised, the policy pays out. What's notable about their approach is that it's architecture-agnostic: they don't need to understand every detail of how the model works internally, they need to understand how reliably it performs. Their head of AI insurance, Michael Berger, has described the premium calculation as being based on the robustness of the AI model, which is essentially a statistical assessment of performance variance.

On the Lloyd's of London side, a Y Combinator-backed startup called Armilla launched AI liability policies in 2025, covering businesses facing legal claims from customers or third parties harmed by underperforming AI. Their model has an interesting feature: payouts are contingent on whether the AI's performance significantly declines below initial expectations. If a chatbot's accuracy drops from 95 percent to 85 percent, the policy compensates for the shortfall. This is essentially a formalised version of post-deployment monitoring with an insurance wrapper. There's also Testudo, another Lloyd's-backed startup, which aggregates real-time data on AI-related litigation, regulatory developments, and incidents to inform underwriting. Their data paints a stark picture: in the first quarter of 2025, generative AI lawsuits increased 23 percent year-over-year, and filings from January through April surged 81 percent, with average settlements running around $4 million.

These early players are doing important work, but they're all, to varying degrees, constrained by the same evaluation problem. The question of whether an AI system is "robust" depends entirely on the quality of the testing used to assess it, and as I've argued, that testing is not yet reliable enough to support confident pricing. Munich Re's approach, treating AI as a probabilistic system and drawing on their reinsurance modelling expertise, is the most sophisticated, but even they are working with limited deployment data in a fast-moving landscape.

The liability question

Air Canada's chatbot gaffe was small, but the underlying legal question scales. When an AI system causes harm, where does liability sit?

The current answer, in most jurisdictions, is relatively simple: the company that deploys the AI is responsible for its outputs, just as it would be responsible for the actions of an employee or the accuracy of information on its website. The tribunal in the Air Canada case was explicit about this. But things get more complicated as you move up the value chain. If a law firm deploys an AI system to assist with legal research and the AI hallucinates a case citation, the firm is clearly liable to its client. But does the firm have recourse against the AI vendor? Did the vendor's disclaimers adequately communicate the risk? Was the model's performance within the range the vendor promised? These questions are already generating real litigation.

The EU is building the most comprehensive regulatory framework here. The new Product Liability Directive, set for implementation by December 2026, classifies standalone and integrated AI tools as "products" under a strict-liability regime, meaning victims don't need to prove fault, only that the product was defective and caused damage. The proposed AI Liability Directive would create a "presumption of causality," making it easier for people harmed by AI to establish that the system's output caused their damage. Recent discussions have even considered extending liability along the entire AI value chain, from the foundation model provider through the deployer to the end user.

These frameworks matter for insurance because they define what can be claimed and by whom. Strict liability regimes generate more claims than fault-based ones. As more claims are filed, the bigger the insurance market becomes, and the more important accurate risk pricing becomes. Which brings us back to the evaluation problem.

Healthcare: the hard case

If you want to understand why AI insurance is so difficult to get right, healthcare is the most instructive domain. The stakes are highest, the feedback loops are weakest, and the structural barriers to evaluation are most severe.

The core challenge is closing the loop. In software engineering, you can instrument nearly every stage of the development pipeline. Code lives on GitHub, tasks live on Linear, communication lives on Slack. A post-deployment monitoring tool can combine these data sources to analyse whether a change was good or bad. You can measure pull request merge rates, error rates in production, the frequency of rollbacks. The feedback loop is tight, deterministic, and well-instrumented.

Healthcare is different. Consider an AI system that recommends a treatment plan. Did the recommendation lead to a good outcome? To answer that, you need to track what treatment was actually administered, how the patient responded, whether complications arose, and how those complications were managed. In a well-integrated health system with a modern electronic health record, this might be feasible. But in many healthcare settings (and this is the majority globally) IT systems are fragmented, interoperability is limited, and documentation practices are inconsistent. The treatment outcome might be recorded in a different system than the one where the AI made its suggestion. Follow-up assessments might not be documented at all. If the patient was discussed in a multidisciplinary team meeting, the details of that discussion are often poorly filed, with limited linkage to the patient record. The data plumbing simply isn't there to close the loop.

This matters because outcome-based insurance depends on the ability to measure outcomes. If you can't reliably track whether an AI's recommendation led to a good or bad result, you can't build the feedback loop that outcome-based pricing requires. The implication is that the scope of insurable AI in healthcare will, at least for now, be constrained by the scope of observability. An AI system that handles a narrow, well-instrumented task, such as processing radiology reports within a single integrated system, can be meaningfully monitored and therefore meaningfully insured. A general-purpose AI diagnostician operating across fragmented systems with patchy documentation cannot, at least not with the same confidence.

There's also the human-in-the-loop question. Most healthcare AI systems are deployed with a clinician reviewing and approving the AI's outputs. When a doctor reviews and signs off on AI-generated clinical documentation, does that transfer liability to the doctor? And if so, does it make the AI system paradoxically harder to insure, because the insurer might argue that the human checkpoint, not the AI, is the proximate cause of any harm? In practice, I think outcome-based frameworks cut through this: what matters is the rate of adverse outcomes in the combined human-AI system, regardless of where you attribute the individual decision. But the legal and actuarial frameworks for thinking about shared human-AI liability are still immature, and their development will significantly shape how healthcare AI is insured.

The case for outcome-based insurance

Given all of this (the evaluation challenges, the mispricing risk, the structural barriers in domains like healthcare) I think the most promising approach to AI insurance is one that is resolutely focused on outcomes rather than mechanisms. We shouldn't try to understand every layer of the model's architecture or every decision in its training pipeline. We should measure what matters: the rate and severity of adverse events in real-world deployment.

This is, in some ways, analogous to the black box in a young driver's car. You fit a monitoring device, set some parameters (don't exceed the speed limit, don't drive between midnight and 5am), and then adjust premiums based on actual driving behaviour. The insurer doesn't need to understand the biomechanics of the driver's reaction times or the engineering of the car's braking system. They just need the data on outcomes.

For AI, this means setting an initial premium based on pre-deployment evaluation, a structured set of scenarios designed to stress-test the system including adversarial probes for AI-specific failure modes, and then adjusting that premium based on post-deployment performance data. If the system maintains its accuracy across different demographics, use cases, and edge cases, the premium stays stable or decreases. If performance degrades, the premium adjusts upward.

This approach is architecture-agnostic, which means it scales across different types of AI systems. It's grounded in real data rather than theoretical risk models. And it creates a natural incentive structure: companies that invest in monitoring, testing, and maintaining their AI systems get rewarded with lower premiums. Munich Re's Michael von Gablenz has described this dynamic as insurance functioning as "soft regulation," where the premium price becomes a market signal about model quality. A company whose AI carries a low insurance premium is, in effect, carrying a third-party endorsement of its reliability, a potentially significant commercial differentiator in high-stakes domains where clients are cautious about trusting AI.

But outcome-based insurance is only as good as the outcome data. In well-instrumented domains like software engineering, where feedback loops are tight and signals are clear, the model works well. In under-instrumented domains like healthcare, it works only to the extent that observability infrastructure allows. This means the development of AI insurance and the development of health IT infrastructure are, in a sense, the same problem. You can't properly insure what you can't observe.

Silent AI: the hidden risk

There's one more dimension to this that I find particularly important. Swiss Re has introduced the concept of "silent AI," a direct parallel to the "silent cyber" problem that cost the insurance industry billions in unexpected payouts over the past decade.

The issue is this: many existing insurance policies (professional indemnity, product liability, general commercial) were written before AI was widely deployed. Their language may not explicitly include or exclude AI-related risks. This means that when an AI system causes harm, the resulting claim might be covered under a policy that was never designed or priced to handle it. The insurer gets hit with a loss they didn't anticipate, from a risk they didn't model.

With silent cyber, the industry learned this lesson the hard way: cyber incidents spanning traditional policy boundaries created accumulation risk that blindsided insurers who hadn't proactively assessed their exposure. AI incidents could do exactly the same, spanning cyber, liability, and business interruption lines in ways that no single policy was designed to cover. Insurers who don't proactively audit which AI risks their existing policies silently cover may find themselves in the same position.

This is the other side of the AI insurance story. It's not just about creating new products for new risks. It's about understanding how existing products interact with a technology that is being embedded into virtually every industry.

Where this is headed

The most useful frame for thinking about where AI insurance goes from here is the trajectory of cyber insurance. Twenty years ago, cyber risk was exotic, a niche concern for a small number of technology companies. Today, cyber insurance is a mainstream product that most businesses carry as standard. Munich Re explicitly draws this parallel, forecasting that AI insurance will follow the same path.

The similarities are striking. In the early days of cyber insurance, the industry struggled with the same problems: limited historical data, rapidly evolving threats, difficulty modelling tail risks, and a tendency for losses to correlate in ways that traditional actuarial models didn't capture. What made cyber insurance work, eventually, was a combination of better data, standardised frameworks, and market pressure. The same dynamic seems likely for AI, though the timeline is uncertain.

I want to be clear that predicting the future of a market this young requires significant humility. But a few things seem likely to me.

The insurers who win in this space will be software companies as much as they are insurance companies. They'll invest heavily in pre-deployment evaluation and post-deployment monitoring, building or acquiring sophisticated testing frameworks that can keep pace with the technology. The traditional insurance model of assess-once-and-renew-annually is too slow for a technology that can drift in weeks.

Market forces will drive adoption faster than regulation, at least in the near term. The calculus for a company deploying AI is straightforward: if the adverse events are expensive enough and frequent enough, you buy insurance. As AI systems take on higher-stakes tasks (making medical decisions, providing legal advice, managing financial portfolios) the cost of getting it wrong will push more companies toward coverage. Reputable firms will opt into insurance and performance guarantees not because they're required to, but because it signals trustworthiness to clients.

And the evaluation problem, how to accurately assess the rate at which an AI system produces adverse outcomes, will remain the core intellectual challenge. It's not solved. Benchmarks are gamed, real-world performance data is sparse and often proprietary, and the landscape changes fast enough that evaluations have a short shelf life. But this is exactly why the opportunity is so large. Whoever solves evaluation well enough to price premiums accurately will have an enormous competitive advantage, and in doing so, will help build the infrastructure of trust that allows AI to be deployed responsibly in the domains where it matters most.

Ultimately, the $812 that Air Canada paid Jake Moffatt was insignificant, but the precedent it set will have implications for the coming decades as thousands of similar lawsuits unfold. Companies are liable for their tools, whether those tools are human or algorithmic. As AI systems grow more capable and more autonomous, the question of who insures the machine becomes the question of how much we trust it, and how honestly we reckon with what happens when that trust is misplaced.