When AI Hallucination Reached a Federal Courtroom
- Daniel Perkins
- Apr 1
- 10 min read
AI Hallucination | Confabulation Risk | Human Oversight Failure
A generative AI tool produced fictitious legal citations. An attorney submitted them to a federal judge. When questioned, he asked the AI to verify them. It said they were real. They were not.
At a Glance
Case | Mata v. Avianca, Inc., 678 F.Supp.3d 443 (S.D.N.Y. 2023) |
Attorneys | Steven A. Schwartz and Peter LoDuca, Levidow, Levidow & Oberman P.C. |
AI Tool Used | ChatGPT (OpenAI) |
Year | Filed 2022. Sanctions issued June 22, 2023. |
The Issue | ChatGPT generated six or more completely fictitious legal cases, which were submitted to a federal court as real legal precedent. |
The Compounding Error | When the court questioned the citations, the attorneys asked ChatGPT to confirm they were real. ChatGPT said yes. |
Outcome | $5,000 penalty, jointly and severally imposed. Mandatory letters of notice to the plaintiff and to each real judge whose name was falsely attached to a fabricated opinion. The court declined to compel an apology, reasoning that a forced apology is not a sincere one. |
Primary Risk | Confabulation (NIST AI 600-1, Section 2.2) - confidently stated, entirely false output. |
Governance Failure | No acceptable use policy. No verification process. AI treated as an authoritative source. |
Who Should Care | Every executive deploying AI in any professional, legal, compliance, or consequential decision context. |
The Facts
Roberto Mata filed a personal injury lawsuit in New York state court in February 2022 against Avianca, a Latin American airline, claiming he was struck in the knee by a metal serving cart during an international flight. A routine case. The kind that gets filed every day.
The case moved to federal court after Avianca removed it on jurisdictional grounds related to the Montreal Convention, an international treaty governing liability for international air travel. Avianca then moved to dismiss, arguing the claim was time-barred under the Convention's two-year statute of limitations. Mata's legal team needed case law to argue against that dismissal.
That's where Steven Schwartz, the attorney handling the substantive work, turned to ChatGPT. Schwartz later described his reasoning: he couldn't access certain legal databases his firm used, and he thought of ChatGPT as a "super search engine." He asked it to find cases supporting the legal argument he needed. ChatGPT complied, producing a list of cases complete with citations, docket numbers, party names, quoted holdings, and internal cross-references. Everything a real legal citation looks like. None of those cases existed.
How ChatGPT produces fake citations
The ChatGPT instance used in this case was not a legal research platform. It was not independently checking Westlaw, LexisNexis, PACER, or any court docket. It generated text that looked like legal research because legal citations have recognizable patterns. That is the danger: the output had the form of legal authority without the substance of legal authority.
NIST AI 600-1 calls this confabulation: the production of confidently stated but erroneous or false content. ChatGPT didn't flag uncertainty. It produced its fake citations in exactly the same tone it uses when it produces accurate information. There is no visual, tonal, or structural difference between a ChatGPT hallucination and a ChatGPT accurate response. That's the risk.
Schwartz submitted the fake cases in his opposition brief. Avianca's attorneys, reviewing the filing, couldn't locate several of the cited cases. They flagged this to the court. The court couldn't locate them either and ordered Mata's attorneys to produce copies of the cited decisions.
Here is where the situation became significantly worse. Rather than immediately investigate or disclose the problem, Schwartz went back to ChatGPT and asked it directly: are these cases real? ChatGPT confirmed they were real and could be found on reputable legal databases, including LexisNexis and Westlaw. Schwartz then submitted copies of the "decisions" to the court — documents that ChatGPT had also fabricated. At no point did anyone independently verify the cases against an actual legal database before submitting them to a federal judge.
The compounding error
The single most important detail in this case is not that ChatGPT hallucinated. That's a known, documented behavior. The compounding error was what happened next: the attorneys asked ChatGPT to verify its own output and treated that confirmation as authoritative.
This is the equivalent of asking a witness if they're telling the truth and accepting "yes" as evidence. ChatGPT cannot verify its own outputs. It has no external connection to check against. When it confirmed the cases were real, it was producing another confabulation — a confident, plausible-sounding response with no basis in fact. Using an AI tool to verify an AI tool's output, without any independent human check against actual sources, is not a verification process. It's a loop.
When Judge P. Kevin Castel reviewed the submitted "opinions," he found that one of the fake cases cited internal cases that also didn't exist. He had the clerk of the U.S. Court of Appeals for the Eleventh Circuit confirm that no case involving the party name "Varghese" had been filed since 2010. The docket number attributed to Varghese actually belonged to a completely different, real case involving an unrelated party. Judge Castel described one of the legal analyses as "gibberish."
On June 22, 2023, Judge Castel issued his sanctions order. He found that the attorneys had acted with "subjective bad faith," had continued to stand by the fake opinions even after the court raised questions, and had failed to disclose the full truth until faced with formal sanctions proceedings. The court noted explicitly that there is "nothing inherently improper about using a reliable artificial intelligence tool for assistance." The sanctions were not about using AI. They were about abandoning professional responsibility.
The attorneys and their firm were required to pay a $5,000 penalty. They were required to send the sanctions order and a transcript of the hearing to their own client. They were required to send letters to the real judges whose names had been falsely attached to fabricated opinions, letting those judges know that their professional identity had been used to give credibility to something they never wrote.
The AI Risk at the Center: Confabulation
Framework source: NIST AI 600-1, Section 2.2
NIST AI 600-1 identifies confabulation as one of twelve risks unique to or exacerbated by generative AI. The formal definition is the production of confidently stated but erroneous or false content by which users may be misled or deceived. Three things in that definition matter here.
First: confidently stated. ChatGPT didn't hedge. It didn't say it wasn't sure or that you should verify this. It produced what looked exactly like correct legal citations. Confidence is not the same as accuracy in a generative AI system, but users who don't understand how these models work have no way to tell the difference from the output alone.
Second: by which users may be misled. This isn't the model being malicious. It's the natural result of how generative models work. They predict statistically plausible outputs based on training data. A legal citation is a specific, structured format. The model learned that format. When asked to produce a legal citation, it generates something that looks right, whether or not it is.
Third: especially relevant in healthcare, legal, and other consequential decision-making contexts. NIST AI 600-1 also notes that generative AI outputs may include confabulated logic or citations that purport to justify or explain the system's answer, further misleading humans into inappropriately trusting the output. That is exactly what happened here. The fake cases had internal citations. Those internal citations cited other fake cases. The fabrication was layered, structured, and internally consistent, which made it look legitimate to someone who wasn't independently verifying it.
Where Human Oversight Failed
Framework source: NIST AI 100-1 (GOVERN Function); NIST AI 600-1 (Human-AI Configuration)
This case is not primarily about ChatGPT failing. ChatGPT did exactly what it was designed to do: generate plausible text in response to prompts. The case is about the complete absence of human oversight at every point where oversight should have existed.
Oversight Failure 1 — No Tool Assessment Before Use
The attorneys deployed ChatGPT for legal research without understanding what ChatGPT is and isn't capable of. Schwartz described it as a "super search engine." It isn't a search engine at all. It doesn't retrieve information from databases. It generates text. That distinction, retrieval versus generation, is the most important thing anyone needs to understand before using a generative AI tool for any professional purpose. Nobody asked that question before using the tool in federal court filings.
Oversight Failure 2 — No Independent Verification Process
Every professional context that involves consequential outputs needs a verification step that is independent of the tool that produced those outputs. In legal research, that means checking every citation against Westlaw, LexisNexis, or the actual court's docket before filing it with a judge. The attorneys skipped this step entirely. When they needed to verify the citations, they asked ChatGPT to verify them. That's not independent verification. That's a closed loop between the tool and itself.
Oversight Failure 3 — Continued Inaction After Warning
When Avianca's attorneys flagged that the cases couldn't be located, that was a warning signal that should have triggered immediate investigation and disclosure. Instead, the attorneys submitted fabricated documents generated by the same tool that fabricated the original citations, and did not disclose the role of ChatGPT until formal sanctions proceedings were underway. The court found this sequence particularly telling. The error compounded because no one was exercising professional judgment over the AI's output at any point in the chain.
NIST AI RMF Analysis
Framework source: NIST AI 100-1
GOVERN — Failed Before the Tool Was Opened
What should have existed: an acceptable use policy defining which AI tools are permitted for which professional tasks, clear categorization of legal research as a high-stakes function requiring verification protocols, named accountability for the accuracy of AI-assisted work product, and training on the capabilities and limitations of generative AI tools before professional use. What existed: no policy, no categorization, no protocol, and one attorney's intuition that a text generation tool worked like a database search. GOVERN is foundational. Without it, the other three functions have nothing to land on.
MAP — The Risk Was Identifiable and Identified by the Vendor
What should have existed: a risk assessment that identified the deployment context (federal court filings), the stakes (professional sanctions, client harm, judicial integrity), and the known limitations of the specific tool being used. What was available and ignored: OpenAI's own documentation clearly states that ChatGPT can produce incorrect information and should not be relied upon for factual claims without independent verification. This was not proprietary knowledge. It was publicly available guidance from the vendor of the tool being used. The MAP function requires organizations to understand the context and risks of AI deployment before deployment. That work wasn't done.
MEASURE — No Verification, No Testing, No Check
What should have existed: a defined verification process for AI-generated outputs used in professional filings. At minimum, every citation independently verified against a primary source before submission. What existed: none. The output of a text generation tool was submitted to a federal judge as verified legal precedent with no intermediate check against reality. The MEASURE function isn't always about running complex performance evaluations. Sometimes it's just: did someone independently confirm this is true before it went out the door? The answer here was no.
MANAGE — No Response Protocol, No Disclosure Trigger
What should have existed: a defined process for what to do when AI output is questioned, including immediate investigation, escalation, and disclosure protocols. When Avianca flagged the missing cases, that should have triggered a defined response: stop, investigate, disclose, remediate. What happened: the attorneys submitted fabricated documents and continued to defend the filings. There was no incident response process, no disclosure protocol, and no escalation trigger that would have changed the trajectory before sanctions.
Framework Mapping
Framework | Relevance |
NIST AI 600-1 | Confabulation (Section 2.2) — the primary AI risk in this case. Confidently stated, entirely false output in a high-stakes professional context. |
NIST AI RMF (AI 100-1) | All four functions failed: GOVERN (no policy, no accountability), MAP (no risk assessment), MEASURE (no verification), MANAGE (no incident response). |
ISO 42001:2023 | Clause 5 (Leadership) and Clause 8 (Operation): no AI system lifecycle governance, no use policy, no accountability structure for AI-assisted professional work. |
EU AI Act | EU AI Act: Annex III classifies certain AI systems used by or on behalf of judicial authorities to assist in researching or interpreting facts and law, or applying law to concrete facts, as high-risk. Mata v. Avianca was a U.S. case involving private counsel, so the EU AI Act does not directly govern the facts. The governance logic still applies: legal and adjudicative AI workflows require human oversight, traceability, and verification before reliance. |
OWASP LLM Top 10 | LLM09:2025 Misinformation: AI-generated content presented as factual without verification. |
What Good Looks Like
Before the Tool Was Opened
An acceptable use policy for AI tools in legal research would have defined which tools are permitted, for which tasks, with which verification requirements. Training on generative AI capabilities and limitations, specifically that ChatGPT is not a database and cannot retrieve court records, would have prevented the foundational misunderstanding. Risk classification would have flagged legal research for court filings as a high-stakes function requiring human expert verification of all outputs.
When the Tool Was Used
A requirement to independently verify every citation against Westlaw, LexisNexis, or the relevant court's docket before inclusion in any filing. A clear process owner, someone responsible for confirming that AI-assisted research had been verified before the document went out the door.
When the Problem Was Flagged
An immediate stop on the filing, independent investigation of all citations, and proactive disclosure to the court. A defined escalation path that didn't depend on the same attorney who made the error to self-report.
None of this is technically complex. It doesn't require a PhD in machine learning. It requires organizational discipline: clear policies, defined processes, assigned accountability, and the understanding that AI outputs in professional contexts need human verification before they reach anyone with authority over real decisions.
"The attorneys were not sanctioned for using AI. They abandoned the gatekeeping role that every professional in every consequential context still owns."
Four Questions for Your Organization
Do you know which employees are currently using generative AI tools for professional work that produces outputs used in filings, reports, audits, or decisions?
Does your acceptable use policy distinguish between AI tools that retrieve verified information and AI tools that generate plausible text? If your people can't answer that question, they can't make safe decisions about when to verify AI output.
Which of your current AI-assisted workflows produce outputs that reach a regulator, court, client, or board without an independent verification step between the AI and the destination?
If an AI output failure were discovered after the fact, what process would activate, and who would own it?
If the answer is "whoever made the error would figure it out," that's not a process. That's the same situation Schwartz was in when Avianca flagged the missing cases.
Read the Executive Brief
This case study has a companion Executive Brief ,a six-section one-pager distilling the governance signal, risk snapshot, recommended actions, and board discussion questions for executives who need the lesson without the full analysis.
Sources
Mata v. Avianca, Inc., 678 F. Supp. 3d 443, S.D.N.Y. 2023
NIST AI 600-1, Generative AI Profile, Section 2.2, Confabulation
NIST AI 100-1, AI Risk Management Framework
ISO/IEC 42001:2023
EU AI Act, Regulation (EU) 2024/1689, Annex III
OWASP LLM Top 10 2025, LLM09: Misinformation
OpenAI documentation / Help Center on output accuracy limitations
UNITI CYBER MEDIA — BUILT IN, NOT BOLTED ON.
We do not rise to the level of our AI capabilities. We fall to the level of our governance. Build it in from the start. Everything else is just damage control.

Comments