Sounding Right vs. Being Right: Making AI a Reliable Partner in Safety Analysis

1. Introduction

Picture yourself facing a mountain of incident reports. Heck, if you're reading this there's a good chance that no imagination is necessary! Each report tells a small story: a maintenance worker's boot catches a loose cable, a forklift operator misjudges a turn, a supervisor notices a guardrail that should have been replaced months ago. You're convinced these narratives hold the key to preventing the next serious accident, but finding patterns in hundreds of these accounts is like trying to recognise a tune in a crowded bar.

You've got your hands in your hair, your coffee's gone cold - but then it dawns on you like the first ray of sunshine on a cloudy day: AI! And not the finicky tech-wizz machine-learning something-something AI of yesteryear. No, those modern do-all large language models (LLMs) that you've already used to write many a tedious email in the last few years. Surely they can cut through that cacophony and deliver to you a clear chorus of insights at a simple request? And I'll admit, the first time I watched ChatGPT digest a month's worth of incident reports in seconds and spit out trend summaries, I was impressed. What would take a human analyst weeks appeared on my screen in moments.

But my enthusiasm hit a wall. Beneath this impressive performance lurks a subtle danger I've come to respect.

These systems don't think like analysts; they predict what you'll approve of. Like master improvisers, they weave together plausible narratives that can feel indistinguishable from expert work. Where a spreadsheet either calculates correctly or throws a visible error (or at least an answer that seems wildly wrong), an LLM can produce fluent answers that sound right even when they are wrong. I've learned that success requires more than good prompts; it demands a rigorous framework that keeps these powerful tools honest.

2. The Illusion of Understanding: What LLMs Actually Do

To understand why AI can make a dangerous safety analyst, let me walk you through how (I've come to understand) it actually functions. Large language models succeed by predicting one token after another rather than by reasoning about cause and effect. They do not "read" a safety report the way you or I would; they statistically estimate the most likely continuation of your prompt based on patterns in training text. And yes, this training text will have contained many great examples of cause and effect correlation - however, these models also go through extensive post-training, where they are moulded and tamed into complying with instructions and producing affable responses. That's the mass market appeal.

This mechanism creates three specific risks I've watched safety professionals (and others!) stumble into: false confidence, loss of nuance, and inconsistency.

The Risk of False Confidence. The primary danger is that an LLM prioritises sounding logical over being logical. When context is incomplete or messy, the model will often "smooth" the story to keep its output coherent, even if reality is incoherent. This bothers me more than any other limitation.

I'll go over a healthcare study that makes this risk tangible. I like to make a point of it because, when I read it, it really helped me put my finger on something that had been bothering me about LLM AIs for a while. On the surface, the results of the study looked promising: the model's classification of topics aligned with human themes roughly 80% of the time [1]. But the more important finding was what happened in the instances when the data it was tasked with processing didn't make sense. When human reviewers marked a topic as logically incoherent, the model still labelled 96% of those incoherent topics as coherent [1].

Take a moment to appreciate what this means. The AI was constructing a cohesive theme where the sub-themes of the text were in fact divergent - it so desperately "wants" to be helpful, even when when being helpful backfires! For a safety manager, that's the smoking gun: the model can confidently impose structure on noise by presenting half-formed patterns as if they were solid findings.

The Speed vs. Nuance Trade-off. The second issue is one I see teams discover the hard way: while AI is fast, it often struggles to read between the lines. In one experiment about analysing text messages about medication adherence, the AI completed the task in minutes (about 97% faster than human researchers), but agreement with human coding was only "fair" (about 37-47%) [2]. Upon inspection, the reason wasn't mysterious: the models interpreted messages literally and missed implied meaning, slang, or subtext [2].

At first I didn't grasp the relevance to a safety context, but the parallel is obvious to me now. A report that says someone was "a bit off today" could signal fatigue, impairment, or stress. A human reads the workplace context, they dig a little, they follow up. A single-pass LLM may file it under a bland health note and move on.

Inconsistency in the "Black Box". Finally, here's something that catches people off guard because it runs counter to what we are used to getting out of computers: LLM outputs can vary across runs because generation involves stochastic variables (read "random stuff"). I've seen this myself—rerunning the same dataset can yield different summaries, themes, or priorities, especially when prompts are broad or ambiguous. Most people don't discover this because performing a second run and doing a side-by-side is so much more tedious than just accepting the beautifully formatted response you got the first time. This variability is one reason qualitative research reviews emphasise reproducibility challenges and the need for careful methodology and validation when using LLMs [8].

3. What the Research Really Shows

So, if LLMs are prone to hallucination and literalism, are they then useless for safety? Not necessarily. After working through scholarly articles on the topic, the research trend seems consistent to me: structured use can be valuable; unstructured use is risky. Allow me to walk you through some examples in a way that I hope will make clear my conclusions at the end.

The Strength: High-Volume Sorting. When the task is tightly defined, especially classification, LLMs can perform extremely well. In construction safety, researchers fine-tuned multiple lightweight language models to classify OSHA accident narratives and found strong performance. Ensemble methods performed best, with a top F1 score around 0.93 [4]. That is a practical result, and one I do find genuinely exciting: for the "grunt work" of screening and tagging large volumes of reports, LLMs can be a reliable force multiplier, provided the categories are clear and the workflow is controlled.

"But wait...", I hear you say in confusion. Didn't I tell you just a few paragraphs ago that the models tend to miss subtle content subtext? The key here is that rigour is achieved by running the same data through a funnel of hard-coded steps that each try to combat known LLM shortcomings. More on that below.

The Weakness: Determining Causality. Here's where my optimism gets checked again. Performance drops when the model must explain why an incident happened. In a study applying the Gemma-2 model to Swiss healthcare incident reports, event extraction was strong (about 92% accuracy), but identification of contributing factors dropped to about 72%. The authors noted hallucination and interpretation issues in that deeper layer of analysis [5]. This is the boundary safety teams must respect: AI can be a scanner, but it is not a good diagnostician. I've seen too many people miss this distinction, probably because it is hard to resist: after all, if you ask it to diagnose, it'll try to do so... and make a good show of it too!

The Solution: Structured Prompting. Now, the gap between "good sorting" and "bad reasoning" can often be narrowed by changing how we ask the model to work. Instead of dumping data into a single prompt ("Summarise these incidents"), structured workflows force the model through stages. This includes drafting candidate themes, refining them, checking evidence, and only then summarising. Research on LLM-assisted qualitative analysis highlights that stepwise or staged prompting can materially improve coverage and reduce brittle, overly-polished outputs [3].

The practical directive for workplace safety is simple, and it's one I return to constantly: don't accept conclusions that appear in one jump. Make the model grind through the paces with explicit steps: identify, justify with quotes, score, aggregate, and then summarise. In the academic work, several other techniques are mentioned (model ensembles are particularly interesting) but I won't go into a detailed exposition here.

The Safety Net: Human Oversight Even with structure, the literature consistently emphasises something I believe deeply: triangulation. This involves continuous human validation to mitigate bias, preserve context, and catch subtle misreadings [6]. And for classification tasks, using multiple models (or model "ensembles") can increase stability and reduce overgeneralisation compared with relying on a single model [4]. In short: AI can highlight patterns at scale, but human expertise is the safeguard against confident invention.

AI in safety analysis: tiered reliability framework

4. The Danger Zones: Where AI Analysis Breaks Down

Even with improved structures, certain tasks remain high-risk. I already touched on some of these points but I want to clearly list the specific pain points now that the shape of my overall argument has been laid out.

Subtle Interpretation (The "Slang" Problem). Remember that medication text-message study? It showed that models often miss implied meaning and interpret language too literally [2]. In safety reporting, this worries me because culture signals and "between-the-lines" cues (fatigue normalisation, quiet pressure to rush, informal workarounds) can vanish if you rely on surface keywords.

Quantitative Aggregation (The "Math" Problem). Here's something that surprises people, but really shouldn't: LLMs are language engines, not counting machines. Without external tooling and explicit methods, they can distort frequencies, overweight common phrases, and underplay rare-but-catastrophic events. This is especially dangerous because fluent summaries can sound statistically grounded when they're not. I've caught this mistake more times than I can count (I guess I'm not a counting machine either). Use them to score qualitative data, then use good old fashioned calculation to do the statistics. Some of the LLM provider native interfaces now include tool availability for flagship models, e.g. allowing them to compose and execute python scripts, but this does not carry over to simple API usage (that is, plugging it into your system in a standard prompt-answer/input-output setup).

Large, Noisy Datasets. Context limits force the model to see only fragments of long logs at once. Reviews of LLM use in qualitative research highlight contextual limitations and occasional inaccuracies as persistent barriers [8]. In practice, this can mean missed "slow-burn" trends and invented glue that stitches partial context into a confident story. It's like asking someone to understand a novel by reading a random sampling of pages. They might miss the critical event that explains the antagonist's villain arc.

Bias-Sensitive Analysis. Training data reflects dominant perspectives (either in the training data or in the reinforcement training), and this matters more than most people realise. Bias analyses of LLM outputs show measurable skews that can intensify as models scale [7]. For safety teams, this matters: it can shape which risks feel "typical," whose hazards get emphasised, and which groups' issues are under-identified. Unless humans actively validate and correct.

These vulnerabilities mean that unassisted LLMs can simulate analytical rigour while delivering inconsistent, occasionally fabricated results. This creates a dangerous combination when lives depend on accurate conclusions.

5. A Practical Framework for Trustworthy AI Assistance

Let me reaffirm that the solution isn't abandoning these tools, but using structure that matches risk. Think in tiers: simple prompts for basic tasks, multi-step workflows for depth, and agentic systems for complex oversight [9]. This transforms AI from a liability into a disciplined assistant. Let's put it all together in a nice succinct table.

Task Type	What It Involves	Best Structure	Why It Succeeds	Workplace Safety Example
Quick Pattern Spotting	Finding obvious themes in modest datasets	Single-prompt (zero-shot)	Fast and useful for simple scans; always pair with human review	Scan 50 reports for recurring phrases like "slippery floor" or "inadequate lighting"
Detailed Scoring & Sorting	Rating each event and aggregating results	Multi-step prompting (staged theme building + evidence checks)	Forces intermediate outputs, improves coverage, reduces "one-shot" improvisation [3]	Score each report for severity (1–5), group by likely root cause, then prioritize interventions
Complex Synthesis with Oversight	Large-scale integration, validation, ethics checks	Agentic oversight frameworks (tiered roles + cross-checking)	Specialized roles reduce single-point failure; hierarchical oversight improves safety outcomes [9]	One agent categorizes incidents, another scores risk, a third flags bias; humans make final policy calls

Start with hybrid workflows. Let the AI draft candidate insights, then require evidence (quotes, counts, examples) and bring humans in to validate. So-called "reasoning models" can already give you improved results on a single-prompt framework if the prompt is carefully structured and well-tested, as the "reasoning" step essentially provides a scratch pad for the model to walk through multiple workflow in a single run. For statistical rigour, compute metrics outside the LLM and ask the model to interpret the verified numbers rather than invent them. In short, a disciplined structure turns impressive mimicry into dependable analysis.

Worked Example: Tiered AI Oversight for a Near-Miss Scaffold Incident

expand

Scenario: A near-miss is reported on a construction site involving a scaffold platform that shifted under load. No injury occurred, but the potential severity is high.

Raw narrative

“During façade installation, the scaffold platform at Level 3 dropped ~30–50 mm on one side when two workers moved a pallet of stone veneer. A support bracket appeared to rotate. No injuries. Area was cleared. Scaffold tag status unclear. Photos taken by supervisor. Install team mentioned the bracket was ‘a bit loose yesterday’ but work continued due to schedule pressure.”

Important constraint: The AI is not allowed to invent facts. It must (a) quote the narrative, (b) label uncertainty, and (c) escalate critical/ambiguous cases to a human reviewer.

Step 1 — Tier 1: Rapid triage (3 independent agents)

Tier 1 goal: Quickly flag severity and decide whether the report requires deeper analysis. Each agent must provide:

Potential severity (low/medium/high/critical)
Escalate? (yes/no)
Evidence quotes (exact phrases from the narrative)

Tier 1 outputs

Agent A: Severity = Critical; Escalate = Yes
Evidence: “platform … dropped ~30–50 mm”, “support bracket … rotate”, “work continued due to schedule pressure”
Agent B: Severity = High; Escalate = Yes
Evidence: “two workers moved a pallet”, “tag status unclear”, “bracket was ‘a bit loose yesterday’”
Agent C: Severity = Critical; Escalate = Yes
Evidence: “area was cleared”, “photos taken”, “continued due to schedule pressure”

TAO-inspired logic: Escalation happens not only when the AI is unsure, but also when the case is potentially high impact or contains ambiguity (e.g., “tag status unclear”) — even if agents appear confident.

Step 2 — Tier 2: Domain-focused review (controls + causal hypotheses)

Tier 2 goal: Identify likely failure modes, map them to control categories, and propose safe immediate actions. Tier 2 must:

Separate observations (from text) from hypotheses (inferred possibilities)
Recommend immediate containment actions that are conservative and non-harmful
Flag any action that requires human verification

Tier 2 synthesis

Observed (from narrative):

Vertical drop on one side: “dropped ~30–50 mm”
Possible component movement: “support bracket appeared to rotate”
Status uncertainty: “scaffold tag status unclear”
Prior warning + schedule pressure: “loose yesterday”, “continued due to schedule pressure”

Hypotheses (must be verified):

Incorrect installation/locking of bracket or missing fastener
Overload or uneven load distribution from pallet movement
Inspection/tagging process gap (tag not updated, or not visible)

Immediate containment actions (low-regret):

Cordon off and prevent reuse of the scaffold bay until inspected and retagged
Preserve evidence: keep photos, record time, names, and conditions; avoid changing components before inspection where possible
Initiate a supervisor/HSEQ review because “tag status unclear” + potential structural failure

Why this is safer than a single LLM response: A single model might jump straight to a confident “root cause” (e.g., “faulty bracket”) without acknowledging uncertainty. Tier 2 forces a disciplined separation: observation vs hypothesis.

Step 3 — Tier 3: Specialist check + “human handoff” decision

Tier 3 goal: Decide what can be safely automated versus what must be escalated to a human, and produce a final structured summary.

Tier 3 decision

Overall risk: Critical (potential fall from height / structural failure)
Automatable outputs: classification tags, evidence extraction, conservative containment steps
Requires human validation: root cause conclusion, scaffold re-certification, disciplinary decisions, close-out approval
Escalate to human? Yes (high potential harm + ambiguity + schedule pressure indicator)

Final “ready-to-use” summary (for the safety team):

Event type: Near miss — scaffold movement under load
Potential consequence: serious injury/fatality (fall/structural collapse) → treat as critical
Key evidence quotes: “dropped ~30–50 mm”, “tag status unclear”, “loose yesterday”, “schedule pressure”
Immediate actions: isolate bay; preserve evidence; initiate inspection/retag; capture witness notes
Next step: human HSEQ lead to confirm cause + corrective action plan before return to service

Takeaway: This is what “reliable AI” looks like in safety analysis — not one confident answer, but a controlled workflow with independent checks, evidence grounding, and clear escalation triggers (as emphasised by tiered oversight approaches like TAO).

6. Building Safer Workplaces Through Smarter AI Use

Language models tempt us with shortcuts, transforming weeks of qualitative analysis into minutes. Yet their inner workings, particularly the "improviser" tendency, demand respect and structure. We've seen how they falter on nuance and causality but can excel at high-volume sorting when tightly constrained. The path forward is not blind trust or blanket rejection; it's disciplined design. We need stepwise workflows, ensemble checks where appropriate, and continuous human validation.

The folks at myosh aim to help system users squeeze real value from AI technology without offloading trust and responsibility from people onto language models (more details in a future article). This matters to me personally: safety depends on salience, judgement, and accountability: qualities that remain uniquely human. Used correctly, LLM-based tools can improve the efficiency of human attention by highlighting patterns faster, while humans remain the final authority on what those patterns mean and what actions follow.

‍

Sources:

Author

Martin Ashworth

Originally working in the UK and now living in Australia, Martin is a retired health and safety professional whose interest in OHS critical thinking grew from hands-on inspections through to innovative training and thoughtful policy work. From a perspective humbled by real world workplace exposure, he shares practical insights on making safety aspiration and theory workable and human-centered.

‍

Share this article

Contents

Latest articles

Managing Major Accident Hazards in Oil & Gas: Moving Beyond Paper Bowties

Health & Safety Priorities for Transport & Logistics (Chain of Responsibility Focus)

The Physics of Risk: Where Energy Scaling Intuitions Fail