Can LLMs support safety assessment?

In brief

LLMs can help organise information, draft tables, summarise material and generate questions for safety analysis.
They should not be treated as stand-alone safety assessors or independent sources of assurance.
Their strongest use is often as a structured assistant: useful for preparation, reflection and documentation, but only under expert review.
Their biggest risk is that incomplete or weak analysis can sound fluent, confident and methodologically convincing.
Good use requires clear boundaries, evidence checking, traceability, data governance and human accountability.

Why this topic matters

Large language models are now easy to access and increasingly capable of producing structured, professional-looking text. It is therefore unsurprising that people are beginning to test them in safety assessment, hazard analysis, incident learning and assurance work. They can generate draft tables quickly. They can summarise long documents. They can suggest candidate hazards, unsafe control actions, failure modes, loss scenarios or safety requirements. For busy analysts and organisations facing complex systems, this is attractive.

However, safety assessment is not only a text-generation task. It is a disciplined process of understanding a system, deciding what is inside and outside the analysis boundary, identifying what matters to stakeholders, examining controls and feedback, and making evidence-based judgements about risk, uncertainty and action. A fluent table is not the same as a valid analysis.

This distinction matters because LLM output can look more complete than it really is. A model may produce a plausible STPA table while missing the control structure that makes the analysis meaningful. It may produce a FRAM-style list of functions without understanding how variability propagates across everyday work. It may suggest HAZOP safeguards that are procedurally easy to write but weak as risk controls. In safety-critical settings, such gaps are not just academic problems. They can shape design decisions, operational procedures, safety cases and organisational confidence.

The core idea

The safest way to think about LLMs in safety assessment is to treat them as assistive tools, not as assessors. They can support parts of the work, but they do not own the judgement. They do not know the local system unless the analyst provides the relevant context. They do not interview operators, observe work-as-done, inspect equipment, validate assumptions or take responsibility for a safety claim.

Used well, an LLM can act like a tireless drafting assistant or a second reader. It can help an analyst organise messy notes, convert workshop outputs into a clearer format, suggest questions that may have been overlooked, or compare whether a table is internally consistent. It can also help translate technical findings into plain language for a broader audience.

Used poorly, an LLM can become a source of false confidence. One-shot safety assessments are particularly risky. A single prompt such as “apply STPA to this system” or “generate a HAZOP” may produce something that resembles expert work, but resemblance is not reliability. The analyst still needs to ask: What evidence supports this? What has been assumed? What has been omitted? Which parts are generic? Which parts are wrong? Who is accountable for using this output?

What LLMs can help with

LLMs are most useful when the task is bounded, reviewable and not safety-decisive on its own. They can help with the surrounding work of safety assessment: preparation, organisation, drafting, sense-checking and communication.

Swipe horizontally to view full table.

Resource comparison table
Possible use	How it can help	What still needs expert review
Preparing for analysis	Generate workshop questions, stakeholder prompts or checklists for information gathering.	Whether the questions fit the actual system, domain and purpose of the analysis.
Organising evidence	Summarise non-sensitive documents, meeting notes or previous analysis outputs.	Whether the summary is complete, accurate and traceable to the source material.
Drafting method artefacts	Prepare draft STPA, FRAM, FMEA or HAZOP-style tables for review.	Whether the method has been applied correctly and whether entries are valid.
Challenging assumptions	Suggest alternative scenarios, missing stakeholders, degraded modes or follow-up questions.	Whether the suggestions are relevant, feasible and supported by evidence.
Communicating findings	Turn technical outputs into clearer language for different audiences.	Whether the simplified text preserves the safety meaning and caveats.

This makes LLMs potentially valuable as part of a human-led workflow. Their contribution is not that they “know” the answer. It is that they can help analysts see, structure and communicate material that still needs to be checked by people with system knowledge.

Where LLMs can mislead

LLMs can make mistakes in ordinary ways, such as inventing facts, misquoting sources or overlooking important information. In safety assessment, there are also more specific problems.

First, they may lose the systems perspective. A systems-based method requires attention to relationships: controls, feedback, timing, responsibilities, interfaces, constraints and adaptation. LLMs may instead drift toward generic component failure lists or broad recommendations that sound sensible but do not follow from the modelled system.

Second, they may be inconsistent across prompts or across repeated runs. The same task can produce different hazards, different levels of detail or different recommendations. This is not always a problem for brainstorming, but it is a serious problem if the output is treated as a stable assessment.

Third, they may fail to carry earlier analysis forward. A method such as STPA or FRAM is cumulative: earlier decisions shape later outputs. If losses, hazards, functions, control actions or variabilities are not used consistently, the final recommendations may become generic rather than methodologically grounded.

Fourth, they may appear more authoritative than they are. A polished table, formal terminology and confident wording can hide missing evidence. This creates a human factors problem: people may under-check AI-generated outputs because they look organised and professional.

A safer workflow

Define the analysis purpose, boundary and method before involving the LLM.
Use only data that can be shared under the relevant confidentiality, security and data-protection rules.
Ask the model to support a specific task, not to complete the whole safety assessment.
Keep the prompts, model version, date, inputs and outputs as part of the analysis record.
Review each output against source evidence and mark it as accepted, amended or rejected.
Use domain experts and operational stakeholders to validate assumptions and findings.
Make sure final recommendations remain traceable to hazards, scenarios, evidence and expert judgement.

For example, it may be reasonable to ask an LLM to turn workshop notes into a draft table of possible control actions, provided each entry is checked by the analyst. It may be useful to ask for questions to explore during a FRAM workshop, provided the questions are treated as prompts for discussion rather than findings. It may be helpful to ask for a plain-language summary of a safety report, provided the summary is checked against the report.

It is much less defensible to ask an LLM to generate a final STPA, FRAM, HAZOP, FMEA or safety case and then use that output as evidence of safety. The more consequential the decision, the stronger the need for independent review, stakeholder involvement and documented traceability.

A practical way to use LLMs

A cautious workflow starts with the human analyst, not the model. Before using an LLM, define the purpose of the analysis, the system boundary, the relevant stakeholders, the intended method and the evidence base. Decide what the model is allowed to do and what it is not allowed to decide.

Questions to ask before relying on an output

Before using LLM-generated material in safety work, it helps to pause and ask a few practical questions.

What exactly was the model asked to do?
What information did it have, and what information did it not have?
Which assumptions did it introduce?
Can each hazard, scenario or recommendation be traced to evidence?
Does the output follow the method, or does it only imitate the format?
Has someone with domain and operational knowledge reviewed it?
Could the output create automation bias, overconfidence or misplaced accountability?

These questions are deliberately simple. They help keep the focus on safety judgement rather than model performance alone. A model that produces a convincing answer is not necessarily producing a usable safety assessment.

Limitations and cautions

LLMs should be used with particular care in safety-critical work because the costs of being wrong can be high. They can produce plausible but false information, omit context, flatten uncertainty, introduce bias, repeat patterns, misunderstand terminology, and generate recommendations that are not feasible in the real operating environment. They can also create security, privacy and intellectual property risks if sensitive information is entered into unsuitable tools.

There is also a governance issue. If an LLM helps draft a safety argument, a risk assessment or an incident-learning report, the organisation still needs to know who reviewed it, what evidence was used, what limitations were identified and who is accountable for the final decision. The model cannot carry that responsibility.

The practical conclusion is not to avoid LLMs completely. It is to use them in proportion to their maturity and the consequence of the decision. They may be useful for preparation, exploration, drafting and communication. They should not replace expert judgement, operational knowledge, stakeholder engagement, method competence or independent assurance.

Related publication(s)

Abdouramane Attou Bounou, N., Kaya, G.K. and Camelia, F. (2026). A system-based safety analysis of AI-enabled UAV operations for runway FOD detection: application of STPA and PROMETHEE. Safety Science, 202, 107280. DOI: 10.1016/j.ssci.2026.107280.
Kaya, G.K., Bovell, D., Sujan, M. and Braithwaite, G. (2025). Large language models powered system safety assessment: applying STPA and FRAM. Safety Science, 191, 106960. DOI: 10.1016/j.ssci.2025.106960.

Selected references

Charalampidou, S., Zeleskidis, A. and Dokas, I.M. (2024). Hazard analysis in the era of AI: Assessing the usefulness of ChatGPT4 in STPA hazard analysis. Safety Science, 178, 106608. DOI: 10.1016/j.ssci.2024.106608.
Dokas, I.M. (2026). From hallucinations to hazards: benchmarking LLMs for hazard analysis in safety-critical systems. Safety Science, 194, 107056. DOI: 10.1016/j.ssci.2025.107056.
Graydon, M.S. and Lehman, S.M. (2025). Examining Proposed Uses of LLMs to Produce or Assess Assurance Arguments. NASA Technical Memorandum NASA/TM–20250001849.
Hollnagel, E. (2012). FRAM: The Functional Resonance Analysis Method: Modelling Complex Socio-technical Systems. Ashgate.
Lee, J., Park, S., Oh, S. and Ma, B. (2026). Can large language models automate the HAZOP process without human intervention? Safety Science, 194, 107039. DOI: 10.1016/j.ssci.2025.107039.
Leveson, N.G. and Thomas, J.P. (2018). STPA Handbook. MIT Partnership for Systems Approaches to Safety and Security.
Qi, Y., Zhao, X., Khastgir, S. and Huang, X. (2025). Safety analysis in the era of large language models: A case study of STPA using ChatGPT. Machine Learning with Applications, 19, 100622. DOI: 10.1016/j.mlwa.2025.100622.

Back to all resources