Model Guardrails Are Getting Better. That Doesn’t Mean Your Product Is Safe.

Over the past few years, model providers have invested heavily in “guardrails”: safety layers around large language models that detect risky content, block some harmful queries, and make systems harder to jailbreak.

Hemraj Bedassee , Testing Delivery Practitioner, Testlio

November 27th, 2025

OpenAI has formalized this into a configurable Guardrails framework with checks like PII detection, moderation, jailbreak detection, hallucination detection, off-topic filters, prompt-injection detection, and URL filtering. These checks can be chained in a pipeline that runs before a model call (preflight), during the call (input), and after the final response (output).

Anthropic has built a layered safeguards stack around Claude: usage policies and a unified harm framework, safety-focused training (including “harmlessness” and Constitutional AI), pre-deployment evaluations, real-time classifier guards on inputs and outputs, and specialized high-risk domain classifiers, such as its nuclear safeguards.

This is real progress, but it’s also where many teams are now making a dangerous assumption:

“If the provider has guardrails, our product must be safe enough.”

In practice, that assumption is wrong.

OpenAI Guardrails

OpenAI Guardrails is a safety framework that sits around model calls and runs configurable checks in a staged pipeline:

Preflight: runs before the model (for example, PII masking or input moderation)
Input: runs in parallel with the model generation (for example, jailbreak detection)
Output: runs over the generated content (for example, factuality or compliance checks)

The current catalogue of published checks includes, among others:

Contains PII
Moderation (safety policy categories)
Jailbreak detection
Off-topic prompts
URL filter
Hallucination detection (vector-store grounding)
NSFW text detection
Prompt injection detection

Each check uses some combination of detectors, LLM-based “judge” models, pattern matching, and thresholds. When a check fires, it can block, redact, or flag content or raise an exception back to the calling application.

This is useful infrastructure, but even OpenAI’s own ecosystem and independent research make it clear these controls are probabilistic and bypassable. Security researchers and vendors have already shown ways to defeat jailbreak and prompt-injection detectors, including in OpenAI’s Guardrails, by crafting novel prompt patterns. Guardrails reduce risk, but they do not eliminate it.

Anthropic’s Safeguards

Anthropic takes a “defence in depth” approach to Claude, combining:

A usage policy and unified harm framework
Safety-focused training and Constitutional AI
Pre-deployment evaluations of misuse, bias, and high-risk domains
Real-time classifier guards on inputs and outputs
Enhanced safety filters for repeat policy violators
Specialized high-risk classifiers (for example, the nuclear classifier deployed on Claude traffic at 96% accuracy in preliminary testing)

In 2025, Anthropic also introduced “Constitutional Classifiers”: dedicated input/output classifiers trained on synthetic data derived from a rules “constitution”, to defend against universal jailbreaks.

This is a major step forward in making systematic jailbreaks harder. But even Anthropic’s own reports and external jailbreak research stress that:

These classifiers are focused on specific threat models
They block a narrow class of harmful information, not all possible harmful or misleading outputs.

So again: they reduce certain classes of risk; they do not guarantee global safety.

Guardrails Solve One Problem: Misuse. Your Product Has Many More.

Provider guardrails are mainly designed to reduce misuse and policy-violating content at the model level.

They are not designed to understand:

Your specific domain rules (for example, what “safe” means in healthcare, finance, or public sector workflows)
Your business logic and constraints
Your retrieval pipeline or knowledge base structure
Your agent/tool orchestration and downstream actions
Your UX patterns, tone, or escalation rules
Your jurisdiction-specific regulatory duties

That means a model can be “well-guarded” according to the provider’s policy and still:

Give misleading or partially incorrect answers
Offer unsafe domain-specific advice that is not explicitly banned by the usage policy
Behave inconsistently across languages, personas, or regions
Lose context across long multi-turn conversations
Call the wrong tools or take unsafe actions in an agentic workflow
Drift in tone, memory, or personalization in ways that matter to your users and regulators

Where Testing Needs To Probe Beyond Guardrails

For teams building AI-powered products, the real work happens after model guardrails have done their part.

Here are the areas where testing needs to go deeper.

Factuality and hallucination: Guardrails like OpenAI’s hallucination detector help, but they typically work against a specific knowledge base or vector store and do not claim to guarantee global factual accuracy. Testing has to probe unsupported claims, partial truths, outdated documents, and reasoning failures in the real domain data you use.
Domain-specific safety: Neither OpenAI nor Anthropic state that their guardrails certify domain-correct behaviour in high-stakes areas such as medicine, law, finance, or safety-critical industrial use. Their focus is on broad harm categories, not on sector-specific correctness. Domain-focused testing has to cover harmful-but-subtle failures that still comply with provider policy.
Multi-turn behaviour and memory: Anthropic explicitly uses real-time classifier guards on inputs and outputs; OpenAI supports conversation-aware guardrails. But long-horizon behaviours, slow escalation, and personalization drift remain hard problems. Testing needs to simulate realistic, multi-step conversations and look for behaviour that changes over time.
Agent and tool-use safety: Provider guardrails focus mainly on what the model says, not everything it does when embedded in an agent that can call tools, APIs, or workflows. OpenAI includes prompt injection and tool-call checks; Anthropic’s adds CBRN-focused jailbreak defence, but neither claims to validate every downstream action in your app. Testing has to verify tool selection, parameter safety, fallback behaviour, and data exposure across end-to-end flows.
Bias, fairness, and user equity: Anthropic and OpenAI both run internal bias and harm evaluations, but they do not guarantee per-client fairness across your specific user groups, languages, and use cases. Independent testing needs to uncover systematic differences in how the system treats different demographics, segments, or regions.
Configuration and integration risk: Guardrails are configurable. Thresholds, enabled checks, logging, and safety modes can all be changed at the client side. Misconfiguration is a real failure mode and it often doesn’t show up until someone deliberately tests it.

How a Dedicated Testing Partner Like Testlio Fits In

This is where a focused AI testing practice becomes critical. A provider gives you model-level guardrails, while a testing partner helps you understand what actually happens when that model is dropped into your real product.

The role Testlio plays in this landscape looks like this:

Map provider guardrails to your real use cases: We start by identifying which provider safeguards are actually enabled and mapping them to your flows and context. The goal is to see where those guardrails apply and where they do not.
Design tests that deliberately target known and likely gaps: Using what the providers themselves publish about their safeguards and what the research community continues to show with new jailbreak and prompt-injection techniques.
Run structured red teaming and regression: We do not just test “happy path” behaviour. We use structured adversarial prompts, long-horizon conversations, and realistic misuse patterns to see how your application behaves with guardrails on and how that behaviour changes when models or configurations are updated.
Translate findings into product-level decisions: The output is not just a list of raw issues but also a set of product decisions you can act on: which guardrail settings to tighten, where to add UX-level warnings or confirmations or where to add non-AI fallbacks or human review.
Continuously test as models and safeguards evolve: AI Model Providers are iterating quickly, introducing new guardrail features, safety levels, classifiers, and models. As they evolve, your risk profile changes too. A guardrail improvement in one dimension can create new failure modes elsewhere. Ongoing testing is how you keep the real-world experience aligned with your risk tolerance.

Final Thoughts

A few years ago, we were asking: “Does this model have guardrails?” But today, the better question is:

What exactly do those guardrails cover?
What do they explicitly not cover?
And how are we testing everything that falls outside their scope in our actual product?

Model guardrails are necessary. They meaningfully reduce some classes of risk, but they are not the same thing as product safety, domain correctness, or regulatory readiness. That gap is now where the real work of AI testing lives.