Guardrail Assessment

Do your guardrails hold up
against a real attack?

Content filters, jailbreak detectors, PII masking - your guardrails only protect your AI system as well as they perform against real attacks. We measure what they actually deliver: quantitatively, reproducibly, audit-ready.

Request Guardrail Assessment View test dimensions

Bypass Rate False-Positive Rate Latency Metrics GES Score

GUARDRAIL EFFECTIVENESS SCORE - EXAMPLE

Jailbreak Bypass Rate

34 % critical

PII Leakage Rate

18 % high

False-Positive Rate

12 % medium

Indirect Injection Bypass

61 % critical

Adversarial Latency

+840 ms medium

OVERALL - GUARDRAIL EFFECTIVENESS SCORE

31 / 100 - CRITICAL

HARDENING REQUIRED

Fixed-price quote: from EUR 10,000; Fixed-price quote
Quote turnaround: 48h (business days); Quote turnaround
Guardrail systems tested: 6+; Guardrail systems tested
Subcontractors: 0; Subcontractors

The Problem

Guardrails you think are secure - but aren't

Most AI guardrails are configured, not tested. They are switched on, the out-of-the-box defaults adopted and considered "secure". The reality: jailbreak techniques evolve daily. What was blocked yesterday gets through today with a minimal reformulation. And nobody measures it.

The guardrail dilemma

Too restrictive leads to user frustration and guardrail deactivation. Too tolerant lets attackers through. Correct calibration requires systematic testing, not intuition.

Jailbreaking is industrialised

Public databases with thousands of jailbreak prompts, automated bypass tools and community forums make it trivially easy to circumvent unpatched guardrails.

Regulatory requirements

EU AI Act Art. 15 and the GPAI Code of Practice require demonstrable robustness against misuse. A Guardrail Assessment provides measurable evidence of this robustness.

Latency as an attack surface

External guardrail services add latency. Under adversarial load - complex requests deliberately saturating the classifiers - response time can rise to several seconds and cause timeouts.

GUARDRAIL LAYERS - WHAT MUST BE TESTED

Input Classifier

Detects malicious inputs before LLM processing

System Prompt Guards

Protects system prompt from being overwritten by user inputs

Output Classifier

Filters harmful outputs after LLM processing

PII Masking

Anonymises personal data in outputs

Topic Restriction

Limits discussion scope to permitted topics

Constitutional Classifier

Checks outputs against ethics guidelines and policies

Monitoring & Logging

Detects bypass patterns in real time

FROM OUR PRACTICE

Significant bypass rates

Our assessments regularly reveal that a significant proportion of tested guardrails can be bypassed - particularly with uncalibrated out-of-the-box deployments.

After hardening based on our recommendations, bypass rates can be substantially reduced while maintaining an acceptable false-positive rate.

What we test

Five test dimensions - one score

The Guardrail Assessment measures all relevant quality dimensions of your AI protection layers - quantitatively and comparably.

01 False-Negative Rate

Bypass Resistance

What percentage of all adversarial requests are correctly blocked? We test with 500+ curated bypass techniques: roleplay prompts, token smuggling, adversarial suffixes, multilingual exploits, many-shot jailbreaking, contextual circumvention and novel, non-publicly known techniques from our research.

False-Negative Rate500+ test cases

02 Usability Impact

False-Positive Rate

A guardrail that blocks every other legitimate request is not protection - it is a productivity killer. We measure the false-positive rate with realistic, harmless requests from your use case and identify the optimal calibration point between security and usability.

Usability BalanceROC Curve

03 Performance Degradation

Latency Under Adversarial Load

Attackers can saturate guardrails with deliberately complex requests - a timing attack on your AI application. We measure latency under normal operation and under adversarial load: P50, P95, P99 response times and timeout behaviour.

P95 LatencyTimeout Resistance

04 Portability

Cross-Model Effectiveness

A guardrail configured for GPT-4 may be ineffective for Claude or Llama - each model has different tokenisation properties and behaviour patterns. We test whether your guardrails work model-independently or need model-specific calibration.

Multi-ModelTokeniser Differences

05 Conversation Robustness

Multi-Turn Manipulation Resistance

Many guardrails only check individual messages, not the conversation history. Attackers use multi-turn sequences to gradually condition guardrails: a harmless request sets up the next one until the classifier is outwitted. We test robustness over 10-, 20- and 50-turn conversations.

Multi-TurnContext Attacks

06 Privacy Compliance

PII and Data Leakage

Does your system disclose personal data despite guardrails - from context, from training or from connected data sources? We test PII masking effectiveness, membership inference resistance and contextual data exfiltration per GDPR requirements.

GDPRPII Masking

Tested Systems

We know your guardrail system

Every guardrail system has its own vulnerability classes and bypass techniques - generic tests are not sufficient.

Azure AI Content Safety

Microsoft

Specific bypass techniques for Azure AI Content Safety: severity threshold exploits, category-specific circumventions (Hate, Violence, Sexual, Self-Harm), Prompt Shield bypass and multi-modal attacks. We test all four harm categories and groundedness detection.

Amazon Bedrock Guardrails

AWS

Testing of all Bedrock guardrail functions: topic denial, content filter, word filter, sensitive information filter and grounding checks. Specific attacks on topic restriction bypasses and PII entity recognition gaps in non-English text.

NVIDIA NeMo Guardrails

NVIDIA

Colang-based guardrail flows have specific logic exploits: flow manipulation through adversarial inputs, rail bypass via uncovered conversation paths and input/output rail inconsistencies. We test both predefined and custom flows.

Anthropic Constitutional AI

Anthropic

Training-based guardrails have different vulnerabilities than external classifiers: contextual conditioning, reasoning exploits, cross-lingual bypasses and adversarial roleplaying techniques that circumvent Constitutional AI checks. We test Claude models in their production context.

Lakera Guard

Open Source / SaaS

Real-time guardrail API with specialised prompt injection detectors: we test detection rates against current jailbreak databases, latency under load and effectiveness for non-English language inputs that may be underrepresented in the training set.

Custom Implementations

Proprietary

Many organisations build their own guardrails based on regex, keyword lists or fine-tuned classifiers. We analyse your specific implementation, identify gaps in coverage and develop bespoke bypass tests and hardening recommendations.

Your deliverable

Guardrail Effectiveness Score

The Guardrail Effectiveness Score (GES) is the central deliverable of the assessment. It compresses the security performance of your guardrails into an understandable, comparable metric - as the basis for management decisions and compliance evidence.

80-100

Very Good

Guardrails are effectively calibrated. Bypass rate < 5%, false-positive rate < 8%. Targeted optimisation recommended.

60-79

Good

Basic protection in place, but bypass rate 5-15% or elevated false-positive rate. Specific hardening measures recommended.

40-59

Needs Improvement

Significant vulnerabilities. Bypass rate 15-30%. Structural reconfiguration required.

0-39

Critical

Guardrails provide no reliable protection. Bypass rate > 30%. Immediate action required before production use.

COMPONENTS OF THE GUARDRAIL EFFECTIVENESS SCORE

35 %

Bypass Resistance (FNR)

Weighted proportion of successful bypasses across all attack categories

25 %

Usability (FPR)

Proportion of incorrectly blocked legitimate requests in the use-case context

15 %

Latency Robustness

Performance degradation under adversarial load (P95)

15 %

PII Protection

Effectiveness of PII masking and data exfiltration prevention

10 %

Multi-Turn Resistance

Robustness against stepwise context manipulation attacks

COMPLIANCE USE OF THE GES

EU AI Act Art. 15: evidence of robustness against adversarial inputs
GPAI Code of Practice: quantitative safety metrics for general-purpose AI
ISO 42001: evidence for Control A-6.1 (AI System Risk Management)
GDPR: evidence of technical safeguards for AI systems processing personal data

Methodology

How AWARE7 tests guardrails

Quantitative testing with 500+ curated test cases - combined with manual expert analysis for novel bypass techniques.

1 day

Guardrail Inventory

Complete mapping of all guardrail layers: which systems are active? How are they configured? Which harm categories are covered? Which thresholds are set? Which models do they run on? Result: guardrail architecture diagram.

1-2 days

Baseline Measurement

Establishing the initial measurement: false-positive rate with 200+ legitimate requests from your use case. Latency baseline under normal operation. Performance profile of the guardrail system as reference for all subsequent tests.

3-5 days

Systematic Bypass Testing

Testing with 500+ curated bypass techniques from our proprietary test set - broken down by attack category: jailbreaking, roleplays, token smuggling, encoding tricks, multilingual exploits, many-shot conditioning and adversarial suffixes. False-negative rate measured per category.

2-3 days

Multi-Turn & Contextual Attacks

Tests not covered by single-message analysis: stepwise context conditioning over 10-, 20-, 50-turn conversations. Guardrail exhaustion attacks. Cross-session persistence tests for systems with persistent conversation memory.

1-2 days

Latency & Resilience Tests

Quantitative latency measurement under adversarial load: P50, P95, P99 response times. Timeout behaviour with complex requests. Guardrail system behaviour under overload - does it fail open or fail closed?

2-3 days

Reporting & Calibration Recommendations

Guardrail Effectiveness Score (GES) with breakdown by test dimension. Concrete calibration recommendations for each threshold. Hardening roadmap with prioritisation. Compliance mapping to EU AI Act, ISO 42001 and GDPR.

Typical total duration: 8-15 days - depending on the number of guardrail layers and desired test depth.
You receive a binding fixed-price quote within 48 business hours from EUR 10,000.

Warum AWARE7

Was uns von anderen Anbietern unterscheidet

Reine Awareness-Plattformen testen keine Systeme. Reine Beratungskonzerne sind zu weit weg. AWARE7 verbindet beides: Wir hacken Ihre Infrastruktur und schulen Ihre Mitarbeiter - mittelstandsgerecht, persönlich, ohne Enterprise-Overhead.

Forschung und Lehre als Fundament

Rund 20% unseres Umsatzes stammen aus Forschungsprojekten für BSI und BMBF. Unsere Studien analysieren Millionen von Websites und Zehntausende Phishing-E-Mails - publiziert auf ACM- und Springer-Konferenzen. Drei unserer Führungskräfte sind gleichzeitig Professoren an deutschen Hochschulen.

Digitale Souveränität - keine Kompromisse

Alle Daten werden ausschließlich in Deutschland gespeichert und verarbeitet - ohne US-Cloud-Anbieter. Keine Freelancer, keine Subunternehmer in der Wertschöpfung. Alle Mitarbeiter sind sozialversicherungspflichtig angestellt und einheitlich rechtlich verpflichtet. Auf Anfrage VS-NfD-konform.

Festpreis in 24h - planbare Projektzeiträume

Innerhalb von 24 Stunden erhalten Sie ein verbindliches Festpreisangebot - kein Stundensatz-Risiko, keine Nachforderungen, keine Überraschungen. Durch eingespieltes Team und standardisierte Prozesse erhalten Sie einen klaren Zeitplan mit definiertem Starttermin und Endtermin.

Ihr fester Ansprechpartner - jederzeit erreichbar

Ein persönlicher Projektleiter begleitet Sie vom Erstgespräch bis zum Re-Test. Sie buchen Termine direkt bei Ihrem Ansprechpartner - keine Ticket-Systeme, kein Callcenter, kein Wechsel zwischen wechselnden Beratern. Kontinuität schafft Vertrauen.

Für wen sind wir der richtige Partner?

Mittelstand mit 50–2.000 MA

Unternehmen, die echte Security brauchen - ohne einen DAX-Konzern-Dienstleister zu bezahlen. Festpreis, klarer Scope, ein Ansprechpartner.

IT-Verantwortliche & CISOs

Die intern überzeugend argumentieren müssen - und dafür einen Bericht mit Vorstandssprache brauchen, nicht nur technische Findings.

Regulierte Branchen

KRITIS, Gesundheitswesen, Finanzdienstleister: NIS-2, ISO 27001, DORA - wir kennen die Anforderungen und liefern Nachweise, die Auditoren akzeptieren.

Mitwirkung an Industriestandards

LLM

OWASP · 2023

OWASP Top 10 for Large Language Models

Prof. Dr. Matteo Große-Kampmann als Contributor im Core-Team des international anerkannten OWASP LLM-Sicherheitsstandards.

BSI

BSI · Allianz für Cyber-Sicherheit

Management von Cyber-Risiken

Prof. Dr. Matteo Große-Kampmann als Mitwirkender des offiziellen BSI-Handbuchs für die Unternehmensleitung (dt. Version).

Frequently asked questions about Guardrail Assessments

Everything about guardrail bypasses, false-positive rates and the Guardrail Effectiveness Score.

What are AI guardrails?

AI guardrails are protective layers that limit and control the behaviour of a Large Language Model. They include: content filters (block harmful or inappropriate outputs), jailbreak detectors (detect attempts to circumvent security policies), PII masking (anonymise personal data in outputs), output validators (ensure responses conform to a defined schema or format), constitutional classifiers (check outputs against ethical guidelines) and topic-restriction filters (limit the scope of discussion to permitted topics). Guardrails can be implemented model-internally (training-based, such as RLHF), model-externally (separate classifier models) or rule-based (regex, keyword lists) - and all three layers have different vulnerability profiles.

What is a guardrail bypass?

A guardrail bypass is any technique by which an attacker or user circumvents the protective measures of an AI system and provokes unwanted outputs. Bypass techniques include: roleplay prompts (the system is asked to play a fictional character with no restrictions), token smuggling (special characters or unusual encodings circumvent keyword filters), multilingual exploits (switching to less securely trained languages), many-shot jailbreaking (the model is conditioned with examples), adversarial suffixes (mathematically optimised token sequences), contextual circumvention (the request is reformulated so that the classifier does not recognise it as harmful) and encoding tricks (Base64, ROT13, Leet-Speak). In the Guardrail Assessment we measure the false-negative rate - the proportion of successful bypasses out of all bypass attempts - as a quantitative indicator of your guardrail effectiveness.

How do I measure the effectiveness of my guardrails?

Guardrail effectiveness can be quantified by two complementary metrics: The False-Negative Rate (FNR) measures what percentage of harmful requests slip through the guardrails - a high value indicates insufficient protection. The False-Positive Rate (FPR) measures what percentage of legitimate requests are incorrectly blocked - a high value indicates UX problems and user frustration. The tension between FNR and FPR is the central challenge in guardrail design: too restrictive leads to too many false positives; too tolerant leads to too many bypasses. Our Guardrail Assessment provides both metrics in a standardised Guardrail Effectiveness Score, broken down by attack category and guardrail layer.

What is the difference between a content filter and Constitutional AI?

A content filter is a rule-based or classifier-based system that checks inputs or outputs against a list of prohibited content - fast, deterministic, but easily circumvented by rephrasing. Constitutional AI (Anthropic) is a training-based approach: the model is trained with a "constitution" - a rule set of principles - to evaluate and correct its own outputs. Constitutional AI is harder to bypass than keyword filters because the restrictions are anchored in the model itself, but has its own vulnerabilities: multi-step reasoning, contextual manipulation and cross-linguistic exploits. In the Guardrail Assessment we test both paradigms with specific attack techniques tailored to each implementation.

Which guardrail systems do you test?

We test all leading guardrail platforms: Azure AI Content Safety (Microsoft), Amazon Bedrock Guardrails, NVIDIA NeMo Guardrails (Colang-based), Anthropic Constitutional AI and Claude Guardrails, OpenAI Moderation API, Lakera Guard, LLM Guard (open source), and custom guardrail implementations based on proprietary classifier models. For each platform we know the specific bypass techniques and vulnerability classes. Our assessment covers both cloud-hosted guardrail services and self-hosted open-source implementations.

What does a Guardrail Assessment cost?

A Guardrail Assessment starts from EUR 10,000. The price depends on the number of guardrail layers, the complexity of the systems to be tested and the desired test scope (single guardrail component vs. complete guardrail architecture). For complex systems with multiple guardrail layers, custom classifiers and quantitative effectiveness measurement across multiple attack categories, the typical effort is between EUR 12,000 and EUR 20,000. You receive a binding fixed-price quote within 48 business hours - no hourly rates, no additional charges.

Can guardrails achieve a false-positive rate of 0%?

No - that is the fundamental guardrail dilemma. Any guardrail sensitive enough to detect all bypasses will inevitably also block legitimate requests. Highly restrictive guardrails achieve low false-negative rates (few bypasses) but have high false-positive rates (many legitimate requests blocked), leading to user complaints, productivity losses and ultimately the deactivation of the guardrails. The optimal operating point on the ROC curve depends on your use case: a public chatbot requires different guardrail thresholds than an internal analysis tool. Our assessment not only quantifies the current performance but also recommends the optimal configuration point for your specific deployment context.

How often should guardrails be retested?

Guardrails are not static security measures - they are challenged daily by new bypass techniques. The jailbreak community continuously publishes new attack methods; model updates change the behaviour of the underlying classifiers; and new deployment contexts open up new attack surfaces. Recommendation: after every major model update or guardrail reconfiguration run a quick assessment, and at least every six months a full Guardrail Assessment. For systems in regulated industries (finance, healthcare, critical infrastructure) we recommend a retainer model with quarterly effectiveness measurement and threshold calibration.

What is your real guardrail bypass rate?

We measure the effectiveness of your AI safety filters quantitatively - with 500+ bypass techniques and the Guardrail Effectiveness Score. Fixed-price commitment from EUR 10,000.

Request Guardrail Assessment

Kostenlos · 30 Minuten · Unverbindlich