Guardrail Assessment
Do your guardrails hold up
against a real attack?
Content filters, jailbreak detectors, PII masking - your guardrails only protect your AI system as well as they perform against real attacks. We measure what they actually deliver: quantitatively, reproducibly, audit-ready.
OVERALL - GUARDRAIL EFFECTIVENESS SCORE
31 / 100 - CRITICAL
- Fixed-price quote
- from EUR 10,000
- Quote turnaround
- 48h (business days)
- Guardrail systems tested
- 6+
- Subcontractors
- 0
The Problem
Guardrails you think are secure - but aren't
Most AI guardrails are configured, not tested. They are switched on, the out-of-the-box defaults adopted and considered "secure". The reality: jailbreak techniques evolve daily. What was blocked yesterday gets through today with a minimal reformulation. And nobody measures it.
The guardrail dilemma
Too restrictive leads to user frustration and guardrail deactivation. Too tolerant lets attackers through. Correct calibration requires systematic testing, not intuition.
Jailbreaking is industrialised
Public databases with thousands of jailbreak prompts, automated bypass tools and community forums make it trivially easy to circumvent unpatched guardrails.
Regulatory requirements
EU AI Act Art. 15 and the GPAI Code of Practice require demonstrable robustness against misuse. A Guardrail Assessment provides measurable evidence of this robustness.
Latency as an attack surface
External guardrail services add latency. Under adversarial load - complex requests deliberately saturating the classifiers - response time can rise to several seconds and cause timeouts.
GUARDRAIL LAYERS - WHAT MUST BE TESTED
Input Classifier
Detects malicious inputs before LLM processing
System Prompt Guards
Protects system prompt from being overwritten by user inputs
Output Classifier
Filters harmful outputs after LLM processing
PII Masking
Anonymises personal data in outputs
Topic Restriction
Limits discussion scope to permitted topics
Constitutional Classifier
Checks outputs against ethics guidelines and policies
Monitoring & Logging
Detects bypass patterns in real time
FROM OUR PRACTICE
Significant bypass rates
Our assessments regularly reveal that a significant proportion of tested guardrails can be bypassed - particularly with uncalibrated out-of-the-box deployments.
After hardening based on our recommendations, bypass rates can be substantially reduced while maintaining an acceptable false-positive rate.
What we test
Five test dimensions - one score
The Guardrail Assessment measures all relevant quality dimensions of your AI protection layers - quantitatively and comparably.
Bypass Resistance
What percentage of all adversarial requests are correctly blocked? We test with 500+ curated bypass techniques: roleplay prompts, token smuggling, adversarial suffixes, multilingual exploits, many-shot jailbreaking, contextual circumvention and novel, non-publicly known techniques from our research.
False-Positive Rate
A guardrail that blocks every other legitimate request is not protection - it is a productivity killer. We measure the false-positive rate with realistic, harmless requests from your use case and identify the optimal calibration point between security and usability.
Latency Under Adversarial Load
Attackers can saturate guardrails with deliberately complex requests - a timing attack on your AI application. We measure latency under normal operation and under adversarial load: P50, P95, P99 response times and timeout behaviour.
Cross-Model Effectiveness
A guardrail configured for GPT-4 may be ineffective for Claude or Llama - each model has different tokenisation properties and behaviour patterns. We test whether your guardrails work model-independently or need model-specific calibration.
Multi-Turn Manipulation Resistance
Many guardrails only check individual messages, not the conversation history. Attackers use multi-turn sequences to gradually condition guardrails: a harmless request sets up the next one until the classifier is outwitted. We test robustness over 10-, 20- and 50-turn conversations.
PII and Data Leakage
Does your system disclose personal data despite guardrails - from context, from training or from connected data sources? We test PII masking effectiveness, membership inference resistance and contextual data exfiltration per GDPR requirements.
Tested Systems
We know your guardrail system
Every guardrail system has its own vulnerability classes and bypass techniques - generic tests are not sufficient.
Azure AI Content Safety
MicrosoftSpecific bypass techniques for Azure AI Content Safety: severity threshold exploits, category-specific circumventions (Hate, Violence, Sexual, Self-Harm), Prompt Shield bypass and multi-modal attacks. We test all four harm categories and groundedness detection.
Amazon Bedrock Guardrails
AWSTesting of all Bedrock guardrail functions: topic denial, content filter, word filter, sensitive information filter and grounding checks. Specific attacks on topic restriction bypasses and PII entity recognition gaps in non-English text.
NVIDIA NeMo Guardrails
NVIDIAColang-based guardrail flows have specific logic exploits: flow manipulation through adversarial inputs, rail bypass via uncovered conversation paths and input/output rail inconsistencies. We test both predefined and custom flows.
Anthropic Constitutional AI
AnthropicTraining-based guardrails have different vulnerabilities than external classifiers: contextual conditioning, reasoning exploits, cross-lingual bypasses and adversarial roleplaying techniques that circumvent Constitutional AI checks. We test Claude models in their production context.
Lakera Guard
Open Source / SaaSReal-time guardrail API with specialised prompt injection detectors: we test detection rates against current jailbreak databases, latency under load and effectiveness for non-English language inputs that may be underrepresented in the training set.
Custom Implementations
ProprietaryMany organisations build their own guardrails based on regex, keyword lists or fine-tuned classifiers. We analyse your specific implementation, identify gaps in coverage and develop bespoke bypass tests and hardening recommendations.
Your deliverable
Guardrail Effectiveness Score
The Guardrail Effectiveness Score (GES) is the central deliverable of the assessment. It compresses the security performance of your guardrails into an understandable, comparable metric - as the basis for management decisions and compliance evidence.
80-100
Very Good
Guardrails are effectively calibrated. Bypass rate < 5%, false-positive rate < 8%. Targeted optimisation recommended.
60-79
Good
Basic protection in place, but bypass rate 5-15% or elevated false-positive rate. Specific hardening measures recommended.
40-59
Needs Improvement
Significant vulnerabilities. Bypass rate 15-30%. Structural reconfiguration required.
0-39
Critical
Guardrails provide no reliable protection. Bypass rate > 30%. Immediate action required before production use.
COMPONENTS OF THE GUARDRAIL EFFECTIVENESS SCORE
Bypass Resistance (FNR)
Weighted proportion of successful bypasses across all attack categories
Usability (FPR)
Proportion of incorrectly blocked legitimate requests in the use-case context
Latency Robustness
Performance degradation under adversarial load (P95)
PII Protection
Effectiveness of PII masking and data exfiltration prevention
Multi-Turn Resistance
Robustness against stepwise context manipulation attacks
COMPLIANCE USE OF THE GES
- EU AI Act Art. 15: evidence of robustness against adversarial inputs
- GPAI Code of Practice: quantitative safety metrics for general-purpose AI
- ISO 42001: evidence for Control A-6.1 (AI System Risk Management)
- GDPR: evidence of technical safeguards for AI systems processing personal data
Methodology
How AWARE7 tests guardrails
Quantitative testing with 500+ curated test cases - combined with manual expert analysis for novel bypass techniques.
1 day
Guardrail Inventory
Complete mapping of all guardrail layers: which systems are active? How are they configured? Which harm categories are covered? Which thresholds are set? Which models do they run on? Result: guardrail architecture diagram.
1-2 days
Baseline Measurement
Establishing the initial measurement: false-positive rate with 200+ legitimate requests from your use case. Latency baseline under normal operation. Performance profile of the guardrail system as reference for all subsequent tests.
3-5 days
Systematic Bypass Testing
Testing with 500+ curated bypass techniques from our proprietary test set - broken down by attack category: jailbreaking, roleplays, token smuggling, encoding tricks, multilingual exploits, many-shot conditioning and adversarial suffixes. False-negative rate measured per category.
2-3 days
Multi-Turn & Contextual Attacks
Tests not covered by single-message analysis: stepwise context conditioning over 10-, 20-, 50-turn conversations. Guardrail exhaustion attacks. Cross-session persistence tests for systems with persistent conversation memory.
1-2 days
Latency & Resilience Tests
Quantitative latency measurement under adversarial load: P50, P95, P99 response times. Timeout behaviour with complex requests. Guardrail system behaviour under overload - does it fail open or fail closed?
2-3 days
Reporting & Calibration Recommendations
Guardrail Effectiveness Score (GES) with breakdown by test dimension. Concrete calibration recommendations for each threshold. Hardening roadmap with prioritisation. Compliance mapping to EU AI Act, ISO 42001 and GDPR.
Typical total duration: 8-15 days - depending on the number of guardrail layers and desired test depth.
You receive a binding fixed-price quote within 48 business hours from EUR 10,000.
Warum AWARE7
Was uns von anderen Anbietern unterscheidet
Reine Awareness-Plattformen testen keine Systeme. Reine Beratungskonzerne sind zu weit weg. AWARE7 verbindet beides: Wir hacken Ihre Infrastruktur und schulen Ihre Mitarbeiter - mittelstandsgerecht, persönlich, ohne Enterprise-Overhead.
Forschung und Lehre als Fundament
Rund 20% unseres Umsatzes stammen aus Forschungsprojekten für BSI und BMBF. Unsere Studien analysieren Millionen von Websites und Zehntausende Phishing-E-Mails - publiziert auf ACM- und Springer-Konferenzen. Drei unserer Führungskräfte sind gleichzeitig Professoren an deutschen Hochschulen.
Digitale Souveränität - keine Kompromisse
Alle Daten werden ausschließlich in Deutschland gespeichert und verarbeitet - ohne US-Cloud-Anbieter. Keine Freelancer, keine Subunternehmer in der Wertschöpfung. Alle Mitarbeiter sind sozialversicherungspflichtig angestellt und einheitlich rechtlich verpflichtet. Auf Anfrage VS-NfD-konform.
Festpreis in 24h - planbare Projektzeiträume
Innerhalb von 24 Stunden erhalten Sie ein verbindliches Festpreisangebot - kein Stundensatz-Risiko, keine Nachforderungen, keine Überraschungen. Durch eingespieltes Team und standardisierte Prozesse erhalten Sie einen klaren Zeitplan mit definiertem Starttermin und Endtermin.
Ihr fester Ansprechpartner - jederzeit erreichbar
Ein persönlicher Projektleiter begleitet Sie vom Erstgespräch bis zum Re-Test. Sie buchen Termine direkt bei Ihrem Ansprechpartner - keine Ticket-Systeme, kein Callcenter, kein Wechsel zwischen wechselnden Beratern. Kontinuität schafft Vertrauen.
Für wen sind wir der richtige Partner?
Mittelstand mit 50–2.000 MA
Unternehmen, die echte Security brauchen - ohne einen DAX-Konzern-Dienstleister zu bezahlen. Festpreis, klarer Scope, ein Ansprechpartner.
IT-Verantwortliche & CISOs
Die intern überzeugend argumentieren müssen - und dafür einen Bericht mit Vorstandssprache brauchen, nicht nur technische Findings.
Regulierte Branchen
KRITIS, Gesundheitswesen, Finanzdienstleister: NIS-2, ISO 27001, DORA - wir kennen die Anforderungen und liefern Nachweise, die Auditoren akzeptieren.
Mitwirkung an Industriestandards
OWASP · 2023
OWASP Top 10 for Large Language Models
Prof. Dr. Matteo Große-Kampmann als Contributor im Core-Team des international anerkannten OWASP LLM-Sicherheitsstandards.
BSI · Allianz für Cyber-Sicherheit
Management von Cyber-Risiken
Prof. Dr. Matteo Große-Kampmann als Mitwirkender des offiziellen BSI-Handbuchs für die Unternehmensleitung (dt. Version).
Frequently asked questions about Guardrail Assessments
Everything about guardrail bypasses, false-positive rates and the Guardrail Effectiveness Score.
What are AI guardrails?
What is a guardrail bypass?
How do I measure the effectiveness of my guardrails?
What is the difference between a content filter and Constitutional AI?
Which guardrail systems do you test?
What does a Guardrail Assessment cost?
Can guardrails achieve a false-positive rate of 0%?
How often should guardrails be retested?
What is your real guardrail bypass rate?
We measure the effectiveness of your AI safety filters quantitatively - with 500+ bypass techniques and the Guardrail Effectiveness Score. Fixed-price commitment from EUR 10,000.
Kostenlos · 30 Minuten · Unverbindlich