Adversarial Machine Learning - KI-Angriffe und Gegenmaßnahmen
Adversarial machine learning refers to attack techniques that specifically manipulate or deceive machine learning models. These include adversarial examples (minimal changes to input data that cause models to misclassify), data poisoning (tainted training data), model inversion (extraction of training data), prompt injection (in LLMs), and model stealing. MITRE ATLAS documents these attack techniques.
Adversarial Machine Learning (AML) is the field of research and attack focused on the security of AI/ML systems. With the use of ML in security-critical applications (malware detection, facial recognition, autonomous systems, LLM agents), AML is becoming a central security issue.
Attack Categories
MITRE ATLAS - Adversarial Threat Landscape for AI Systems
Attack Phase 1: Reconnaissance (Information Gathering)
- Discover ML Models: Detecting whether the target system uses ML
- API Probing: Black-box queries to understand model behavior
- Training Data Collection: What was used to train the model?
Attack Phase 2: Model Attacks
A) Evasion Attacks (Inference Phase)
- Adversarial Examples: Minimal input changes
- Example: STOP sign with patches stuck on it → Recognized as "Speed Limit"
- Example: Spam email with invisible characters → Bypass AI filters
- Example: Malware with adversarial feature manipulation → EDR bypass
B) Data Poisoning (Training Phase)
- Trojan/Backdoor Attack: Poisoned training data with a trigger - "If image contains a 3x3 pixel pattern at position X → always classify as benign"
- Surreptitious Training: Supply chain attack on ML training pipeline
- Label Flipping: Targeted corruption of correct labels
C) Model Inversion / Extraction
- Membership Inference: "Was this dataset used in training?"
- Model Stealing: API queries → own model with identical behavior
- Training Data Extraction: Language model outputs training data
- GPT-2: 67 million PII records extractable from training data
D) LLM-Specific - Prompt Injection
- Direct: User enters malicious prompt
- Indirect: Malicious content in retrieved document
- Jailbreaking: Bypassing system instructions
- Prompt Leaking: Extracting system prompts
- Tool Misuse: LLM agent performs unwanted actions
Known real-world examples
| Year | Incident |
|---|---|
| 2019 | Microsoft Azure Face API - Adversarial patches fooled facial recognition |
| 2020 | Skylight (anti-malware) - ML bypass demonstrated via feature manipulation |
| 2022 | GPT-2/GPT-3 - Membership inference confirmed on training data |
| 2023 | ChatGPT - DAN jailbreak, indirect prompt injection in web browsing mode |
| 2024 | Autonomous AI Agents - Tool misuse via poisoned documents |
OWASP LLM Top 10 (2025) - AML-relevant categories
LLM01: Prompt Injection
- Direct: "Ignore previous instructions and..."
- Indirect: Malicious content in Retrieved Context (RAG)
Protection:
- Input Sanitization (but not complete protection!)
- Privilege Separation: LLM must not have direct access to the database
- Output Validation: Check results before execution
- Principle of Least Privilege for tool calls
LLM02: Insecure Output Handling
- LLM output is used unvalidated as HTML/SQL/shell commands
- Cross-prompt injection → XSS
- SQL via LLM → SQL injection
Protection:
- Treat LLM output as untrusted input
- Sanitize before HTML rendering, parameterize before SQL
LLM06: Sensitive Information Disclosure
- System prompts can be leaked
- PII in responses (memorized training data)
Protection: Differential privacy in training, output filtering
LLM08: Excessive Agency
- LLM agent has too many permissions
- Tool call without confirmation: "Delete all files in /tmp"
Protection:
- Minimal permissions for tools
- Human-in-the-loop for destructive actions
- Confirmation before irreversible operations
Practical Countermeasures
1. Adversarial Training
- Expand training set with adversarial examples
- Improves robustness, but increases training effort
- Certified Robustness: mathematical proof of robustness
2. Input Preprocessing
- Feature Squeezing: Reduce bit depth, remove noise
- Randomized Smoothing: Add random noise before classification
- Detectron: Adversarial example detection pipeline
3. Ensemble and Uncertainty
- Multiple models → Majority decision
- Uncertainty quantification: Model "knows when it doesn’t know"
- Monte Carlo dropout: Confidence estimation
4. ML Supply Chain Security
- Provenance: Where does the model come from? (MLflow, DVC)
- Model Cards: transparent documentation
- Trusted Model Repositories (Hugging Face + Signing)
- Dependency scan for ML libraries (PyTorch, TensorFlow)
5. LLM-specific
- Guardrails: NVIDIA NeMo Guardrails, LlamaGuard
- Prompt Templates instead of free-form input
- RAG Isolation: Retrieved Context separated from System Prompt
- Output Monitor: LLM checks its own output (Meta-LLM)
6. Monitoring and Detection
- Anomaly Detection on ML Inputs
- Distribution Shift: Is the input distribution changing?
- Shadow Model: A second model validates the main model
- Logging of all adversarial queries
Regulatory and Frameworks
| Standard | Description |
|---|---|
| MITRE ATLAS | mitre-atlas.mitre.org - 100+ AML techniques, TTP matrix similar to ATT&CK; |
| NIST AI RMF | AI Risk Management Framework (2023) - Govern, Map, Measure, Manage |
| EU AI Act (2024) | Risk-based approach; high-risk AI (biometrics, critical infrastructure, law enforcement) → mandatory security testing |
| ISO/IEC 42001 | AI Management System Standard (2023) - first certification standard for AI governance |
| ENISA | "Securing Machine Learning Algorithms" (2021) - Good practices for secure ML development |