Adversarial Machine Learning - KI-Angriffe und Gegenmaßnahmen

Adversarial Machine Learning (AML) is the field of research and attack focused on the security of AI/ML systems. With the use of ML in security-critical applications (malware detection, facial recognition, autonomous systems, LLM agents), AML is becoming a central security issue.

Attack Categories

MITRE ATLAS - Adversarial Threat Landscape for AI Systems

Attack Phase 1: Reconnaissance (Information Gathering)

Discover ML Models: Detecting whether the target system uses ML
API Probing: Black-box queries to understand model behavior
Training Data Collection: What was used to train the model?

Attack Phase 2: Model Attacks

A) Evasion Attacks (Inference Phase)

Adversarial Examples: Minimal input changes
Example: STOP sign with patches stuck on it → Recognized as "Speed Limit"
Example: Spam email with invisible characters → Bypass AI filters
Example: Malware with adversarial feature manipulation → EDR bypass

B) Data Poisoning (Training Phase)

Trojan/Backdoor Attack: Poisoned training data with a trigger - "If image contains a 3x3 pixel pattern at position X → always classify as benign"
Surreptitious Training: Supply chain attack on ML training pipeline
Label Flipping: Targeted corruption of correct labels

C) Model Inversion / Extraction

Membership Inference: "Was this dataset used in training?"
Model Stealing: API queries → own model with identical behavior
Training Data Extraction: Language model outputs training data
GPT-2: 67 million PII records extractable from training data

D) LLM-Specific - Prompt Injection

Direct: User enters malicious prompt
Indirect: Malicious content in retrieved document
Jailbreaking: Bypassing system instructions
Prompt Leaking: Extracting system prompts
Tool Misuse: LLM agent performs unwanted actions

Known real-world examples

Year	Incident
2019	Microsoft Azure Face API - Adversarial patches fooled facial recognition
2020	Skylight (anti-malware) - ML bypass demonstrated via feature manipulation
2022	GPT-2/GPT-3 - Membership inference confirmed on training data
2023	ChatGPT - DAN jailbreak, indirect prompt injection in web browsing mode
2024	Autonomous AI Agents - Tool misuse via poisoned documents

OWASP LLM Top 10 (2025) - AML-relevant categories

LLM01: Prompt Injection

Direct: "Ignore previous instructions and..."
Indirect: Malicious content in Retrieved Context (RAG)

Protection:

Input Sanitization (but not complete protection!)
Privilege Separation: LLM must not have direct access to the database
Output Validation: Check results before execution
Principle of Least Privilege for tool calls

LLM02: Insecure Output Handling

LLM output is used unvalidated as HTML/SQL/shell commands
Cross-prompt injection → XSS
SQL via LLM → SQL injection

Protection:

Treat LLM output as untrusted input
Sanitize before HTML rendering, parameterize before SQL

LLM06: Sensitive Information Disclosure

System prompts can be leaked
PII in responses (memorized training data)

Protection: Differential privacy in training, output filtering

LLM08: Excessive Agency

LLM agent has too many permissions
Tool call without confirmation: "Delete all files in /tmp"

Protection:

Minimal permissions for tools
Human-in-the-loop for destructive actions
Confirmation before irreversible operations

Practical Countermeasures

1. Adversarial Training

Expand training set with adversarial examples
Improves robustness, but increases training effort
Certified Robustness: mathematical proof of robustness

2. Input Preprocessing

Feature Squeezing: Reduce bit depth, remove noise
Randomized Smoothing: Add random noise before classification
Detectron: Adversarial example detection pipeline

3. Ensemble and Uncertainty

Multiple models → Majority decision
Uncertainty quantification: Model "knows when it doesn’t know"
Monte Carlo dropout: Confidence estimation

4. ML Supply Chain Security

Provenance: Where does the model come from? (MLflow, DVC)
Model Cards: transparent documentation
Trusted Model Repositories (Hugging Face + Signing)
Dependency scan for ML libraries (PyTorch, TensorFlow)

5. LLM-specific

Guardrails: NVIDIA NeMo Guardrails, LlamaGuard
Prompt Templates instead of free-form input
RAG Isolation: Retrieved Context separated from System Prompt
Output Monitor: LLM checks its own output (Meta-LLM)

6. Monitoring and Detection

Anomaly Detection on ML Inputs
Distribution Shift: Is the input distribution changing?
Shadow Model: A second model validates the main model
Logging of all adversarial queries

Regulatory and Frameworks

Standard	Description
MITRE ATLAS	mitre-atlas.mitre.org - 100+ AML techniques, TTP matrix similar to ATT&CK;
NIST AI RMF	AI Risk Management Framework (2023) - Govern, Map, Measure, Manage
EU AI Act (2024)	Risk-based approach; high-risk AI (biometrics, critical infrastructure, law enforcement) → mandatory security testing
ISO/IEC 42001	AI Management System Standard (2023) - first certification standard for AI governance
ENISA	"Securing Machine Learning Algorithms" (2021) - Good practices for secure ML development

Attack Categories

MITRE ATLAS - Adversarial Threat Landscape for AI Systems

A) Evasion Attacks (Inference Phase)

B) Data Poisoning (Training Phase)

C) Model Inversion / Extraction

D) LLM-Specific - Prompt Injection

Known real-world examples

OWASP LLM Top 10 (2025) - AML-relevant categories

LLM01: Prompt Injection

LLM02: Insecure Output Handling

LLM06: Sensitive Information Disclosure

LLM08: Excessive Agency

Practical Countermeasures

1. Adversarial Training

2. Input Preprocessing

3. Ensemble and Uncertainty

4. ML Supply Chain Security

5. LLM-specific

6. Monitoring and Detection

Regulatory and Frameworks

AWARE7 Services on This Topic