Web scraping: techniques, legal situation and defensive measures

Web scraping refers to the automated extraction of data from websites using software bots or scripts. The term is derived from the English verb "to scrape" and aptly describes how structured or unstructured data is harvested from websites. Web scraping is technically neutral—it is used for both legitimate purposes (price comparison, SEO analysis, research) and malicious activities (spam harvesting, content theft, OSINT attacks).

Techniques and Tools

HTTP-based Scraping

The simplest form: HTTP requests are sent to a web server, the HTML response is downloaded and parsed.

Tools:

Python + Requests + BeautifulSoup: Classic combination for simple, static websites. BeautifulSoup parses HTML and allows for the selection of elements via CSS selectors or XPath.
Scrapy: Python framework for extensive scraping projects with built-in request scheduling, rate limiting, pipeline architecture, and middleware system.
curl / wget: Command-line tools for simple single-page downloads; suitable for quick tests.

Limitation: Works only with static HTML pages. JavaScript-rendered content (SPAs, React, Angular) is not accessible this way.

Browser Automation (Headless Browser)

For JavaScript-rendered websites, browser automation tools are used that control a real browser (or a headless version):

Selenium: Controls real browsers (Chrome, Firefox, Edge) via the WebDriver API. Can simulate complex interactions (login, form filling, scrolling). Slower and more resource-intensive.
Playwright: A more modern alternative to Selenium. Supports Chromium, Firefox, and WebKit. Better async support, more reliable selectors.
Puppeteer: A Google library for controlling Chromium via the DevTools protocol. Very good performance, tight Chrome integration.

API-based scraping

Many websites communicate internally via APIs (REST, GraphQL). By analyzing network traffic (Browser DevTools → Network tab), these undocumented APIs can be addressed directly—often more efficiently than HTML parsing.

Distributed Scraping

For large-scale scraping operations, proxy pools are used to bypass IP blocks. Services such as ScraperAPI, Bright Data, or Oxylabs manage pools of millions of residential IPs. Distributed scraping also spreads the load across many source IPs and circumvents rate limiting.

Legal Situation

The legality of web scraping is complex and depends on several areas of law. A blanket statement—"scraping is legal/illegal"—is incorrect.

If scraped data contains personal information (names, email addresses, phone numbers, profiles), the GDPR applies:

Art. 6 GDPR (Lawfulness of processing): There must be a legal basis—consent, legitimate interest, or performance of a contract. Legitimate interest (Art. 6(1)(f)) may apply in individual cases, but must be carefully weighed.
Art. 14 GDPR (Right to information): Information obligations also apply to data not collected directly from the data subject—often difficult to fulfill in practice when it comes to mass scraping.
GDPR Fines: Unlawful scraping of personal data may be considered a GDPR violation (Art. 83). Practical examples: LinkedIn has sued scrapers multiple times under the GDPR.

UrhG (Copyright Act) and Database Protection

§87b UrhG (Database Protection): Databases into whose compilation a substantial investment has been made (selection, verification, presentation) are protected by copyright as database works. The systematic scraping of significant portions of such a database is prohibited without the rights holder’s permission.
BGH "Freunde finden" (I ZR 135/12): The Federal Court of Justice (BGH) ruled that the automated extraction of user data from a platform violates database protection and the terms of use.
§44b UrhG (Text and Data Mining): Since 2021, the UrhG has permitted text and data mining for research purposes (§60d UrhG) and for other purposes, provided the rights holder has not expressly reserved the right to prohibit it (robots.txt, Terms of Service). However, commercial use without such a reservation remains controversial.

Terms of Service

Most major websites exclude web scraping in their Terms of Service. A violation of the Terms of Service is not a criminal offense, but may have civil law consequences (injunction, damages). Additionally, if scraping is performed despite a robots.txt block, §202a StGB (data spying) may apply if the measures are deemed to be access protection.

robots.txt

The robots.txt file is a technical standard, not a legal one. It indicates which parts of a website should not be crawled. Ignoring robots.txt is not a criminal offense per se, but it can be interpreted as an indicator of intentional, abusive behavior.

Detection and Defense

From a website operator’s perspective, there are various methods to detect and hinder scraping:

Detection Methods

Traffic Analysis:

Unnaturally high request rates from individual IP addresses
Missing or unusual HTTP headers (no User-Agent, no Referer)
Accessing pages in an order atypical for humans
Very short dwell time per page
Missing JavaScript execution (detectable via server-side rendering indicators)

Honeypot links: Invisible links (e.g., display: none or white text on a white background) that are never clicked by real users. Scrapers that follow all links give themselves away by accessing these honeypot URLs.

Browser fingerprinting: Analysis of browser characteristics (Canvas fingerprint, WebGL, available fonts, JavaScript engine specifics) to distinguish bot traffic from real user traffic.

CAPTCHA: Conventional (reCAPTCHA v2) or behavior-based CAPTCHAs (reCAPTCHA v3, hCaptcha) as an access barrier for bots. Modern scraping services can bypass many CAPTCHAs using CAPTCHA-solving services (human solvers for ~$2 per 1,000 CAPTCHAs).

Defense Measures

Measure	Effort	Effectiveness
IP-based rate limiting	Low	Medium (proxy pools bypass it)
Bot detection (Cloudflare, PerimeterX)	Medium	High
CAPTCHA	Medium	Medium (CAPTCHA solvers)
Mandatory JavaScript rendering	Low	Medium
Honeypot links	Low	High for simple scrapers
API keys for structured access	Medium	High (creates accountability)
Browser fingerprinting	High	High
Dynamic content obfuscation	High	Medium-high

Recommendation: The most effective strategy combines bot detection services (Cloudflare Bot Management, Akamai Bot Manager), rate limiting, and honeypot links. Complete prevention is not possible, but the profitability for scrapers can be reduced to such an extent that they switch to easier targets.

Web Scraping in the OSINT Context

In the context of Open Source Intelligence (OSINT), web scraping is a key tool:

Attacker’s perspective:

Collecting email addresses from company websites for phishing campaigns
Extracting employee profiles from LinkedIn for social engineering attacks
Analyzing job postings to identify technology stacks and vulnerabilities
Prices, product data, and business information for competitive intelligence

Defensive Use:

Monitoring one’s own external attack surface (which data about the company is publicly accessible)
Brand monitoring: Monitoring phishing domains that mimic the company name
Threat intelligence: Monitoring darknet forums and Pastebin services for leaks

Penetration testing: During the OSINT phases of penetration tests, web scraping is used to gather information—always with the client’s explicit permission.

Conclusion

Web scraping is technically a neutral capability; its legality and ethics depend on the specific use case. It is important for companies to understand their own attack surface from a scraper’s perspective—which data is publicly accessible, and which of it could be misused for phishing, social engineering, or competitive intelligence? At the same time, websites should be protected with bot detection measures to make data misuse more difficult and to minimize regulatory risks arising from the unauthorized processing of personal data.