Why parallelising your scraper will get you silently banned

Anti-bot systems don't always return 403s. Sometimes they return empty results — and your logs look clean. Here's why sequential requests are the architecture, not a workaround.

The trap

The instinct when you have 20 keywords to scrape is to fire them all at once. Parallel requests, faster results, done in seconds instead of minutes.

Anti-bot systems exploit exactly this instinct. When they detect burst traffic from a single session, they don't always block you — they serve you empty results. No error. No log entry. Just zero products, every time.

Your code works. Your metrics look fine. Your data is silently wrong.

Why this happens

Modern anti-bot systems (DataDome, Cloudflare, PerimeterX) use behavioural fingerprinting, not just IP blocking. A human browsing a marketplace makes one request, pauses, makes another. A bot sends ten requests in 200ms. The system doesn't ban you — it feeds you garbage until you go away.

This is called a shadow ban in scraping terminology. The session remains valid, the HTTP status codes stay 200, but the response payloads are sanitised to contain nothing useful.

The fix isn't a library — it's a constraint

The correct architecture is:

for keyword in searches:
    results = scrape_keyword(keyword)
    store_results(results)
    time.sleep(3)  # non-negotiable

That sleep is load-bearing. It is not a politeness convention. Removing it will silently corrupt your dataset.

Key rules:

Never use asyncio.gather() or ThreadPoolExecutor for scrape calls
Delay between requests: 3s minimum, increase if you see empty results return
One scrape session — do not spawn multiple instances

The signal that something is wrong

If you're seeing empty results and no errors, assume anti-bot interference before assuming a bug. The diagnostic is:

Run a single manual request from your browser on the same IP
Run the same keyword via your scraper
Compare the counts

If the browser returns 40 products and the scraper returns 0, you've been shadow-banned. Increase the delay and restart with a fresh session.

The architectural implication

Sequential scraping means your scrape job takes minutes, not seconds. Design around this — don't treat it as a performance problem to optimise away. A scheduler that runs every 4 hours with a 3-minute sequential job is correct. A parallel job that completes in 10 seconds with corrupted data is wrong.

The right question is not "how do I make this faster?" but "how do I make this reliable?"