Bypass Cloudflare & DataDome When Scraping in 2026 (TLS Fingerprints & curl_cffi)

By Serpent API Team · · 13 min read

You write a clean scraper, set a perfect Chrome User-Agent, copy every header out of the browser dev tools — and Cloudflare still hands you a 403 or an endless “Checking your browser” loop. Meanwhile the exact same URL opens instantly in your own Chrome. Something is reading more than your headers.

That something is the TLS handshake. Modern anti-bot edges from Cloudflare and DataDome fingerprint the way your client negotiates the encrypted connection, long before they look at a single HTTP header. Get that layer wrong and you are flagged as a bot no matter how realistic everything above it looks.

This guide explains exactly how that detection works — JA3/JA4 TLS fingerprints, the JavaScript challenge, and behavioral scoring — then walks through the three tools that actually get through: curl_cffi browser impersonation, a real-browser stack for the heavy challenges, and clean residential IPs. All with runnable Python.

TL;DR: Cloudflare and DataDome inspect your TLS handshake (JA3/JA4) first, so plain requests is flagged before any header or JavaScript is read. Use curl_cffi with impersonate="chrome" to forge a browser-identical TLS and HTTP/2 fingerprint for low-friction checks; fall back to a real browser stack only when you hit a JavaScript managed challenge (see does stealth still work in 2026). Route everything through clean residential IPs, detect challenge pages by status code and markers so you can react, and scrape only public data within a site's terms.

How Cloudflare & DataDome actually detect you

Most “why am I blocked” tutorials stop at the User-Agent string, which is why most scrapers stay blocked. A modern anti-bot edge scores you across three independent layers, and any one of them can sink you. Understanding all three is the difference between guessing and fixing.

LayerWhat it inspectsWhen it fires
TLS fingerprint (JA3/JA4)The Client Hello: cipher order, extensions, curves, ALPNDuring the handshake — before any HTTP byte
HTTP/2 fingerprintFrame settings, header order, pseudo-header order, priorityRight after TLS, before your request is served
JavaScript challengeCanvas, WebGL, fonts, timing, automation flagsOn the “Checking your browser” interstitial
Behavioral scoringRequest cadence, mouse/scroll signals, IP reputationContinuously, across a session

The first two layers are passive: the edge reads how your TCP/TLS/HTTP2 stack behaves and compares it to known browsers. The JavaScript challenge is active: it ships code that runs in a real engine and reports back hundreds of environment signals. Behavioral scoring then watches the whole session over time.

The crucial insight is the ordering. The TLS check happens first, so a request from a non-browser HTTP client is judged before it ever announces a User-Agent. This is why a header-only disguise is doomed — you are decorating a layer the edge has already decided to distrust. The same fingerprinting philosophy drives the browser-side signals covered in how to beat headless Chrome detection.

Why requests and got fail at the TLS layer

When Python's requests opens an HTTPS connection, it negotiates TLS through OpenSSL. OpenSSL advertises a specific, ordered list of cipher suites and TLS extensions in its Client Hello. Node's got and the standard library do the same through their own TLS stacks. None of those orderings match what Chrome or Firefox send.

A JA3 fingerprint is simply a hash of those handshake values — TLS version, the ordered cipher list, the extensions, the elliptic curves, and the curve formats. Real Chrome produces one stable JA3; requests on OpenSSL produces a totally different one. To a Cloudflare or DataDome edge, that mismatch is a billboard reading “automated client.” JA4 is the newer successor that adds ALPN and other details and is harder to spoof.

The practical consequence is brutal and counter-intuitive: your headers can be flawless and you are still blocked, because the verdict was reached during the handshake. You cannot patch this from inside requests — the ordering is baked into the underlying TLS library. The only fix is to send a handshake that matches a real browser, which is exactly what the next tool does.

# This looks perfect and still gets a 403 from a protected site.
import requests

headers = {
    "User-Agent": (
        "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 "
        "(KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36"
    ),
    "Accept-Language": "en-US,en;q=0.9",
    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
}

resp = requests.get("https://protected.example.com/", headers=headers, timeout=20)
print(resp.status_code)   # often 403 / 503 — the TLS fingerprint gave you away

No header tweak rescues this. The connection was scored before requests sent a single one of those headers. That is the whole problem in one snippet.

Tool #1: curl_cffi browser impersonation

The cleanest fix for the passive layers is curl_cffi, a Python binding to curl-impersonate — a build of libcurl patched to reproduce the exact TLS and HTTP/2 fingerprint of real browsers. Instead of OpenSSL's ordering, it sends Chrome's ordering, so the JA3/JA4 hash the edge computes is one a browser actually emits.

The API is a near drop-in for requests, with one extra argument — impersonate:

# pip install curl_cffi
from curl_cffi import requests

# impersonate="chrome" forges Chrome's TLS + HTTP/2 fingerprint
resp = requests.get(
    "https://protected.example.com/",
    impersonate="chrome",
    headers={"Accept-Language": "en-US,en;q=0.9"},
    timeout=20,
)

print(resp.status_code)        # 200 — the handshake now matches a real browser
print(len(resp.text), "bytes")

# Pin a specific build for stability, e.g. impersonate="chrome124".
# A Session reuses the connection and fingerprint across requests:
session = requests.Session(impersonate="chrome")
home = session.get("https://protected.example.com/")
page = session.get("https://protected.example.com/listings?page=2")
print(home.status_code, page.status_code)

Because curl_cffi is still an HTTP client — no browser engine, no rendering — it is fast and cheap, often a few milliseconds and a few kilobytes per request. For sites whose protection stops at the passive TLS/HTTP2 layer, this single change takes you from a wall of 403s to clean 200s. Bright Data's walkthrough of web scraping with curl_cffi covers more impersonation targets and async usage.

One honest caveat: impersonation targets drift. As Chrome ships new versions, its fingerprint changes, and the library adds new targets to match. Pin a version you have tested (for example impersonate="chrome124") and revisit it periodically, the same maintenance discipline you would apply to selectors in scraping an independent index like Brave.

Tool #2: a real-browser stack for JS challenges

A forged TLS fingerprint defeats the passive layer, but it cannot execute JavaScript. When a site escalates to a managed challenge — the “Checking your browser” interstitial, a DataDome interstitial, or an interactive widget — it serves code that must run in a real engine and report canvas, WebGL, timing, and automation signals back. No HTTP client can answer that, so you escalate to a real browser.

The pattern is a two-tier fetch: try the cheap curl_cffi request first, and only spin up the expensive browser when you detect a challenge response. That keeps the slow, resource-heavy path for the minority of requests that genuinely need it:

# pip install curl_cffi playwright   (then: playwright install chromium)
from curl_cffi import requests
from playwright.sync_api import sync_playwright

def looks_like_challenge(resp):
    if resp.status_code in (403, 503):
        return True
    body = resp.text.lower()
    return any(m in body for m in (
        "checking your browser", "cf-challenge", "datadome", "captcha-delivery",
    ))

def fetch(url):
    # Tier 1: cheap impersonated HTTP request
    resp = requests.get(url, impersonate="chrome", timeout=20)
    if not looks_like_challenge(resp):
        return resp.text

    # Tier 2: a real browser runs the JavaScript challenge
    with sync_playwright() as p:
        browser = p.chromium.launch(
            args=["--disable-blink-features=AutomationControlled"]
        )
        page = browser.new_page()
        page.goto(url, wait_until="domcontentloaded", timeout=45000)
        # let the challenge clear, then read the settled DOM
        page.wait_for_load_state("networkidle")
        html = page.content()
        browser.close()
        return html

print(len(fetch("https://protected.example.com/")), "bytes")

Spinning up a real browser is not automatically enough — a default headless Chrome leaks its own automation signals that the challenge scores. You harden it with a stealth layer and the anti-detection techniques covered in does puppeteer-extra-plugin-stealth still work in 2026 and in the broader headless Chrome detection guide. The browser is the heavy tool; reach for it only when tier one comes back challenged.

Tool #3: clean residential IPs

A perfect TLS fingerprint and a clean browser still fail if the request arrives from an IP the edge already distrusts. Behavioral and reputation scoring weigh the network you come from heavily — datacenter ranges from the big cloud providers are pre-scored as suspicious because that is where most automation originates.

Residential IPs, by contrast, belong to real consumer ISPs and carry the reputation of ordinary home connections, so they clear reputation checks that datacenter IPs trip. You can source them from any residential proxy provider — Bright Data, Oxylabs, Decodo, IPRoyal, SOAX, NetNut, and DataImpulse are all peers in this market — and wire them straight into curl_cffi:

from curl_cffi import requests

# Use any residential proxy provider; format is the provider's gateway + creds
proxies = {
    "http":  "http://USER:PASS@gateway.example-proxy.com:7000",
    "https": "http://USER:PASS@gateway.example-proxy.com:7000",
}

resp = requests.get(
    "https://protected.example.com/",
    impersonate="chrome",
    proxies=proxies,
    timeout=30,
)
print(resp.status_code)

Two practical notes. First, residential bandwidth is billed per gigabyte, so combine it with the byte-trimming techniques in scraping Google for free to avoid paying for resources you never parse. Second, do not assume a residential IP makes you invisible — abuse a clean IP with aggressive cadence and it gets banned like any other, so rate-limit yourself and respect the target.

Detecting a challenge page so you can react

The single most important habit when fighting anti-bot systems is to know when you have been challenged rather than silently saving a useless page. A challenge response is often a valid HTTP 200 whose body is the interstitial, not your data — a naive scraper logs success and parses garbage.

Build one detector and run every response through it. Check the status codes that anti-bot edges use (403, 503, and Cloudflare's distinctive 1020 “access denied”), then scan the body for the markers each system leaves behind:

import re

# Markers seen in Cloudflare / DataDome challenge and block pages
CHALLENGE_MARKERS = (
    "checking your browser",      # Cloudflare interstitial
    "cf-chl-",                    # Cloudflare challenge token id
    "cf-challenge",               # Cloudflare challenge platform
    "ray id",                     # Cloudflare error footer
    "datadome",                   # DataDome script / cookie
    "captcha-delivery.com",       # DataDome CAPTCHA host
    "enable javascript and cookies to continue",
)

def detect_challenge(status_code, body, headers=None):
    """Return a string describing the challenge, or None if the page is clean."""
    headers = headers or {}

    # Cloudflare's error 1020 is access-denied via a firewall rule
    if status_code in (403, 503, 1020):
        return f"blocked: HTTP {status_code}"

    # DataDome stamps a response header on many of its responses
    if "x-datadome" in {k.lower() for k in headers}:
        return "datadome: x-datadome header present"

    low = body.lower()
    for marker in CHALLENGE_MARKERS:
        if marker in low:
            return f"challenge marker: {marker!r}"

    # Cloudflare's managed challenge often ships a tiny HTML shell
    if status_code == 200 and len(body) < 2000 and "<title>just a moment" in low:
        return "challenge: cloudflare 'Just a moment' shell"

    return None

# usage
verdict = detect_challenge(resp.status_code, resp.text, dict(resp.headers))
if verdict:
    print("Challenged ->", verdict)   # escalate to the browser tier or back off
else:
    parse(resp.text)                  # safe to parse real data

With this in place your scraper reacts instead of failing blind: a clean page goes to the parser, a challenged page escalates to the browser tier or backs off. Treating a challenge as a recoverable signal — not a crash — is the same fail-loud discipline that keeps a scraper alive in a real free-scraping pipeline. Pair it with proper back-off when you see repeated blocks so you do not hammer an edge that has already said no.

Ethics & terms of service

Getting past a challenge is a technical capability, not a permission slip. The fact that a request can be made to look like a browser does not mean a site has agreed to be scraped, and the responsible — and usually lawful — line is narrow and worth respecting.

Scrape public data only: pages that load without a login and that the site serves to any visitor. Do not circumvent authentication, paywalls, or access controls on private material — that crosses from “reading a public page” into territory that draws real legal risk. Read and honor a site's robots.txt and its terms of service, identify yourself honestly where appropriate, and rate-limit so you never degrade the service for real users.

Be especially careful with personal data, which is governed by laws like the GDPR and CCPA regardless of how you obtained it. DataDome's own threat research is a useful window into how the defending side thinks about bot traffic. For a fuller treatment of where the lines sit, see is scraping Google legal in 2026 and our guide to legal and ethical search-data collection.

The verdict

The chain that beats Cloudflare and DataDome in 2026 is layered, and you climb it only as far as a given site forces you. Start at the cheapest rung and escalate on evidence, never by default.

TierToolBeatsCost per request
0Plain requests / gotNothing — flagged at TLSTiny, but blocked
1curl_cffi impersonationJA3/JA4 + HTTP/2 passive checksLow (HTTP client)
2Hardened real browserJavaScript managed challengesHigh (full engine)
+Clean residential IPsReputation / behavioral scoringPer-GB bandwidth

For most protected sites, curl_cffi plus a clean residential IP clears the bar, and you reserve the browser for the minority that throw a real JavaScript challenge. The detector decides when to climb. That is an efficient, honest architecture — and it is also a meaningful amount of moving machinery to build, monitor, and keep current as fingerprints drift.

If the anti-bot arms race is not the product you are trying to ship, the alternative is to let someone else absorb it. A managed SERP API hands you clean JSON and keeps the TLS, challenge, and IP work off your plate entirely.

Skip the anti-bot arms race entirely.

Serpent's SERP API returns clean JSON from Google, Bing, Yahoo & DuckDuckGo — no proxies, no CAPTCHAs, no parser maintenance. Get 10 free searches on signup, then pay-as-you-go from $0.03 per 10,000 searches at scale, no subscription.

Get Your Free API Key

Explore: SERP API · Pricing · Playground

FAQ

Is bypassing Cloudflare legal?

There is no law against making an HTTP request that happens to pass an anti-bot check. What matters is what you access and how. Scraping public data that requires no login is generally treated differently from circumventing access controls on private or paywalled material, and you remain bound by a site's terms of service and by data-protection law for any personal data. The technical ability to get past a challenge does not grant permission, so scrape only public pages, respect robots.txt, rate-limit yourself, and avoid logged-in or paywalled content.

What is a JA3 fingerprint?

JA3 is a hash of the values in a TLS Client Hello — the protocol version, the ordered list of cipher suites, the extensions, the elliptic curves, and the curve formats. Because each HTTP client builds that handshake differently, the hash acts like a signature for the client software. Real Chrome produces one JA3 value; Python's requests with OpenSSL produces a completely different one, so an anti-bot system can flag the request before a single byte of your HTTP headers is even read. JA4 is the newer, more granular successor that adds details like ALPN and is harder to spoof.

Does curl_cffi beat DataDome?

curl_cffi forges a browser-identical TLS and HTTP/2 fingerprint, which is enough to pass the passive TLS layer that both Cloudflare and DataDome inspect first, so it gets you past many low-friction checks at a fraction of a browser's cost. But DataDome and Cloudflare also run an active JavaScript challenge that collects canvas, WebGL, timing, and behavioral signals. curl_cffi cannot execute that JavaScript, so when a site escalates to a managed challenge you need a real browser stack. Use curl_cffi first; fall back to a browser only when you actually hit a challenge page.

Why does requests get blocked when a browser isn't?

Because the block happens at the TLS layer, before your HTTP request is even processed. Python's requests negotiates TLS through OpenSSL, which orders its cipher suites and extensions differently from Chrome, producing a JA3/JA4 fingerprint that no real browser ever emits. An anti-bot edge can drop or challenge the connection on that mismatch alone, no matter how perfect your User-Agent and headers are. A real browser, or a client like curl_cffi that copies the browser's exact handshake, presents a fingerprint the edge expects and is allowed through.