cloro
Technical Guides

How to solve CAPTCHAs: the scraper's guide

Scraping Automation

The internet does not want you to read it.

You write a Python script. It works for five minutes. Then you see it: the “I am not a robot” checkbox, or worse, a grid of traffic lights.

CAPTCHAs (Completely Automated Public Turing test to tell Computers and Humans Apart) are the gatekeepers of the web. For data scientists and developers, they are the primary bottleneck to gathering intelligence at scale.

If you are building a scraper, you have two choices: give up, or learn to solve them.

This guide covers automating the solution so your scripts can run 24/7 without you clicking traffic lights manually.

Often, CAPTCHAs appear alongside other restrictions like geofencing or firewall blocks.

Table of contents

Solve or avoid? The first decision

Before you write a single line of solver code, answer this: do you actually need to solve the CAPTCHA, or can you avoid triggering it?

In our testing, around 80% of CAPTCHAs are triggered by a fixable upstream signal: a datacenter IP, a missing Accept-Language header, a stale TLS fingerprint, or burst request rates. Fix those and the CAPTCHA never appears. Solve them and you pay $0.50 to $3.00 per 1k requests indefinitely.

Avoid the CAPTCHA when:

  • Volume is high (>100k requests/day) and CAPTCHA cost would dominate your scraping bill.
  • Latency matters (real-time monitoring, price tracking).
  • The site uses behavioural CAPTCHAs (reCAPTCHA v3, Turnstile) where the “score” is the gate. Solving these is unreliable; preventing the trigger is more durable.

Solve the CAPTCHA when:

  • The CAPTCHA is unconditional (every request triggers it, regardless of fingerprint).
  • Volume is low (under 10k requests/month) and engineering time is more expensive than $30 in solver fees.
  • You need a real session token (logged-in scraping) and the CAPTCHA gates the login form itself.

The major families of CAPTCHAs

Before you can solve a CAPTCHA, you need to identify it. Different vendors require different bypass strategies.

1. reCAPTCHA (Google)

The most common.

  • v2. The classic “I’m not a robot” checkbox. Sometimes triggers an image challenge.
  • v3. Invisible. It scores your behavior (mouse movements, browser history) from 0.0 to 1.0. If you score low, you are blocked.

2. hCaptcha

The privacy-focused alternative, common on sites that want to avoid Google. Known for slightly harder image challenges (e.g., “Select the seaplane”).

3. Cloudflare Turnstile

The “smart” CAPTCHA. It often doesn’t show a puzzle at all. It inspects your browser environment (TLS fingerprint, canvas, fonts) to verify you are a real browser.

4. Geetest

Popular in Asia. Often involves sliding a puzzle piece or clicking characters in order.

The decision framework: family to strategy to tradeoff

Match the CAPTCHA family to the strategy that beats it. This is the cheat sheet we hand to engineers when they ask “what should I use?”

CAPTCHA familyBest primary strategyReliabilityCost per 1kAvg. latencyWhen to switch
reCAPTCHA v2API solving service95–99%$1.00–$3.0015–45 sDrop to AI vision if latency matters more than cost
reCAPTCHA v3Avoid via clean fingerprintn/a$00 sIf still gated, use AI scoring services (~$2.00/1k)
hCaptchaAI solver (CapSolver, NopeCHA)90–95%$1.00–$2.005–15 sFall back to human-solved 2Captcha for hard variants
Cloudflare TurnstileResidential proxies + headless browser70–85%$0.50–$2.002–8 sAdd fingerprint patches if the site upgrades to managed challenges
Geetest v3/v4Specialised solver (CapSolver)80–90%$1.50–$3.0010–30 sBrowser automation if API fails consistently
Image OCRTesseract / EasyOCR60–95%~$0~1 sSwitch to API if accuracy drops below 80% on your set

How to read this table. “Reliability” assumes a clean residential IP. With datacenter IPs, expect 10 to 20 percentage-point drops across the board. “Latency” is the time from challenge appearing to token returned, not end-to-end request time.

Strategy 1: API solving services

This is the most reliable method for high-volume scraping.

You send the CAPTCHA data (like the sitekey and URL) to a third-party service. They route it to a human worker or a specialized AI model. The worker solves it, and the service sends you back a “token.”

You inject this token into the website’s form, and the server thinks you solved it.

Top services:

  • 2Captcha. The veteran. Reliable, with a huge pool of human workers. Slower but solves almost anything.
  • CapSolver. AI-focused. Fast and cheaper than humans. Good for reCAPTCHA and hCaptcha.
  • Anti-Captcha. Another solid human-based service with good API libraries.

Code example: solving reCAPTCHA v2 with Python

Here is how to implement this in Python using the 2captcha-python library (or raw requests).

Scenario: a website has a reCAPTCHA v2 lock on its login form.

Step 1: find the sitekey. Inspect the HTML source of the target page. Look for data-sitekey="6Ld..." inside the CAPTCHA div or iframe.

Step 2: the Python script.

import time
import requests

# Configuration
API_KEY = 'YOUR_2CAPTCHA_API_KEY'
SITE_KEY = '6Ld_TARGET_SITE_KEY'
URL = 'https://target-website.com/login'

def solve_recaptcha():
    print("Sending CAPTCHA to 2Captcha...")

    # 1. Send the request to the solving service
    response = requests.post('http://2captcha.com/in.php', data={
        'key': API_KEY,
        'method': 'userrecaptcha',
        'googlekey': SITE_KEY,
        'pageurl': URL,
        'json': 1
    })

    request_id = response.json().get('request')
    print(f"Task ID: {request_id}")

    # 2. Wait for the solution
    print("Waiting for solution...")
    while True:
        time.sleep(5)
        result = requests.get(f'http://2captcha.com/res.php?key={API_KEY}&action=get&id={request_id}&json=1')
        result_json = result.json()

        if result_json.get('status') == 1:
            print("CAPTCHA Solved!")
            return result_json.get('request')  # This is the token

        if result_json.get('request') == 'CAPCHA_NOT_READY':
            continue
        else:
            print(f"Error: {result_json.get('request')}")
            return None

# 3. Use the token
token = solve_recaptcha()
if token:
    # Now you submit this token with your form data
    # usually in a field named 'g-recaptcha-response'
    login_data = {
        'username': 'myuser',
        'password': 'mypassword',
        'g-recaptcha-response': token
    }
    # requests.post(URL, data=login_data)

Code example: solving Cloudflare Turnstile

Turnstile is harder because there is no visible puzzle. You need to forward the page’s JavaScript challenge response. Most modern solvers expose a Turnstile-specific endpoint.

import requests
import time

API_KEY = 'YOUR_CAPSOLVER_KEY'
SITE_URL = 'https://target-site.com/'
SITE_KEY = '0x4AAAAAAA...'  # from data-sitekey on the cf-turnstile div

# 1. Create the task
task = requests.post('https://api.capsolver.com/createTask', json={
    'clientKey': API_KEY,
    'task': {
        'type': 'AntiTurnstileTaskProxyLess',
        'websiteURL': SITE_URL,
        'websiteKey': SITE_KEY
    }
}).json()

task_id = task['taskId']

# 2. Poll for the result
while True:
    time.sleep(2)
    res = requests.post('https://api.capsolver.com/getTaskResult', json={
        'clientKey': API_KEY,
        'taskId': task_id
    }).json()

    if res['status'] == 'ready':
        token = res['solution']['token']
        break

# 3. Submit the token in the cf-turnstile-response form field
payload = {'cf-turnstile-response': token, 'email': 'me@example.com'}
requests.post(SITE_URL, data=payload)

Key gotcha: Turnstile tokens expire in 5 minutes. If your scraping pipeline queues requests, solve the CAPTCHA just before you need to submit, not at the start of a long job.

Strategy 2: Browser automation plugins

If you are using Puppeteer, Playwright, or Selenium, you are controlling a real browser.

Instead of making API calls, you can install extensions that solve CAPTCHAs automatically inside the browser session.

Tools:

  • Puppeteer-extra-plugin-recaptcha. A well-known plugin for Puppeteer that uses AI to solve the image challenges automatically.
  • Buster. A browser extension that solves reCAPTCHA audio challenges using speech-to-text APIs.

Pros: easier to integrate if you are already using a browser.

Cons: slower than direct requests. Detecting the “I am not a robot” iframe can be flaky.

Strategy 3: AI vision models

For simple image CAPTCHAs (text on a distorted background), you don’t need a service. You can use Optical Character Recognition (OCR).

Libraries:

  • Tesseract (via pytesseract). Good for clean text.
  • EasyOCR. Deep-learning based, handles distortion better.
  • YOLO. For object detection CAPTCHAs (e.g., “Click all the buses”).

This approach is nearly free but requires significant development time to train or tune models for specific CAPTCHA types.

The cost of scraping

Solving CAPTCHAs is not free.

  • Financial cost. API services charge per 1,000 solutions (e.g., $0.50 to $3.00 per 1k). If you scrape 1 million pages, that’s $500 to $3000 just in CAPTCHA fees.
  • Latency cost. A human worker takes 15 to 45 seconds to solve a reCAPTCHA. This kills high-frequency trading or real-time monitoring scripts.
  • Maintenance cost. Websites change their CAPTCHA providers. Today it’s reCAPTCHA; tomorrow it’s Cloudflare. Your script breaks, and you spend hours rewriting the solver logic.

The automated alternative

If your goal is the data, not the engineering challenge of breaking bot protection, building your own solver is often a waste of resources.

Advanced scraping platforms handle this natively.

cloro integrates CAPTCHA solving directly into the request pipeline.

When you send a request through cloro to scrape Google Search or monitor ChatGPT, we detect the CAPTCHA, solve it (using a blend of AI and premium proxies), and return the clean HTML.

You don’t manage API keys. You don’t handle retries. You don’t wait 45 seconds for a human to click traffic lights.

Stop fighting the gatekeepers. Walk right past them.

Frequently asked questions

Can AI solve CAPTCHAs?+

Yes, modern vision models (like GPT-4V or specialized solvers) can solve image CAPTCHAs with high accuracy.

What is the hardest CAPTCHA to solve?+

Behavioral CAPTCHAs (like reCAPTCHA v3 or Cloudflare Turnstile) are hardest because they analyze browsing history and TLS fingerprints, not just a puzzle.

Is it illegal to bypass CAPTCHAs?+

It depends on jurisdiction and intent. Bypassing access controls to access public data is often a legal grey area; bypassing them to commit fraud is illegal.

What is the cost of scraping with CAPTCHAs?+

Solving CAPTCHAs adds significant financial and latency costs. API services charge per solution, and human-solved CAPTCHAs introduce delays of 15-45 seconds per challenge.

Are there automated alternatives to manual CAPTCHA solving?+

Yes, dedicated scraping platforms like cloro integrate CAPTCHA solving directly into their request pipelines, handling detection, solving, and token injection automatically, saving engineering time and cost.

Should I solve the CAPTCHA or avoid triggering it in the first place?+

Avoid first. A clean residential IP, realistic TLS fingerprint, and natural request pacing will prevent most CAPTCHAs from triggering. Solving is a fallback for the 5-15% of requests that still get challenged.

How do I tell which CAPTCHA family a site is using?+

Inspect the iframe src or embedded script. reCAPTCHA loads from google.com/recaptcha, hCaptcha from hcaptcha.com, Turnstile from challenges.cloudflare.com, and Geetest from geetest.com. The data-sitekey attribute confirms the provider.