What is unique about scraping Gemini?

Gemini's responses are often delivered via internal Google APIs with complex, nested JSON arrays that are difficult to parse compared to standard JSON.

Does Gemini provide confidence scores?

Yes, internally Gemini assigns confidence scores to its sources. Advanced scrapers can extract these hidden metrics from the response data.

How do I handle Google login for scraping?

You generally shouldn't. Automating logins is risky and leads to bans. It's better to use methods that don't require personal authentication or use ephemeral sessions.

What is the internal API parsing challenge in Gemini?

Gemini's responses come as deeply nested JSON arrays with mixed indexing for content and sources, requiring specific path navigation and robust error handling to extract data reliably.

What infrastructure is needed to scrape Gemini?

You need browser automation (Playwright), network interception to capture internal Bard API calls, and a custom JSON parser capable of handling complex nested arrays.

How to scrape Google Gemini: parse internal API confidence scores

Google Gemini generates hundreds of millions of responses daily. The web interface delivers rich structured data with confidence scoring that direct APIs miss: detailed sources, proper markdown formatting, and real-time web integration.

Gemini wasn’t built for programmatic access. The platform uses anti-bot systems, internal API endpoints with nested JSON, and session validation that most scraping tools can’t handle.

After analyzing millions of Gemini responses, we’ve reverse-engineered the process. This guide shows how to scrape Gemini and extract the structured data with confidence scoring that makes it useful for research and product work.

Why scrape Google Gemini responses?

Gemini’s API responses look nothing like what users see in the UI.

What you miss with the API:

The actual interface experience users get
Source citations with confidence levels
Markdown formatting and structure
Real-time web integration and context

Without scraping, you can’t verify what Gemini tells users or assess source reliability. Scraping also runs up to 12x cheaper than direct API usage and returns the same output users see.

Use cases:

Verification. Check what Gemini tells users, with source confidence.
SEO. Track how Gemini sources and cites information.
Market research. Pull full responses with formatted markdown.
Content analysis. Study how Gemini structures information and ranks reliability.

You might also be interested in how to scrape Google AI Mode for a different perspective on Google’s AI search.

Understanding Gemini’s architecture

A quick look at what makes Gemini scraping awkward.

Response generation:

Query processing. Your prompt is analyzed and sent to the Bard backend.
Internal API calls. Gemini makes HTTP POST requests to Bard frontend endpoints.
JSON array responses. Responses come as nested JSON data.
Dynamic rendering. Content renders client-side with source citations.
Confidence scoring. Sources are assigned confidence levels based on reliability.

Key technical challenges.

Internal API format:

// Gemini uses complex nested JSON arrays
const response = {
  0: [
    2,
    "response_data",
    {
      4: [0, "text_content", [{ 1: "confidence_levels" }]],
    },
  ],
};
// Content isn't available in standard API format

Response structure:

[0, [2, "nested_response_data", {"4": [0, "content", [sources]]}]]

Anti-bot detection:

Canvas fingerprinting
Request pattern monitoring
CAPTCHA challenges
Cookie-based session validation

Dynamic source loading:

Confidence-level based source ordering
Real-time web integration
Nested JSON parsing requirements

The internal API parsing challenge

Most of the work in Gemini scraping is parsing nested JSON arrays from internal Bard endpoints. What makes it tricky:

Event stream structure:

# Raw Gemini internal API response example
[0, [2, "response_data", {
  "4": [0, "Hello", [
    {"1": 85, "2": ["https://example.com", "Source Title", "Description"]}
  ]]
}]]

Parsing challenges:

Nested arrays. Response data sits deep inside JSON arrays.
Mixed indexing. Content and sources use different array positions.
Confidence extraction. Pulling source confidence levels requires specific path navigation.
Error handling. Network issues can corrupt the nested structure.

Python parsing implementation:

import json
from typing import List, Dict, Any, Optional

def get_final_response(event_stream_body: str) -> Optional[Any]:
    """Extract the final complete response from an event stream."""
    lines: List[str] = event_stream_body.strip().split("\n")

    largest_response: Optional[Any] = None
    largest_size: int = 0

    for line in lines:
        try:
            data: Any = json.loads(line)
            line_size: int = len(line)
            if line_size > largest_size:
                largest_size = line_size
                largest_response = data

        except (json.JSONDecodeError, IndexError, TypeError):
            continue

    if not largest_response:
        return None

    return json.loads(largest_response[0][2])

def extract_response_text(response_object: Any) -> str:
    """Extract the main text content from nested response."""
    return response_object[4][0][1][0]

def extract_sources(response_object: Any) -> List[Dict[str, Any]]:
    """Extract sources with confidence levels from response."""
    sources: List[Dict[str, Any]] = []

    try:
        citations_objects = response_object[4][0][2][1]

        for idx, citation_object in enumerate(citations_objects, start=1):
            confidence_level = citation_object[1][2]
            url = citation_object[2][0][0]
            label = citation_object[2][0][1]
            description = citation_object[2][0][3]
            sources.append({
                "position": idx,
                "label": label,
                "url": url,
                "description": description,
                "confidence_level": confidence_level,
            })

    except (json.JSONDecodeError, IndexError, TypeError, KeyError, AttributeError):
        pass

    return sources

Building the scraping infrastructure

Here’s the full scraping system, step by step.

Required components:

Browser automation. Playwright, since the interface is JavaScript-heavy.
Network interception. To capture internal Bard API calls.
JSON parser. To process the nested response arrays.
Content extractor. To parse HTML and pull structured data.

Complete scraper implementation:

import asyncio
from playwright.async_api import async_playwright, Page
import json
from typing import Dict, Any, List, Optional

class GeminiScraper:
    def __init__(self):
        self.captured_responses = []

    async def setup_page_interceptor(self, page: Page):
        """Set up network request interception for Bard endpoints."""

        async def handle_response(response):
            # Capture Bard frontend API responses
            if 'BardChatUi/data/assistant.lamda.BardFrontendService/StreamGenerate' in response.url:
                response_body = await response.text()
                self.captured_responses.append(response_body)

        page.on('response', handle_response)

    async def scrape_gemini(self, prompt: str) -> Dict[str, Any]:
        """Main scraping function."""

        async with async_playwright() as p:
            browser = await p.chromium.launch(headless=False)
            context = await browser.new_context()
            page = await context.new_page()

            # Set up response interception
            await self.setup_page_interceptor(page)

            try:
                # Navigate to Gemini
                await page.goto('https://gemini.google.com/app')

                # Wait for textarea and enter prompt
                await page.wait_for_selector('[role="textbox"]')
                await page.fill('[role="textbox"]', prompt)
                await page.press('[role="textbox"]', 'Enter')

                # Wait for response completion
                await self.wait_for_response(page)

                # Parse the captured response
                if self.captured_responses:
                    raw_response = self.captured_responses[0]
                    return self.parse_gemini_response(raw_response)
                else:
                    raise Exception("No response captured")

            finally:
                await browser.close()

    async def wait_for_response(self, page: Page, timeout: int = 60):
        """Wait for Gemini response completion."""

        for i in range(timeout * 2):  # Check every 500ms
            # Check if we have captured responses
            if self.captured_responses:
                return

            # Check for content in DOM
            content_div = page.locator('message-content').first
            if await content_div.count() > 0:
                content_text = await content_div.text_content()
                if content_text and len(content_text.strip()) > 50:
                    await asyncio.sleep(2)  # Allow for final updates
                    continue

            await asyncio.sleep(0.5)

        raise Exception("Response timeout")

    def parse_gemini_response(self, raw_response: str) -> Dict[str, Any]:
        """Parse the raw Gemini response into structured data."""

        # Extract final response from event stream
        final_response = get_final_response(raw_response)

        # Extract text and sources
        text = extract_response_text(final_response)
        sources = extract_sources(final_response)

        return {
            'text': text,
            'sources': sources,
        }

Parsing the streaming response data

A closer look at the extraction process.

Extracting markdown with inline sources:

async def extract_markdown_with_sources(page: Page, sources: List[Dict]) -> str:
    """Extract markdown content with inline source citations."""

    try:
        # Wait for source chips to be visible
        chip_locator = "source-inline-chip .button"
        if await page.locator(chip_locator).count() > 0:
            await page.locator(chip_locator).first.wait_for(state="visible")

        # Get the main content HTML
        content_html = await page.locator("message-content").first.inner_html()

        # Convert HTML to markdown with source links
        markdown = convert_html_to_markdown_with_links(
            content_html,
            [[source] for source in sources],
            chip_locator
        )

        return markdown

    except Exception as e:
        print(f"Markdown extraction failed: {e}")
        return ""

async def extract_html_content(page: Page, request_id: str) -> str:
    """Extract full HTML content for upload."""

    try:
        full_html = await page.content()

        # Upload to storage service
        uploaded_url = await upload_html(request_id, full_html)
        return uploaded_url

    except Exception as e:
        print(f"HTML extraction failed: {e}")
        return ""

Complete response parsing with all data types:

from typing import TypedDict, List, NotRequired, Optional

class GeminiLinkData(TypedDict):
    position: int
    label: str
    url: str
    description: str
    confidence_level: int

class GeminiResult(TypedDict):
    text: str
    sources: List[GeminiLinkData]
    markdown: NotRequired[str]
    html: NotRequired[Optional[str]]

async def parse_complete_gemini_response(
    page: Page,
    request_data: Dict[str, Any],
    event_stream_body: str
) -> GeminiResult:
    """Parse Gemini response with all optional data types."""

    include_markdown = request_data.get("include", {}).get("markdown", False)
    include_html = request_data.get("include", {}).get("html", False)

    # Extract core data
    final_response = get_final_response(event_stream_body)
    text = extract_response_text(final_response)
    sources = extract_sources(final_response)

    result: GeminiResult = {
        "text": text,
        "sources": sources,
    }

    # Add optional data
    if include_markdown:
        result["markdown"] = await extract_markdown_with_sources(page, sources)

    if include_html:
        result["html"] = await extract_html_content(page, request_data["requestId"])

    return result

Using cloro’s managed Gemini scraper

cloro homepage

Building and maintaining a reliable Gemini scraper is expensive in time and infrastructure. That’s why we built cloro, a managed API that handles it for you.

API integration:

import requests
import json

# Your prompt
prompt = "What are the latest developments in renewable energy in 2026?"

# API request to cloro
response = requests.post(
    'https://api.cloro.dev/v1/monitor/gemini',
    headers={'Authorization': 'Bearer YOUR_API_KEY'},
    json={
        'prompt': prompt,
        'country': 'US',
        'include': {
            'markdown': True,
            'html': True,
            'sources': True
        }
    }
)

result = response.json()
print(json.dumps(result, indent=2))

What cloro handles for you:

Browser management. Rotating browsers, user agents, and fingerprints.
Anti-bot evasion. CAPTCHA solving and detection avoidance.
Rate limiting. Request scheduling and backoff strategies.
Data parsing. Structured data extracted from responses.
Error handling. Retry logic and recovery.
Scalability. Distributed infrastructure for high-volume requests.

Structured output:

{
  "status": "success",
  "result": {
    "text": "The renewable energy sector has seen remarkable developments in 2026...",
    "sources": [
      {
        "position": 1,
        "url": "https://energy.gov/solar-innovations",
        "label": "DOE Solar Innovations Report",
        "description": "Latest breakthroughs in solar panel efficiency and storage technology",
        "confidence_level": 92
      }
    ],
    "markdown": "**The renewable energy sector** has seen remarkable developments in 2026...",
    "html": "https://storage.cloud.html/uploaded-gemini-response.html"
  }
}

Why teams use cloro:

99.9% uptime, vs. DIY solutions that break often.
P50 latency under 45s, vs. manual scraping that takes hours.
No infrastructure costs. We handle browsers, proxies, and maintenance.
Structured data, with sources, confidence levels, and markdown parsed for you.
Rate limiting and ethical scraping practices.
Distributed infrastructure for high-volume requests.

Conclusion

Gemini data is worth pulling. Researchers studying AI behavior, businesses tracking their competitive landscape, and developers building AI-powered tools all benefit from structured Gemini responses with confidence scoring.

For most teams, cloro’s Gemini scraper is the shortest path. You get:

Reliable scraping infrastructure on day one
Data parsing with confidence scoring
Anti-bot evasion and rate limiting built in
Retry logic and error recovery
Structured JSON output with all metadata

Building this yourself typically costs $5,000-10,000/month in development time, browser instances, proxy services, and maintenance.

If you need a custom solution, the approach above is a starting point. Expect ongoing maintenance: Gemini updates its anti-bot measures and response formats often.

As more teams discover the value of AI monitoring, competition for visibility in AI responses grows. Companies that start tracking their Gemini presence now will build a lead that’s hard to close later.

Ready to pull Gemini data? Get started with cloro’s API.

How to scrape Google Gemini: parse internal API confidence scores

Why scrape Google Gemini responses?

Understanding Gemini’s architecture

The internal API parsing challenge

Building the scraping infrastructure

Parsing the streaming response data

Using cloro’s managed Gemini scraper

Conclusion

Frequently asked questions

How to scrape Google Search results effortlessly

Google search parameters: the complete guide

How to scrape ChatGPT: parse SSE streams + bypass Cloudflare

Why scrape Google Gemini responses?

Understanding Gemini’s architecture

The internal API parsing challenge

Building the scraping infrastructure

Parsing the streaming response data

Using cloro’s managed Gemini scraper

Conclusion

Frequently asked questions

Related reading

How to scrape Google Search results effortlessly

Google search parameters: the complete guide

How to scrape ChatGPT: parse SSE streams + bypass Cloudflare