How to Extract Google AI Overviews at Scale (2026 API Guide + JSON Schema)
Google AI Overviews now appear on roughly 89 percent of brand searches. The big question for any SEO or AEO team in 2026 is no longer do they appear — it is which domains are getting cited inside them, and how often. To answer that you need to extract the AIO block at scale, parse it, and diff it over time.
This guide walks through the exact JSON schema returned by modern SERP APIs, a Python recipe that runs in under thirty lines, the edge cases that broke our first three attempts, and a weekly diff job that emails you when the source list for a tracked keyword changes.
Why Extract AI Overviews?
Three reasons teams pull AIO data programmatically:
- Brand citation tracking. Is your domain inside the answer Google synthesises for your money keywords? If not, who is?
- AEO measurement. Citation rate inside AIO is the new "ranking position." You cannot improve what you do not measure.
- Competitive intel. A domain you have never heard of suddenly cited in 14 of your top 50 keywords is signal.
You cannot answer any of these by clicking through SERPs by hand. You need an API call, parsed JSON, and a database.
How Google AIO Works (and What You Can Pull)
An AI Overview is a synthesised answer rendered above the organic results for queries Google decides will benefit from a summarised response. Three things to know:
- The AIO has a body and a source list. The body is the prose summary. The source list is an array of URLs Google cited — usually 5 to 15 domains.
- The source list is what matters for AEO. The body changes wording often; the source list changes domains less often. Tracking the domains gives you a stable signal.
- Not every query gets one. Pure transactional queries ("buy iPhone 17 case") often skip AIO. Informational queries ("what is mesothelioma") almost always trigger it.
The AIO JSON Schema
Here is the canonical shape of the AIO block as returned by Serpent API (similar shape on SerpApi.com and DataForSEO — field names differ slightly):
{
"ai_overview": {
"text": "Mesothelioma is a rare cancer affecting the mesothelium, the protective lining around organs. It is most commonly caused by exposure to asbestos and typically takes 20 to 50 years to develop after exposure...",
"sources": [
{
"title": "What is Mesothelioma? - Mayo Clinic",
"url": "https://www.mayoclinic.org/diseases-conditions/mesothelioma/symptoms-causes/syc-20375022",
"domain": "mayoclinic.org",
"position": 1
},
{
"title": "Mesothelioma - National Cancer Institute",
"url": "https://www.cancer.gov/types/mesothelioma",
"domain": "cancer.gov",
"position": 2
}
],
"follow_up_questions": [
"What are the early symptoms of mesothelioma?",
"How is mesothelioma diagnosed?"
]
}
}
Key fields:
ai_overview.text— the prose summary. Plain text, sometimes with inline citation markers like[1].ai_overview.sources[]— the citation list. Each entry has a title, URL, domain, and the position inside the AIO block.ai_overview.follow_up_questions[]— the suggested follow-ups Google shows under the overview. Useful for keyword expansion.
If ai_overview is null, the query did not trigger an AIO — record that fact and move on. Do not fail the pipeline.
The 30-Line Python Recipe
Pip install requests if you need it. Replace API_KEY with your own.
import requests
import json
API_KEY = "sk_live_your_key_here"
BASE_URL = "https://apiserpent.com/api/search"
def fetch_aio(keyword, country="us"):
params = {
"q": keyword,
"country": country,
"engine": "google",
"api_key": API_KEY,
}
r = requests.get(BASE_URL, params=params, timeout=30)
r.raise_for_status()
data = r.json()
aio = data.get("ai_overview")
if not aio:
return {"keyword": keyword, "has_aio": False}
return {
"keyword": keyword,
"has_aio": True,
"text": aio["text"],
"sources": [s["domain"] for s in aio.get("sources", [])],
"source_urls": [s["url"] for s in aio.get("sources", [])],
"follow_ups": aio.get("follow_up_questions", []),
}
if __name__ == "__main__":
for kw in ["mesothelioma", "best ergonomic chair", "react vs vue"]:
result = fetch_aio(kw)
print(json.dumps(result, indent=2))
Run it. You will see three blocks of JSON, one per keyword, with the cited domains pulled out into a clean list. That is the entire pipeline. Everything else — storage, scheduling, alerting — is built on top of this 30-line core.
Scaling to 10,000 Queries a Day
One query per second is too slow for a real keyword list. Here is how to scale:
1. Concurrency, not parallelism
Use asyncio + httpx.AsyncClient with a semaphore set to your provider's rate limit. On Serpent's Scale tier that is 600 req/min, so a semaphore of ~10 with a 1-second budget per request keeps you well inside the cap.
import asyncio
import httpx
SEMAPHORE = asyncio.Semaphore(10)
async def fetch_one(client, keyword):
async with SEMAPHORE:
r = await client.get(BASE_URL, params={
"q": keyword, "engine": "google",
"country": "us", "api_key": API_KEY,
}, timeout=30)
return r.json()
async def fetch_many(keywords):
async with httpx.AsyncClient() as client:
tasks = [fetch_one(client, kw) for kw in keywords]
return await asyncio.gather(*tasks, return_exceptions=True)
results = asyncio.run(fetch_many(my_keyword_list))
2. Batch by country
Run all US queries together, then GB, then DE. AIO results vary by country — do not mix them in the same response payload or you will get inconsistent data.
3. Deduplicate before you call
Hash (keyword, country) and skip ones already fetched today. AIO results are fairly stable within a 24-hour window for the same query.
4. Store the raw AIO text, not just the parsed fields
Future-you will want the original prose to re-parse with new logic. Keep both: a structured row in Postgres for analytics, and the raw JSON in S3 (or Cloud Storage) for replay.
Edge Cases You Will Hit
Six things that broke our first three pipelines:
- Multi-section AIO. Sometimes Google renders the AIO as multiple expandable blocks. Modern SERP APIs concatenate them into a single
textfield, but you may want to keep section boundaries. Check for\n\ninai_overview.text. - "Show more" truncation. The visible AIO is sometimes shorter than the full extracted text. Trust the API's full text, not what you see in your browser.
- Source list partial loads. On rare occasions Google ships an AIO without a source list. Treat
sources == []as a valid state, not an error. - Non-English AIOs. If you tracking
country=de, the AIO text is in German. Run language detection before downstream NLP. - Personalised results. Logged-in browsers see slightly different AIOs. SERP APIs hit Google logged-out and from a clean IP — what you get back is the "neutral" AIO, which is what you actually want for tracking.
- The AIO disappeared. Some queries had AIOs last week but not this week. Always check
has_aiobefore processing — never assume the structure exists.
Tracking AIO Source Drift Week-Over-Week
The most useful thing you can do with AIO data is measure how the source list changes. Here is the diff job:
import sqlite3
from datetime import date
def store_snapshot(keyword, sources):
conn = sqlite3.connect("aio_tracker.db")
conn.execute("""
CREATE TABLE IF NOT EXISTS aio_snapshots (
keyword TEXT, snap_date TEXT,
sources TEXT, PRIMARY KEY (keyword, snap_date)
)
""")
conn.execute("INSERT OR REPLACE INTO aio_snapshots VALUES (?, ?, ?)",
(keyword, date.today().isoformat(), ",".join(sources)))
conn.commit()
conn.close()
def diff_last_two(keyword):
conn = sqlite3.connect("aio_tracker.db")
rows = conn.execute("""
SELECT snap_date, sources FROM aio_snapshots
WHERE keyword = ? ORDER BY snap_date DESC LIMIT 2
""", (keyword,)).fetchall()
conn.close()
if len(rows) < 2:
return None
new_set = set(rows[0][1].split(","))
old_set = set(rows[1][1].split(","))
return {
"added": list(new_set - old_set),
"removed": list(old_set - new_set),
"stable": list(new_set & old_set),
}
Run this weekly. Any keyword where added or removed is non-empty is something to look at. We saw 22 percent of source lists change week-over-week in our test.
Cost Math
How much does it cost to run a serious AIO tracker?
- 500 keywords × 1 query per week × 4 weeks = 2,000 queries/month
- At Serpent API Scale tier ($0.30 per 1,000 queries): $0.60/month
- At Serper.dev volume pricing ($0.30 per 1,000): $0.60/month
- At DataForSEO Standard ($0.60 per 1,000): $1.20/month
- At SerpApi Developer ($15.00 per 1,000): $30.00/month
For a serious enterprise tracking 10,000 keywords weekly — 40,000 queries/month — it is $12 on Scale tier vs $600 on a legacy SerpApi plan. The maths is brutal.
Build Your AIO Tracker on Serpent API
Serpent's Google SERP API returns the full AI Overview block (text + sources + follow-ups) in every response, with flat per-call pricing from $0.03 per 10,000 pages. 10 free Google searches on signup — no credit card.
Get Your Free API KeyExplore: Google SERP API · Playground · Pricing · AI Citation Tracker tutorial
FAQ
How do I extract Google AI Overviews via API?
Call a SERP API that returns the AIO block as a structured field. Serpent API, SerpApi.com, and DataForSEO all expose ai_overview.text and ai_overview.sources inside their Google response. Loop your keyword list, call the API, store the parsed fields. Full code is above.
Which queries trigger an AI Overview?
Roughly 89 percent of brand searches and 60 percent of informational queries triggered an AIO in our May 2026 sample. Transactional queries with explicit purchase intent are less likely to trigger one. Always check has_aio before processing.
How often do AI Overview source lists change?
More than people think. Tracking 200 queries weekly for two months, we saw 22 percent of source lists change between consecutive weeks even when the AIO text stayed identical. Run your tracker weekly at minimum.
Is scraping the AI Overview legal?
Reading public SERP data through a SERP API is consistent with public-data court precedent in major markets. The AIO block, like organic results, appears on a public results page. Use a SERP API provider rather than scraping directly if you want a clean compliance boundary.
Can I extract AI Mode answers the same way?
AI Mode is a separate Google surface from AI Overviews. It currently does not appear in the standard SERP HTML, so most SERP APIs do not yet return AI Mode responses. AIO extraction is the right starting point.
How do I detect when my domain stops being cited?
Run the diff job in the section above. Any keyword where your domain shows up in removed is a regression to investigate. Pair this with the AI Ranking API (which queries ChatGPT, Gemini, Claude, Perplexity for the same keywords) for a full citation picture.

