Parse the Google SERP in 2026: PAA, AI Overviews, Snippets & Knowledge Panels
Fetching the HTML of a Google results page is the easy part. The hard part — the part that quietly eats weeks — is turning that page into clean, structured data when the page is no longer a simple list of ten blue links.
A modern Google SERP is a collage: organic results sit alongside People Also Ask accordions, a featured snippet at the top, an AI Overview that streams in after load, a knowledge panel down the right rail, and a band of related searches at the bottom. Each one lives in a different part of the DOM, and each is rendered by markup that Google rotates constantly.
This guide walks through parsing the features that matter, with resilient Python you can actually ship — multi-selector fallbacks, attribute and role anchors instead of brittle class names, and a normalizer that hands you one tidy JSON object at the end.
TL;DR: Don't anchor parsers on Google's short, rotated CSS classes — they change weekly. Anchor on stable structure (the <a>/<h3> pairing of an organic result), ARIA roles and data attributes, and text patterns in feature headers, each wrapped in a multi-selector fallback chain. Organic results and People Also Ask parse from static HTML; AI Overview text is the hardest because it streams in via JavaScript and is lazy-loaded behind "Show more", so it needs a headless browser. Normalize every feature into one dict, mark each field optional, and fail loud on a zero-result page.
Anatomy of a modern SERP
Before you write a single selector, it helps to map the page. The reason parsing Google is hard is not that any one feature is complex — it is that there are a dozen of them, each in its own corner of the DOM, each with different rules about when it appears. The Nielsen Norman Group's breakdown of key SERP features is a good visual reference for how much real estate the non-organic blocks now occupy.
Here is the feature map: what each block is, where it sits on the page, and why it resists a clean parse.
| Feature | Where it sits | Why it's hard to parse |
|---|---|---|
| Organic results | Main column, the classic ranked list | Class names rotate; ads and people-also-search blocks are interleaved |
| Featured snippet | Top of the main column ("position zero") | Several layouts — paragraph, list, table — each with different markup |
| People Also Ask | Mid-column accordion | Answers load on expand via JavaScript, so static HTML has questions but not answers |
| AI Overview | Top of the page, above or near the snippet | Streamed in after load, lazy-loaded behind "Show more", inconsistent coverage |
| Knowledge panel | Right rail (desktop), inline (mobile) | Key/value layout varies wildly by entity type (person, place, company) |
| Related searches | Bottom of the page | Sometimes a grid of cards, sometimes a plain link list |
Two patterns jump out of that table. First, the dynamic features — AI Overview, PAA answers — aren't in the raw HTML at all; they arrive with JavaScript. Second, every feature has multiple layouts, which is exactly why a single hard-coded selector fails so fast. For the question of whether you should be scraping this at all, see whether scraping Google is legal in 2026.
A robust selector strategy
The single biggest mistake in SERP parsing is anchoring on Google's CSS classes. You open DevTools, see a result wrapped in div.tF2Cxc or a link in div.yuRUbf, copy that into your selector, and ship it. It works — for a few days. Then Google's build pipeline rotates the obfuscated class name and your parser returns an empty list at 3 a.m.
The fix is to anchor on things that change far more slowly than class names. Three anchors, in order of preference:
| Anchor | Example | Why it's durable |
|---|---|---|
| Structure / relationships | An <a> that contains an <h3> | The link-plus-heading shape of a result rarely changes |
| Attributes & ARIA roles | [role="heading"], [data-attrid], [jsname] | Roles are tied to accessibility and semantics, not styling churn |
| Text heuristics | A header reading "People also ask" | User-facing copy is stable across deploys and localizable |
The second principle is to never rely on a single selector. Wrap each anchor in a small fallback chain: try the most specific selector, then a looser one, then a text-based heuristic, and only give up after all of them miss. This is the same defensive mindset covered in depth in building a resilient scraper that survives Google's weekly selector changes. Here is the tiny helper that powers every parser in this post:
from bs4 import BeautifulSoup
def first_match(node, selectors):
"""Try CSS selectors in order; return the first element found, else None."""
for sel in selectors:
found = node.select_one(sel)
if found is not None:
return found
return None
def text_of(node, default=""):
"""Safely read stripped text from a node that may be None."""
return node.get_text(" ", strip=True) if node is not None else default
Those two functions look trivial, but they are the difference between a parser that degrades gracefully and one that throws a NoneType error the first time a layout shifts. Everything below builds on them.
Extracting organic results & snippets
Organic results are the backbone of the page, so get these solid first. The durable shape of a result is an anchor that wraps a heading: the <h3> is the title, the enclosing <a>'s href is the URL, and a sibling block holds the description snippet. Rather than hunt for the rotating result-container class, we find every <h3>, walk up to its link, and read the snippet from the nearest text container.
# pip install beautifulsoup4 lxml
from bs4 import BeautifulSoup
def parse_organic(html):
soup = BeautifulSoup(html, "lxml")
results, position = [], 0
# Each organic result is an <a> that contains an <h3> title.
for h3 in soup.select("h3"):
link = h3.find_parent("a")
if link is None or not link.get("href", "").startswith("http"):
continue
# The result block is the link's nearest meaningful ancestor.
block = link.find_parent(
lambda tag: tag.name == "div" and tag.get("data-hveid")
) or link.parent
# Snippet: try a few durable containers, then any descriptive div.
snippet_el = first_match(block, [
'[data-sncf]',
'[data-content-feature] div[role="text"]',
'div[style*="line-clamp"]',
'div span',
])
snippet = text_of(snippet_el)
title = text_of(h3)
url = link["href"]
if not title or not url:
continue
position += 1
results.append({
"position": position,
"title": title,
"url": url,
"snippet": snippet,
})
return results
Notice the technique: we anchor on the <h3> and the <a> — a structural relationship that has survived years of redesigns — and only then reach for a snippet using a fallback chain through first_match. We also filter out non-http hrefs, which skips the internal "people also search for" and navigation links that masquerade as results.
One subtlety: this counts results in DOM order, which usually matches visual order, but ads and the snippet can push the true rank around. If you need pixel-accurate ranking, that is a different measurement entirely — see the pixel position API for why on-screen position and DOM position diverge.
Extracting People Also Ask & related searches
People Also Ask is the accordion of related questions that expand to reveal answers. Here is the catch that trips up most parsers: in the static HTML you usually get the questions but not the answers, because answers are fetched and injected by JavaScript only when a user expands each row. So from a plain HTML fetch, harvest the questions; for the answers, you need a browser that clicks each row open.
The questions themselves live in elements with a heading role or a data attribute that flags them as expandable. We anchor on the role and a couple of fallbacks rather than the class:
def parse_people_also_ask(html):
soup = BeautifulSoup(html, "lxml")
# The PAA container is usually flagged by an "Initibo"-style jsname or a
# nearby header reading "People also ask". Find it by text, then by data attr.
container = None
for el in soup.find_all(string=True):
if el.strip().lower() == "people also ask":
container = el.find_parent("div")
break
if container is None:
container = first_match(soup, [
'div[data-initq]',
'div[jsname][data-q]',
])
if container is None:
return []
questions = []
# Each question row exposes the text via a data attribute or a heading role.
for row in container.select('[data-q], [role="heading"], [aria-level]'):
q = (row.get("data-q") or text_of(row)).strip()
if q and q.lower() != "people also ask" and q not in questions:
questions.append(q)
return [{"question": q, "answer": None} for q in questions]
def parse_related_searches(html):
soup = BeautifulSoup(html, "lxml")
related = []
# Related searches are /search?q=... links near the foot of the page.
for a in soup.select('a[href*="/search?"]'):
text = text_of(a)
if text and text not in related and len(text.split()) <= 8:
related.append(text)
return related
The PAA parser leans on the visible header text "People also ask" as its primary anchor, because that string is stable across deploys even when every class around it changes. We then collect question text from whatever exposes it — a data-q attribute or a heading role — and de-duplicate. If you want the answers too, drive a headless browser to click each row and re-read the DOM; the click-and-wait pattern is the same one used for dynamic features in scraping the Google SERP for free. PAA mining is also the engine behind winning position-zero featured snippets, since the two features feed each other.
Featured snippets, AI Overviews & knowledge panels
These three are the rich features, and they range from "tricky" to "genuinely the hardest thing on the page." Take them in order of difficulty.
Featured snippets — the answer box at position zero — are tricky because Google renders them in several layouts: a paragraph, a bulleted list, a numbered list, or a small table. A single selector cannot catch all four, so detect the snippet container by its position and data attributes, then branch on what's inside it. Anchor on the data-attrid family Google uses to label answer content rather than the styling class.
Knowledge panels are easier structurally but vary by entity. They're a key/value card — founded date, CEO, headquarters — living in the right rail on desktop. The reliable anchor is the data-attrid attribute, which encodes the semantic field (for example data-attrid values like kc:/business/business:founded). Read those attribute-labelled rows into a dict and you have the panel without caring how it's styled.
AI Overviews are the hardest field on the entire page, and it's worth being blunt about why. The AI Overview block is streamed into the DOM by JavaScript after the initial HTML response, so a plain requests.get never sees it. It's then lazy-loaded behind a "Show more" control, so even a headless browser captures only a truncated preview unless it clicks to expand. And it doesn't appear for every query, region, or session — coverage is inconsistent by design. A robust extractor has to wait for the block, expand it, and read the rendered text, and still treat the result as optional. Because that flow deserves its own treatment, the full method — waiting, expanding, and reading the streamed text reliably — is covered in the deep dive on extracting Google AI Overviews. The practical rule: parse the AI Overview last, in a browser, and never let a missing one fail your pipeline.
Google's own search appearance documentation describes these features from the publisher's side, which is a useful cross-check on what each block is supposed to contain when it does appear.
Normalizing everything to clean JSON
Once each feature has its own parser, the last job is to stitch them into one predictable object. This is the contract the rest of your code depends on, so make it boringly consistent: every feature is a key, every absent feature is an empty list or None — never a missing key. That way downstream code can read data["ai_overview"] without a guard and get None rather than a KeyError.
def normalize_serp(html, query):
"""Parse a Google results page into one tidy, stable dict."""
soup = BeautifulSoup(html, "lxml")
serp = {
"query": query,
"organic": parse_organic(html),
"people_also_ask": parse_people_also_ask(html),
"related_searches": parse_related_searches(html),
"featured_snippet": None, # filled by your snippet branch, if present
"knowledge_panel": {}, # key/value dict from data-attrid rows
"ai_overview": None, # browser-only; optional by design
}
# Knowledge panel: read every semantically labelled row into a dict.
for row in soup.select("[data-attrid]"):
attrid = row.get("data-attrid", "")
value = text_of(row)
if attrid.startswith("kc:") and value:
key = attrid.split("/")[-1].split(":")[-1]
serp["knowledge_panel"].setdefault(key, value)
# A tidy summary so callers can see at a glance what was found.
serp["features_present"] = sorted(
name for name, val in serp.items()
if name != "query" and val not in (None, [], {})
)
return serp
if __name__ == "__main__":
import json
with open("serp.html", encoding="utf-8") as f:
html = f.read()
print(json.dumps(normalize_serp(html, "best running shoes 2026"), indent=2))
The shape never changes between queries, which is what makes it usable. A query with no knowledge panel returns an empty dict for that key, not a different schema. That stability is also what makes the data cacheable and diffable — if you're storing SERPs over time to track changes, a fixed schema lets you compare two captures field by field, the approach behind building a SERP cache.
Why this breaks weekly
Even with every defensive technique above, a self-maintained Google parser is a standing maintenance task, not a one-time build. Google ships layout experiments to slices of traffic continuously and rotates its obfuscated class names on its own schedule. Anchoring on structure, roles, and text buys you weeks instead of days — but not immunity.
The failure mode that actually hurts is the silent one: a layout shift that makes parse_organic return an empty list while every HTTP request still returns 200. A naive scraper logs success and you discover the gap days later when a report comes back blank. The cure is to fail loud — assert on expected counts and alert when they collapse:
def assert_healthy(serp, min_organic=5):
organic = len(serp["organic"])
if organic < min_organic:
raise RuntimeError(
f"SERP parse looks broken: only {organic} organic results "
f"for {serp['query']!r} (expected >= {min_organic}). "
f"Selectors have probably drifted."
)
return serp
Run a known query as a canary on a schedule, pipe that assertion to an alert, and you'll learn about selector drift from a notification instead of an angry stakeholder. The full playbook — fallback chains, fail-loud assertions, and canary monitors — is the subject of surviving Google's weekly selector changes, and the broader "why does my scraper die overnight" question is answered in why your SERP scraper breaks at 3 a.m.
The verdict
Parsing the Google SERP yourself is entirely doable, and the techniques here — structural and role anchors, multi-selector fallbacks, text heuristics, a fixed normalized schema, and fail-loud assertions — are what separate a parser that lasts a month from one that dies in a day. For organic results, People Also Ask questions, and related searches, a well-built BeautifulSoup parser will serve you well.
The honest caveat is the dynamic features. AI Overviews and PAA answers need a headless browser and a tolerance for inconsistency, and all of it sits on top of selectors that Google is free to change at any moment. When the maintenance cost of chasing that drift outweighs the cost of the data, a managed API that returns the whole feature set as stable JSON — and absorbs the selector churn for you — becomes the cheaper option. If you're weighing that trade-off, what a SERP API actually is lays out the build-versus-buy maths.
Skip the selector churn. Get every SERP feature as clean JSON.
Serpent's SERP API returns clean JSON from Google, Bing, Yahoo & DuckDuckGo — organic, People Also Ask, featured snippets, AI Overviews and knowledge panels, no proxies, no CAPTCHAs, no parser maintenance. Get 10 free searches on signup, then pay-as-you-go from $0.03 per 10,000 searches at scale, no subscription.
Get Your Free API KeyExplore: SERP API · Pricing · Playground
FAQ
Why do Google's feature selectors change so often?
Google's result HTML is generated by a build pipeline that emits obfuscated, frequently rotated CSS class names, and the company ships layout experiments to slices of traffic constantly. A class that anchors your parser today can be renamed in the next deploy, and an A/B test can serve a different DOM to a fraction of requests. That is why parsers built on raw class names break within days — and why you anchor on stable attributes, roles, and text patterns instead.
Can I extract AI Overview text?
Sometimes, and it is the hardest feature on the page. The AI Overview block is streamed in by JavaScript after the initial HTML loads, is heavily lazy-loaded behind a "Show more" control, and does not appear on every query or for every region or session. A plain HTML fetch usually misses it entirely; you need a headless browser that waits for the block, expands it, and reads the rendered text. Even then, coverage is inconsistent, so always treat the field as optional.
What's the best way to keep parsers alive?
Use multi-selector fallback chains so that when one anchor disappears the parser tries the next, anchor on attributes and ARIA roles rather than class names, and fail loud with assertions on expected result counts so a silent zero-result page raises an alert instead of logging success. Pair that with a canary monitor that runs a known query on a schedule and flags drift the moment the shape changes.
Are SERP feature CSS classes stable?
No. The short, cryptic class names you see in Google's result HTML — strings like g, tF2Cxc, or yuRUbf — are generated and rotated by Google's build system, so they can and do change without notice. Treat them as hints at best. Build your selectors around things that change far less often: the link and heading structure, ARIA roles, data attributes, and the natural-language text of feature headers.



