Build a Deep Research Agent in Python (Plan → Search → Synthesize → Cite)

By Serpent API Team · · 12 min read

Gemini Deep Research and OpenAI’s deep research feel like magic. You ask one hard question, walk away for ten minutes, and come back to a long, well-organized report with footnotes.

Here’s the good news: the pattern behind them is not magic. It is a simple loop you can build yourself in an afternoon.

In this tutorial we’ll build a working deep research agent in Python. It plans sub-questions, runs real web searches, reads the sources, writes a cited answer, then checks itself for gaps and goes again.

You bring your own LLM (Claude, GPT, or Gemini — your choice) and your own search. For search we’ll use a clean, controllable SERP API so you can see and log every single source the agent touches.

TL;DR: A deep research agent is a loop: plan the question into sub-questions, fan-out a web search for each, read the results, synthesize a cited answer, then reflect on gaps and loop again. We build it in ~120 lines of Python using your own LLM for reasoning and the Serpent SERP API for retrieval — so every source is visible, loggable, and cheap.

The deep-research pattern (what the big tools actually do)

Every serious deep-research system runs the same core loop: Plan → Search → Read → Reflect → Iterate → Synthesize. Once you see it, you can’t unsee it.

Google describes Gemini Deep Research as a tool that “browses the web the way you do: searching, finding interesting pieces of information and then starting a new search based on what it’s learned,” repeating this many times before writing a report (Google blog).

OpenAI’s deep research follows the same shape. It clarifies the task, decomposes the high-level query into sub-questions, searches iteratively (refining each query based on what it just read), then compiles a report where “every factual claim is accompanied by an inline citation.”

Anthropic’s engineering team describes their Research feature as a lead agent that plans a strategy, then spawns sub-agents to explore different angles in parallel before the lead compiles the final answer (Anthropic engineering blog).

Strip away the branding and the parts are always these:

StepWhat it doesWho does it in our build
PlanBreak the question into 3–6 sub-questionsYour LLM
SearchFire one web search per sub-questionSerpent SERP API
ReadCollect titles, URLs, snippets as contextYour code
SynthesizeWrite a cited answer from the contextYour LLM
ReflectFind what’s still missingYour LLM
IterateSearch the gaps, then re-synthesizeThe loop

That’s the whole machine. We’ll build each box and wire them together.

How this differs from single-shot RAG

The short version: RAG retrieves once and answers once; a deep research agent loops.

In classic retrieval-augmented generation, you take the user’s question, fetch some matching context, stuff it into one prompt, and generate one answer. It’s fast and great for “what does the docs say about X.” We covered that exact flow in our RAG with real-time search tutorial.

A deep research agent is the multi-step cousin. It plans several lines of inquiry, searches each one, notices what it still doesn’t know, and runs more rounds. It trades latency for depth and coverage.

Single-shot RAGDeep research agent
Search callsUsually 1Many (one per sub-question, multiple rounds)
LLM calls1Plan + synthesize + reflect, repeated
Best forDirect lookups, chat groundingOpen-ended “research this for me” tasks
LatencySecondsTens of seconds to minutes
OutputShort grounded answerLong cited report

Rule of thumb: if one good search answers it, use RAG. If the question has parts — comparisons, timelines, “pros and cons,” market scans — you want the loop.

Why a controllable SERP API beats opaque built-in search

A SERP API hands you the raw results, so you control the queries, see every source, log everything, and pay a flat, known price. A model’s built-in search hides all of that.

Three concrete reasons this matters for an agent:

Transparency. You can print the exact query the agent ran and the exact URLs it read. When the answer is wrong, you can see which source misled it. With opaque built-in search you get a black box. This is the same observability argument we make in SERP APIs for AI agents.

Cost control. Search calls are a line item you can count and cap. With predictable per-call pricing you know a 30-search run costs a fraction of a cent in retrieval, separate from your token spend. We break the math down further in how to reduce LLM web-search grounding cost.

Freshness. You decide the query, the country, and how many results to pull, so you can target today’s news or a specific market. And there’s no proxy pool or headless browser to manage — the API handles access for you, so the agent just gets clean JSON back.

Setup: keys, install, and the search helper

You need two things: an LLM (your model and key) and a search key. Grab a free Serpent key at the dashboard — signup includes 10 free Google searches, which is plenty to test the loop.

Install the one dependency we actually need:

pip install requests

Now the search helper. This is the only piece that talks to the web. It hits GET /api/search with the X-API-Key header and returns clean organic results.

import requests

SERPENT_KEY = "sk_live_your_key"   # your Serpent key
SEARCH_URL  = "https://api.apiserpent.com/api/search"

def web_search(query, num=10, country="us"):
    """Fire one search at Serpent and return [{title, url, snippet}]."""
    resp = requests.get(
        SEARCH_URL,
        headers={"X-API-Key": SERPENT_KEY},
        params={"q": query, "engine": "google", "num": num, "country": country},
        timeout=60,
    )
    resp.raise_for_status()
    data = resp.json()
    organic = data.get("results", {}).get("organic", [])
    return [
        {"title": r.get("title"), "url": r.get("url"), "snippet": r.get("snippet")}
        for r in organic
    ]

One call, up to 100 results, and the page depth never changes the price. If you want broader recall per sub-question, raise num to 20 or 30. Full parameters are in the API docs.

Step 1 — The planner (decompose into sub-questions)

The planner is a single LLM call that turns the user’s question into a short list of sub-questions. That list is the research plan.

We keep the LLM call generic so you can plug in any provider. Replace llm_json() with your own SDK call — the only requirement is that it returns the model’s text, and we ask the model to output JSON.

import json

def llm_json(prompt):
    """
    Call YOUR model and return its text reply.
    Swap this body for your provider's SDK (Claude, GPT, Gemini...).
    The prompt asks for JSON, so just return the raw string.
    """
    raise NotImplementedError("Plug in your LLM call here")

def plan(question, n=5):
    prompt = f"""You are a research planner. Break the user's question into
{n} focused, non-overlapping sub-questions that, answered together, fully
cover it. Return ONLY a JSON array of strings.

Question: {question}"""
    raw = llm_json(prompt)
    return json.loads(raw)

For the question “Is a heat pump worth it for a cold-climate home in 2026?” a good planner returns something like:

[
  "How well do cold-climate heat pumps perform below freezing in 2026?",
  "What is the upfront install cost of a cold-climate heat pump?",
  "What rebates or tax credits are available for heat pumps in 2026?",
  "How do heat pump running costs compare to gas furnaces?",
  "What are common reliability complaints about cold-climate heat pumps?"
]

Notice the planner already decided what to investigate and in what order — broad context first, specifics after. That ordering is exactly what the big tools do implicitly.

Step 2 — The retrieval loop (fan-out searches)

The retrieval step fires one search per sub-question and collects every source into a numbered list. Numbering now makes citations trivial later.

def gather(subquestions, per_q=8):
    """Search each sub-question, return a numbered source list."""
    sources = []
    for sq in subquestions:
        for hit in web_search(sq, num=per_q):
            if not hit.get("url"):
                continue
            sources.append({
                "n": len(sources) + 1,
                "title": hit["title"],
                "url": hit["url"],
                "snippet": hit["snippet"],
                "subq": sq,
            })
    return sources

This is the “fan-out” the deep-research papers talk about: each sub-question becomes its own query, so you cover the topic widely instead of betting everything on one search. If you want to run the searches in parallel, wrap web_search in a ThreadPoolExecutor — the API is happy with concurrent calls.

Want to understand the fan-out idea more deeply? It’s the same technique Google now uses inside AI Mode — see our breakdown of query fan-out.

Tip: Because Serpent returns results.organic with clean title, url, and snippet fields, you usually don’t even need to fetch the full pages to get a solid answer — the snippets alone give the model strong grounding. Fetch full text only for the few URLs that matter most.

Step 3 — Synthesis with inline citations

The synthesizer feeds the numbered sources to the LLM and asks for an answer with bracketed citations like [3] that map back to source numbers.

def llm_text(prompt):
    """Same idea as llm_json, but the answer is free-form prose."""
    raise NotImplementedError("Plug in your LLM call here")

def synthesize(question, sources):
    context = "\n".join(
        f"[{s['n']}] {s['title']} - {s['snippet']} ({s['url']})"
        for s in sources
    )
    prompt = f"""Answer the question using ONLY the sources below.
Add a bracketed citation like [n] after every claim, matching the
source numbers. If the sources don't cover something, say so.

QUESTION: {question}

SOURCES:
{context}"""
    return llm_text(prompt)

To turn the markers into real links, render each [n] as an anchor to sources[n-1]["url"] and print a numbered reference list underneath — same as the footnotes you see in Gemini and OpenAI reports.

import re

def linkify(answer, sources):
    by_n = {s["n"]: s["url"] for s in sources}
    def repl(m):
        n = int(m.group(1))
        return f'[{n}]({by_n[n]})' if n in by_n else m.group(0)
    body = re.sub(r"\[(\d+)\]", repl, answer)
    refs = "\n".join(f"{s['n']}. {s['title']} - {s['url']}" for s in sources)
    return body + "\n\n## Sources\n" + refs

Now every claim is traceable to a URL you can click. That auditability is the entire point — and it’s only possible because you hold the source list.

Step 4 — Reflect and iterate on gaps

The reflection step asks the model what the current answer still doesn’t cover, and turns those gaps into fresh sub-questions for another round.

def reflect(question, answer):
    prompt = f"""Given this draft answer to the question, list up to 3
specific gaps, missing angles, or unverified claims that another round
of web search should fill. Return ONLY a JSON array of search-ready
sub-questions. If the answer is already thorough, return [].

QUESTION: {question}

DRAFT ANSWER:
{answer}"""
    return json.loads(llm_json(prompt))

If reflect() returns an empty list, you’re done. If it returns new sub-questions, you search those, add them to your sources, and synthesize again. That empty-list check is your stopping rule — the same “coverage is sufficient” signal the production systems use, alongside a hard cap on rounds.

Putting it together and running it

Here’s the full orchestrator. It runs the plan, gathers sources, synthesizes, then reflects and iterates up to max_rounds times.

def deep_research(question, max_rounds=3):
    subqs = plan(question)
    sources = gather(subqs)
    answer = synthesize(question, sources)

    for _ in range(max_rounds - 1):
        gaps = reflect(question, answer)
        if not gaps:
            break                      # coverage is good, stop
        sources += gather(gaps)        # search the gaps
        answer = synthesize(question, sources)  # re-synthesize with more

    return linkify(answer, sources)

if __name__ == "__main__":
    q = "Is a cold-climate heat pump worth it for a home in 2026?"
    print(deep_research(q))

That’s a complete deep research agent. Plug your model into llm_json and llm_text, drop in your Serpent key, and run it. On a real question it will fire somewhere between 15 and 40 searches across two or three rounds and hand you a cited report.

From here you can layer on the production touches: run sub-questions concurrently, dedupe URLs across rounds, fetch full page text for your top few sources, and add a token budget. Anthropic’s post on their multi-agent research system is a great map of where to go next, including running sub-agents in parallel.

Cost, freshness, and going further

Running this agent costs two things: your LLM tokens and your search calls — and the search half is cheap and predictable.

Serpent’s Google search is $0.60 per 10,000 searches pay-as-you-go, dropping to $0.06 / 10K after a single $100 deposit and $0.03 / 10K at the Scale tier. Crucially, page depth does not multiply the price — pulling 100 results for a sub-question costs the same as pulling 10. There’s no subscription, and signup gives you 10 free searches. See the pricing page for the full breakdown.

So a research run that fires 30 searches costs well under a cent in retrieval. The expensive half is almost always the LLM, which is exactly why a cheap, transparent search layer is the right call — you can search more aggressively without watching a meter spin.

For multi-engine coverage you can point the same helper at Bing, Yahoo, or DuckDuckGo by changing the engine param, and combine sources from several engines per sub-question. If you’re curious about wiring this into an editor or assistant, our guide to a SERP MCP server for Claude and Cursor shows the agent-tool version of the same idea.

That’s the whole thing. The “deep research” everyone’s talking about is a loop you now own end to end — with every source visible and every dollar accounted for.

Power your research agent with Serpent

Serpent gives your agent clean Google, Bing, Yahoo and DuckDuckGo results in one API call — up to 100 results per call, every source URL visible. Start with 10 free Google searches, then pay as little as $0.03 per 10K. No subscription.

Get Your Free API Key

Explore: Google SERP API · Playground · Pricing

FAQ

What is a deep research agent?

A deep research agent is an LLM-driven loop that decomposes a question into sub-questions, runs web searches for each, reads the sources, synthesizes a cited answer, then reflects on gaps and searches again. It is the pattern behind Gemini Deep Research and OpenAI deep research.

How is a deep research agent different from RAG?

Single-shot RAG retrieves once and answers once. A deep research agent loops: it plans multiple sub-questions, searches repeatedly, checks for gaps, and runs more rounds until coverage is good. It trades speed for depth and breadth.

Why use a SERP API instead of a model’s built-in search?

A SERP API gives you the raw results, so you control which queries run, see every source URL, log everything, and pay a flat known price per call. Built-in search is opaque, harder to audit, and you cannot inspect or cache the retrieval step.

Which LLM should I use for the planner and synthesizer?

Any capable model works. You supply your own model and key, such as Claude, GPT, or Gemini. The agent only needs a chat completion call that follows instructions and can output JSON, so the code stays vendor-neutral.

How much does it cost to run a deep research agent?

Cost is your LLM tokens plus the search calls. Serpent’s Google search is $0.60 per 10,000 searches pay-as-you-go, dropping to $0.03 per 10,000 at scale, with page depth not multiplying the price. A typical run firing 15 to 40 searches costs a fraction of a cent in retrieval.

How do I add inline citations to the answer?

Number every source URL you collect, pass the numbered list into the synthesis prompt, and instruct the model to add bracketed markers like [1] next to each claim. Then render those numbers as links back to the source URLs.