Build Your Own Answer Engine (Open-Source Perplexity Alternative) on a SERP API
Perplexity made one idea famous: you ask a question, and instead of ten blue links you get a written answer with little numbered citations you can click.
People love it. But it is a closed product with a monthly bill, and you do not control the model, the sources, or the data.
The good news is that the whole pattern is simple enough to build yourself in an afternoon. The only hard part is the search layer, and a SERP API hands you that on a plate.
This guide shows you the full architecture and gives you a complete, copy-pasteable backend in both Python and Node. You bring your own LLM. We will handle the search.
TL;DR: An answer engine is four steps — search the live web, build context from the top results, synthesize a short answer with your LLM, and stream it back with numbered citations. Use a SERP API for the search step so you get fresh, ranked, real sources without running a scraper. Grounding costs as little as $0.03 per 10K searches; you choose and pay for the LLM. This is the one-shot, user-facing cousin of a deep research agent.
The open-source answer-engine landscape
There is already a healthy crop of open-source Perplexity clones, and studying them tells you exactly what to build.
The well-known ones are Fireplexity (a Next.js engine from the Firecrawl team), Morphic (Next.js with a generative UI on the Vercel AI SDK), Perplexica (works with local models through Ollama), and llm-answer-engine (Next.js, Groq, and a search provider).
Look closely and you notice they all share the same skeleton. A question comes in, the app searches the web, it stuffs the best results into a prompt, an LLM writes the answer, and the UI shows citations.
You also notice they all wrestle with the same bottleneck: the search and grounding layer. Every one of them needs fresh, ranked, real sources, and that is the part that breaks when you try to scrape Google yourself. Solve grounding cleanly and the rest is glue code.
The architecture in four steps
An answer engine is a short, mostly linear pipeline — not a complicated agent loop.
Here is the whole flow, end to end:
| Step | What happens | Tool |
|---|---|---|
| 1. Search | Turn the user question into a query and fetch fresh, ranked sources from the live web | SERP API (/api/search) |
| 2. Build context | Take the top organic results — title, URL, snippet — and assemble a numbered source list | Your backend |
| 3. Synthesize | Send the question + numbered sources to your LLM with a tight prompt: answer briefly, cite with [n] | Your chosen LLM |
| 4. Stream | Stream the answer token-by-token; render [n] markers as links back to the source URLs | SSE + your frontend |
That is it. No vector store to maintain, no embedding pipeline, no nightly re-index job. The freshness comes from the search call, which runs anew on every question.
Step 1 is where most home-grown projects stall. You can spin up a headless browser and a proxy pool, but Google fights back with blocks and CAPTCHAs, and your engine breaks at 3 a.m. when nobody is watching. A SERP API removes that whole class of problem — no proxy pool or headless browser to manage, the API handles access for you.
Why SERP grounding beats a static vector DB
For an answer engine, live search grounding is almost always the right choice over a static vector database.
The reason is freshness. A vector DB serves your documents, which means someone has to keep loading and re-indexing them. The moment the world changes — a price, a release date, a news event — your index is stale, and the model confidently gives users yesterday's news.
As one practitioner guide puts it, dynamic web data is “fresh and comprehensive, essential for fast-changing contexts such as regulations, market news, or product updates,” while static knowledge “is at risk of becoming stale.” That trade-off is the whole game for a public-facing answer engine.
| Concern | Static vector DB | SERP API grounding |
|---|---|---|
| Freshness | As old as your last re-index | Live, every query |
| Setup work | Embed, chunk, store, sync | One HTTP call |
| Source ranking | You build relevance yourself | Search engine already ranked them |
| Coverage | Only what you loaded | The whole indexed web |
| Time-sensitive queries | Weak | Strong |
This is the same insight behind real-time-search RAG: for anything that changes over time, retrieval from a live index beats retrieval from a frozen one. A vector DB still has its place for private internal docs — many teams run a hybrid — but the public answer engine you are building wants the open web.
Build the backend (Python and Node)
The backend is one endpoint: take a query, call the SERP API, build a grounded prompt, call your LLM, and return { answer, citations[] }.
First, the search call. You hit GET /api/search with your key in the X-API-Key header. The response has a results.organic array, each item carrying position, title, url, and snippet — exactly the fields you want for grounding.
Here is a complete Python (FastAPI) backend. It is provider-neutral on purpose: swap the call_llm body for whichever model you prefer.
import os, requests
from fastapi import FastAPI
from pydantic import BaseModel
SERPENT_KEY = os.environ["SERPENT_API_KEY"]
app = FastAPI()
class Ask(BaseModel):
q: str
def search(query: str, n: int = 8):
r = requests.get(
"https://api.apiserpent.com/api/search",
headers={"X-API-Key": SERPENT_KEY},
params={"q": query, "engine": "google", "country": "us", "num": n},
timeout=30,
)
r.raise_for_status()
return r.json()["results"]["organic"][:n]
def build_prompt(query: str, sources: list) -> str:
block = "\n".join(
f"[{i+1}] {s['title']} ({s['url']})\n{s.get('snippet','')}"
for i, s in enumerate(sources)
)
return (
"Answer the question using ONLY the sources below.\n"
"Be concise (3-5 sentences). After each claim add a citation "
"like [1] or [2] pointing to the source you used.\n\n"
f"SOURCES:\n{block}\n\nQUESTION: {query}\nANSWER:"
)
def call_llm(prompt: str) -> str:
# Plug in your own LLM here (Claude, GPT, Gemini, a local model...).
# Return the model's text answer as a string.
...
@app.post("/ask")
def ask(body: Ask):
sources = search(body.q)
answer = call_llm(build_prompt(body.q, sources))
citations = [
{"n": i + 1, "title": s["title"], "url": s["url"]}
for i, s in enumerate(sources)
]
return {"answer": answer, "citations": citations}
Prefer JavaScript? Here is the same thing as a tiny Node / Express endpoint, again leaving the LLM call for you to fill in.
import express from "express";
const app = express();
app.use(express.json());
const KEY = process.env.SERPENT_API_KEY;
async function search(query, n = 8) {
const url = new URL("https://api.apiserpent.com/api/search");
url.search = new URLSearchParams({
q: query, engine: "google", country: "us", num: String(n),
});
const r = await fetch(url, { headers: { "X-API-Key": KEY } });
if (!r.ok) throw new Error(`search failed: ${r.status}`);
const data = await r.json();
return data.results.organic.slice(0, n);
}
function buildPrompt(query, sources) {
const block = sources
.map((s, i) => `[${i + 1}] ${s.title} (${s.url})\n${s.snippet || ""}`)
.join("\n");
return `Answer using ONLY these sources. Be concise (3-5 sentences). ` +
`Cite each claim with [n].\n\nSOURCES:\n${block}\n\nQUESTION: ${query}\nANSWER:`;
}
async function callLLM(prompt) {
// Plug in your own LLM here. Return the answer text.
}
app.post("/ask", async (req, res) => {
const sources = await search(req.body.q);
const answer = await callLLM(buildPrompt(req.body.q, sources));
const citations = sources.map((s, i) => ({
n: i + 1, title: s.title, url: s.url,
}));
res.json({ answer, citations });
});
app.listen(3000);
Notice what is not here: no proxy config, no browser automation, no retry logic for blocked pages. The search step is one clean request because the API absorbs the messy parts. (For the deeper grounding playbook, see our guide on SERP APIs for AI agents and LLMs.)
Grab a key first: the code above needs a SERPENT_API_KEY. New accounts get 10 free Google searches — enough to build and test your engine before you spend a cent. Try queries live in the playground to see the exact JSON shape.
Streaming the answer with citations
To feel like Perplexity, the answer must appear as it is written — so stream it.
Most LLM SDKs return an async token stream. You forward those tokens to the browser over Server-Sent Events (SSE). Here is the FastAPI version, streaming the synthesis step.
from fastapi.responses import StreamingResponse
@app.get("/ask/stream")
def ask_stream(q: str):
sources = search(q)
prompt = build_prompt(q, sources)
def gen():
# 1) send the sources up front so the UI can render the list
import json
yield f"event: sources\ndata: {json.dumps(sources)}\n\n"
# 2) stream the answer tokens as they arrive from your LLM
for token in stream_llm(prompt): # your generator
yield f"event: token\ndata: {json.dumps(token)}\n\n"
yield "event: done\ndata: {}\n\n"
return StreamingResponse(gen(), media_type="text/event-stream")
The trick that makes citations work: send the numbered source list before the answer text. The frontend then has everything it needs to turn a [3] in the streaming text into a live link the instant it appears.
Rendering numbered citations and follow-ups
On the frontend, you listen to the stream, accumulate the text, and replace [n] markers with links into your source list.
This small browser snippet does the whole job — receive sources, append tokens, and linkify citations on the fly.
const sources = {}; // n -> { title, url }
const out = document.getElementById("answer");
let text = "";
const es = new EventSource(`/ask/stream?q=${encodeURIComponent(query)}`);
es.addEventListener("sources", (e) => {
JSON.parse(e.data).forEach((s, i) => (sources[i + 1] = s));
});
es.addEventListener("token", (e) => {
text += JSON.parse(e.data);
// turn [3] into a clickable superscript citation
out.innerHTML = text.replace(/\[(\d+)\]/g, (m, n) => {
const s = sources[n];
return s
? `<sup><a href="${s.url}" target="_blank" title="${s.title}">[${n}]</a></sup>`
: m;
});
});
es.addEventListener("done", () => es.close());
Want follow-up questions like Perplexity shows at the bottom? Add one line to your synthesis prompt: “After the answer, suggest three short follow-up questions on separate lines prefixed with Q:.” Then split those off in the backend and render them as clickable chips that re-run the pipeline.
That is a fully working answer engine: a search box, a streamed answer, numbered citations, and follow-ups. Everything else — chat history, theming, a generative UI — is polish you can borrow from Morphic or build at your own pace.
Cost, engines, and design choices
The two running costs are search grounding and LLM tokens, and grounding is the cheaper, more predictable of the two.
On Serpent, Google search is flat per call: $0.60 per 10,000 searches pay-as-you-go, dropping to $0.06 per 10K with a single $100 deposit and $0.03 per 10K at $500. Crucially, page depth does not multiply the price — pulling a richer 100-result search costs the same as a 10-result one, so you can ground generously without watching a meter. There is no subscription, and the minimum deposit is $10. See the pricing page for the live numbers.
A few design choices worth making early:
- How many sources? Six to eight organic results is the sweet spot — enough coverage without drowning the model in tokens.
- Which engine? Google is the default and broadest. But Serpent also exposes Bing, Yahoo, and DuckDuckGo with the same response shape, so you can let users pick or blend engines. A multi-engine aggregator gives noticeably wider source coverage for tough queries.
- Localize. Pass
countryandlanguageso a user in Berlin gets German-relevant sources. - Trim tokens. Snippets, not full pages, keep the LLM bill low. Our notes on reducing grounding cost go deeper.
Because the grounding price is flat and the source count is yours to set, your per-answer cost is easy to forecast — one search call plus the tokens your chosen model charges for a short synthesis.
Answer engine vs. deep research agent
An answer engine and a deep research agent look similar but solve different problems — do not confuse them.
An answer engine — what you just built — is one-shot and user-facing. One question in, one cited answer out, in a couple of seconds. It is optimized for speed and a clean UI, like a chat box on your site.
A deep research agent runs a loop. It plans sub-questions, searches several times, reads more deeply, reflects, and assembles a long report over many steps. It is optimized for thoroughness, not latency. If that is what you need, follow our companion tutorial on building a deep research agent in Python.
| Answer engine (this post) | Deep research agent | |
|---|---|---|
| Pattern | One-shot pipeline | Multi-step loop |
| Latency | Seconds | Tens of seconds to minutes |
| Search calls | One per question | Many per question |
| Output | Short answer + citations | Long structured report |
| Best for | User-facing chat / Q&A | Analyst-style deep dives |
Pick the answer engine when you want a fast, friendly product surface. Both sit on the exact same grounding layer, so once you have the search call working, you can graduate from one to the other whenever you like.
Ground your answer engine on the cheapest SERP API
Serpent returns up to 100 fresh, ranked organic results in a single call — the grounding layer your Perplexity alternative needs, with no proxy pool or headless browser to manage. Start with 10 free Google searches, from $0.03 per 10K, no subscription.
Get Your Free API KeyExplore: Google SERP API · Docs · Pricing
FAQ
What is an answer engine?
An answer engine takes a natural-language question, searches the live web for relevant sources, feeds the top results to an LLM, and returns one concise written answer with numbered citations linking back to the source URLs. Perplexity is the best-known example.
How is this different from a deep research agent?
An answer engine is a one-shot, user-facing app: ask a question, get a cited answer in seconds. A deep research agent runs a multi-step loop, planning sub-questions and searching many times to produce a long report. This tutorial builds the fast one-shot version.
Why use a SERP API instead of a vector database?
A vector database serves your own static documents and goes stale fast. A SERP API returns fresh, ranked sources from the live web every time, so the engine answers time-sensitive questions like prices, news, and releases correctly without any re-indexing work.
How much does it cost to run an answer engine?
Search grounding on Serpent is flat per call: $0.60 per 10,000 Google searches pay-as-you-go, dropping to $0.06 and $0.03 per 10,000 with deposits. Page depth does not multiply the price. Your other cost is the LLM tokens you choose to spend on synthesis.
Which LLM should I use to synthesize the answer?
Any model you like. The architecture is provider-neutral: you pass the search context plus the question to your chosen LLM, such as Claude, GPT, Gemini, or a local model. Pick based on cost, latency, and quality for your audience.
Can I add streaming and follow-up questions?
Yes. Stream the LLM output token-by-token over Server-Sent Events so the answer appears as it is written, then ask the model for two or three follow-up questions in the same call. The tutorial shows both patterns with copy-pasteable code.



