A non-technical explanation of how this assistant searches the document corpus, retrieves relevant passages, and generates sourced answers.
This page explains, in plain terms, how the assistant on this site produces an answer when you type a question. The technique is called retrieval-augmented generation, often abbreviated RAG. The goal of this section is to demystify the process so that you can interpret answers critically and understand both the strengths and limits of the approach.
A traditional search engine returns a list of links: you click one, you read it, you decide if it answered your question. A traditional chat assistant generates an answer from its training data: you read the answer, but you cannot easily check what document it came from, and the training data is fixed at some past date.
Retrieval-augmented generation combines the two. When you ask a question, the system:
The crucial property is that the answer is grounded in specific source documents that you
can open and verify yourself. Every answer on this site includes one or more EFTA document
identifiers that link back to the original PDF on justice.gov.
Your question is a string of words. To match it against documents, the system needs to
convert the question into a numeric representation called an embedding — a list of numbers
(in our case, 1,536 of them) that captures the semantic meaning of the question. Two
questions that mean the same thing produce similar embeddings, even if they use different
words. We use OpenAI’s text-embedding-3-small model for this step.
Each chunk of text in the corpus has been pre-embedded the same way. The collection of embeddings lives in a vector database — in our case, Pinecone — which is optimized to find, among the 2.2 million chunks indexed, the ones whose embeddings are most similar to the question embedding. The result is a list of candidate passages, typically the top ~50.
Vector similarity is fast but imperfect — sometimes a chunk is similar in meaning but not the
best answer to the actual question. A second model, the Cohere reranker (rerank-v3.5),
takes the top candidates and re-orders them based on a more careful comparison between the
question and each candidate. The top ~10 reranked chunks become the context for the answer.
Finally, the question and the top reranked chunks are passed as input to a large language
model — Llama 3.3 70B, hosted on Groq. The model is instructed to answer the question using
only the provided chunks, and to cite each chunk by its EFTA identifier. After the answer
is generated, a small post-processing step replaces each EFTA identifier with a clickable
link to the original PDF.
For a frank discussion of failure modes and known limitations, see the Limitations page.