A non-technical explanation of how this assistant searches the document corpus, retrieves relevant passages, and generates sourced answers.

This page explains, in plain terms, how the assistant on this site produces an answer when you type a question. The technique is called retrieval-augmented generation, often abbreviated RAG. The goal of this section is to demystify the process so that you can interpret answers critically and understand both the strengths and limits of the approach.

The basic idea

A traditional search engine returns a list of links: you click one, you read it, you decide if it answered your question. A traditional chat assistant generates an answer from its training data: you read the answer, but you cannot easily check what document it came from, and the training data is fixed at some past date.

Retrieval-augmented generation combines the two. When you ask a question, the system:

Searches a specific corpus of documents — in this case, the indexed Epstein Files dataset.
Retrieves the passages that are most likely to be relevant to your question.
Passes those passages, together with your question, to a language model.
The language model generates an answer that draws on the retrieved passages and cites them by document identifier.

The crucial property is that the answer is grounded in specific source documents that you can open and verify yourself. Every answer on this site includes one or more EFTA document identifiers that link back to the original PDF on justice.gov.

The four stages, in slightly more detail

Stage 1: Embedding the question

Your question is a string of words. To match it against documents, the system needs to convert the question into a numeric representation called an embedding — a list of numbers (in our case, 1,536 of them) that captures the semantic meaning of the question. Two questions that mean the same thing produce similar embeddings, even if they use different words. We use OpenAI’s text-embedding-3-small model for this step.

Stage 2: Searching the index

Each chunk of text in the corpus has been pre-embedded the same way. The collection of embeddings lives in a vector database — in our case, Pinecone — which is optimized to find, among the 2.2 million chunks indexed, the ones whose embeddings are most similar to the question embedding. The result is a list of candidate passages, typically the top ~50.

Stage 3: Reranking

Vector similarity is fast but imperfect — sometimes a chunk is similar in meaning but not the best answer to the actual question. A second model, the Cohere reranker (rerank-v3.5), takes the top candidates and re-orders them based on a more careful comparison between the question and each candidate. The top ~10 reranked chunks become the context for the answer.

Stage 4: Generation

Finally, the question and the top reranked chunks are passed as input to a large language model — Llama 3.3 70B, hosted on Groq. The model is instructed to answer the question using only the provided chunks, and to cite each chunk by its EFTA identifier. After the answer is generated, a small post-processing step replaces each EFTA identifier with a clickable link to the original PDF.

What this design buys you

Transparency. You can click any citation and read the underlying document. You are never required to trust the assistant’s summary at face value.
Specificity. The answer is constrained to the corpus on this site. The assistant will not answer from general knowledge or speculation.
Up-to-date corpus. Because the corpus is separate from the language model’s training data, the corpus can be updated as new materials are released without retraining the underlying model.

What this design does not buy you

Perfect recall. If a relevant document is not in the index, the system cannot find it.
Perfect ranking. Sometimes the most relevant passage is not in the top retrieved set, and the answer suffers as a result.
Editorial judgment. The language model can summarize and rephrase, but it does not evaluate the credibility, motive, or context of the underlying source documents. That is your job as a reader.

For a frank discussion of failure modes and known limitations, see the Limitations page.

How Retrieval-Augmented Generation Works