Back to topics

The Document Corpus

An overview of the dataset behind this research tool: where the documents come from, how they were processed, and what types of records are searchable.

A research tool is only as good as the corpus it indexes. This page describes what is in the dataset behind this assistant, where the materials came from, and — equally important — what is not in the dataset.

Source dataset

The corpus is derived from the Nikity/Epstein-Files dataset hosted on Hugging Face, which aggregates publicly released documents from the U.S. Department of Justice and related government sources. The dataset itself is a continuation of independent archival work by journalists, researchers, and FOIA requesters who have systematically downloaded and catalogued public releases over a period of years.

Each document is identified by an EFTA document identifier — EFTA followed by 8 digits (e.g., EFTA00009654). This identifier corresponds to the official document number assigned by DOJ at the time of public release. Citations in the chat link directly to the original PDF on justice.gov when an online URL is associated with the document.

What types of documents are included

The corpus spans several distinct document categories:

  • Court filings, including indictments, motions, briefs, and orders from the federal cases described in other topics on this site
  • Deposition transcripts from civil litigation that has been unsealed
  • Trial exhibits authenticated and admitted in the 2021 Maxwell criminal trial
  • Government reports, including the 2020 OPR report, the 2023 OIG report, and related inspector-general work
  • Congressional materials, including released document collections from House and Senate committees
  • Correspondence and email records entered as exhibits in civil or criminal proceedings
  • Flight logs and travel records described in the corresponding topic page
  • Financial records to the extent included in public exhibits
  • Settlement and compensation documents filed publicly in civil suits

What is not in the corpus

Equally important for honest research is what is not indexed:

  • Sealed documents. Anything that remains under seal by court order is not in the dataset. This includes some discovery materials in cases that have not been fully unsealed.
  • Grand jury materials. Federal Rule of Criminal Procedure 6(e) makes most grand jury proceedings confidential; these are not part of any public release.
  • Privileged communications. Attorney-client and other privileged materials, where protected, are not in the public record.
  • Investigative materials in active matters. If a federal investigation is ongoing, related investigative records are typically not yet public.
  • Materials from non-U.S. jurisdictions. Litigation and inquiry in other countries (e.g., the Virgin Islands government’s separate civil suits) may have its own document set that is not fully covered.
  • Original media reporting. Newspaper articles, books, and journalistic investigations are not part of the corpus, although they may cite documents that are indexed. See the Resources page for media references.

Document processing

For technical readers, a brief note on processing:

  1. Documents are downloaded from the source dataset in their original format (mostly PDF).
  2. Text content is extracted using OCR where necessary and stored together with metadata (document ID, source URL, date if available).
  3. Text is split into semantic chunks suitable for vector embedding.
  4. Each chunk is embedded using OpenAI’s text-embedding-3-small model and stored in a Pinecone vector index.
  5. At query time, the user’s question is embedded, the index is searched for the most semantically similar chunks, results are reranked using Cohere’s reranker, and the top chunks are passed to a Llama 3.3 70B language model (via Groq) to generate a sourced answer.

For a more detailed walk-through of the retrieval pipeline, see the Methodology section.

Limitations to keep in mind

  • OCR errors. Documents released as image-only PDFs (which is common for older materials) are subject to OCR errors that can affect search quality.
  • Redactions. Many documents contain redactions in the original; the text of redacted passages is not recoverable from the dataset.
  • Coverage gaps. Although the corpus is large, it does not include every document ever released about this matter. New materials continue to be unsealed periodically.

For questions about specific documents, the assistant will return the EFTA identifiers and links to the original PDFs so you can verify the underlying source directly.

Suggested research questions

Open the chat and ask any of these to explore the topic in the document corpus:

  • Where do the documents in this dataset come from?
  • What types of documents are in the indexed corpus?
  • How recent is the document collection?
  • What documents are NOT in the corpus and why?
Open the chat →