An overview of the dataset behind this research tool: where the documents come from, how they were processed, and what types of records are searchable.
A research tool is only as good as the corpus it indexes. This page describes what is in the
dataset behind this assistant, where the materials came from, and — equally important — what
is not in the dataset.
Source dataset
The corpus is derived from the Nikity/Epstein-Files dataset hosted on Hugging Face, which
aggregates publicly released documents from the U.S. Department of Justice and related
government sources. The dataset itself is a continuation of independent archival work by
journalists, researchers, and FOIA requesters who have systematically downloaded and
catalogued public releases over a period of years.
Each document is identified by an EFTA document identifier — EFTA followed by 8 digits
(e.g., EFTA00009654). This identifier corresponds to the official document number assigned
by DOJ at the time of public release. Citations in the chat link directly to the original
PDF on justice.gov when an online URL is associated with the document.
What types of documents are included
The corpus spans several distinct document categories:
- Court filings, including indictments, motions, briefs, and orders from the federal
cases described in other topics on this site
- Deposition transcripts from civil litigation that has been unsealed
- Trial exhibits authenticated and admitted in the 2021 Maxwell criminal trial
- Government reports, including the 2020 OPR report, the 2023 OIG report, and related
inspector-general work
- Congressional materials, including released document collections from House and Senate
committees
- Correspondence and email records entered as exhibits in civil or criminal proceedings
- Flight logs and travel records described in the corresponding topic page
- Financial records to the extent included in public exhibits
- Settlement and compensation documents filed publicly in civil suits
What is not in the corpus
Equally important for honest research is what is not indexed:
- Sealed documents. Anything that remains under seal by court order is not in the
dataset. This includes some discovery materials in cases that have not been fully unsealed.
- Grand jury materials. Federal Rule of Criminal Procedure 6(e) makes most grand jury
proceedings confidential; these are not part of any public release.
- Privileged communications. Attorney-client and other privileged materials, where
protected, are not in the public record.
- Investigative materials in active matters. If a federal investigation is ongoing,
related investigative records are typically not yet public.
- Materials from non-U.S. jurisdictions. Litigation and inquiry in other countries
(e.g., the Virgin Islands government’s separate civil suits) may have its own document set
that is not fully covered.
- Original media reporting. Newspaper articles, books, and journalistic investigations
are not part of the corpus, although they may cite documents that are indexed. See the
Resources page for media references.
Document processing
For technical readers, a brief note on processing:
- Documents are downloaded from the source dataset in their original format (mostly PDF).
- Text content is extracted using OCR where necessary and stored together with metadata
(document ID, source URL, date if available).
- Text is split into semantic chunks suitable for vector embedding.
- Each chunk is embedded using OpenAI’s
text-embedding-3-small model and stored in a
Pinecone vector index.
- At query time, the user’s question is embedded, the index is searched for the most
semantically similar chunks, results are reranked using Cohere’s reranker, and the top
chunks are passed to a Llama 3.3 70B language model (via Groq) to generate a sourced
answer.
For a more detailed walk-through of the retrieval pipeline, see the Methodology section.
Limitations to keep in mind
- OCR errors. Documents released as image-only PDFs (which is common for older
materials) are subject to OCR errors that can affect search quality.
- Redactions. Many documents contain redactions in the original; the text of redacted
passages is not recoverable from the dataset.
- Coverage gaps. Although the corpus is large, it does not include every document ever
released about this matter. New materials continue to be unsealed periodically.
For questions about specific documents, the assistant will return the EFTA identifiers and
links to the original PDFs so you can verify the underlying source directly.
Suggested research questions
Open the chat and ask any of these to explore the topic in the document corpus:
- Where do the documents in this dataset come from?
- What types of documents are in the indexed corpus?
- How recent is the document collection?
- What documents are NOT in the corpus and why?
Open the chat →