What Is This Project?
Epstein Files is a free, open-source research tool that provides AI-powered access to publicly available documents related to the Jeffrey Epstein case. These documents were released by the U.S. Department of Justice, the U.S. House Oversight Committee, and the Southern District of New York through court proceedings and congressional actions.
The goal is to make these public records more accessible to journalists, researchers, and the general public by allowing natural-language searches instead of manually browsing thousands of PDF files.
Document Sources
The document corpus is sourced from the Nikity/Epstein-Files dataset on HuggingFace, which aggregates publicly released DOJ documents. The collection includes:
- Court filings and legal motions from civil and criminal proceedings
- Unsealed depositions and witness testimonies
- Communications and correspondence entered as exhibits
- Flight logs and travel records submitted as court evidence
- Government investigation documents released under FOIA
Every document is identified by its official DOJ document ID (format: EFTA followed by 8 digits) and links directly to the original PDF hosted on justice.gov.
How It Works
The tool uses a technique called Retrieval-Augmented Generation (RAG) to combine document search with AI-generated answers:
- Document processing: Over 2.2 million document segments were converted into numerical representations (embeddings) using OpenAI's text-embedding-3-small model and stored in a Pinecone vector database.
- Search: When you ask a question, it is converted into an embedding and compared against the entire document corpus to find the most relevant passages.
- Reranking: The initial results are refined using Cohere's rerank model to ensure the most relevant documents surface first.
- Response generation: The relevant document excerpts are passed to a large language model (Llama 3.3 70B via Groq) along with your question to generate a sourced, coherent answer.
- Citation linking: Document references in the response are automatically converted into clickable links to the original PDFs on justice.gov.
Technology Stack
Frontend
React 19, HeroUI v3, Tailwind CSS v4
Backend
Python, FastAPI, LangChain
Vector Database
Pinecone (2.2M+ vectors)
AI Models
Llama 3.3 70B, OpenAI Embeddings, Cohere Reranker
Limitations
- Responses are AI-generated and may contain inaccuracies, hallucinations, or misinterpretations of the source documents.
- The tool searches a specific dataset and may not include all publicly available documents related to the case.
- This is not a legal tool. Responses should not be treated as legal advice or used as evidence in any proceeding.
- Some documents in the source dataset are redacted or partially illegible, which may affect the quality of retrieved information.
Open Source
This project is fully open source. You can review the code, report issues, or contribute on GitHub.