Back to chat

Editorial standards

Who runs this site, how content is written and reviewed, and how to report errors. Last reviewed: May 2026.

Who maintains this archive

Epstein Files is maintained by Juan, an independent software engineer. The project began as a public-interest experiment in applying retrieval-augmented generation (RAG) to a body of court records that, in its raw form, is too large for a single person to browse — 2.2 million indexed passages drawn from DOJ filings, depositions, exhibits, and congressional materials related to the Jeffrey Epstein and Ghislaine Maxwell cases.

I am not a journalist or a lawyer. I am the sole author and reviewer of the editorial content on this site (topic guides, methodology pages, glossary, FAQ, this page). My background is in software, and the tool exists because the underlying records are already public — anyone can read them on justice.gov — but few people have the time to do so. A searchable interface lowers the cost of looking.

This is a non-commercial project. There is no investor, no sponsoring publication, and no editorial board outside of me. Hosting, models, and the vector database are paid for out of pocket. If the site begins running advertising through Google AdSense, that will be disclosed here, and ad revenue will only be used to offset infrastructure costs. There is no paid content, no affiliate placement, and no relationship — paid or otherwise — with any person or entity named in the underlying documents.

What this site publishes

The site is built around two kinds of content, and the distinction matters:

  1. Editorial pages — the topic guides at /topics, the methodology series at /methodology, the glossary, the resources list, the FAQ, and this page. These are written by a human (me), reviewed by a human (me), and revised when I find errors. They cite the same DOJ records the assistant retrieves, but they are not generated text — they are summaries written to give a reader unfamiliar with the case enough context to ask useful questions of the chat.
  2. Chat answers — text the assistant produces in response to a user question. These are generated by a large language model (Groq-hosted Llama 3.3 70B at the time of writing) from a small set of document passages retrieved by vector similarity. Every claim is intended to be supported by a citation in the form [EFTA00000000] linking back to the original PDF. Chat answers are not human-reviewed before they are shown to you; they should be read as a research tool, not as edited prose.

What this site does NOT publish

  • Private or leaked material. Every document in the corpus is already part of the public record — released by the DOJ, the Southern District of New York, the House Committee on Oversight and Government Reform, or unsealed in civil litigation. Nothing on this site is hacked, leaked, or obtained outside the public record.
  • Original allegations. The site reports what the documents say. It does not allege new wrongdoing, develop independent factual claims, or speculate about matters that are not present in the documents. Allegations made in pleadings are described as allegations; findings of fact are described as findings.
  • Insinuations against named individuals. Many people are mentioned in the underlying documents — in contact directories, flight manifests, witness lists, and other materials — who are not accused of any crime. The mere appearance of a name in a document is not evidence of wrongdoing, and the assistant is instructed to surface what the documents say without endorsing any inference. Users should do the same.
  • Sponsored or paid content. No third party has paid for placement on this site, and no editorial content is influenced by advertising. The Google AdSense ad slots (if and when active) are filled by an automated network with no editorial input from me.

How chat answers are produced

When you ask a question, the system does the following:

  1. Embeds your question with OpenAI’s text-embedding-3-small model.
  2. Retrieves the top relevant passages from a Pinecone vector index of the corpus.
  3. Re-ranks those passages with a Cohere reranker for relevance.
  4. Sends the question plus the retrieved passages to a Groq-hosted Llama 3.3 70B model with an instruction to answer only from the passages and to cite the originating documents.
  5. Post-processes the model’s output to convert each [EFTA…] identifier into a clickable link to justice.gov.

This pipeline has known failure modes, documented honestly on the limitations page. The most important: language models can produce plausible-sounding citations that don’t actually exist, can misattribute claims to documents that don’t support them, and can be confidently wrong about specifics. Treat every assistant answer as a starting point, not a settled fact, and follow the citation to the original PDF before relying on anything important.

Corrections and feedback

If you find an error — a misstated fact in a topic guide, a broken citation, an incorrect glossary definition, a documented historical detail that I’ve gotten wrong — please report it. I take corrections seriously and will fix issues promptly, attributing the correction in the page footer with a date.

For corrections, factual disputes, takedown requests, or general feedback, email [email protected] . Include a link to the page in question and a citation to the source you believe is correct. I read every email, though I cannot guarantee a response time.

Editorial decisions — what to summarize, how to characterize a legal proceeding, how to handle the appearance of an uncharged individual in a document — are mine. I am open to being persuaded by a well-sourced correction, but I do not take down accurate characterizations of public proceedings on request.

Independence and disclosure

I have no professional, financial, or personal relationship with any party named in the documents in the corpus — survivors, defendants, accused individuals, prosecutors, attorneys, judges, or institutions. I have not been retained, contacted, or compensated by any law firm, advocacy group, journalist, or political organization connected to this case. If that changes, I will disclose it on this page.

The dataset on which the assistant is built — Nikity/Epstein-Files on HuggingFace — is a third-party aggregation of public materials. I did not assemble the corpus; I indexed it. If the upstream dataset is corrected or expanded, this site will re-index in due course. See the dataset and corpus page for a longer discussion.

Code and reproducibility

This site is built on open infrastructure: Python (FastAPI, LangChain) for the backend, Astro and React for the frontend, Pinecone for the vector index, OpenAI for embeddings, Cohere for reranking, and Groq for inference. The application code is independent of the dataset itself. If you want to verify how chat answers are produced — what prompt the model sees, what passages are retrieved, how citations are injected — read the methodology pages , which describe the system without marketing.