🦆🔍 Learn SQL by Exploring the Epstein Files

Mar 10, 2026 20 min read

📚 DuckDB for Cloud Security

This series explores how to leverage DuckDB for cloud security analysis, taking you from basic concepts to advanced security monitoring techniques. Learn to build your own CSPM solutions, analyze security tool outputs, and create powerful DIY security workflows.

Why This Dataset

SQL tutorials usually suck. You get a fake employees table with 12 rows, you learn SELECT * FROM employees WHERE salary > 50000, and you immediately forget it because you had zero reason to care about the results.

What if the dataset was 1.78 million emails from a convicted sex trafficker’s inbox - emails between Jeffrey Epstein and the powerful people who surrounded him - released to the public by Congress and the Department of Justice?

Now you have a reason to care about the results.

The Jmail Data API serves the entire Epstein Files Transparency Act (EFTA) archive as Parquet files. No API keys. No auth. No rate limits. Just static files on a CDN, ready to be queried. And DuckDB - the in-process analytics database I’ve been writing about in this series - can query those files directly over HTTP, including from WebAssembly in your browser.

Every query in this post runs live. The widget below isn’t a screenshot - it’s a full DuckDB engine compiled to WASM, executing SQL against real Parquet files from data.jmail.world. Your queries never leave your machine.

Your Workbench

Go ahead and hit “Run Query” on the pre-loaded starter. Everything below this point is meant to be run interactively.

The first time you run a query, DuckDB-WASM needs to download the Parquet file (the slim email dataset is ~41 MB). After that first pull, subsequent queries against the same file are fast because the data is cached in your browser’s memory.

Level 0: Look Before You Leap

Before writing any analysis, figure out what you’re working with. The DESCRIBE statement shows you the schema of a dataset without loading all the data:

DESCRIBE SELECT *
FROM read_parquet('https://data.jmail.world/v1/emails-slim.parquet');

Paste that into the widget above. You’ll see columns like sender, subject, to_recipients, sent_at, epstein_is_sender, and more. This is your reconnaissance phase - same as any OSINT workflow. Understand the data before you start pulling threads.

The key columns to know:

  • sender - who sent the email
  • subject - the subject line
  • to_recipients - a JSON array of recipient email addresses
  • cc_recipients / bcc_recipients - same, for CC and BCC
  • sent_at - timestamp of when the email was sent
  • epstein_is_sender - boolean flag, true when Epstein sent the email
  • account_email - which of Epstein’s accounts (Gmail or Yahoo) the email belongs to

Level 1: SELECT, WHERE, ORDER BY

The foundation of SQL. Every query starts with what you want (SELECT), where it lives (FROM), and optionally what conditions filter it (WHERE).

Who sent the most emails?

SELECT sender, COUNT(*) as email_count
FROM read_parquet('https://data.jmail.world/v1/emails-slim.parquet')
GROUP BY sender
ORDER BY email_count DESC
LIMIT 20;

COUNT(*) counts rows per group. GROUP BY sender creates one group per unique sender. ORDER BY email_count DESC sorts highest first. LIMIT 20 caps the output. These four clauses are the bread and butter of exploratory analysis.

Filter by subject keywords

SELECT sender, subject, sent_at
FROM read_parquet('https://data.jmail.world/v1/emails-slim.parquet')
WHERE subject ILIKE '%legal%'
ORDER BY sent_at DESC
LIMIT 30;

ILIKE is case-insensitive pattern matching. The % wildcards match any characters before and after “legal”. This is your grep for structured data. Try swapping in '%flight%', '%island%', '%donation%', or any keyword that catches your eye.

Emails from a specific sender

SELECT subject, sent_at, to_recipients
FROM read_parquet('https://data.jmail.world/v1/emails-slim.parquet')
WHERE sender ILIKE '%groff%'
ORDER BY sent_at DESC
LIMIT 20;

Lesley Groff was Epstein’s executive assistant and one of the most prolific correspondents in the archive. Swap in other names as you discover them.

Level 2: Aggregation and GROUP BY

Aggregation is where SQL goes from “search engine” to “analysis tool.” Instead of looking at individual rows, you’re computing summaries.

Email volume over time

SELECT 
  YEAR(TRY_CAST(sent_at AS TIMESTAMP)) as year,
  MONTH(TRY_CAST(sent_at AS TIMESTAMP)) as month,
  COUNT(*) as emails
FROM read_parquet('https://data.jmail.world/v1/emails-slim.parquet')
WHERE TRY_CAST(sent_at AS TIMESTAMP) IS NOT NULL
GROUP BY year, month
ORDER BY year, month;

This reveals the temporal shape of the archive. Where are the spikes? Where are the gaps? Are there periods where email volume drops off suspiciously?

Busiest days of the week

SELECT 
  DAYNAME(TRY_CAST(sent_at AS TIMESTAMP)) as day_of_week,
  DAYOFWEEK(TRY_CAST(sent_at AS TIMESTAMP)) as day_num,
  COUNT(*) as emails
FROM read_parquet('https://data.jmail.world/v1/emails-slim.parquet')
WHERE TRY_CAST(sent_at AS TIMESTAMP) IS NOT NULL
GROUP BY day_of_week, day_num
ORDER BY day_num;

DAYNAME and DAYOFWEEK are DuckDB date functions that extract parts of a timestamp. Useful for behavioral profiling - does the communication pattern look like a normal business operation?

Epstein’s send vs receive ratio

SELECT 
  epstein_is_sender,
  COUNT(*) as email_count,
  ROUND(COUNT(*) * 100.0 / SUM(COUNT(*)) OVER (), 1) as pct
FROM read_parquet('https://data.jmail.world/v1/emails-slim.parquet')
GROUP BY epstein_is_sender;

That SUM(COUNT(*)) OVER () is a window function - it computes the total count across all groups without collapsing them. We use it here to calculate percentages. Window functions are one of the most powerful features in SQL and we’ll use them more later.

Level 3: JSON Unnesting and Network Analysis

The to_recipients column is a JSON array - something like ["person1@email.com", "person2@email.com"]. To analyze the communication network, we need to “unnest” this array so each recipient gets its own row.

Who did Epstein email the most?

SELECT 
  recipient,
  COUNT(*) as times_contacted
FROM read_parquet('https://data.jmail.world/v1/emails-slim.parquet')
CROSS JOIN UNNEST(from_json(to_recipients, '["VARCHAR"]')) AS t(recipient)
WHERE epstein_is_sender = true
GROUP BY recipient
ORDER BY times_contacted DESC
LIMIT 20;

from_json(to_recipients, '["VARCHAR"]') parses the JSON string into a DuckDB list. unnest() explodes that list so each element becomes its own row. This is a pattern you’ll use constantly when working with semi-structured data in security logs, API responses, and OSINT dumps.

Bidirectional communication pairs

Who had the most back-and-forth with Epstein?

WITH sent AS (
  SELECT 
    other_party,
    COUNT(*) as sent_count
  FROM read_parquet('https://data.jmail.world/v1/emails-slim.parquet')
  CROSS JOIN UNNEST(from_json(to_recipients, '["VARCHAR"]')) AS t(other_party)
  WHERE epstein_is_sender = true
  GROUP BY other_party
),
received AS (
  SELECT 
    sender as other_party,
    COUNT(*) as received_count
  FROM read_parquet('https://data.jmail.world/v1/emails-slim.parquet')
  WHERE epstein_is_sender = false
  GROUP BY sender
)
SELECT 
  COALESCE(s.other_party, r.other_party) as contact,
  COALESCE(s.sent_count, 0) as epstein_sent,
  COALESCE(r.received_count, 0) as they_sent,
  COALESCE(s.sent_count, 0) + COALESCE(r.received_count, 0) as total
FROM sent s
FULL OUTER JOIN received r ON s.other_party = r.other_party
ORDER BY total DESC
LIMIT 20;

Welcome to Common Table Expressions (CTEs) - the WITH clause. Think of each CTE as a named subquery you can reference later. Here we compute sent and received counts separately, then FULL OUTER JOIN them to get the complete picture. COALESCE handles NULLs when someone appears in only one direction.

This is the kind of network analysis that OSINT practitioners do manually with link charts. SQL does it in seconds.

Level 4: Joins Across Datasets

The Jmail Data API serves more than just emails. Let’s bring in other datasets.

Cross-reference emails with iMessage threads

WITH email_contacts AS (
  SELECT 
    sender as contact,
    COUNT(*) as email_count
  FROM read_parquet('https://data.jmail.world/v1/emails-slim.parquet')
  WHERE epstein_is_sender = false
  GROUP BY sender
),
imessage_contacts AS (
  SELECT name, slug, message_count
  FROM read_parquet('https://data.jmail.world/v1/imessage_conversations.parquet')
)
SELECT 
  im.name,
  im.slug,
  im.message_count as imessages,
  COALESCE(ec.email_count, 0) as emails,
  im.message_count + COALESCE(ec.email_count, 0) as total_comms
FROM imessage_contacts im
LEFT JOIN email_contacts ec 
  ON LOWER(im.name) = LOWER(ec.contact)
ORDER BY total_comms DESC;

LEFT JOIN keeps all rows from the left table (iMessages) and attaches matching email data where it exists. Here we join on name, but depending on how a person appears in the email archive you may need to normalize further or join against slug instead.

Explore the photo metadata

SELECT source, COUNT(*) as photo_count
FROM read_parquet('https://data.jmail.world/v1/photos.parquet')
GROUP BY source
ORDER BY photo_count DESC;
SELECT image_description, original_filename, source
FROM read_parquet('https://data.jmail.world/v1/photos.parquet')
WHERE image_description ILIKE '%office%'
LIMIT 20;

The image_description column contains AI-generated descriptions of each photo - another searchable surface for investigation.

Level 5: Window Functions

Window functions let you compute values across sets of rows related to the current row - without collapsing the result into groups. They’re the difference between intermediate and advanced SQL.

Running total of emails over time

SELECT 
  DATE_TRUNC('month', TRY_CAST(sent_at AS TIMESTAMP)) as month,
  COUNT(*) as monthly_count,
  SUM(COUNT(*)) OVER (ORDER BY DATE_TRUNC('month', TRY_CAST(sent_at AS TIMESTAMP))) as running_total
FROM read_parquet('https://data.jmail.world/v1/emails-slim.parquet')
WHERE TRY_CAST(sent_at AS TIMESTAMP) IS NOT NULL
GROUP BY month
ORDER BY month;

SUM(...) OVER (ORDER BY ...) accumulates the count across rows ordered by month. This shows the growth of the archive over time.

Rank senders within each year

SELECT year, sender, email_count, rank
FROM (
  SELECT 
    YEAR(TRY_CAST(sent_at AS TIMESTAMP)) as year,
    sender,
    COUNT(*) as email_count,
    ROW_NUMBER() OVER (PARTITION BY YEAR(TRY_CAST(sent_at AS TIMESTAMP)) ORDER BY COUNT(*) DESC) as rank
  FROM read_parquet('https://data.jmail.world/v1/emails-slim.parquet')
  WHERE TRY_CAST(sent_at AS TIMESTAMP) IS NOT NULL
  GROUP BY year, sender
)
WHERE rank <= 5
ORDER BY year, rank;

ROW_NUMBER() OVER (PARTITION BY year ORDER BY count DESC) assigns a rank within each year. PARTITION BY creates separate windows per year, and ORDER BY determines the ranking within each window. This reveals who the top correspondents were during different periods - useful for building a timeline of relationships.

Level 6: Full-Text Search (Advanced)

One important caveat: emails.parquet is not a drop-in “full body text” dataset. At the moment, the richer text content lives in separate sharded files under documents-full/, not directly in emails.parquet.

So if you want true keyword-in-context searching across document or email body text, use the documents-full/ shards described in the Jmail Data API docs rather than assuming a body column exists in emails.parquet.

The practical takeaway: use the browser widget and emails-slim.parquet for fast metadata exploration, then move to the larger document shards when you want deeper full-text work.

Available Datasets

Everything served from https://data.jmail.world/v1/:

DatasetFileRecordsSize
Emails (slim)emails-slim.parquet1.78M41 MB
Emails (full)emails.parquet1.78M334 MB
Documentsdocuments.parquet1.41M25 MB
Photosphotos.parquet18K~1 MB
Peoplepeople.parquet473<100 KB
Face dataphoto_faces.parquet975<100 KB
iMessage threadsimessage_conversations.parquet15<10 KB
iMessagesimessage_messages.parquet4.5K~100 KB
Star countsstar_counts.parquet414K~2 MB

Full document text is available in sharded files under documents-full/ - see the Jmail Data API docs for details. That’s the place to go for body-text style searching.

Taking It Local

The browser widget is great for exploration, but you do not need to stay in the browser. If you want to save files, rerun heavier queries, build views, or keep a persistent local database, install DuckDB and run the exact same SQL from your terminal.

DuckDB installation instructions live here: DuckDB install guide. On common systems, that usually means brew install duckdb on macOS or apt install duckdb on Debian/Ubuntu.

Once DuckDB is installed, you can run the queries from this post locally:

# Download once
curl -o emails-slim.parquet https://data.jmail.world/v1/emails-slim.parquet
curl -o emails.parquet https://data.jmail.world/v1/emails.parquet

# Query locally
duckdb -c "
  SELECT sender, COUNT(*) as n
  FROM read_parquet('emails-slim.parquet')
  GROUP BY sender ORDER BY n DESC LIMIT 20;
"

Or create a persistent database for repeated analysis:

CREATE TABLE emails AS 
  SELECT * FROM read_parquet('emails.parquet');
CREATE TABLE docs AS 
  SELECT * FROM read_parquet('documents.parquet');
CREATE TABLE photos AS 
  SELECT * FROM read_parquet('photos.parquet');

SELECT sender, COUNT(*) FROM emails GROUP BY sender ORDER BY 2 DESC LIMIT 10;

For the Python folks:

import duckdb

conn = duckdb.connect('epstein.db')
df = conn.sql("""
  SELECT sender, COUNT(*) as n
  FROM read_parquet('https://data.jmail.world/v1/emails-slim.parquet')
  GROUP BY sender ORDER BY n DESC LIMIT 20
""").df()

print(df)

What You Learned

If you followed along and ran the queries, you now know how to:

  • Explore schemas with DESCRIBE
  • Filter and search with WHERE, ILIKE, and wildcards
  • Aggregate with GROUP BY, COUNT, SUM
  • Work with dates using YEAR(), MONTH(), DAYNAME(), DATE_TRUNC()
  • Parse JSON with from_json() and unnest()
  • Join datasets with LEFT JOIN, FULL OUTER JOIN
  • Use CTEs to break complex queries into readable steps
  • Apply window functions like ROW_NUMBER(), SUM() OVER()
  • Extract context from text with SUBSTRING and POSITION

These aren’t toy skills. This is the same SQL that powers threat hunting in SIEMs, log analysis in Splunk/Elastic, and data correlation in every OSINT toolkit. The difference is you learned it on data you actually wanted to explore.

Further Reading

~jared gore