Why This Dataset
SQL tutorials usually suck. You get a fake employees table with 12 rows, you learn SELECT * FROM employees WHERE salary > 50000, and you immediately forget it because you had zero reason to care about the results.
What if the dataset was 1.78 million emails from a convicted sex trafficker’s inbox - emails between Jeffrey Epstein and the powerful people who surrounded him - released to the public by Congress and the Department of Justice?
Now you have a reason to care about the results.
The Jmail Data API serves the entire Epstein Files Transparency Act (EFTA) archive as Parquet files. No API keys. No auth. No rate limits. Just static files on a CDN, ready to be queried. And DuckDB - the in-process analytics database I’ve been writing about in this series - can query those files directly over HTTP, including from WebAssembly in your browser.
Every query in this post runs live. The widget below isn’t a screenshot - it’s a full DuckDB engine compiled to WASM, executing SQL against real Parquet files from data.jmail.world. Your queries never leave your machine.
Your Workbench
Go ahead and hit “Run Query” on the pre-loaded starter. Everything below this point is meant to be run interactively.
The first time you run a query, DuckDB-WASM needs to download the Parquet file (the slim email dataset is ~41 MB). After that first pull, subsequent queries against the same file are fast because the data is cached in your browser’s memory.
Level 0: Look Before You Leap
Before writing any analysis, figure out what you’re working with. The DESCRIBE statement shows you the schema of a dataset without loading all the data:
DESCRIBE SELECT *
FROM read_parquet('https://data.jmail.world/v1/emails-slim.parquet');
Paste that into the widget above. You’ll see columns like sender, subject, to_recipients, sent_at, epstein_is_sender, and more. This is your reconnaissance phase - same as any OSINT workflow. Understand the data before you start pulling threads.
The key columns to know:
sender- who sent the emailsubject- the subject lineto_recipients- a JSON array of recipient email addressescc_recipients/bcc_recipients- same, for CC and BCCsent_at- timestamp of when the email was sentepstein_is_sender- boolean flag,truewhen Epstein sent the emailaccount_email- which of Epstein’s accounts (Gmail or Yahoo) the email belongs to
Level 1: SELECT, WHERE, ORDER BY
The foundation of SQL. Every query starts with what you want (SELECT), where it lives (FROM), and optionally what conditions filter it (WHERE).
Who sent the most emails?
SELECT sender, COUNT(*) as email_count
FROM read_parquet('https://data.jmail.world/v1/emails-slim.parquet')
GROUP BY sender
ORDER BY email_count DESC
LIMIT 20;
COUNT(*) counts rows per group. GROUP BY sender creates one group per unique sender. ORDER BY email_count DESC sorts highest first. LIMIT 20 caps the output. These four clauses are the bread and butter of exploratory analysis.
Filter by subject keywords
SELECT sender, subject, sent_at
FROM read_parquet('https://data.jmail.world/v1/emails-slim.parquet')
WHERE subject ILIKE '%legal%'
ORDER BY sent_at DESC
LIMIT 30;
ILIKE is case-insensitive pattern matching. The % wildcards match any characters before and after “legal”. This is your grep for structured data. Try swapping in '%flight%', '%island%', '%donation%', or any keyword that catches your eye.
Emails from a specific sender
SELECT subject, sent_at, to_recipients
FROM read_parquet('https://data.jmail.world/v1/emails-slim.parquet')
WHERE sender ILIKE '%groff%'
ORDER BY sent_at DESC
LIMIT 20;
Lesley Groff was Epstein’s executive assistant and one of the most prolific correspondents in the archive. Swap in other names as you discover them.
Level 2: Aggregation and GROUP BY
Aggregation is where SQL goes from “search engine” to “analysis tool.” Instead of looking at individual rows, you’re computing summaries.
Email volume over time
SELECT
YEAR(TRY_CAST(sent_at AS TIMESTAMP)) as year,
MONTH(TRY_CAST(sent_at AS TIMESTAMP)) as month,
COUNT(*) as emails
FROM read_parquet('https://data.jmail.world/v1/emails-slim.parquet')
WHERE TRY_CAST(sent_at AS TIMESTAMP) IS NOT NULL
GROUP BY year, month
ORDER BY year, month;
This reveals the temporal shape of the archive. Where are the spikes? Where are the gaps? Are there periods where email volume drops off suspiciously?
Busiest days of the week
SELECT
DAYNAME(TRY_CAST(sent_at AS TIMESTAMP)) as day_of_week,
DAYOFWEEK(TRY_CAST(sent_at AS TIMESTAMP)) as day_num,
COUNT(*) as emails
FROM read_parquet('https://data.jmail.world/v1/emails-slim.parquet')
WHERE TRY_CAST(sent_at AS TIMESTAMP) IS NOT NULL
GROUP BY day_of_week, day_num
ORDER BY day_num;
DAYNAME and DAYOFWEEK are DuckDB date functions that extract parts of a timestamp. Useful for behavioral profiling - does the communication pattern look like a normal business operation?
Epstein’s send vs receive ratio
SELECT
epstein_is_sender,
COUNT(*) as email_count,
ROUND(COUNT(*) * 100.0 / SUM(COUNT(*)) OVER (), 1) as pct
FROM read_parquet('https://data.jmail.world/v1/emails-slim.parquet')
GROUP BY epstein_is_sender;
That SUM(COUNT(*)) OVER () is a window function - it computes the total count across all groups without collapsing them. We use it here to calculate percentages. Window functions are one of the most powerful features in SQL and we’ll use them more later.
Level 3: JSON Unnesting and Network Analysis
The to_recipients column is a JSON array - something like ["person1@email.com", "person2@email.com"]. To analyze the communication network, we need to “unnest” this array so each recipient gets its own row.
Who did Epstein email the most?
SELECT
recipient,
COUNT(*) as times_contacted
FROM read_parquet('https://data.jmail.world/v1/emails-slim.parquet')
CROSS JOIN UNNEST(from_json(to_recipients, '["VARCHAR"]')) AS t(recipient)
WHERE epstein_is_sender = true
GROUP BY recipient
ORDER BY times_contacted DESC
LIMIT 20;
from_json(to_recipients, '["VARCHAR"]') parses the JSON string into a DuckDB list. unnest() explodes that list so each element becomes its own row. This is a pattern you’ll use constantly when working with semi-structured data in security logs, API responses, and OSINT dumps.
Bidirectional communication pairs
Who had the most back-and-forth with Epstein?
WITH sent AS (
SELECT
other_party,
COUNT(*) as sent_count
FROM read_parquet('https://data.jmail.world/v1/emails-slim.parquet')
CROSS JOIN UNNEST(from_json(to_recipients, '["VARCHAR"]')) AS t(other_party)
WHERE epstein_is_sender = true
GROUP BY other_party
),
received AS (
SELECT
sender as other_party,
COUNT(*) as received_count
FROM read_parquet('https://data.jmail.world/v1/emails-slim.parquet')
WHERE epstein_is_sender = false
GROUP BY sender
)
SELECT
COALESCE(s.other_party, r.other_party) as contact,
COALESCE(s.sent_count, 0) as epstein_sent,
COALESCE(r.received_count, 0) as they_sent,
COALESCE(s.sent_count, 0) + COALESCE(r.received_count, 0) as total
FROM sent s
FULL OUTER JOIN received r ON s.other_party = r.other_party
ORDER BY total DESC
LIMIT 20;
Welcome to Common Table Expressions (CTEs) - the WITH clause. Think of each CTE as a named subquery you can reference later. Here we compute sent and received counts separately, then FULL OUTER JOIN them to get the complete picture. COALESCE handles NULLs when someone appears in only one direction.
This is the kind of network analysis that OSINT practitioners do manually with link charts. SQL does it in seconds.
Level 4: Joins Across Datasets
The Jmail Data API serves more than just emails. Let’s bring in other datasets.
Cross-reference emails with iMessage threads
WITH email_contacts AS (
SELECT
sender as contact,
COUNT(*) as email_count
FROM read_parquet('https://data.jmail.world/v1/emails-slim.parquet')
WHERE epstein_is_sender = false
GROUP BY sender
),
imessage_contacts AS (
SELECT name, slug, message_count
FROM read_parquet('https://data.jmail.world/v1/imessage_conversations.parquet')
)
SELECT
im.name,
im.slug,
im.message_count as imessages,
COALESCE(ec.email_count, 0) as emails,
im.message_count + COALESCE(ec.email_count, 0) as total_comms
FROM imessage_contacts im
LEFT JOIN email_contacts ec
ON LOWER(im.name) = LOWER(ec.contact)
ORDER BY total_comms DESC;
LEFT JOIN keeps all rows from the left table (iMessages) and attaches matching email data where it exists. Here we join on name, but depending on how a person appears in the email archive you may need to normalize further or join against slug instead.
Explore the photo metadata
SELECT source, COUNT(*) as photo_count
FROM read_parquet('https://data.jmail.world/v1/photos.parquet')
GROUP BY source
ORDER BY photo_count DESC;
SELECT image_description, original_filename, source
FROM read_parquet('https://data.jmail.world/v1/photos.parquet')
WHERE image_description ILIKE '%office%'
LIMIT 20;
The image_description column contains AI-generated descriptions of each photo - another searchable surface for investigation.
Level 5: Window Functions
Window functions let you compute values across sets of rows related to the current row - without collapsing the result into groups. They’re the difference between intermediate and advanced SQL.
Running total of emails over time
SELECT
DATE_TRUNC('month', TRY_CAST(sent_at AS TIMESTAMP)) as month,
COUNT(*) as monthly_count,
SUM(COUNT(*)) OVER (ORDER BY DATE_TRUNC('month', TRY_CAST(sent_at AS TIMESTAMP))) as running_total
FROM read_parquet('https://data.jmail.world/v1/emails-slim.parquet')
WHERE TRY_CAST(sent_at AS TIMESTAMP) IS NOT NULL
GROUP BY month
ORDER BY month;
SUM(...) OVER (ORDER BY ...) accumulates the count across rows ordered by month. This shows the growth of the archive over time.
Rank senders within each year
SELECT year, sender, email_count, rank
FROM (
SELECT
YEAR(TRY_CAST(sent_at AS TIMESTAMP)) as year,
sender,
COUNT(*) as email_count,
ROW_NUMBER() OVER (PARTITION BY YEAR(TRY_CAST(sent_at AS TIMESTAMP)) ORDER BY COUNT(*) DESC) as rank
FROM read_parquet('https://data.jmail.world/v1/emails-slim.parquet')
WHERE TRY_CAST(sent_at AS TIMESTAMP) IS NOT NULL
GROUP BY year, sender
)
WHERE rank <= 5
ORDER BY year, rank;
ROW_NUMBER() OVER (PARTITION BY year ORDER BY count DESC) assigns a rank within each year. PARTITION BY creates separate windows per year, and ORDER BY determines the ranking within each window. This reveals who the top correspondents were during different periods - useful for building a timeline of relationships.
Level 6: Full-Text Search (Advanced)
One important caveat: emails.parquet is not a drop-in “full body text” dataset. At the moment, the richer text content lives in separate sharded files under documents-full/, not directly in emails.parquet.
So if you want true keyword-in-context searching across document or email body text, use the documents-full/ shards described in the Jmail Data API docs rather than assuming a body column exists in emails.parquet.
The practical takeaway: use the browser widget and emails-slim.parquet for fast metadata exploration, then move to the larger document shards when you want deeper full-text work.
Available Datasets
Everything served from https://data.jmail.world/v1/:
| Dataset | File | Records | Size |
|---|---|---|---|
| Emails (slim) | emails-slim.parquet | 1.78M | 41 MB |
| Emails (full) | emails.parquet | 1.78M | 334 MB |
| Documents | documents.parquet | 1.41M | 25 MB |
| Photos | photos.parquet | 18K | ~1 MB |
| People | people.parquet | 473 | <100 KB |
| Face data | photo_faces.parquet | 975 | <100 KB |
| iMessage threads | imessage_conversations.parquet | 15 | <10 KB |
| iMessages | imessage_messages.parquet | 4.5K | ~100 KB |
| Star counts | star_counts.parquet | 414K | ~2 MB |
Full document text is available in sharded files under documents-full/ - see the Jmail Data API docs for details. That’s the place to go for body-text style searching.
Taking It Local
The browser widget is great for exploration, but you do not need to stay in the browser. If you want to save files, rerun heavier queries, build views, or keep a persistent local database, install DuckDB and run the exact same SQL from your terminal.
DuckDB installation instructions live here: DuckDB install guide. On common systems, that usually means brew install duckdb on macOS or apt install duckdb on Debian/Ubuntu.
Once DuckDB is installed, you can run the queries from this post locally:
# Download once
curl -o emails-slim.parquet https://data.jmail.world/v1/emails-slim.parquet
curl -o emails.parquet https://data.jmail.world/v1/emails.parquet
# Query locally
duckdb -c "
SELECT sender, COUNT(*) as n
FROM read_parquet('emails-slim.parquet')
GROUP BY sender ORDER BY n DESC LIMIT 20;
"
Or create a persistent database for repeated analysis:
CREATE TABLE emails AS
SELECT * FROM read_parquet('emails.parquet');
CREATE TABLE docs AS
SELECT * FROM read_parquet('documents.parquet');
CREATE TABLE photos AS
SELECT * FROM read_parquet('photos.parquet');
SELECT sender, COUNT(*) FROM emails GROUP BY sender ORDER BY 2 DESC LIMIT 10;
For the Python folks:
import duckdb
conn = duckdb.connect('epstein.db')
df = conn.sql("""
SELECT sender, COUNT(*) as n
FROM read_parquet('https://data.jmail.world/v1/emails-slim.parquet')
GROUP BY sender ORDER BY n DESC LIMIT 20
""").df()
print(df)
What You Learned
If you followed along and ran the queries, you now know how to:
- Explore schemas with
DESCRIBE - Filter and search with
WHERE,ILIKE, and wildcards - Aggregate with
GROUP BY,COUNT,SUM - Work with dates using
YEAR(),MONTH(),DAYNAME(),DATE_TRUNC() - Parse JSON with
from_json()andunnest() - Join datasets with
LEFT JOIN,FULL OUTER JOIN - Use CTEs to break complex queries into readable steps
- Apply window functions like
ROW_NUMBER(),SUM() OVER() - Extract context from text with
SUBSTRINGandPOSITION
These aren’t toy skills. This is the same SQL that powers threat hunting in SIEMs, log analysis in Splunk/Elastic, and data correlation in every OSINT toolkit. The difference is you learned it on data you actually wanted to explore.
Further Reading
- Jmail Data API docs - full dataset documentation and Python client
- DuckDB documentation - the official reference
- Jmail - the Gmail-style browsing interface built by Riley Walz and Luke Igel
- Part 1: DuckDB for Security Engineers - the first post in this series, covering security tool output analysis
- Part 4: Tailpipe - DuckDB-powered SIEM for cloud logs