🦆🔍 Learn SQL by Exploring the Epstein Files

Why This Dataset

SQL tutorials usually suck. You get a fake employees table with 12 rows, you learn SELECT * FROM employees WHERE salary > 50000, and you immediately forget it because you had zero reason to care about the results.

What if the dataset was 1.78 million emails from a convicted sex trafficker’s inbox - emails between Jeffrey Epstein and the powerful people who surrounded him - released to the public by Congress and the Department of Justice?

Now you have a reason to care about the results.

The Jmail Data API serves the entire Epstein Files Transparency Act (EFTA) archive as Parquet files. No API keys. No auth. No rate limits. Just static files on a CDN, ready to be queried. And DuckDB - the in-process analytics database I’ve been writing about in this series - can query those files directly over HTTP, including from WebAssembly in your browser.

Every query in this post runs live. The widget below isn’t a screenshot - it’s a full DuckDB engine compiled to WASM, executing SQL against real Parquet files from data.jmail.world. Your queries never leave your machine.

But this isn’t just an SQL tutorial. Each section teaches a SQL technique and the analyst thinking behind it - the questions an investigator asks, the patterns they notice, and how those same patterns show up when you’re hunting threats in cloud environments at work. The Epstein archive is the training ground. Your AWS account is the final exam.

Your Workbench

Go ahead and hit “Run Query” on the pre-loaded starter. Everything below this point is meant to be run interactively.

The first time you run a query, DuckDB-WASM needs to download the Parquet file (the slim email dataset is ~41 MB). After that first pull, subsequent queries against the same file are fast because the data is cached in your browser’s memory.

Why Parquet Matters (and Why You Keep Seeing It)

You’ll notice the Jmail API doesn’t serve JSON or CSV. It serves .parquet files. That’s not an accident, and it’s worth understanding why because you’re going to encounter Parquet constantly in your career.

Parquet is a columnar file format. A CSV stores data row by row - all of row 1, then all of row 2, and so on. Parquet stores data column by column - all of the sender values together, then all of the subject values, then all of the sent_at values. This sounds like a trivial difference but it changes everything for analytical workloads.

When you write SELECT sender, COUNT(*) FROM emails GROUP BY sender, you only need two columns out of maybe fifteen. With CSV, you have to read the entire file - every column of every row - to get at the sender column. With Parquet, DuckDB reads only the sender column and skips the rest. On a 1.78 million row dataset, that’s the difference between downloading 334 MB (the full email file with body text) and 41 MB (the slim file, which is itself Parquet doing its job).

Parquet also compresses aggressively within each column because similar values are stored together. A column of email addresses compresses far better than a row that mixes email addresses with timestamps, subject lines, and JSON arrays.

Where you’ll see Parquet at work:

AWS CloudTrail Lake stores audit logs as Parquet in S3. When you query “who called AssumeRole in the last 90 days,” Athena is reading Parquet files under the hood.
AWS Cost and Usage Reports land in S3 as Parquet. That billing analysis your finance team runs? Parquet.
Azure Monitor and Microsoft Defender export logs as Parquet to ADLS Gen2.
GCP BigQuery uses a columnar format internally that’s conceptually identical to Parquet, and BigQuery export jobs write Parquet to Cloud Storage.
Security tools like CrowdStrike Falcon LogScale, Panther, and Amazon Security Lake all use Parquet as their storage layer.
Any data lake or lakehouse - Delta Lake, Apache Iceberg, Apache Hudi - is Parquet files with a metadata layer on top.

The skill you’re building right now - read_parquet() in DuckDB - is the same skill that lets you point DuckDB at an S3 bucket full of CloudTrail logs and start hunting without spinning up an Athena cluster or paying per-query fees. Every query in this tutorial works the same way whether the Parquet file is on data.jmail.world, in an S3 bucket, or on your local disk.

Level 0: Reconnaissance - Look Before You Leap

An analyst never starts by asking questions. They start by understanding what’s available to ask questions about. In an incident response, the first thing you do is inventory your log sources. What do I have? What’s the time range? What fields exist? The DESCRIBE statement is that inventory step.

DESCRIBE SELECT *
FROM read_parquet('https://data.jmail.world/v1/emails-slim.parquet');

Paste that into the widget above. You’ll see columns like sender, subject, to_recipients, sent_at, epstein_is_sender, and more. This is your reconnaissance phase - same as any OSINT workflow. Understand the data before you start pulling threads.

The key columns to know:

sender - who sent the email
subject - the subject line
to_recipients - a JSON array of recipient email addresses
cc_recipients / bcc_recipients - same, for CC and BCC
sent_at - timestamp of when the email was sent
epstein_is_sender - boolean flag, true when Epstein sent the email
account_email - which of Epstein’s accounts (Gmail or Yahoo) the email belongs to

At work: You’d do the same thing with a fresh CloudTrail export: DESCRIBE SELECT * FROM read_parquet('s3://your-bucket/AWSLogs/**/*.parquet'). Before you hunt for anything, you need to know what fields are available - userIdentity, eventSource, sourceIPAddress, requestParameters - and what types they are. Reconnaissance before exploitation applies to data analysis too.

Level 1: Identify the Players (SELECT, WHERE, ORDER BY)

The foundation of SQL. Every query starts with what you want (SELECT), where it lives (FROM), and optionally what conditions filter it (WHERE). But the analyst question driving this level is simple: who are the key actors in this dataset?

In any investigation - whether it’s a congressional inquiry or a compromised AWS account - the first thing you want to know is who’s involved and how active they are.

Who sent the most emails?

SELECT sender, COUNT(*) as email_count
FROM read_parquet('https://data.jmail.world/v1/emails-slim.parquet')
GROUP BY sender
ORDER BY email_count DESC
LIMIT 20;

COUNT(*) counts rows per group. GROUP BY sender creates one group per unique sender. ORDER BY email_count DESC sorts highest first. LIMIT 20 caps the output. These four clauses are the bread and butter of exploratory analysis.

Now actually read that list. The first non-Epstein name is Lesley Groff with ~126K emails - she was his executive assistant, which is exactly the volume profile of someone managing his calendar. The next non-Epstein name is Richard Kahn (~29K) - his longtime accountant. Then Karyna Shuliak (~17K), his on-and-off girlfriend. None of those people are surprising in retrospect, but you didn’t know who they were when you ran the query - and that’s the point. The data named them for you before you knew to look.

Analyst’s notebook: Three of the top five senders in any executive inbox will be: an assistant, a money person, and a romantic partner. That’s the same shape you’d see in the Google Workspace logs of any CEO. The interesting names are the ones that don’t fit that pattern. Make a mental list of any name in the top 20 you can’t immediately classify - those are the threads you’ll pull in Level 3.

Filter by subject keywords

SELECT sender, subject, sent_at
FROM read_parquet('https://data.jmail.world/v1/emails-slim.parquet')
WHERE subject ILIKE '%legal%'
  AND TRY_CAST(sent_at AS TIMESTAMP) BETWEEN '2001-01-01' AND '2020-01-01'
  AND sender NOT ILIKE '%redacted%'
ORDER BY sent_at DESC
LIMIT 30;

ILIKE is case-insensitive pattern matching. The % wildcards match any characters before and after “legal”. This is your grep for structured data. The NOT ILIKE '%redacted%' filter drops a class of inbound messages where the sender field was stripped by DOJ before release - those tend to dominate keyword searches and crowd out the substantive hits. Read the results that survive: BOP legal hold notices, FBI victim services briefings, subpoena coordination. The query is the same grep you’d run; the discipline is knowing which categories of result are noise vs. signal.

Emails from a specific sender

Pick the highest-volume name from the top-senders list that you don’t immediately recognize and pull their thread:

SELECT subject, sent_at, to_recipients
FROM read_parquet('https://data.jmail.world/v1/emails-slim.parquet')
WHERE sender ILIKE '%groff%'
  AND TRY_CAST(sent_at AS TIMESTAMP) BETWEEN '2001-01-01' AND '2020-01-01'
ORDER BY sent_at ASC
LIMIT 20;

Sorting ASC instead of DESC shows you Groff’s earliest emails - the moment she enters the archive. The BETWEEN filter throws out rows with sentinel timestamps like 1990-01-01T08:00:00Z, which the archive uses as a placeholder for emails whose true send date wasn’t recoverable. That kind of garbage-date filter is a real-world habit you should build now - every messy dataset has its own version of these sentinels, and they always end up at the top of an ASC sort unless you exclude them. Run the same pattern for '%kahn%', '%shuliak%'. Each name gets you a first-contact date and a flavor of what they handled (logistics, money, personal).

At work: This is exactly how you start a CloudTrail investigation. SELECT userIdentity.arn, COUNT(*) FROM cloudtrail GROUP BY 1 ORDER BY 2 DESC tells you which principals are noisiest. Then you filter by keyword - WHERE eventName ILIKE '%Delete%' or WHERE eventSource = 's3.amazonaws.com' - to narrow down to the activity that matters. The SQL is identical. The only difference is the column names.

Level 2: Establish the Timeline (Aggregation and GROUP BY)

Aggregation is where SQL goes from “search engine” to “analysis tool.” Instead of looking at individual rows, you’re computing summaries. The analyst question here is: when did things happen, and does the temporal pattern tell us anything?

Every investigation is built on a timeline. Before you can understand what happened, you need to understand when.

Email volume over time

SELECT
  YEAR(TRY_CAST(sent_at AS TIMESTAMP)) as year,
  MONTH(TRY_CAST(sent_at AS TIMESTAMP)) as month,
  COUNT(*) as emails
FROM read_parquet('https://data.jmail.world/v1/emails-slim.parquet')
WHERE TRY_CAST(sent_at AS TIMESTAMP) BETWEEN '2001-01-01' AND '2020-01-01'
GROUP BY year, month
ORDER BY year, month;

Scroll through the output and look at the shape. A few things should jump out without you having to do any historical research first:

2005 starts at single digits per month. That same year (March 2005), Palm Beach PD opened the investigation that became his first prosecution. The data didn’t tell you why his email volume collapsed - but it told you that it did, and to the month.
The archive stays thin through 2006, 2007, and most of 2008. This is the window that contains his 2008 plea deal and the 13 months he spent in the Palm Beach County Stockade.
April 2009 jumps from ~440 emails to ~3,900 emails in a single month, then keeps growing. That’s not a gradual ramp - that’s a step function. He got out of jail in July 2009; the email account jeevacation@gmail.com came online roughly when he started preparing for life after incarceration.

You just used SQL to derive a public figure’s incarceration timeline from email metadata alone. No Wikipedia, no news archive - just GROUP BY year, month and reading the shape.

Analyst’s notebook: Gaps and step changes are the two highest-signal patterns in any temporal dataset. A flat-line gap means the account was paused, abandoned, or under legal hold. A sharp step up means a new tool, a new role, or a new account coming online. The trick is that you can identify both before you know what caused them - and then go hunting for the cause. The data leads, the news confirms.

Busiest days of the week

SELECT
  DAYNAME(TRY_CAST(sent_at AS TIMESTAMP)) as day_of_week,
  DAYOFWEEK(TRY_CAST(sent_at AS TIMESTAMP)) as day_num,
  COUNT(*) as emails
FROM read_parquet('https://data.jmail.world/v1/emails-slim.parquet')
WHERE TRY_CAST(sent_at AS TIMESTAMP) IS NOT NULL
GROUP BY day_of_week, day_num
ORDER BY day_num;

DAYNAME and DAYOFWEEK are DuckDB date functions that extract parts of a timestamp. This is behavioral profiling - does the communication pattern look like a normal business operation, or is there significant weekend/off-hours activity?

Epstein’s send vs receive ratio

SELECT
  epstein_is_sender,
  COUNT(*) as email_count,
  ROUND(COUNT(*) * 100.0 / SUM(COUNT(*)) OVER (), 1) as pct
FROM read_parquet('https://data.jmail.world/v1/emails-slim.parquet')
GROUP BY epstein_is_sender;

That SUM(COUNT(*)) OVER () is a window function - it computes the total count across all groups without collapsing them. We use it here to calculate percentages. Window functions are one of the most powerful features in SQL and we’ll use them more later.

At work: Temporal analysis is the backbone of cloud security monitoring. SELECT DATE_TRUNC('hour', eventTime), COUNT(*) FROM cloudtrail GROUP BY 1 ORDER BY 1 is one of the first queries you’d run during an incident. A spike at 3 AM in API calls from an IAM user that normally works 9-to-5? That’s your signal. The day-of-week analysis catches the same thing - a service account that suddenly becomes active on weekends might have been compromised. And the send/receive ratio translates directly to “is this principal mostly reading data or mostly writing it?” - a shift in that ratio is a behavioral change worth investigating.

Level 3: Map the Network (JSON Unnesting and Relationship Analysis)

The to_recipients column is a JSON array - something like ["person1@email.com", "person2@email.com"]. To analyze the communication network, we need to “unnest” this array so each recipient gets its own row.

The analyst question here is the one that drives every network investigation: who is connected to whom, and how strongly?

Who did Epstein email the most?

SELECT
  recipient,
  COUNT(*) as times_contacted
FROM read_parquet('https://data.jmail.world/v1/emails-slim.parquet')
CROSS JOIN UNNEST(from_json(to_recipients, '["VARCHAR"]')) AS t(recipient)
WHERE epstein_is_sender = true
GROUP BY recipient
ORDER BY times_contacted DESC
LIMIT 25;

from_json(to_recipients, '["VARCHAR"]') parses the JSON string into a DuckDB list. unnest() explodes that list so each element becomes its own row. This is a pattern you’ll use constantly when working with semi-structured data - and semi-structured data is everywhere in cloud environments.

Now read the output. The assistant and accountant are at the top - expected. But look further down. Kathy Ruemmler is in the top 5, with roughly 3,600 emails from Epstein. Boris Nikolic (Bill Gates’ former science advisor), Peter Mandelson (the British Lord and former EU commissioner), Tom Pritzker (Hyatt chairman), Joi Ito (former director of the MIT Media Lab). Some of these names you know. Some you don’t. The ones you don’t are the ones worth pulling on.

Pulling the thread: when does a name first appear?

Pick a name from the list that surprised you - we’ll use Kathy Ruemmler because she’s the highest-volume non-staffer in the top 5. The first question to ask about any new actor is when do they show up?

SELECT YEAR(TRY_CAST(sent_at AS TIMESTAMP)) as year, COUNT(*) as emails
FROM read_parquet('https://data.jmail.world/v1/emails-slim.parquet')
WHERE all_participants ILIKE '%ruemmler%'
GROUP BY year
ORDER BY year;

Look at what the data tells you: a single stray email in 2010, another in 2013, then 2,756 in 2014. That’s not a relationship that grew organically - that’s a switch flipping. So when in 2014 did it flip?

SELECT sender, subject, sent_at
FROM read_parquet('https://data.jmail.world/v1/emails-slim.parquet')
WHERE all_participants ILIKE '%ruemmler%'
  AND TRY_CAST(sent_at AS TIMESTAMP) BETWEEN '2014-01-01' AND '2014-12-31'
ORDER BY sent_at ASC
LIMIT 10;

The first sustained thread starts July 1, 2014. Kathy Ruemmler resigned as White House Counsel to President Obama in May 2014. The data didn’t tell you that part - but it told you that whatever caused her to become one of Epstein’s most frequent correspondents happened in the gap between those two months. That’s the rabbit hole. SQL got you to the edge of it; the public record explains the rest.

This pattern - top-N list → pick the unfamiliar name → ask when they first appeared → ask what they did right before that → look at the news for that window - is the entire OSINT analyst workflow. You just executed it.

Bidirectional communication pairs

Who had the most back-and-forth with Epstein? One-way communication tells you who received orders. Two-way communication tells you who had a relationship.

WITH sent AS (
  SELECT
    other_party,
    COUNT(*) as sent_count
  FROM read_parquet('https://data.jmail.world/v1/emails-slim.parquet')
  CROSS JOIN UNNEST(from_json(to_recipients, '["VARCHAR"]')) AS t(other_party)
  WHERE epstein_is_sender = true
  GROUP BY other_party
),
received AS (
  SELECT
    sender as other_party,
    COUNT(*) as received_count
  FROM read_parquet('https://data.jmail.world/v1/emails-slim.parquet')
  WHERE epstein_is_sender = false
  GROUP BY sender
)
SELECT
  COALESCE(s.other_party, r.other_party) as contact,
  COALESCE(s.sent_count, 0) as epstein_sent,
  COALESCE(r.received_count, 0) as they_sent,
  COALESCE(s.sent_count, 0) + COALESCE(r.received_count, 0) as total
FROM sent s
FULL OUTER JOIN received r ON s.other_party = r.other_party
WHERE COALESCE(s.other_party, r.other_party) IS NOT NULL
  AND COALESCE(s.other_party, r.other_party) NOT ILIKE '%jeevacation%'
  AND COALESCE(s.other_party, r.other_party) NOT ILIKE '%epstein%'
  AND COALESCE(s.other_party, r.other_party) NOT ILIKE 'jeffrey%'
  AND COALESCE(s.other_party, r.other_party) NOT ILIKE '%redacted%'
  AND COALESCE(s.other_party, r.other_party) NOT IN ('Unknown', '')
ORDER BY total DESC
LIMIT 20;

Welcome to Common Table Expressions (CTEs) - the WITH clause. Think of each CTE as a named subquery you can reference later. Here we compute sent and received counts separately, then FULL OUTER JOIN them to get the complete picture. COALESCE handles NULLs when someone appears in only one direction. The NOT ILIKE chain in the outer WHERE strips out Epstein’s own aliases (jeevacation@gmail.com, “Jeffrey Epstein”, etc.) - without it, the top of the leaderboard is just Epstein emailing himself across his Gmail and Yahoo accounts, which is real but uninteresting.

This is the kind of network analysis that OSINT practitioners do manually with link charts. SQL does it in seconds.

Analyst’s notebook: Look at the ratio between epstein_sent and they_sent for each contact. A contact where Epstein sent 500 emails but received only 3 is a very different relationship than one where the counts are roughly equal. The asymmetry tells you about power dynamics, information flow direction, and who was driving the communication.

At work: This is the same pattern you’d use to map IAM role assumption chains in CloudTrail. Replace sender with userIdentity.arn and to_recipients with requestParameters.roleArn and you’re building the same bidirectional graph - which principals assume which roles, how often, and is it reciprocal? A service account that assumes an admin role 500 times but the admin role never calls back to that service account’s APIs? That’s expected. But if you see two IAM users assuming each other’s roles back and forth? That’s worth investigating - it could be privilege laundering to obscure an audit trail.

Level 4: Correlate Across Sources (Joins Across Datasets)

Single-source analysis only gets you so far. Real investigations require correlation - checking whether patterns in one dataset are confirmed, contradicted, or expanded by another. The analyst question: does this person’s activity in one channel match their activity in another?

The Jmail Data API serves more than just emails. Let’s bring in other datasets.

Cross-reference emails with iMessage threads

WITH email_contacts AS (
  SELECT
    sender as contact,
    COUNT(*) as email_count
  FROM read_parquet('https://data.jmail.world/v1/emails-slim.parquet')
  WHERE epstein_is_sender = false
  GROUP BY sender
),
imessage_contacts AS (
  SELECT name, slug, message_count
  FROM read_parquet('https://data.jmail.world/v1/imessage_conversations.parquet')
)
SELECT
  im.name,
  im.slug,
  im.message_count as imessages,
  COALESCE(ec.email_count, 0) as emails,
  im.message_count + COALESCE(ec.email_count, 0) as total_comms
FROM imessage_contacts im
LEFT JOIN email_contacts ec
  ON LOWER(im.name) = LOWER(ec.contact)
ORDER BY total_comms DESC;

LEFT JOIN keeps all rows from the left table (iMessages) and attaches matching email data where it exists. Here we join on name, but depending on how a person appears in the email archive you may need to normalize further or join against slug instead.

Analyst’s notebook: Someone who appears in iMessages but not in email is interesting. It might mean the name doesn’t match between systems (a normalization problem), or it might mean this person deliberately used a different channel to communicate - they’re in the text messages but stayed out of the email inbox. The absence of a match is itself a finding.

Explore the photo metadata

SELECT source, COUNT(*) as photo_count
FROM read_parquet('https://data.jmail.world/v1/photos.parquet')
GROUP BY source
ORDER BY photo_count DESC;

SELECT image_description, original_filename, source
FROM read_parquet('https://data.jmail.world/v1/photos.parquet')
WHERE image_description ILIKE '%office%'
LIMIT 20;

The image_description column contains AI-generated descriptions of each photo - another searchable surface for investigation.

At work: Cross-source correlation is the entire job description of a cloud security analyst. You’re joining CloudTrail events with VPC Flow Logs, with GuardDuty findings, with IAM policy documents, with Config snapshots. “This IAM user made 47 GetObject calls against an S3 bucket at 2 AM” is one finding. Joining that against VPC Flow Logs to confirm data was exfiltrated to an external IP makes it an incident. The SQL JOIN pattern you just learned - CTEs to prepare each source, then join on a shared key - is the exact same pattern, just with eventTime and sourceIPAddress instead of name and sender.

Level 5: Detect Change Over Time (Window Functions)

Window functions let you compute values across sets of rows related to the current row - without collapsing the result into groups. They’re the difference between intermediate and advanced SQL.

The analyst question here is about trends and shifts: not just what happened, but how did the pattern change? Who became more important over time? Who disappeared?

Running total of emails over time

SELECT
  DATE_TRUNC('month', TRY_CAST(sent_at AS TIMESTAMP)) as month,
  COUNT(*) as monthly_count,
  SUM(COUNT(*)) OVER (ORDER BY DATE_TRUNC('month', TRY_CAST(sent_at AS TIMESTAMP))) as running_total
FROM read_parquet('https://data.jmail.world/v1/emails-slim.parquet')
WHERE TRY_CAST(sent_at AS TIMESTAMP) BETWEEN '2001-01-01' AND '2020-01-01'
GROUP BY month
ORDER BY month;

SUM(...) OVER (ORDER BY ...) accumulates the count across rows ordered by month. This shows the growth of the archive over time.

Rank senders within each year

SELECT year, sender, email_count, rank
FROM (
  SELECT
    YEAR(TRY_CAST(sent_at AS TIMESTAMP)) as year,
    sender,
    COUNT(*) as email_count,
    ROW_NUMBER() OVER (PARTITION BY YEAR(TRY_CAST(sent_at AS TIMESTAMP)) ORDER BY COUNT(*) DESC) as rank
  FROM read_parquet('https://data.jmail.world/v1/emails-slim.parquet')
  WHERE TRY_CAST(sent_at AS TIMESTAMP) BETWEEN '2001-01-01' AND '2020-01-01'
  GROUP BY year, sender
)
WHERE rank <= 5
ORDER BY year, rank;

ROW_NUMBER() OVER (PARTITION BY year ORDER BY count DESC) assigns a rank within each year. PARTITION BY creates separate windows per year, and ORDER BY determines the ranking within each window.

This is where you should pause and actually read the output year by year. You’re looking at a per-year leaderboard of the people closest to Epstein’s daily communication. The assistants will dominate every year - filter them mentally and look at who else cycles in. You’ll see correspondents appear in the top 5 for two or three years and then disappear. You’ll see names you recognize from headlines show up in years that line up with specific phases of his post-prison reputation rehabilitation. Pick any name that surprises you and run the same when do they first appear query we used on Ruemmler in Level 3. That’s the loop. The leaderboard is just the index into the rabbit hole.

Analyst’s notebook: Two of the most useful signals in this output are new arrivals to the top 5 and departures from it. Someone who was #2 in 2008 and gone by 2009 either changed email addresses, stopped communicating, or had a falling out. Someone who appears at #1 in a year they weren’t even in the top 50 the year before is a relationship that formed under specific circumstances - and the year boundary tells you when to start looking for those circumstances.

At work: Window functions are what turn a log search into a detection. ROW_NUMBER() OVER (PARTITION BY userIdentity.arn ORDER BY eventTime) lets you sequence every action a principal took during an incident. LAG(eventTime) OVER (PARTITION BY sourceIPAddress ORDER BY eventTime) lets you calculate the time between consecutive API calls from the same IP - useful for detecting automated tooling (sub-second gaps between calls = a script, not a human). And ranking principals by data volume per month is exactly how you’d catch a slow data exfiltration that stays under daily thresholds but accumulates over weeks.

Level 6: Search for Specific Evidence (Full-Text Search)

If you want to search email bodies, you’ll need the full 334 MB dataset. This is heavier but unlocks keyword-in-context analysis. The analyst question at this level is the most direct one: does the evidence contain what I think it contains?

All the previous levels gave you the shape of the data - who, when, how much, how connected. This level is where you go looking for the substance.

SELECT sender, subject, sent_at,
  SUBSTRING(content_markdown,
    GREATEST(1, POSITION('island' IN LOWER(content_markdown)) - 50),
    120
  ) as context
FROM read_parquet('https://data.jmail.world/v1/emails.parquet')
WHERE content_markdown ILIKE '%island%'
  AND sent_at IS NOT NULL
ORDER BY sent_at DESC
LIMIT 20;

This extracts a 120-character window around each keyword match - similar to how grep -C works. Swap 'island' for whatever you’re investigating. Note: the full email file is large, so this query will take longer on the first run as DuckDB downloads it.

At work: Full-text search across cloud logs is how you confirm hypotheses. You found a suspicious PutBucketPolicy call in CloudTrail - now search the requestParameters JSON for "Effect": "Allow" combined with "Principal": "*" to confirm someone opened a bucket to the world. The SUBSTRING + POSITION pattern gives you the context window around the match so you can read what happened around the suspicious event without dumping entire log entries.

From the Archive to Your Cloud

If you’ve been running these queries, you’ve been doing something more important than learning SQL syntax. You’ve been practicing the analyst workflow: start broad, identify actors, build a timeline, map relationships, correlate sources, detect changes, and search for evidence. That workflow doesn’t change whether you’re investigating a public archive or responding to a 2 AM PagerDuty alert.

Here’s how each technique maps to the cloud security work you do (or want to do):

What you did here	The SQL technique	The same thing in cloud security
Inventoried the schema	`DESCRIBE`	Inventorying your log sources - what fields does CloudTrail give you? What’s in VPC Flow Logs? What does GuardDuty export?
Found the top senders	`GROUP BY`, `COUNT`, `ORDER BY`	Finding the noisiest IAM principals - service accounts making millions of API calls, or a single user who suddenly starts calling `ListBuckets` across every region
Searched by keyword	`WHERE`, `ILIKE`	Hunting for specific API calls - `WHERE eventName ILIKE '%Delete%'` to find destructive actions, or `WHERE errorCode = 'AccessDenied'` to find permission boundary violations
Built a timeline	`YEAR()`, `MONTH()`, `DATE_TRUNC()`	Constructing the incident timeline - when did the compromise start? When did lateral movement begin? Where are the gaps in logging?
Profiled behavior patterns	`DAYNAME()`, `DAYOFWEEK()`	Detecting off-hours activity - API calls at 3 AM from a principal that normally operates 9-to-5, or weekend activity from a CI/CD service account that only runs on weekday deploys
Parsed JSON arrays	`from_json()`, `unnest()`	Extracting nested CloudTrail fields - `requestParameters` and `responseElements` are JSON objects containing the actual details of what was requested and what happened
Mapped the network	CTEs, `FULL OUTER JOIN`	Building IAM relationship graphs - which principals assume which roles, which roles have access to which resources, and where are the trust relationships?
Correlated across sources	`LEFT JOIN` across datasets	Joining CloudTrail with VPC Flow Logs, GuardDuty findings, Config snapshots, or billing data to build the complete picture
Ranked actors over time	`ROW_NUMBER() OVER (PARTITION BY ...)`	Detecting behavioral shifts - a principal that was #47 in API volume last month and is #1 this month has either changed roles or been compromised
Searched email bodies	`SUBSTRING`, `POSITION`, `ILIKE`	Keyword-in-context search across log payloads - confirming a `PutBucketPolicy` actually set `Principal: *`, or finding the specific S3 keys that were exfiltrated

The Parquet files change. The SQL doesn’t.

Available Datasets

Everything served from https://data.jmail.world/v1/:

Dataset	File	Records	Size
Emails (slim)	`emails-slim.parquet`	1.78M	41 MB
Emails (full)	`emails.parquet`	1.78M	334 MB
Documents	`documents.parquet`	1.41M	25 MB
Photos	`photos.parquet`	18K	~1 MB
People	`people.parquet`	473	<100 KB
Face data	`photo_faces.parquet`	975	<100 KB
iMessage threads	`imessage_conversations.parquet`	15	<10 KB
iMessages	`imessage_messages.parquet`	4.5K	~100 KB
Star counts	`star_counts.parquet`	414K	~2 MB

Full document text is available in sharded files under documents-full/ - see the Jmail Data API docs for details.

Taking It Local

The browser widget is great for exploration, but you do not need to stay in the browser. If you want to save files, rerun heavier queries, build views, or keep a persistent local database, install DuckDB and run the exact same SQL from your terminal.

DuckDB installation instructions live here: DuckDB install guide. On common systems, that usually means brew install duckdb on macOS or apt install duckdb on Debian/Ubuntu.

Once DuckDB is installed, you can run the queries from this post locally:

# Download once
curl -o emails-slim.parquet https://data.jmail.world/v1/emails-slim.parquet
curl -o emails.parquet https://data.jmail.world/v1/emails.parquet

# Query locally
duckdb -c "
  SELECT sender, COUNT(*) as n
  FROM read_parquet('emails-slim.parquet')
  GROUP BY sender ORDER BY n DESC LIMIT 20;
"

Or create a persistent database for repeated analysis:

CREATE TABLE emails AS
  SELECT * FROM read_parquet('emails.parquet');
CREATE TABLE docs AS
  SELECT * FROM read_parquet('documents.parquet');
CREATE TABLE photos AS
  SELECT * FROM read_parquet('photos.parquet');

SELECT sender, COUNT(*) FROM emails GROUP BY sender ORDER BY 2 DESC LIMIT 10;

For the Python folks:

import duckdb

conn = duckdb.connect('epstein.db')
df = conn.sql("""
  SELECT sender, COUNT(*) as n
  FROM read_parquet('https://data.jmail.world/v1/emails-slim.parquet')
  GROUP BY sender ORDER BY n DESC LIMIT 20
""").df()

print(df)

What You Learned

If you followed along and ran the queries, you picked up two things: SQL techniques and the analyst instincts for when to use them.

The SQL skill	When an analyst reaches for it
`DESCRIBE` to explore schemas	Before touching anything - inventory your data sources and understand what’s available
`WHERE`, `ILIKE`, and wildcards to filter and search	When you have a lead and need to pull the thread - a name, a keyword, an API call
`GROUP BY`, `COUNT`, `SUM` to aggregate	When you need to go from “individual events” to “patterns” - who’s the busiest, what’s the most common, where’s the volume?
`YEAR()`, `MONTH()`, `DATE_TRUNC()` for date analysis	When you’re building or validating a timeline - the spine of every investigation
`DAYNAME()`, `DAYOFWEEK()` for behavioral profiling	When you’re looking for anomalies in when things happen, not just what happens
`from_json()` and `unnest()` for JSON parsing	When your data has nested structures - and in cloud environments, it almost always does
`LEFT JOIN`, `FULL OUTER JOIN` for cross-source correlation	When one data source isn’t enough and you need to confirm findings against a second source
CTEs (`WITH` clause) for readable complex queries	When your analysis has multiple steps and you need to keep your thinking organized
`ROW_NUMBER()`, `SUM() OVER()` window functions	When you need to detect change - shifts in ranking, accumulation over time, behavioral drift
`SUBSTRING` and `POSITION` for text extraction	When you’ve found something suspicious and need to read the surrounding context

These aren’t toy skills. This is the same SQL that powers threat hunting in SIEMs, log analysis in Splunk/Elastic, and data correlation in every OSINT toolkit. The difference is you learned it on data you actually wanted to explore .

📚 DuckDB for Cloud Security