Raw Data Guide

Raw Data Guide — Fiverr Research

Date: 2026-04-10 Purpose: map every file in the scrape bundle so you can run your own analysis later.

Directory layout

/Users/ivanprotsko/upready/docs/research/raw/2026-04-10-fiverr-scrape/
├── sitemaps/                    # Original Fiverr sitemaps (gzip + extracted)
│   ├── sitemap_categories.xml        # 945 category URLs
│   ├── sitemap_gigs[1-7].xml         # 307,985 gig URLs (7 shards)
│   ├── sitemap_users[1-7].xml        # 310,422 seller URLs
│   └── sitemap_tags[1-2].xml         # 55,736 tag pages
│
├── parsed/                      # JSONL parsed from sitemaps
│   ├── categories.jsonl              # 945 category entries
│   ├── categories-tree.json          # hierarchical taxonomy
│   ├── categories-tree.md            # human-readable taxonomy
│   ├── gigs.jsonl                    # 307,985 gig URLs with seller, slug, keywords
│   ├── sellers.jsonl                 # 310,422 seller usernames
│   ├── top-words.tsv                 # word frequency in 307K slugs (200 rows)
│   ├── top-bigrams.tsv               # 2-word combinations
│   ├── top-trigrams.tsv              # 3-word combinations (strong signal of service types)
│   ├── top-first-words.tsv           # action verbs ("create", "design", "make", ...)
│   ├── longtail-words.tsv            # words appearing 2-5 times (rare)
│   ├── top-sellers-by-gig-count.tsv  # sellers with most gigs (up to 30)
│   └── seller-distribution.json      # stats (median 1 gig/seller, 60% have 1)
│
├── scraped/                     # Data scraped from live listing pages
│   ├── gigs-raw.jsonl                # ~25K+ enriched gig records with price, rating, etc.
│   ├── listing-snapshots.jsonl       # per-page snapshot (subcategory + pagination total)
│   ├── keyword-gigs.jsonl            # ~4K gigs from keyword-targeted search
│   ├── keyword-snapshots.jsonl       # per-keyword search snapshots
│   ├── progress.json, progress-2.json, progress-3.json  # resume state per worker
│   ├── scrape.log, scrape-2.log, scrape-3.log           # worker logs
│   ├── scrape-keywords.log           # keyword scraper log
│   │
│   └── analysis/                # Aggregated analytics
│       ├── summary.json              # headline stats (median, top countries, etc.)
│       ├── global-pricing.json       # price distribution
│       ├── subcategory-analysis.json # per-subcategory pricing, seller levels, %pro
│       ├── top-sellers.json          # top sellers by cumulative review count
│       ├── country-distribution.tsv  # country → gig count
│       ├── seller-level-distribution.tsv
│       ├── keyword-analysis.json     # per-keyword pricing stats
│       ├── keyword-analysis.md       # human-readable keyword table
│       ├── western-vs-cheap.json     # Western vs cheap-country competition per subcat
│       ├── western-vs-cheap.md       # ranked by Western %
│       └── long-tail-report.md       # competition tier breakdown
│
├── initial-data/                # Early sample extracts (gig / listing / seller pages)
├── recon/                       # Raw HTML from initial recon
├── patchright-test/             # Initial patchright validation
└── stealth-test/                # Patchright stealth validation

Entity schemas

See /Users/ivanprotsko/upready/docs-projects/fiverr-research/DATA-MODEL.md for full schemas.

Quick reference:

gigs-raw.jsonl rows: one gig per line, JSON with ~35 fields (title, price, seller country, rating, reviews, Pro flag, consultation, etc.)
listing-snapshots.jsonl rows: one listing page per line (category_path, page, total_gigs, gigs_in_page)
keyword-gigs.jsonl rows: same as gigs-raw but tagged with keyword and search_total
keyword-snapshots.jsonl rows: one search-page per line

Useful one-liners

# Count organic gigs in scraped data
grep -v '"type":"promoted_gigs"' /Users/ivanprotsko/upready/docs/research/raw/2026-04-10-fiverr-scrape/scraped/gigs-raw.jsonl | wc -l

# Top sellers by seller_country
jq -r '.seller_country' /Users/ivanprotsko/upready/docs/research/raw/2026-04-10-fiverr-scrape/scraped/gigs-raw.jsonl | sort | uniq -c | sort -rn | head -20

# Price histogram
jq -r '.price_usd' /Users/ivanprotsko/upready/docs/research/raw/2026-04-10-fiverr-scrape/scraped/gigs-raw.jsonl | awk '{print int($1/50)*50}' | sort -n | uniq -c

# Find all gigs with price > 1000
jq -c 'select(.price_usd > 1000)' /Users/ivanprotsko/upready/docs/research/raw/2026-04-10-fiverr-scrape/scraped/gigs-raw.jsonl | jq -r '"\(.price_usd)\t\(.seller_country)\t\(.category_path)\t\(.title)"'

# Subcategory name list from sitemap
jq -r 'select(.depth == 2) | "\(.category)/\(.subcategory)"' /Users/ivanprotsko/upready/docs/research/raw/2026-04-10-fiverr-scrape/parsed/categories.jsonl | sort

How to re-run or extend the scrape

Source code: /Users/ivanprotsko/upready/projects/fiverr-research/

/Users/ivanprotsko/upready/projects/fiverr-research/scripts/scrape-listings.ts — main listing scraper (category + subcategory walker)
/Users/ivanprotsko/upready/projects/fiverr-research/scripts/scrape-listings-2.ts — worker 2 (separate Chrome profile)
/Users/ivanprotsko/upready/projects/fiverr-research/scripts/scrape-listings-3.ts — worker 3
/Users/ivanprotsko/upready/projects/fiverr-research/scripts/scrape-keywords.ts — keyword-targeted scraper for modern stack
/Users/ivanprotsko/upready/projects/fiverr-research/scripts/parse-sitemaps.ts — sitemap → JSONL parser
/Users/ivanprotsko/upready/projects/fiverr-research/scripts/analyze-categories.ts — taxonomy tree analysis
/Users/ivanprotsko/upready/projects/fiverr-research/scripts/analyze-gigs.ts — pricing/country/level stats
/Users/ivanprotsko/upready/projects/fiverr-research/scripts/analyze-keywords.ts — per-keyword stats
/Users/ivanprotsko/upready/projects/fiverr-research/scripts/analyze-western.ts — Western vs cheap-country split
/Users/ivanprotsko/upready/projects/fiverr-research/scripts/analyze-longtail.ts — competition-tier report
/Users/ivanprotsko/upready/projects/fiverr-research/scripts/analyze-sitemap-keywords.ts — keyword frequency in slugs

Example — re-run the listing scraper

cd /Users/ivanprotsko/upready/projects/fiverr-research
SCRAPE_CATEGORIES=graphics-design,programming-tech \
  MAX_PAGES_PER_SUBCAT=5 \
  MIN_DEPTH=2 MAX_DEPTH=2 \
  DELAY_MS=2500 \
  npx tsx scripts/scrape-listings.ts

Example — new keyword search

Edit KEYWORDS array in /Users/ivanprotsko/upready/projects/fiverr-research/scripts/scrape-keywords.ts, then:

cd /Users/ivanprotsko/upready/projects/fiverr-research
npx tsx scripts/scrape-keywords.ts

Results append to /Users/ivanprotsko/upready/docs/research/raw/2026-04-10-fiverr-scrape/scraped/keyword-gigs.jsonl.

Anti-bot bypass notes

Fiverr uses PerimeterX / HUMAN Security challenge on every page. Results:

Method	Works?	Notes
`curl` + user-agent	❌	“It needs a human touch” challenge
`playwright` headless	❌	Same challenge
`playwright-extra` + stealth	❌	Same
`rebrowser-playwright` headed	❌	Same
`patchright` + Chrome channel	✅	Works with persistent context + real Chrome
Googlebot user-agent	❌	IP-based detection, not UA
Wayback Machine	⚠️	Few gig URLs archived
Public sitemaps (static files)	✅	Unprotected, no auth needed

Key: patchright (fork of playwright with undetectable patches) + channel: 'chrome' (uses real installed Chrome) + launchPersistentContext() (persistent profile) bypasses the challenge. Works headed, haven’t tested headless because headed is already enough.