Skip to content

Raw Data Guide

Date: 2026-04-10 Purpose: map every file in the scrape bundle so you can run your own analysis later.


/Users/ivanprotsko/upready/docs/research/raw/2026-04-10-fiverr-scrape/
├── sitemaps/ # Original Fiverr sitemaps (gzip + extracted)
│ ├── sitemap_categories.xml # 945 category URLs
│ ├── sitemap_gigs[1-7].xml # 307,985 gig URLs (7 shards)
│ ├── sitemap_users[1-7].xml # 310,422 seller URLs
│ └── sitemap_tags[1-2].xml # 55,736 tag pages
├── parsed/ # JSONL parsed from sitemaps
│ ├── categories.jsonl # 945 category entries
│ ├── categories-tree.json # hierarchical taxonomy
│ ├── categories-tree.md # human-readable taxonomy
│ ├── gigs.jsonl # 307,985 gig URLs with seller, slug, keywords
│ ├── sellers.jsonl # 310,422 seller usernames
│ ├── top-words.tsv # word frequency in 307K slugs (200 rows)
│ ├── top-bigrams.tsv # 2-word combinations
│ ├── top-trigrams.tsv # 3-word combinations (strong signal of service types)
│ ├── top-first-words.tsv # action verbs ("create", "design", "make", ...)
│ ├── longtail-words.tsv # words appearing 2-5 times (rare)
│ ├── top-sellers-by-gig-count.tsv # sellers with most gigs (up to 30)
│ └── seller-distribution.json # stats (median 1 gig/seller, 60% have 1)
├── scraped/ # Data scraped from live listing pages
│ ├── gigs-raw.jsonl # ~25K+ enriched gig records with price, rating, etc.
│ ├── listing-snapshots.jsonl # per-page snapshot (subcategory + pagination total)
│ ├── keyword-gigs.jsonl # ~4K gigs from keyword-targeted search
│ ├── keyword-snapshots.jsonl # per-keyword search snapshots
│ ├── progress.json, progress-2.json, progress-3.json # resume state per worker
│ ├── scrape.log, scrape-2.log, scrape-3.log # worker logs
│ ├── scrape-keywords.log # keyword scraper log
│ │
│ └── analysis/ # Aggregated analytics
│ ├── summary.json # headline stats (median, top countries, etc.)
│ ├── global-pricing.json # price distribution
│ ├── subcategory-analysis.json # per-subcategory pricing, seller levels, %pro
│ ├── top-sellers.json # top sellers by cumulative review count
│ ├── country-distribution.tsv # country → gig count
│ ├── seller-level-distribution.tsv
│ ├── keyword-analysis.json # per-keyword pricing stats
│ ├── keyword-analysis.md # human-readable keyword table
│ ├── western-vs-cheap.json # Western vs cheap-country competition per subcat
│ ├── western-vs-cheap.md # ranked by Western %
│ └── long-tail-report.md # competition tier breakdown
├── initial-data/ # Early sample extracts (gig / listing / seller pages)
├── recon/ # Raw HTML from initial recon
├── patchright-test/ # Initial patchright validation
└── stealth-test/ # Patchright stealth validation

See /Users/ivanprotsko/upready/docs-projects/fiverr-research/DATA-MODEL.md for full schemas.

Quick reference:

  • gigs-raw.jsonl rows: one gig per line, JSON with ~35 fields (title, price, seller country, rating, reviews, Pro flag, consultation, etc.)
  • listing-snapshots.jsonl rows: one listing page per line (category_path, page, total_gigs, gigs_in_page)
  • keyword-gigs.jsonl rows: same as gigs-raw but tagged with keyword and search_total
  • keyword-snapshots.jsonl rows: one search-page per line

Terminal window
# Count organic gigs in scraped data
grep -v '"type":"promoted_gigs"' /Users/ivanprotsko/upready/docs/research/raw/2026-04-10-fiverr-scrape/scraped/gigs-raw.jsonl | wc -l
# Top sellers by seller_country
jq -r '.seller_country' /Users/ivanprotsko/upready/docs/research/raw/2026-04-10-fiverr-scrape/scraped/gigs-raw.jsonl | sort | uniq -c | sort -rn | head -20
# Price histogram
jq -r '.price_usd' /Users/ivanprotsko/upready/docs/research/raw/2026-04-10-fiverr-scrape/scraped/gigs-raw.jsonl | awk '{print int($1/50)*50}' | sort -n | uniq -c
# Find all gigs with price > 1000
jq -c 'select(.price_usd > 1000)' /Users/ivanprotsko/upready/docs/research/raw/2026-04-10-fiverr-scrape/scraped/gigs-raw.jsonl | jq -r '"\(.price_usd)\t\(.seller_country)\t\(.category_path)\t\(.title)"'
# Subcategory name list from sitemap
jq -r 'select(.depth == 2) | "\(.category)/\(.subcategory)"' /Users/ivanprotsko/upready/docs/research/raw/2026-04-10-fiverr-scrape/parsed/categories.jsonl | sort

Source code: /Users/ivanprotsko/upready/projects/fiverr-research/

  • /Users/ivanprotsko/upready/projects/fiverr-research/scripts/scrape-listings.ts — main listing scraper (category + subcategory walker)
  • /Users/ivanprotsko/upready/projects/fiverr-research/scripts/scrape-listings-2.ts — worker 2 (separate Chrome profile)
  • /Users/ivanprotsko/upready/projects/fiverr-research/scripts/scrape-listings-3.ts — worker 3
  • /Users/ivanprotsko/upready/projects/fiverr-research/scripts/scrape-keywords.ts — keyword-targeted scraper for modern stack
  • /Users/ivanprotsko/upready/projects/fiverr-research/scripts/parse-sitemaps.ts — sitemap → JSONL parser
  • /Users/ivanprotsko/upready/projects/fiverr-research/scripts/analyze-categories.ts — taxonomy tree analysis
  • /Users/ivanprotsko/upready/projects/fiverr-research/scripts/analyze-gigs.ts — pricing/country/level stats
  • /Users/ivanprotsko/upready/projects/fiverr-research/scripts/analyze-keywords.ts — per-keyword stats
  • /Users/ivanprotsko/upready/projects/fiverr-research/scripts/analyze-western.ts — Western vs cheap-country split
  • /Users/ivanprotsko/upready/projects/fiverr-research/scripts/analyze-longtail.ts — competition-tier report
  • /Users/ivanprotsko/upready/projects/fiverr-research/scripts/analyze-sitemap-keywords.ts — keyword frequency in slugs
Terminal window
cd /Users/ivanprotsko/upready/projects/fiverr-research
SCRAPE_CATEGORIES=graphics-design,programming-tech \
MAX_PAGES_PER_SUBCAT=5 \
MIN_DEPTH=2 MAX_DEPTH=2 \
DELAY_MS=2500 \
npx tsx scripts/scrape-listings.ts

Edit KEYWORDS array in /Users/ivanprotsko/upready/projects/fiverr-research/scripts/scrape-keywords.ts, then:

Terminal window
cd /Users/ivanprotsko/upready/projects/fiverr-research
npx tsx scripts/scrape-keywords.ts

Results append to /Users/ivanprotsko/upready/docs/research/raw/2026-04-10-fiverr-scrape/scraped/keyword-gigs.jsonl.


Fiverr uses PerimeterX / HUMAN Security challenge on every page. Results:

MethodWorks?Notes
curl + user-agent“It needs a human touch” challenge
playwright headlessSame challenge
playwright-extra + stealthSame
rebrowser-playwright headedSame
patchright + Chrome channelWorks with persistent context + real Chrome
Googlebot user-agentIP-based detection, not UA
Wayback Machine⚠️Few gig URLs archived
Public sitemaps (static files)Unprotected, no auth needed

Key: patchright (fork of playwright with undetectable patches) + channel: 'chrome' (uses real installed Chrome) + launchPersistentContext() (persistent profile) bypasses the challenge. Works headed, haven’t tested headless because headed is already enough.