Raw Data Guide
Raw Data Guide — Fiverr Research
Section titled “Raw Data Guide — Fiverr Research”Date: 2026-04-10 Purpose: map every file in the scrape bundle so you can run your own analysis later.
Directory layout
Section titled “Directory layout”/Users/ivanprotsko/upready/docs/research/raw/2026-04-10-fiverr-scrape/├── sitemaps/ # Original Fiverr sitemaps (gzip + extracted)│ ├── sitemap_categories.xml # 945 category URLs│ ├── sitemap_gigs[1-7].xml # 307,985 gig URLs (7 shards)│ ├── sitemap_users[1-7].xml # 310,422 seller URLs│ └── sitemap_tags[1-2].xml # 55,736 tag pages│├── parsed/ # JSONL parsed from sitemaps│ ├── categories.jsonl # 945 category entries│ ├── categories-tree.json # hierarchical taxonomy│ ├── categories-tree.md # human-readable taxonomy│ ├── gigs.jsonl # 307,985 gig URLs with seller, slug, keywords│ ├── sellers.jsonl # 310,422 seller usernames│ ├── top-words.tsv # word frequency in 307K slugs (200 rows)│ ├── top-bigrams.tsv # 2-word combinations│ ├── top-trigrams.tsv # 3-word combinations (strong signal of service types)│ ├── top-first-words.tsv # action verbs ("create", "design", "make", ...)│ ├── longtail-words.tsv # words appearing 2-5 times (rare)│ ├── top-sellers-by-gig-count.tsv # sellers with most gigs (up to 30)│ └── seller-distribution.json # stats (median 1 gig/seller, 60% have 1)│├── scraped/ # Data scraped from live listing pages│ ├── gigs-raw.jsonl # ~25K+ enriched gig records with price, rating, etc.│ ├── listing-snapshots.jsonl # per-page snapshot (subcategory + pagination total)│ ├── keyword-gigs.jsonl # ~4K gigs from keyword-targeted search│ ├── keyword-snapshots.jsonl # per-keyword search snapshots│ ├── progress.json, progress-2.json, progress-3.json # resume state per worker│ ├── scrape.log, scrape-2.log, scrape-3.log # worker logs│ ├── scrape-keywords.log # keyword scraper log│ ││ └── analysis/ # Aggregated analytics│ ├── summary.json # headline stats (median, top countries, etc.)│ ├── global-pricing.json # price distribution│ ├── subcategory-analysis.json # per-subcategory pricing, seller levels, %pro│ ├── top-sellers.json # top sellers by cumulative review count│ ├── country-distribution.tsv # country → gig count│ ├── seller-level-distribution.tsv│ ├── keyword-analysis.json # per-keyword pricing stats│ ├── keyword-analysis.md # human-readable keyword table│ ├── western-vs-cheap.json # Western vs cheap-country competition per subcat│ ├── western-vs-cheap.md # ranked by Western %│ └── long-tail-report.md # competition tier breakdown│├── initial-data/ # Early sample extracts (gig / listing / seller pages)├── recon/ # Raw HTML from initial recon├── patchright-test/ # Initial patchright validation└── stealth-test/ # Patchright stealth validationEntity schemas
Section titled “Entity schemas”See /Users/ivanprotsko/upready/docs-projects/fiverr-research/DATA-MODEL.md for full schemas.
Quick reference:
gigs-raw.jsonlrows: one gig per line, JSON with ~35 fields (title, price, seller country, rating, reviews, Pro flag, consultation, etc.)listing-snapshots.jsonlrows: one listing page per line (category_path, page, total_gigs, gigs_in_page)keyword-gigs.jsonlrows: same as gigs-raw but tagged withkeywordandsearch_totalkeyword-snapshots.jsonlrows: one search-page per line
Useful one-liners
Section titled “Useful one-liners”# Count organic gigs in scraped datagrep -v '"type":"promoted_gigs"' /Users/ivanprotsko/upready/docs/research/raw/2026-04-10-fiverr-scrape/scraped/gigs-raw.jsonl | wc -l
# Top sellers by seller_countryjq -r '.seller_country' /Users/ivanprotsko/upready/docs/research/raw/2026-04-10-fiverr-scrape/scraped/gigs-raw.jsonl | sort | uniq -c | sort -rn | head -20
# Price histogramjq -r '.price_usd' /Users/ivanprotsko/upready/docs/research/raw/2026-04-10-fiverr-scrape/scraped/gigs-raw.jsonl | awk '{print int($1/50)*50}' | sort -n | uniq -c
# Find all gigs with price > 1000jq -c 'select(.price_usd > 1000)' /Users/ivanprotsko/upready/docs/research/raw/2026-04-10-fiverr-scrape/scraped/gigs-raw.jsonl | jq -r '"\(.price_usd)\t\(.seller_country)\t\(.category_path)\t\(.title)"'
# Subcategory name list from sitemapjq -r 'select(.depth == 2) | "\(.category)/\(.subcategory)"' /Users/ivanprotsko/upready/docs/research/raw/2026-04-10-fiverr-scrape/parsed/categories.jsonl | sortHow to re-run or extend the scrape
Section titled “How to re-run or extend the scrape”Source code: /Users/ivanprotsko/upready/projects/fiverr-research/
/Users/ivanprotsko/upready/projects/fiverr-research/scripts/scrape-listings.ts— main listing scraper (category + subcategory walker)/Users/ivanprotsko/upready/projects/fiverr-research/scripts/scrape-listings-2.ts— worker 2 (separate Chrome profile)/Users/ivanprotsko/upready/projects/fiverr-research/scripts/scrape-listings-3.ts— worker 3/Users/ivanprotsko/upready/projects/fiverr-research/scripts/scrape-keywords.ts— keyword-targeted scraper for modern stack/Users/ivanprotsko/upready/projects/fiverr-research/scripts/parse-sitemaps.ts— sitemap → JSONL parser/Users/ivanprotsko/upready/projects/fiverr-research/scripts/analyze-categories.ts— taxonomy tree analysis/Users/ivanprotsko/upready/projects/fiverr-research/scripts/analyze-gigs.ts— pricing/country/level stats/Users/ivanprotsko/upready/projects/fiverr-research/scripts/analyze-keywords.ts— per-keyword stats/Users/ivanprotsko/upready/projects/fiverr-research/scripts/analyze-western.ts— Western vs cheap-country split/Users/ivanprotsko/upready/projects/fiverr-research/scripts/analyze-longtail.ts— competition-tier report/Users/ivanprotsko/upready/projects/fiverr-research/scripts/analyze-sitemap-keywords.ts— keyword frequency in slugs
Example — re-run the listing scraper
Section titled “Example — re-run the listing scraper”cd /Users/ivanprotsko/upready/projects/fiverr-researchSCRAPE_CATEGORIES=graphics-design,programming-tech \ MAX_PAGES_PER_SUBCAT=5 \ MIN_DEPTH=2 MAX_DEPTH=2 \ DELAY_MS=2500 \ npx tsx scripts/scrape-listings.tsExample — new keyword search
Section titled “Example — new keyword search”Edit KEYWORDS array in /Users/ivanprotsko/upready/projects/fiverr-research/scripts/scrape-keywords.ts, then:
cd /Users/ivanprotsko/upready/projects/fiverr-researchnpx tsx scripts/scrape-keywords.tsResults append to /Users/ivanprotsko/upready/docs/research/raw/2026-04-10-fiverr-scrape/scraped/keyword-gigs.jsonl.
Anti-bot bypass notes
Section titled “Anti-bot bypass notes”Fiverr uses PerimeterX / HUMAN Security challenge on every page. Results:
| Method | Works? | Notes |
|---|---|---|
curl + user-agent | ❌ | “It needs a human touch” challenge |
playwright headless | ❌ | Same challenge |
playwright-extra + stealth | ❌ | Same |
rebrowser-playwright headed | ❌ | Same |
patchright + Chrome channel | ✅ | Works with persistent context + real Chrome |
| Googlebot user-agent | ❌ | IP-based detection, not UA |
| Wayback Machine | ⚠️ | Few gig URLs archived |
| Public sitemaps (static files) | ✅ | Unprotected, no auth needed |
Key: patchright (fork of playwright with undetectable patches) + channel: 'chrome' (uses real installed Chrome) + launchPersistentContext() (persistent profile) bypasses the challenge. Works headed, haven’t tested headless because headed is already enough.