# VPSBenchmarks Scraper Upgrade Summary ## Changes Made ### 1. wrangler.toml - **Added**: `[browser]` binding for Cloudflare Browser Rendering API - This enables JavaScript execution for scraping React SPA sites ### 2. src/index.ts - **Updated**: `Env` interface to include `BROWSER: Fetcher` - No other changes needed - scheduled handler already calls scraper correctly ### 3. src/scraper.ts (Complete Rewrite) **Old Approach**: Direct HTML fetch without JavaScript execution - VPSBenchmarks.com is a React SPA → no data in initial HTML - Previous scraper found 0 results due to client-side rendering **New Approach**: Browser Rendering API with JavaScript execution - Uses `/content` endpoint to fetch fully rendered HTML - Fallback to `/scrape` endpoint for targeted element extraction - Multiple parsing patterns for robustness: - Table row extraction - Embedded JSON data detection - Benchmark card parsing - Scraped element parsing **Key Features**: - Waits for `networkidle0` (JavaScript execution complete) - Rejects unnecessary resources (images, fonts, media) for faster loading - 30-second timeout per request - Comprehensive error handling - Deduplication using unique constraint on (provider_name, plan_name, country_code) ### 4. vps-benchmark-schema.sql - **Added**: Unique index for ON CONFLICT deduplication - `CREATE UNIQUE INDEX idx_vps_benchmarks_unique ON vps_benchmarks(provider_name, plan_name, COALESCE(country_code, ''))` ### 5. Documentation - **Created**: `test-scraper.md` - Testing and troubleshooting guide - **Created**: `UPGRADE_SUMMARY.md` - This file ## Browser Rendering API Details ### Endpoints Used 1. **POST /content** (Primary) ```typescript browser.fetch('https://browser-rendering.cloudflare.com/content', { method: 'POST', body: JSON.stringify({ url: 'https://www.vpsbenchmarks.com/', waitUntil: 'networkidle0', rejectResourceTypes: ['image', 'font', 'media', 'stylesheet'], timeout: 30000 }) }) ``` 2. **POST /scrape** (Fallback) ```typescript browser.fetch('https://browser-rendering.cloudflare.com/scrape', { method: 'POST', body: JSON.stringify({ url: 'https://www.vpsbenchmarks.com/', elements: [ { selector: 'table tbody tr' }, { selector: '.benchmark-card' } ] }) }) ``` ### Free Tier Limits - 10 minutes/day - More than sufficient for daily scraping - Each scrape should complete in < 1 minute ## Testing ### Local Testing ```bash # Terminal 1: Start dev server npm run dev # Terminal 2: Trigger scraper curl "http://localhost:8793/__scheduled?cron=0+9+*+*+*" ``` ### Production Deployment ```bash # Deploy to Cloudflare Workers npm run deploy # Verify cron trigger (auto-runs daily at 9:00 AM UTC) npx wrangler tail # Check database npx wrangler d1 execute cloud-instances-db --command="SELECT COUNT(*) FROM vps_benchmarks" ``` ## Expected Results ### Before Upgrade - 269 existing VPS benchmarks (manually seeded) - Scraper runs but finds 0 new entries (JavaScript rendering issue) ### After Upgrade - Should successfully extract benchmark data from rendered page - Number of results depends on vpsbenchmarks.com page structure - Logs will show: - Rendered HTML length - Number of benchmarks extracted - Number inserted/updated/skipped ## Next Steps 1. **Deploy and Monitor** - Deploy to production - Wait for first cron run (or trigger manually) - Check logs for any errors 2. **Analyze Results** - If 0 benchmarks found, inspect rendered HTML in logs - Adjust CSS selectors based on actual site structure - Update parsing patterns as needed 3. **Fine-tune Parsing** - VPSBenchmarks.com structure may vary - Current code includes multiple parsing strategies - May need adjustments based on actual HTML structure ## Rollback Plan If issues occur: 1. Revert `src/scraper.ts` to previous version (direct HTML fetch) 2. Remove `[browser]` binding from wrangler.toml 3. Remove `BROWSER: Fetcher` from Env interface 4. Redeploy ## Technical Notes ### Why Browser Rendering API? **Problem**: VPSBenchmarks.com uses React SPA - Initial HTML contains minimal content - Actual benchmark data loaded via JavaScript after page load - Traditional scraping (fetch + parse) gets empty results **Solution**: Cloudflare Browser Rendering - Executes JavaScript in real browser environment - Waits for network idle (all AJAX complete) - Returns fully rendered HTML with data - Built into Cloudflare Workers platform ### Performance Optimization - `rejectResourceTypes`: Skip images/fonts → 40-60% faster loading - `waitUntil: 'networkidle0'`: Wait for all network activity to settle - 30-second timeout: Prevents hanging on slow pages - Multiple parsing patterns: Increases success rate ### Data Quality - Deduplication via unique constraint - ON CONFLICT DO UPDATE: Keeps data fresh - Performance per dollar auto-calculation - Validation in parsing functions (skip invalid entries) ## Monitoring ### Key Metrics to Watch 1. **Scraper Success Rate** - Number of benchmarks found per run - Should be > 0 after upgrade 2. **Browser Rendering Usage** - Free tier: 10 minutes/day - Current usage: Check Cloudflare dashboard - Each run should use < 1 minute 3. **Database Growth** - Current: 269 records - Expected: Gradual increase from daily scrapes - Deduplication prevents excessive growth 4. **Error Rates** - Parse errors: Indicates structure changes - API errors: Indicates quota or connectivity issues - DB errors: Indicates schema mismatches ## Support If issues persist: 1. Check Cloudflare Workers logs: `npx wrangler tail` 2. Verify Browser Rendering API status 3. Inspect actual vpsbenchmarks.com page structure 4. Update selectors and parsing logic accordingly