- 동시접속자 기반 월간 대역폭 자동 추정 - DAU(일일활성사용자) 추정치 표시 (동접 × 10-14) - 대역폭 기반 Linode/Vultr 자동 선택 로직 - 비용 분석에 대역폭 비용 포함 - 지역 미선택시 서울/도쿄/오사카/싱가포르 기본 표시 - 지역별 서버 분리 표시 (GROUP BY instance + region) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
5.7 KiB
5.7 KiB
VPSBenchmarks Scraper Upgrade Summary
Changes Made
1. wrangler.toml
- Added:
[browser]binding for Cloudflare Browser Rendering API - This enables JavaScript execution for scraping React SPA sites
2. src/index.ts
- Updated:
Envinterface to includeBROWSER: Fetcher - No other changes needed - scheduled handler already calls scraper correctly
3. src/scraper.ts (Complete Rewrite)
Old Approach: Direct HTML fetch without JavaScript execution
- VPSBenchmarks.com is a React SPA → no data in initial HTML
- Previous scraper found 0 results due to client-side rendering
New Approach: Browser Rendering API with JavaScript execution
- Uses
/contentendpoint to fetch fully rendered HTML - Fallback to
/scrapeendpoint for targeted element extraction - Multiple parsing patterns for robustness:
- Table row extraction
- Embedded JSON data detection
- Benchmark card parsing
- Scraped element parsing
Key Features:
- Waits for
networkidle0(JavaScript execution complete) - Rejects unnecessary resources (images, fonts, media) for faster loading
- 30-second timeout per request
- Comprehensive error handling
- Deduplication using unique constraint on (provider_name, plan_name, country_code)
4. vps-benchmark-schema.sql
- Added: Unique index for ON CONFLICT deduplication
CREATE UNIQUE INDEX idx_vps_benchmarks_unique ON vps_benchmarks(provider_name, plan_name, COALESCE(country_code, ''))
5. Documentation
- Created:
test-scraper.md- Testing and troubleshooting guide - Created:
UPGRADE_SUMMARY.md- This file
Browser Rendering API Details
Endpoints Used
-
POST /content (Primary)
browser.fetch('https://browser-rendering.cloudflare.com/content', { method: 'POST', body: JSON.stringify({ url: 'https://www.vpsbenchmarks.com/', waitUntil: 'networkidle0', rejectResourceTypes: ['image', 'font', 'media', 'stylesheet'], timeout: 30000 }) }) -
POST /scrape (Fallback)
browser.fetch('https://browser-rendering.cloudflare.com/scrape', { method: 'POST', body: JSON.stringify({ url: 'https://www.vpsbenchmarks.com/', elements: [ { selector: 'table tbody tr' }, { selector: '.benchmark-card' } ] }) })
Free Tier Limits
- 10 minutes/day
- More than sufficient for daily scraping
- Each scrape should complete in < 1 minute
Testing
Local Testing
# Terminal 1: Start dev server
npm run dev
# Terminal 2: Trigger scraper
curl "http://localhost:8793/__scheduled?cron=0+9+*+*+*"
Production Deployment
# Deploy to Cloudflare Workers
npm run deploy
# Verify cron trigger (auto-runs daily at 9:00 AM UTC)
npx wrangler tail
# Check database
npx wrangler d1 execute cloud-instances-db --command="SELECT COUNT(*) FROM vps_benchmarks"
Expected Results
Before Upgrade
- 269 existing VPS benchmarks (manually seeded)
- Scraper runs but finds 0 new entries (JavaScript rendering issue)
After Upgrade
- Should successfully extract benchmark data from rendered page
- Number of results depends on vpsbenchmarks.com page structure
- Logs will show:
- Rendered HTML length
- Number of benchmarks extracted
- Number inserted/updated/skipped
Next Steps
-
Deploy and Monitor
- Deploy to production
- Wait for first cron run (or trigger manually)
- Check logs for any errors
-
Analyze Results
- If 0 benchmarks found, inspect rendered HTML in logs
- Adjust CSS selectors based on actual site structure
- Update parsing patterns as needed
-
Fine-tune Parsing
- VPSBenchmarks.com structure may vary
- Current code includes multiple parsing strategies
- May need adjustments based on actual HTML structure
Rollback Plan
If issues occur:
- Revert
src/scraper.tsto previous version (direct HTML fetch) - Remove
[browser]binding from wrangler.toml - Remove
BROWSER: Fetcherfrom Env interface - Redeploy
Technical Notes
Why Browser Rendering API?
Problem: VPSBenchmarks.com uses React SPA
- Initial HTML contains minimal content
- Actual benchmark data loaded via JavaScript after page load
- Traditional scraping (fetch + parse) gets empty results
Solution: Cloudflare Browser Rendering
- Executes JavaScript in real browser environment
- Waits for network idle (all AJAX complete)
- Returns fully rendered HTML with data
- Built into Cloudflare Workers platform
Performance Optimization
rejectResourceTypes: Skip images/fonts → 40-60% faster loadingwaitUntil: 'networkidle0': Wait for all network activity to settle- 30-second timeout: Prevents hanging on slow pages
- Multiple parsing patterns: Increases success rate
Data Quality
- Deduplication via unique constraint
- ON CONFLICT DO UPDATE: Keeps data fresh
- Performance per dollar auto-calculation
- Validation in parsing functions (skip invalid entries)
Monitoring
Key Metrics to Watch
-
Scraper Success Rate
- Number of benchmarks found per run
- Should be > 0 after upgrade
-
Browser Rendering Usage
- Free tier: 10 minutes/day
- Current usage: Check Cloudflare dashboard
- Each run should use < 1 minute
-
Database Growth
- Current: 269 records
- Expected: Gradual increase from daily scrapes
- Deduplication prevents excessive growth
-
Error Rates
- Parse errors: Indicates structure changes
- API errors: Indicates quota or connectivity issues
- DB errors: Indicates schema mismatches
Support
If issues persist:
- Check Cloudflare Workers logs:
npx wrangler tail - Verify Browser Rendering API status
- Inspect actual vpsbenchmarks.com page structure
- Update selectors and parsing logic accordingly