Files
cloud-orchestrator/UPGRADE_SUMMARY.md
kappa 4cb9da06dc feat: 대역폭 추정 및 DAU 표시 기능 추가
- 동시접속자 기반 월간 대역폭 자동 추정
- DAU(일일활성사용자) 추정치 표시 (동접 × 10-14)
- 대역폭 기반 Linode/Vultr 자동 선택 로직
- 비용 분석에 대역폭 비용 포함
- 지역 미선택시 서울/도쿄/오사카/싱가포르 기본 표시
- 지역별 서버 분리 표시 (GROUP BY instance + region)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-25 09:40:36 +09:00

5.7 KiB

VPSBenchmarks Scraper Upgrade Summary

Changes Made

1. wrangler.toml

  • Added: [browser] binding for Cloudflare Browser Rendering API
  • This enables JavaScript execution for scraping React SPA sites

2. src/index.ts

  • Updated: Env interface to include BROWSER: Fetcher
  • No other changes needed - scheduled handler already calls scraper correctly

3. src/scraper.ts (Complete Rewrite)

Old Approach: Direct HTML fetch without JavaScript execution

  • VPSBenchmarks.com is a React SPA → no data in initial HTML
  • Previous scraper found 0 results due to client-side rendering

New Approach: Browser Rendering API with JavaScript execution

  • Uses /content endpoint to fetch fully rendered HTML
  • Fallback to /scrape endpoint for targeted element extraction
  • Multiple parsing patterns for robustness:
    • Table row extraction
    • Embedded JSON data detection
    • Benchmark card parsing
    • Scraped element parsing

Key Features:

  • Waits for networkidle0 (JavaScript execution complete)
  • Rejects unnecessary resources (images, fonts, media) for faster loading
  • 30-second timeout per request
  • Comprehensive error handling
  • Deduplication using unique constraint on (provider_name, plan_name, country_code)

4. vps-benchmark-schema.sql

  • Added: Unique index for ON CONFLICT deduplication
  • CREATE UNIQUE INDEX idx_vps_benchmarks_unique ON vps_benchmarks(provider_name, plan_name, COALESCE(country_code, ''))

5. Documentation

  • Created: test-scraper.md - Testing and troubleshooting guide
  • Created: UPGRADE_SUMMARY.md - This file

Browser Rendering API Details

Endpoints Used

  1. POST /content (Primary)

    browser.fetch('https://browser-rendering.cloudflare.com/content', {
      method: 'POST',
      body: JSON.stringify({
        url: 'https://www.vpsbenchmarks.com/',
        waitUntil: 'networkidle0',
        rejectResourceTypes: ['image', 'font', 'media', 'stylesheet'],
        timeout: 30000
      })
    })
    
  2. POST /scrape (Fallback)

    browser.fetch('https://browser-rendering.cloudflare.com/scrape', {
      method: 'POST',
      body: JSON.stringify({
        url: 'https://www.vpsbenchmarks.com/',
        elements: [
          { selector: 'table tbody tr' },
          { selector: '.benchmark-card' }
        ]
      })
    })
    

Free Tier Limits

  • 10 minutes/day
  • More than sufficient for daily scraping
  • Each scrape should complete in < 1 minute

Testing

Local Testing

# Terminal 1: Start dev server
npm run dev

# Terminal 2: Trigger scraper
curl "http://localhost:8793/__scheduled?cron=0+9+*+*+*"

Production Deployment

# Deploy to Cloudflare Workers
npm run deploy

# Verify cron trigger (auto-runs daily at 9:00 AM UTC)
npx wrangler tail

# Check database
npx wrangler d1 execute cloud-instances-db --command="SELECT COUNT(*) FROM vps_benchmarks"

Expected Results

Before Upgrade

  • 269 existing VPS benchmarks (manually seeded)
  • Scraper runs but finds 0 new entries (JavaScript rendering issue)

After Upgrade

  • Should successfully extract benchmark data from rendered page
  • Number of results depends on vpsbenchmarks.com page structure
  • Logs will show:
    • Rendered HTML length
    • Number of benchmarks extracted
    • Number inserted/updated/skipped

Next Steps

  1. Deploy and Monitor

    • Deploy to production
    • Wait for first cron run (or trigger manually)
    • Check logs for any errors
  2. Analyze Results

    • If 0 benchmarks found, inspect rendered HTML in logs
    • Adjust CSS selectors based on actual site structure
    • Update parsing patterns as needed
  3. Fine-tune Parsing

    • VPSBenchmarks.com structure may vary
    • Current code includes multiple parsing strategies
    • May need adjustments based on actual HTML structure

Rollback Plan

If issues occur:

  1. Revert src/scraper.ts to previous version (direct HTML fetch)
  2. Remove [browser] binding from wrangler.toml
  3. Remove BROWSER: Fetcher from Env interface
  4. Redeploy

Technical Notes

Why Browser Rendering API?

Problem: VPSBenchmarks.com uses React SPA

  • Initial HTML contains minimal content
  • Actual benchmark data loaded via JavaScript after page load
  • Traditional scraping (fetch + parse) gets empty results

Solution: Cloudflare Browser Rendering

  • Executes JavaScript in real browser environment
  • Waits for network idle (all AJAX complete)
  • Returns fully rendered HTML with data
  • Built into Cloudflare Workers platform

Performance Optimization

  • rejectResourceTypes: Skip images/fonts → 40-60% faster loading
  • waitUntil: 'networkidle0': Wait for all network activity to settle
  • 30-second timeout: Prevents hanging on slow pages
  • Multiple parsing patterns: Increases success rate

Data Quality

  • Deduplication via unique constraint
  • ON CONFLICT DO UPDATE: Keeps data fresh
  • Performance per dollar auto-calculation
  • Validation in parsing functions (skip invalid entries)

Monitoring

Key Metrics to Watch

  1. Scraper Success Rate

    • Number of benchmarks found per run
    • Should be > 0 after upgrade
  2. Browser Rendering Usage

    • Free tier: 10 minutes/day
    • Current usage: Check Cloudflare dashboard
    • Each run should use < 1 minute
  3. Database Growth

    • Current: 269 records
    • Expected: Gradual increase from daily scrapes
    • Deduplication prevents excessive growth
  4. Error Rates

    • Parse errors: Indicates structure changes
    • API errors: Indicates quota or connectivity issues
    • DB errors: Indicates schema mismatches

Support

If issues persist:

  1. Check Cloudflare Workers logs: npx wrangler tail
  2. Verify Browser Rendering API status
  3. Inspect actual vpsbenchmarks.com page structure
  4. Update selectors and parsing logic accordingly