Files

kappa 4cb9da06dc feat: 대역폭 추정 및 DAU 표시 기능 추가

- 동시접속자 기반 월간 대역폭 자동 추정
- DAU(일일활성사용자) 추정치 표시 (동접 × 10-14)
- 대역폭 기반 Linode/Vultr 자동 선택 로직
- 비용 분석에 대역폭 비용 포함
- 지역 미선택시 서울/도쿄/오사카/싱가포르 기본 표시
- 지역별 서버 분리 표시 (GROUP BY instance + region)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

2026-01-25 09:40:36 +09:00

5.7 KiB

Raw Permalink Blame History

VPSBenchmarks Scraper Upgrade Summary

Changes Made

1. wrangler.toml

Added: [browser] binding for Cloudflare Browser Rendering API
This enables JavaScript execution for scraping React SPA sites

2. src/index.ts

Updated: Env interface to include BROWSER: Fetcher
No other changes needed - scheduled handler already calls scraper correctly

3. src/scraper.ts (Complete Rewrite)

Old Approach: Direct HTML fetch without JavaScript execution

VPSBenchmarks.com is a React SPA → no data in initial HTML
Previous scraper found 0 results due to client-side rendering

New Approach: Browser Rendering API with JavaScript execution

Uses /content endpoint to fetch fully rendered HTML
Fallback to /scrape endpoint for targeted element extraction
Multiple parsing patterns for robustness:
- Table row extraction
- Embedded JSON data detection
- Benchmark card parsing
- Scraped element parsing

Key Features:

Waits for networkidle0 (JavaScript execution complete)
Rejects unnecessary resources (images, fonts, media) for faster loading
30-second timeout per request
Comprehensive error handling
Deduplication using unique constraint on (provider_name, plan_name, country_code)

4. vps-benchmark-schema.sql

Added: Unique index for ON CONFLICT deduplication
CREATE UNIQUE INDEX idx_vps_benchmarks_unique ON vps_benchmarks(provider_name, plan_name, COALESCE(country_code, ''))

5. Documentation

Created: test-scraper.md - Testing and troubleshooting guide
Created: UPGRADE_SUMMARY.md - This file

Browser Rendering API Details

Endpoints Used

POST /content (Primary)

browser.fetch('https://browser-rendering.cloudflare.com/content', {
  method: 'POST',
  body: JSON.stringify({
    url: 'https://www.vpsbenchmarks.com/',
    waitUntil: 'networkidle0',
    rejectResourceTypes: ['image', 'font', 'media', 'stylesheet'],
    timeout: 30000
  })
})

POST /scrape (Fallback)

browser.fetch('https://browser-rendering.cloudflare.com/scrape', {
  method: 'POST',
  body: JSON.stringify({
    url: 'https://www.vpsbenchmarks.com/',
    elements: [
      { selector: 'table tbody tr' },
      { selector: '.benchmark-card' }
    ]
  })
})

Free Tier Limits

10 minutes/day
More than sufficient for daily scraping
Each scrape should complete in < 1 minute

Testing

Local Testing

# Terminal 1: Start dev server
npm run dev

# Terminal 2: Trigger scraper
curl "http://localhost:8793/__scheduled?cron=0+9+*+*+*"

Production Deployment

# Deploy to Cloudflare Workers
npm run deploy

# Verify cron trigger (auto-runs daily at 9:00 AM UTC)
npx wrangler tail

# Check database
npx wrangler d1 execute cloud-instances-db --command="SELECT COUNT(*) FROM vps_benchmarks"

Expected Results

Before Upgrade

269 existing VPS benchmarks (manually seeded)
Scraper runs but finds 0 new entries (JavaScript rendering issue)

After Upgrade

Should successfully extract benchmark data from rendered page
Number of results depends on vpsbenchmarks.com page structure
Logs will show:
- Rendered HTML length
- Number of benchmarks extracted
- Number inserted/updated/skipped

Next Steps

Deploy and Monitor
- Deploy to production
- Wait for first cron run (or trigger manually)
- Check logs for any errors
Analyze Results
- If 0 benchmarks found, inspect rendered HTML in logs
- Adjust CSS selectors based on actual site structure
- Update parsing patterns as needed
Fine-tune Parsing
- VPSBenchmarks.com structure may vary
- Current code includes multiple parsing strategies
- May need adjustments based on actual HTML structure

Rollback Plan

If issues occur:

Revert src/scraper.ts to previous version (direct HTML fetch)
Remove [browser] binding from wrangler.toml
Remove BROWSER: Fetcher from Env interface
Redeploy

Technical Notes

Why Browser Rendering API?

Problem: VPSBenchmarks.com uses React SPA

Initial HTML contains minimal content
Actual benchmark data loaded via JavaScript after page load
Traditional scraping (fetch + parse) gets empty results

Solution: Cloudflare Browser Rendering

Executes JavaScript in real browser environment
Waits for network idle (all AJAX complete)
Returns fully rendered HTML with data
Built into Cloudflare Workers platform

Performance Optimization

rejectResourceTypes: Skip images/fonts → 40-60% faster loading
waitUntil: 'networkidle0': Wait for all network activity to settle
30-second timeout: Prevents hanging on slow pages
Multiple parsing patterns: Increases success rate

Data Quality

Deduplication via unique constraint
ON CONFLICT DO UPDATE: Keeps data fresh
Performance per dollar auto-calculation
Validation in parsing functions (skip invalid entries)

Monitoring

Key Metrics to Watch

Scraper Success Rate
- Number of benchmarks found per run
- Should be > 0 after upgrade
Browser Rendering Usage
- Free tier: 10 minutes/day
- Current usage: Check Cloudflare dashboard
- Each run should use < 1 minute
Database Growth
- Current: 269 records
- Expected: Gradual increase from daily scrapes
- Deduplication prevents excessive growth
Error Rates
- Parse errors: Indicates structure changes
- API errors: Indicates quota or connectivity issues
- DB errors: Indicates schema mismatches

Support

If issues persist:

Check Cloudflare Workers logs: npx wrangler tail
Verify Browser Rendering API status
Inspect actual vpsbenchmarks.com page structure
Update selectors and parsing logic accordingly

5.7 KiB Raw Permalink Blame History