feat: 대역폭 추정 및 DAU 표시 기능 추가

- 동시접속자 기반 월간 대역폭 자동 추정
- DAU(일일활성사용자) 추정치 표시 (동접 × 10-14)
- 대역폭 기반 Linode/Vultr 자동 선택 로직
- 비용 분석에 대역폭 비용 포함
- 지역 미선택시 서울/도쿄/오사카/싱가포르 기본 표시
- 지역별 서버 분리 표시 (GROUP BY instance + region)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
This commit is contained in:
kappa
2026-01-25 09:40:36 +09:00
commit 4cb9da06dc
3337 changed files with 1048645 additions and 0 deletions

198
UPGRADE_SUMMARY.md Normal file
View File

@@ -0,0 +1,198 @@
# VPSBenchmarks Scraper Upgrade Summary
## Changes Made
### 1. wrangler.toml
- **Added**: `[browser]` binding for Cloudflare Browser Rendering API
- This enables JavaScript execution for scraping React SPA sites
### 2. src/index.ts
- **Updated**: `Env` interface to include `BROWSER: Fetcher`
- No other changes needed - scheduled handler already calls scraper correctly
### 3. src/scraper.ts (Complete Rewrite)
**Old Approach**: Direct HTML fetch without JavaScript execution
- VPSBenchmarks.com is a React SPA → no data in initial HTML
- Previous scraper found 0 results due to client-side rendering
**New Approach**: Browser Rendering API with JavaScript execution
- Uses `/content` endpoint to fetch fully rendered HTML
- Fallback to `/scrape` endpoint for targeted element extraction
- Multiple parsing patterns for robustness:
- Table row extraction
- Embedded JSON data detection
- Benchmark card parsing
- Scraped element parsing
**Key Features**:
- Waits for `networkidle0` (JavaScript execution complete)
- Rejects unnecessary resources (images, fonts, media) for faster loading
- 30-second timeout per request
- Comprehensive error handling
- Deduplication using unique constraint on (provider_name, plan_name, country_code)
### 4. vps-benchmark-schema.sql
- **Added**: Unique index for ON CONFLICT deduplication
- `CREATE UNIQUE INDEX idx_vps_benchmarks_unique ON vps_benchmarks(provider_name, plan_name, COALESCE(country_code, ''))`
### 5. Documentation
- **Created**: `test-scraper.md` - Testing and troubleshooting guide
- **Created**: `UPGRADE_SUMMARY.md` - This file
## Browser Rendering API Details
### Endpoints Used
1. **POST /content** (Primary)
```typescript
browser.fetch('https://browser-rendering.cloudflare.com/content', {
method: 'POST',
body: JSON.stringify({
url: 'https://www.vpsbenchmarks.com/',
waitUntil: 'networkidle0',
rejectResourceTypes: ['image', 'font', 'media', 'stylesheet'],
timeout: 30000
})
})
```
2. **POST /scrape** (Fallback)
```typescript
browser.fetch('https://browser-rendering.cloudflare.com/scrape', {
method: 'POST',
body: JSON.stringify({
url: 'https://www.vpsbenchmarks.com/',
elements: [
{ selector: 'table tbody tr' },
{ selector: '.benchmark-card' }
]
})
})
```
### Free Tier Limits
- 10 minutes/day
- More than sufficient for daily scraping
- Each scrape should complete in < 1 minute
## Testing
### Local Testing
```bash
# Terminal 1: Start dev server
npm run dev
# Terminal 2: Trigger scraper
curl "http://localhost:8793/__scheduled?cron=0+9+*+*+*"
```
### Production Deployment
```bash
# Deploy to Cloudflare Workers
npm run deploy
# Verify cron trigger (auto-runs daily at 9:00 AM UTC)
npx wrangler tail
# Check database
npx wrangler d1 execute cloud-instances-db --command="SELECT COUNT(*) FROM vps_benchmarks"
```
## Expected Results
### Before Upgrade
- 269 existing VPS benchmarks (manually seeded)
- Scraper runs but finds 0 new entries (JavaScript rendering issue)
### After Upgrade
- Should successfully extract benchmark data from rendered page
- Number of results depends on vpsbenchmarks.com page structure
- Logs will show:
- Rendered HTML length
- Number of benchmarks extracted
- Number inserted/updated/skipped
## Next Steps
1. **Deploy and Monitor**
- Deploy to production
- Wait for first cron run (or trigger manually)
- Check logs for any errors
2. **Analyze Results**
- If 0 benchmarks found, inspect rendered HTML in logs
- Adjust CSS selectors based on actual site structure
- Update parsing patterns as needed
3. **Fine-tune Parsing**
- VPSBenchmarks.com structure may vary
- Current code includes multiple parsing strategies
- May need adjustments based on actual HTML structure
## Rollback Plan
If issues occur:
1. Revert `src/scraper.ts` to previous version (direct HTML fetch)
2. Remove `[browser]` binding from wrangler.toml
3. Remove `BROWSER: Fetcher` from Env interface
4. Redeploy
## Technical Notes
### Why Browser Rendering API?
**Problem**: VPSBenchmarks.com uses React SPA
- Initial HTML contains minimal content
- Actual benchmark data loaded via JavaScript after page load
- Traditional scraping (fetch + parse) gets empty results
**Solution**: Cloudflare Browser Rendering
- Executes JavaScript in real browser environment
- Waits for network idle (all AJAX complete)
- Returns fully rendered HTML with data
- Built into Cloudflare Workers platform
### Performance Optimization
- `rejectResourceTypes`: Skip images/fonts → 40-60% faster loading
- `waitUntil: 'networkidle0'`: Wait for all network activity to settle
- 30-second timeout: Prevents hanging on slow pages
- Multiple parsing patterns: Increases success rate
### Data Quality
- Deduplication via unique constraint
- ON CONFLICT DO UPDATE: Keeps data fresh
- Performance per dollar auto-calculation
- Validation in parsing functions (skip invalid entries)
## Monitoring
### Key Metrics to Watch
1. **Scraper Success Rate**
- Number of benchmarks found per run
- Should be > 0 after upgrade
2. **Browser Rendering Usage**
- Free tier: 10 minutes/day
- Current usage: Check Cloudflare dashboard
- Each run should use < 1 minute
3. **Database Growth**
- Current: 269 records
- Expected: Gradual increase from daily scrapes
- Deduplication prevents excessive growth
4. **Error Rates**
- Parse errors: Indicates structure changes
- API errors: Indicates quota or connectivity issues
- DB errors: Indicates schema mismatches
## Support
If issues persist:
1. Check Cloudflare Workers logs: `npx wrangler tail`
2. Verify Browser Rendering API status
3. Inspect actual vpsbenchmarks.com page structure
4. Update selectors and parsing logic accordingly