feat: 대역폭 추정 및 DAU 표시 기능 추가
- 동시접속자 기반 월간 대역폭 자동 추정 - DAU(일일활성사용자) 추정치 표시 (동접 × 10-14) - 대역폭 기반 Linode/Vultr 자동 선택 로직 - 비용 분석에 대역폭 비용 포함 - 지역 미선택시 서울/도쿄/오사카/싱가포르 기본 표시 - 지역별 서버 분리 표시 (GROUP BY instance + region) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
This commit is contained in:
198
UPGRADE_SUMMARY.md
Normal file
198
UPGRADE_SUMMARY.md
Normal file
@@ -0,0 +1,198 @@
|
||||
# VPSBenchmarks Scraper Upgrade Summary
|
||||
|
||||
## Changes Made
|
||||
|
||||
### 1. wrangler.toml
|
||||
- **Added**: `[browser]` binding for Cloudflare Browser Rendering API
|
||||
- This enables JavaScript execution for scraping React SPA sites
|
||||
|
||||
### 2. src/index.ts
|
||||
- **Updated**: `Env` interface to include `BROWSER: Fetcher`
|
||||
- No other changes needed - scheduled handler already calls scraper correctly
|
||||
|
||||
### 3. src/scraper.ts (Complete Rewrite)
|
||||
**Old Approach**: Direct HTML fetch without JavaScript execution
|
||||
- VPSBenchmarks.com is a React SPA → no data in initial HTML
|
||||
- Previous scraper found 0 results due to client-side rendering
|
||||
|
||||
**New Approach**: Browser Rendering API with JavaScript execution
|
||||
- Uses `/content` endpoint to fetch fully rendered HTML
|
||||
- Fallback to `/scrape` endpoint for targeted element extraction
|
||||
- Multiple parsing patterns for robustness:
|
||||
- Table row extraction
|
||||
- Embedded JSON data detection
|
||||
- Benchmark card parsing
|
||||
- Scraped element parsing
|
||||
|
||||
**Key Features**:
|
||||
- Waits for `networkidle0` (JavaScript execution complete)
|
||||
- Rejects unnecessary resources (images, fonts, media) for faster loading
|
||||
- 30-second timeout per request
|
||||
- Comprehensive error handling
|
||||
- Deduplication using unique constraint on (provider_name, plan_name, country_code)
|
||||
|
||||
### 4. vps-benchmark-schema.sql
|
||||
- **Added**: Unique index for ON CONFLICT deduplication
|
||||
- `CREATE UNIQUE INDEX idx_vps_benchmarks_unique ON vps_benchmarks(provider_name, plan_name, COALESCE(country_code, ''))`
|
||||
|
||||
### 5. Documentation
|
||||
- **Created**: `test-scraper.md` - Testing and troubleshooting guide
|
||||
- **Created**: `UPGRADE_SUMMARY.md` - This file
|
||||
|
||||
## Browser Rendering API Details
|
||||
|
||||
### Endpoints Used
|
||||
|
||||
1. **POST /content** (Primary)
|
||||
```typescript
|
||||
browser.fetch('https://browser-rendering.cloudflare.com/content', {
|
||||
method: 'POST',
|
||||
body: JSON.stringify({
|
||||
url: 'https://www.vpsbenchmarks.com/',
|
||||
waitUntil: 'networkidle0',
|
||||
rejectResourceTypes: ['image', 'font', 'media', 'stylesheet'],
|
||||
timeout: 30000
|
||||
})
|
||||
})
|
||||
```
|
||||
|
||||
2. **POST /scrape** (Fallback)
|
||||
```typescript
|
||||
browser.fetch('https://browser-rendering.cloudflare.com/scrape', {
|
||||
method: 'POST',
|
||||
body: JSON.stringify({
|
||||
url: 'https://www.vpsbenchmarks.com/',
|
||||
elements: [
|
||||
{ selector: 'table tbody tr' },
|
||||
{ selector: '.benchmark-card' }
|
||||
]
|
||||
})
|
||||
})
|
||||
```
|
||||
|
||||
### Free Tier Limits
|
||||
- 10 minutes/day
|
||||
- More than sufficient for daily scraping
|
||||
- Each scrape should complete in < 1 minute
|
||||
|
||||
## Testing
|
||||
|
||||
### Local Testing
|
||||
```bash
|
||||
# Terminal 1: Start dev server
|
||||
npm run dev
|
||||
|
||||
# Terminal 2: Trigger scraper
|
||||
curl "http://localhost:8793/__scheduled?cron=0+9+*+*+*"
|
||||
```
|
||||
|
||||
### Production Deployment
|
||||
```bash
|
||||
# Deploy to Cloudflare Workers
|
||||
npm run deploy
|
||||
|
||||
# Verify cron trigger (auto-runs daily at 9:00 AM UTC)
|
||||
npx wrangler tail
|
||||
|
||||
# Check database
|
||||
npx wrangler d1 execute cloud-instances-db --command="SELECT COUNT(*) FROM vps_benchmarks"
|
||||
```
|
||||
|
||||
## Expected Results
|
||||
|
||||
### Before Upgrade
|
||||
- 269 existing VPS benchmarks (manually seeded)
|
||||
- Scraper runs but finds 0 new entries (JavaScript rendering issue)
|
||||
|
||||
### After Upgrade
|
||||
- Should successfully extract benchmark data from rendered page
|
||||
- Number of results depends on vpsbenchmarks.com page structure
|
||||
- Logs will show:
|
||||
- Rendered HTML length
|
||||
- Number of benchmarks extracted
|
||||
- Number inserted/updated/skipped
|
||||
|
||||
## Next Steps
|
||||
|
||||
1. **Deploy and Monitor**
|
||||
- Deploy to production
|
||||
- Wait for first cron run (or trigger manually)
|
||||
- Check logs for any errors
|
||||
|
||||
2. **Analyze Results**
|
||||
- If 0 benchmarks found, inspect rendered HTML in logs
|
||||
- Adjust CSS selectors based on actual site structure
|
||||
- Update parsing patterns as needed
|
||||
|
||||
3. **Fine-tune Parsing**
|
||||
- VPSBenchmarks.com structure may vary
|
||||
- Current code includes multiple parsing strategies
|
||||
- May need adjustments based on actual HTML structure
|
||||
|
||||
## Rollback Plan
|
||||
|
||||
If issues occur:
|
||||
1. Revert `src/scraper.ts` to previous version (direct HTML fetch)
|
||||
2. Remove `[browser]` binding from wrangler.toml
|
||||
3. Remove `BROWSER: Fetcher` from Env interface
|
||||
4. Redeploy
|
||||
|
||||
## Technical Notes
|
||||
|
||||
### Why Browser Rendering API?
|
||||
|
||||
**Problem**: VPSBenchmarks.com uses React SPA
|
||||
- Initial HTML contains minimal content
|
||||
- Actual benchmark data loaded via JavaScript after page load
|
||||
- Traditional scraping (fetch + parse) gets empty results
|
||||
|
||||
**Solution**: Cloudflare Browser Rendering
|
||||
- Executes JavaScript in real browser environment
|
||||
- Waits for network idle (all AJAX complete)
|
||||
- Returns fully rendered HTML with data
|
||||
- Built into Cloudflare Workers platform
|
||||
|
||||
### Performance Optimization
|
||||
|
||||
- `rejectResourceTypes`: Skip images/fonts → 40-60% faster loading
|
||||
- `waitUntil: 'networkidle0'`: Wait for all network activity to settle
|
||||
- 30-second timeout: Prevents hanging on slow pages
|
||||
- Multiple parsing patterns: Increases success rate
|
||||
|
||||
### Data Quality
|
||||
|
||||
- Deduplication via unique constraint
|
||||
- ON CONFLICT DO UPDATE: Keeps data fresh
|
||||
- Performance per dollar auto-calculation
|
||||
- Validation in parsing functions (skip invalid entries)
|
||||
|
||||
## Monitoring
|
||||
|
||||
### Key Metrics to Watch
|
||||
|
||||
1. **Scraper Success Rate**
|
||||
- Number of benchmarks found per run
|
||||
- Should be > 0 after upgrade
|
||||
|
||||
2. **Browser Rendering Usage**
|
||||
- Free tier: 10 minutes/day
|
||||
- Current usage: Check Cloudflare dashboard
|
||||
- Each run should use < 1 minute
|
||||
|
||||
3. **Database Growth**
|
||||
- Current: 269 records
|
||||
- Expected: Gradual increase from daily scrapes
|
||||
- Deduplication prevents excessive growth
|
||||
|
||||
4. **Error Rates**
|
||||
- Parse errors: Indicates structure changes
|
||||
- API errors: Indicates quota or connectivity issues
|
||||
- DB errors: Indicates schema mismatches
|
||||
|
||||
## Support
|
||||
|
||||
If issues persist:
|
||||
1. Check Cloudflare Workers logs: `npx wrangler tail`
|
||||
2. Verify Browser Rendering API status
|
||||
3. Inspect actual vpsbenchmarks.com page structure
|
||||
4. Update selectors and parsing logic accordingly
|
||||
Reference in New Issue
Block a user