Files
cloud-orchestrator/test-scraper.md
kappa 4cb9da06dc feat: 대역폭 추정 및 DAU 표시 기능 추가
- 동시접속자 기반 월간 대역폭 자동 추정
- DAU(일일활성사용자) 추정치 표시 (동접 × 10-14)
- 대역폭 기반 Linode/Vultr 자동 선택 로직
- 비용 분석에 대역폭 비용 포함
- 지역 미선택시 서울/도쿄/오사카/싱가포르 기본 표시
- 지역별 서버 분리 표시 (GROUP BY instance + region)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-25 09:40:36 +09:00

104 lines
3.1 KiB
Markdown

# Testing VPSBenchmarks Scraper with Browser Rendering API
## Local Testing
### 1. Start the development server
```bash
npm run dev
```
### 2. Trigger the scheduled handler manually
In a separate terminal, run:
```bash
curl "http://localhost:8793/__scheduled?cron=0+9+*+*+*"
```
### 3. Check the logs
The scraper will:
- Use Browser Rendering API to fetch rendered HTML from vpsbenchmarks.com
- Extract benchmark data from the rendered page
- Insert/update records in the D1 database
- Log the number of benchmarks found and inserted
Expected output:
```
[Scraper] Starting VPSBenchmarks.com scrape with Browser Rendering API
[Scraper] Fetching rendered HTML from vpsbenchmarks.com
[Scraper] Rendered HTML length: XXXXX
[Scraper] Extracted X benchmarks from HTML
[Scraper] Found X benchmark entries
[DB] Inserted/Updated: Provider PlanName
[Scraper] Completed in XXXms: X inserted, X skipped, X errors
```
## Production Deployment
### 1. Deploy to Cloudflare Workers
```bash
npm run deploy
```
### 2. Verify the cron trigger
The scraper will run automatically daily at 9:00 AM UTC.
### 3. Check D1 database for new records
```bash
npx wrangler d1 execute cloud-instances-db --command="SELECT COUNT(*) as total FROM vps_benchmarks"
npx wrangler d1 execute cloud-instances-db --command="SELECT * FROM vps_benchmarks ORDER BY created_at DESC LIMIT 10"
```
## Browser Rendering API Usage
### Free Tier Limits
- 10 minutes per day
- Sufficient for daily scraping (each run should take < 1 minute)
### API Endpoints Used
1. **POST /content** - Fetch fully rendered HTML
- Waits for JavaScript to execute
- Returns complete DOM after rendering
- Options: `waitUntil`, `rejectResourceTypes`, `timeout`
2. **POST /scrape** (fallback) - Extract specific elements
- Target specific CSS selectors
- Returns structured element data
- Useful if full HTML extraction fails
### Error Handling
The scraper includes multiple fallback strategies:
1. Try `/content` endpoint first (full HTML)
2. If no data found, try `/scrape` endpoint (targeted extraction)
3. Multiple parsing patterns for different HTML structures
4. Graceful degradation if API fails
### Troubleshooting
**No benchmarks found:**
- Check if vpsbenchmarks.com changed their HTML structure
- Examine the rendered HTML output in logs
- Adjust CSS selectors in `scrapeBenchmarksWithScrapeAPI()`
- Update parsing patterns in `extractBenchmarksFromHTML()`
**Browser Rendering API errors:**
- Check daily quota usage
- Verify BROWSER binding is configured in wrangler.toml
- Check network connectivity to Browser Rendering API
- Review timeout settings (default: 30 seconds)
**Database insertion errors:**
- Verify vps_benchmarks table schema
- Check unique constraint on (provider_name, plan_name, country_code)
- Ensure all required fields are not null
## Next Steps
After successful testing:
1. **Analyze the rendered HTML** to identify exact CSS selectors
2. **Update parsing logic** based on actual site structure
3. **Test with real data** to ensure accuracy
4. **Monitor logs** after deployment for any issues
5. **Validate data quality** in the database