cloud-orchestrator/test-scraper.md

# Testing VPSBenchmarks Scraper with Browser Rendering API

## Local Testing

### 1. Start the development server
```bash
npm run dev
```

### 2. Trigger the scheduled handler manually
In a separate terminal, run:
```bash
curl "http://localhost:8793/__scheduled?cron=0+9+*+*+*"
```

### 3. Check the logs
The scraper will:
- Use Browser Rendering API to fetch rendered HTML from vpsbenchmarks.com
- Extract benchmark data from the rendered page
- Insert/update records in the D1 database
- Log the number of benchmarks found and inserted

Expected output:
```
[Scraper] Starting VPSBenchmarks.com scrape with Browser Rendering API
[Scraper] Fetching rendered HTML from vpsbenchmarks.com
[Scraper] Rendered HTML length: XXXXX
[Scraper] Extracted X benchmarks from HTML
[Scraper] Found X benchmark entries
[DB] Inserted/Updated: Provider PlanName
[Scraper] Completed in XXXms: X inserted, X skipped, X errors
```

## Production Deployment

### 1. Deploy to Cloudflare Workers
```bash
npm run deploy
```

### 2. Verify the cron trigger
The scraper will run automatically daily at 9:00 AM UTC.

### 3. Check D1 database for new records
```bash
npx wrangler d1 execute cloud-instances-db --command="SELECT COUNT(*) as total FROM vps_benchmarks"
npx wrangler d1 execute cloud-instances-db --command="SELECT * FROM vps_benchmarks ORDER BY created_at DESC LIMIT 10"
```

## Browser Rendering API Usage

### Free Tier Limits
- 10 minutes per day
- Sufficient for daily scraping (each run should take < 1 minute)

### API Endpoints Used

1. **POST /content** - Fetch fully rendered HTML
   - Waits for JavaScript to execute
   - Returns complete DOM after rendering
   - Options: `waitUntil`, `rejectResourceTypes`, `timeout`

2. **POST /scrape** (fallback) - Extract specific elements
   - Target specific CSS selectors
   - Returns structured element data
   - Useful if full HTML extraction fails

### Error Handling

The scraper includes multiple fallback strategies:
1. Try `/content` endpoint first (full HTML)
2. If no data found, try `/scrape` endpoint (targeted extraction)
3. Multiple parsing patterns for different HTML structures
4. Graceful degradation if API fails

### Troubleshooting

**No benchmarks found:**
- Check if vpsbenchmarks.com changed their HTML structure
- Examine the rendered HTML output in logs
- Adjust CSS selectors in `scrapeBenchmarksWithScrapeAPI()`
- Update parsing patterns in `extractBenchmarksFromHTML()`

**Browser Rendering API errors:**
- Check daily quota usage
- Verify BROWSER binding is configured in wrangler.toml
- Check network connectivity to Browser Rendering API
- Review timeout settings (default: 30 seconds)

**Database insertion errors:**
- Verify vps_benchmarks table schema
- Check unique constraint on (provider_name, plan_name, country_code)
- Ensure all required fields are not null

## Next Steps

After successful testing:

1. **Analyze the rendered HTML** to identify exact CSS selectors
2. **Update parsing logic** based on actual site structure
3. **Test with real data** to ensure accuracy
4. **Monitor logs** after deployment for any issues
5. **Validate data quality** in the database