- 동시접속자 기반 월간 대역폭 자동 추정 - DAU(일일활성사용자) 추정치 표시 (동접 × 10-14) - 대역폭 기반 Linode/Vultr 자동 선택 로직 - 비용 분석에 대역폭 비용 포함 - 지역 미선택시 서울/도쿄/오사카/싱가포르 기본 표시 - 지역별 서버 분리 표시 (GROUP BY instance + region) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
323 lines
9.3 KiB
Markdown
323 lines
9.3 KiB
Markdown
# VPSBenchmarks.com Scraper Implementation Summary
|
|
|
|
## Overview
|
|
|
|
Successfully implemented an automated daily scraper for VPSBenchmarks.com that fetches benchmark data via RSS feed and updates the D1 database.
|
|
|
|
## Files Created/Modified
|
|
|
|
### New Files
|
|
1. **src/scraper.ts** (432 lines)
|
|
- Main scraper implementation
|
|
- RSS feed parsing
|
|
- HTML scraping for Geekbench scores
|
|
- Database insertion with deduplication
|
|
|
|
2. **migrations/add_scraper_columns.sql** (18 lines)
|
|
- Database migration to add scraper columns
|
|
- Duplicate removal logic
|
|
- Unique constraint creation
|
|
|
|
3. **test-scraper.ts** (145 lines)
|
|
- Local testing script for parsing logic
|
|
- Validates RSS parsing, title parsing, score extraction
|
|
|
|
4. **SCRAPER.md** (280+ lines)
|
|
- Comprehensive documentation
|
|
- Architecture, data mapping, monitoring
|
|
- Testing and deployment instructions
|
|
|
|
5. **IMPLEMENTATION_SUMMARY.md** (this file)
|
|
- Implementation overview and summary
|
|
|
|
### Modified Files
|
|
1. **src/index.ts**
|
|
- Added import for scraper module
|
|
- Added `scheduled()` handler for cron trigger
|
|
- No changes to existing HTTP handlers
|
|
|
|
2. **wrangler.toml**
|
|
- Added `[triggers]` section with cron schedule
|
|
|
|
3. **CLAUDE.md**
|
|
- Updated architecture diagram
|
|
- Added scraper commands
|
|
- Updated vps_benchmarks table description
|
|
- Added scraper section with testing info
|
|
|
|
## Database Changes
|
|
|
|
### Schema Updates (Applied to Production)
|
|
|
|
Migration executed successfully on remote database:
|
|
- Removed 8 duplicate entries (484 rows written, 3272 rows read)
|
|
- Final record count: 471 unique benchmarks
|
|
|
|
New columns added to `vps_benchmarks`:
|
|
- `storage_gb` (INTEGER) - Storage capacity
|
|
- `benchmark_date` (TEXT) - ISO timestamp of benchmark
|
|
- `data_source` (TEXT) - Source identifier ('vpsbenchmarks.com' or 'manual')
|
|
- `validation_status` (TEXT) - Approval status ('auto_approved', 'pending', 'rejected')
|
|
|
|
Unique constraint added:
|
|
```sql
|
|
CREATE UNIQUE INDEX idx_vps_benchmarks_unique
|
|
ON vps_benchmarks(provider_name, plan_name, country_code);
|
|
```
|
|
|
|
### Before/After
|
|
- **Before**: 479 records with 8 duplicates
|
|
- **After**: 471 unique records with scraper-ready schema
|
|
|
|
## Architecture
|
|
|
|
### Worker Structure
|
|
```
|
|
Worker (src/index.ts)
|
|
├── fetch() - HTTP request handler (existing)
|
|
│ ├── GET /api/health
|
|
│ ├── GET /api/servers
|
|
│ └── POST /api/recommend
|
|
└── scheduled() - NEW cron trigger handler
|
|
└── scrapeVPSBenchmarks() from src/scraper.ts
|
|
```
|
|
|
|
### Scraper Flow
|
|
```
|
|
Cron Trigger (9:00 UTC daily)
|
|
↓
|
|
scheduled() handler
|
|
↓
|
|
scrapeVPSBenchmarks()
|
|
↓
|
|
parseRSSFeed() → Fetch https://www.vpsbenchmarks.com/rss/benchmarks
|
|
↓
|
|
For each RSS item (max 20):
|
|
↓
|
|
parseBenchmarkDetails()
|
|
├── parseTitleFormat() - Extract provider, plan, specs, location
|
|
├── fetchDetailPage() - Fetch HTML detail page
|
|
└── extractBenchmarkScores() - Parse Geekbench scores
|
|
↓
|
|
insertBenchmark() - INSERT OR UPDATE with deduplication
|
|
```
|
|
|
|
### Deduplication Strategy
|
|
- **Primary Key**: Auto-increment `id`
|
|
- **Unique Constraint**: (provider_name, plan_name, country_code)
|
|
- **Conflict Resolution**: `ON CONFLICT DO UPDATE` to overwrite with latest data
|
|
- **Rationale**: Same VPS plan in same location = same benchmark target
|
|
|
|
## Key Features
|
|
|
|
### 1. RSS Feed Parsing
|
|
- Custom XML parser (no external dependencies)
|
|
- Extracts title, link, pubDate, description
|
|
- Handles CDATA sections
|
|
|
|
### 2. Title Parsing
|
|
Supports multiple formats:
|
|
- `Provider - Plan (X vCPU, Y GB RAM) - Location`
|
|
- `Provider Plan - X vCPU, Y GB RAM - Location`
|
|
|
|
Extracts:
|
|
- Provider name
|
|
- Plan name
|
|
- vCPU count
|
|
- Memory (GB)
|
|
- Storage (GB) - optional
|
|
- Monthly price (USD) - optional
|
|
- Location
|
|
|
|
### 3. Location to Country Code Mapping
|
|
Maps 30+ locations to ISO country codes:
|
|
- Singapore → sg
|
|
- Tokyo/Osaka → jp
|
|
- Seoul → kr
|
|
- New York/Virginia → us
|
|
- Frankfurt → de
|
|
- etc.
|
|
|
|
### 4. Geekbench Score Extraction
|
|
Parses HTML detail pages for:
|
|
- Single-Core Score
|
|
- Multi-Core Score
|
|
- Total Score (sum)
|
|
|
|
Supports multiple HTML patterns for robustness.
|
|
|
|
### 5. Error Handling
|
|
- RSS fetch failures: Log and exit gracefully
|
|
- Parse errors: Log and skip item
|
|
- Database errors: Log and skip item
|
|
- All errors prevent crash, allow partial success
|
|
|
|
## Testing
|
|
|
|
### Local Testing Completed
|
|
```bash
|
|
✅ npx tsx test-scraper.ts
|
|
- RSS parsing: 2 items found
|
|
- Title parsing: Successful extraction
|
|
- Score extraction: 1234 single, 5678 multi
|
|
|
|
✅ npm run typecheck
|
|
- No TypeScript errors
|
|
- All types valid
|
|
```
|
|
|
|
### Manual Trigger (Not Yet Tested)
|
|
```bash
|
|
# Test cron trigger locally
|
|
npx wrangler dev --test-scheduled
|
|
curl "http://localhost:8787/__scheduled?cron=0+9+*+*+*"
|
|
```
|
|
|
|
### Production Testing (Pending)
|
|
After deployment, verify:
|
|
1. Cron trigger executes at 9:00 UTC
|
|
2. RSS feed is fetched successfully
|
|
3. Benchmarks are parsed correctly
|
|
4. Database is updated without errors
|
|
5. Logs show success metrics
|
|
|
|
## Monitoring
|
|
|
|
### Console Logs
|
|
- `[Scraper]` - Main lifecycle events
|
|
- `[RSS]` - Feed parsing events
|
|
- `[Parser]` - Title/score parsing events
|
|
- `[Fetch]` - HTTP requests
|
|
- `[DB]` - Database operations
|
|
|
|
### Success Metrics
|
|
```
|
|
[Scraper] Completed in {duration}ms: {inserted} inserted, {skipped} skipped, {errors} errors
|
|
```
|
|
|
|
### Database Verification
|
|
```bash
|
|
npx wrangler d1 execute cloud-instances-db --remote \
|
|
--command="SELECT * FROM vps_benchmarks WHERE data_source='vpsbenchmarks.com' ORDER BY created_at DESC LIMIT 10;"
|
|
```
|
|
|
|
## Deployment Steps
|
|
|
|
### Completed
|
|
1. ✅ Created scraper implementation (src/scraper.ts)
|
|
2. ✅ Created migration file (migrations/add_scraper_columns.sql)
|
|
3. ✅ Applied migration to production database
|
|
4. ✅ Updated wrangler.toml with cron trigger
|
|
5. ✅ Updated main worker (src/index.ts) with scheduled handler
|
|
6. ✅ Created test script (test-scraper.ts)
|
|
7. ✅ Created documentation (SCRAPER.md)
|
|
8. ✅ Updated project documentation (CLAUDE.md)
|
|
9. ✅ Verified TypeScript compilation
|
|
|
|
### Pending
|
|
1. ⏳ Deploy to Cloudflare Workers: `npm run deploy`
|
|
2. ⏳ Verify cron trigger activation
|
|
3. ⏳ Monitor first scraper run (9:00 UTC next day)
|
|
4. ⏳ Validate scraped data in production database
|
|
|
|
## Cron Schedule
|
|
|
|
**Schedule**: Daily at 9:00 AM UTC
|
|
**Cron Expression**: `0 9 * * *`
|
|
|
|
**Timezone Conversions**:
|
|
- UTC: 09:00
|
|
- KST (Seoul): 18:00
|
|
- PST: 01:00
|
|
- EST: 04:00
|
|
|
|
## Known Limitations
|
|
|
|
1. **HTML Parsing Brittleness**
|
|
- Uses regex patterns for score extraction
|
|
- May break if vpsbenchmarks.com changes HTML structure
|
|
- No fallback parser library
|
|
|
|
2. **Partial Location Mapping**
|
|
- Only maps ~30 common locations
|
|
- Unmapped locations result in `null` country_code
|
|
- Could expand mapping as needed
|
|
|
|
3. **No Retry Logic**
|
|
- HTTP failures result in skipped items
|
|
- No exponential backoff for transient errors
|
|
- Could add retry mechanism for robustness
|
|
|
|
4. **Rate Limiting**
|
|
- No explicit rate limiting for HTTP requests
|
|
- Could overwhelm vpsbenchmarks.com if RSS has many items
|
|
- Currently limited to 20 items per run
|
|
|
|
5. **CPU Type Missing**
|
|
- Not available in RSS feed or title
|
|
- Set to `null` for scraped entries
|
|
- Could potentially extract from detail page
|
|
|
|
## Future Improvements
|
|
|
|
1. **Retry Logic**: Add exponential backoff for HTTP failures
|
|
2. **HTML Parser Library**: Use cheerio or similar for robust parsing
|
|
3. **Extended Location Mapping**: Add more city/region mappings
|
|
4. **Admin Trigger Endpoint**: Manual scraper trigger via API
|
|
5. **Email Notifications**: Alert on scraper failures
|
|
6. **Multiple Data Sources**: Support additional benchmark sites
|
|
7. **Validation Rules**: Implement manual review thresholds
|
|
8. **Rate Limiting**: Respect external site limits
|
|
9. **CPU Type Extraction**: Parse detail pages for processor info
|
|
10. **Historical Tracking**: Store benchmark history over time
|
|
|
|
## Success Criteria
|
|
|
|
### Functional Requirements ✅
|
|
- [x] Fetch RSS feed from vpsbenchmarks.com
|
|
- [x] Parse RSS items to extract benchmark metadata
|
|
- [x] Fetch detail pages for Geekbench scores
|
|
- [x] Insert/update database with deduplication
|
|
- [x] Run on daily cron schedule
|
|
- [x] Log scraper activity for monitoring
|
|
|
|
### Non-Functional Requirements ✅
|
|
- [x] TypeScript compilation succeeds
|
|
- [x] No external parsing dependencies (lightweight)
|
|
- [x] Error handling prevents worker crashes
|
|
- [x] Database migration applied successfully
|
|
- [x] Documentation complete and comprehensive
|
|
|
|
### Testing Requirements ⏳
|
|
- [x] Local parsing tests pass
|
|
- [x] TypeScript type checking passes
|
|
- [ ] Manual cron trigger test (pending deployment)
|
|
- [ ] Production scraper run verification (pending deployment)
|
|
- [ ] Database data validation (pending deployment)
|
|
|
|
## Conclusion
|
|
|
|
The VPSBenchmarks.com scraper is fully implemented and ready for deployment. All code is written, tested locally, and documented. The database schema has been successfully updated in production with duplicate removal.
|
|
|
|
**Next Action**: Deploy to Cloudflare Workers and monitor first automated run.
|
|
|
|
```bash
|
|
# Deploy command
|
|
npm run deploy
|
|
|
|
# Monitor logs
|
|
npx wrangler tail
|
|
|
|
# Verify deployment
|
|
curl https://server-recommend.kappa-d8e.workers.dev/api/health
|
|
```
|
|
|
|
## Technical Debt
|
|
|
|
1. Consider adding HTML parser library if parsing becomes unreliable
|
|
2. Expand location-to-country mapping as more regions appear
|
|
3. Add retry logic for transient HTTP failures
|
|
4. Implement admin API endpoint for manual scraper triggering
|
|
5. Add email/webhook notifications for scraper failures
|
|
6. Consider storing raw HTML for failed parses to aid debugging
|