feat: 대역폭 추정 및 DAU 표시 기능 추가
- 동시접속자 기반 월간 대역폭 자동 추정 - DAU(일일활성사용자) 추정치 표시 (동접 × 10-14) - 대역폭 기반 Linode/Vultr 자동 선택 로직 - 비용 분석에 대역폭 비용 포함 - 지역 미선택시 서울/도쿄/오사카/싱가포르 기본 표시 - 지역별 서버 분리 표시 (GROUP BY instance + region) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
This commit is contained in:
322
IMPLEMENTATION_SUMMARY.md
Normal file
322
IMPLEMENTATION_SUMMARY.md
Normal file
@@ -0,0 +1,322 @@
|
||||
# VPSBenchmarks.com Scraper Implementation Summary
|
||||
|
||||
## Overview
|
||||
|
||||
Successfully implemented an automated daily scraper for VPSBenchmarks.com that fetches benchmark data via RSS feed and updates the D1 database.
|
||||
|
||||
## Files Created/Modified
|
||||
|
||||
### New Files
|
||||
1. **src/scraper.ts** (432 lines)
|
||||
- Main scraper implementation
|
||||
- RSS feed parsing
|
||||
- HTML scraping for Geekbench scores
|
||||
- Database insertion with deduplication
|
||||
|
||||
2. **migrations/add_scraper_columns.sql** (18 lines)
|
||||
- Database migration to add scraper columns
|
||||
- Duplicate removal logic
|
||||
- Unique constraint creation
|
||||
|
||||
3. **test-scraper.ts** (145 lines)
|
||||
- Local testing script for parsing logic
|
||||
- Validates RSS parsing, title parsing, score extraction
|
||||
|
||||
4. **SCRAPER.md** (280+ lines)
|
||||
- Comprehensive documentation
|
||||
- Architecture, data mapping, monitoring
|
||||
- Testing and deployment instructions
|
||||
|
||||
5. **IMPLEMENTATION_SUMMARY.md** (this file)
|
||||
- Implementation overview and summary
|
||||
|
||||
### Modified Files
|
||||
1. **src/index.ts**
|
||||
- Added import for scraper module
|
||||
- Added `scheduled()` handler for cron trigger
|
||||
- No changes to existing HTTP handlers
|
||||
|
||||
2. **wrangler.toml**
|
||||
- Added `[triggers]` section with cron schedule
|
||||
|
||||
3. **CLAUDE.md**
|
||||
- Updated architecture diagram
|
||||
- Added scraper commands
|
||||
- Updated vps_benchmarks table description
|
||||
- Added scraper section with testing info
|
||||
|
||||
## Database Changes
|
||||
|
||||
### Schema Updates (Applied to Production)
|
||||
|
||||
Migration executed successfully on remote database:
|
||||
- Removed 8 duplicate entries (484 rows written, 3272 rows read)
|
||||
- Final record count: 471 unique benchmarks
|
||||
|
||||
New columns added to `vps_benchmarks`:
|
||||
- `storage_gb` (INTEGER) - Storage capacity
|
||||
- `benchmark_date` (TEXT) - ISO timestamp of benchmark
|
||||
- `data_source` (TEXT) - Source identifier ('vpsbenchmarks.com' or 'manual')
|
||||
- `validation_status` (TEXT) - Approval status ('auto_approved', 'pending', 'rejected')
|
||||
|
||||
Unique constraint added:
|
||||
```sql
|
||||
CREATE UNIQUE INDEX idx_vps_benchmarks_unique
|
||||
ON vps_benchmarks(provider_name, plan_name, country_code);
|
||||
```
|
||||
|
||||
### Before/After
|
||||
- **Before**: 479 records with 8 duplicates
|
||||
- **After**: 471 unique records with scraper-ready schema
|
||||
|
||||
## Architecture
|
||||
|
||||
### Worker Structure
|
||||
```
|
||||
Worker (src/index.ts)
|
||||
├── fetch() - HTTP request handler (existing)
|
||||
│ ├── GET /api/health
|
||||
│ ├── GET /api/servers
|
||||
│ └── POST /api/recommend
|
||||
└── scheduled() - NEW cron trigger handler
|
||||
└── scrapeVPSBenchmarks() from src/scraper.ts
|
||||
```
|
||||
|
||||
### Scraper Flow
|
||||
```
|
||||
Cron Trigger (9:00 UTC daily)
|
||||
↓
|
||||
scheduled() handler
|
||||
↓
|
||||
scrapeVPSBenchmarks()
|
||||
↓
|
||||
parseRSSFeed() → Fetch https://www.vpsbenchmarks.com/rss/benchmarks
|
||||
↓
|
||||
For each RSS item (max 20):
|
||||
↓
|
||||
parseBenchmarkDetails()
|
||||
├── parseTitleFormat() - Extract provider, plan, specs, location
|
||||
├── fetchDetailPage() - Fetch HTML detail page
|
||||
└── extractBenchmarkScores() - Parse Geekbench scores
|
||||
↓
|
||||
insertBenchmark() - INSERT OR UPDATE with deduplication
|
||||
```
|
||||
|
||||
### Deduplication Strategy
|
||||
- **Primary Key**: Auto-increment `id`
|
||||
- **Unique Constraint**: (provider_name, plan_name, country_code)
|
||||
- **Conflict Resolution**: `ON CONFLICT DO UPDATE` to overwrite with latest data
|
||||
- **Rationale**: Same VPS plan in same location = same benchmark target
|
||||
|
||||
## Key Features
|
||||
|
||||
### 1. RSS Feed Parsing
|
||||
- Custom XML parser (no external dependencies)
|
||||
- Extracts title, link, pubDate, description
|
||||
- Handles CDATA sections
|
||||
|
||||
### 2. Title Parsing
|
||||
Supports multiple formats:
|
||||
- `Provider - Plan (X vCPU, Y GB RAM) - Location`
|
||||
- `Provider Plan - X vCPU, Y GB RAM - Location`
|
||||
|
||||
Extracts:
|
||||
- Provider name
|
||||
- Plan name
|
||||
- vCPU count
|
||||
- Memory (GB)
|
||||
- Storage (GB) - optional
|
||||
- Monthly price (USD) - optional
|
||||
- Location
|
||||
|
||||
### 3. Location to Country Code Mapping
|
||||
Maps 30+ locations to ISO country codes:
|
||||
- Singapore → sg
|
||||
- Tokyo/Osaka → jp
|
||||
- Seoul → kr
|
||||
- New York/Virginia → us
|
||||
- Frankfurt → de
|
||||
- etc.
|
||||
|
||||
### 4. Geekbench Score Extraction
|
||||
Parses HTML detail pages for:
|
||||
- Single-Core Score
|
||||
- Multi-Core Score
|
||||
- Total Score (sum)
|
||||
|
||||
Supports multiple HTML patterns for robustness.
|
||||
|
||||
### 5. Error Handling
|
||||
- RSS fetch failures: Log and exit gracefully
|
||||
- Parse errors: Log and skip item
|
||||
- Database errors: Log and skip item
|
||||
- All errors prevent crash, allow partial success
|
||||
|
||||
## Testing
|
||||
|
||||
### Local Testing Completed
|
||||
```bash
|
||||
✅ npx tsx test-scraper.ts
|
||||
- RSS parsing: 2 items found
|
||||
- Title parsing: Successful extraction
|
||||
- Score extraction: 1234 single, 5678 multi
|
||||
|
||||
✅ npm run typecheck
|
||||
- No TypeScript errors
|
||||
- All types valid
|
||||
```
|
||||
|
||||
### Manual Trigger (Not Yet Tested)
|
||||
```bash
|
||||
# Test cron trigger locally
|
||||
npx wrangler dev --test-scheduled
|
||||
curl "http://localhost:8787/__scheduled?cron=0+9+*+*+*"
|
||||
```
|
||||
|
||||
### Production Testing (Pending)
|
||||
After deployment, verify:
|
||||
1. Cron trigger executes at 9:00 UTC
|
||||
2. RSS feed is fetched successfully
|
||||
3. Benchmarks are parsed correctly
|
||||
4. Database is updated without errors
|
||||
5. Logs show success metrics
|
||||
|
||||
## Monitoring
|
||||
|
||||
### Console Logs
|
||||
- `[Scraper]` - Main lifecycle events
|
||||
- `[RSS]` - Feed parsing events
|
||||
- `[Parser]` - Title/score parsing events
|
||||
- `[Fetch]` - HTTP requests
|
||||
- `[DB]` - Database operations
|
||||
|
||||
### Success Metrics
|
||||
```
|
||||
[Scraper] Completed in {duration}ms: {inserted} inserted, {skipped} skipped, {errors} errors
|
||||
```
|
||||
|
||||
### Database Verification
|
||||
```bash
|
||||
npx wrangler d1 execute cloud-instances-db --remote \
|
||||
--command="SELECT * FROM vps_benchmarks WHERE data_source='vpsbenchmarks.com' ORDER BY created_at DESC LIMIT 10;"
|
||||
```
|
||||
|
||||
## Deployment Steps
|
||||
|
||||
### Completed
|
||||
1. ✅ Created scraper implementation (src/scraper.ts)
|
||||
2. ✅ Created migration file (migrations/add_scraper_columns.sql)
|
||||
3. ✅ Applied migration to production database
|
||||
4. ✅ Updated wrangler.toml with cron trigger
|
||||
5. ✅ Updated main worker (src/index.ts) with scheduled handler
|
||||
6. ✅ Created test script (test-scraper.ts)
|
||||
7. ✅ Created documentation (SCRAPER.md)
|
||||
8. ✅ Updated project documentation (CLAUDE.md)
|
||||
9. ✅ Verified TypeScript compilation
|
||||
|
||||
### Pending
|
||||
1. ⏳ Deploy to Cloudflare Workers: `npm run deploy`
|
||||
2. ⏳ Verify cron trigger activation
|
||||
3. ⏳ Monitor first scraper run (9:00 UTC next day)
|
||||
4. ⏳ Validate scraped data in production database
|
||||
|
||||
## Cron Schedule
|
||||
|
||||
**Schedule**: Daily at 9:00 AM UTC
|
||||
**Cron Expression**: `0 9 * * *`
|
||||
|
||||
**Timezone Conversions**:
|
||||
- UTC: 09:00
|
||||
- KST (Seoul): 18:00
|
||||
- PST: 01:00
|
||||
- EST: 04:00
|
||||
|
||||
## Known Limitations
|
||||
|
||||
1. **HTML Parsing Brittleness**
|
||||
- Uses regex patterns for score extraction
|
||||
- May break if vpsbenchmarks.com changes HTML structure
|
||||
- No fallback parser library
|
||||
|
||||
2. **Partial Location Mapping**
|
||||
- Only maps ~30 common locations
|
||||
- Unmapped locations result in `null` country_code
|
||||
- Could expand mapping as needed
|
||||
|
||||
3. **No Retry Logic**
|
||||
- HTTP failures result in skipped items
|
||||
- No exponential backoff for transient errors
|
||||
- Could add retry mechanism for robustness
|
||||
|
||||
4. **Rate Limiting**
|
||||
- No explicit rate limiting for HTTP requests
|
||||
- Could overwhelm vpsbenchmarks.com if RSS has many items
|
||||
- Currently limited to 20 items per run
|
||||
|
||||
5. **CPU Type Missing**
|
||||
- Not available in RSS feed or title
|
||||
- Set to `null` for scraped entries
|
||||
- Could potentially extract from detail page
|
||||
|
||||
## Future Improvements
|
||||
|
||||
1. **Retry Logic**: Add exponential backoff for HTTP failures
|
||||
2. **HTML Parser Library**: Use cheerio or similar for robust parsing
|
||||
3. **Extended Location Mapping**: Add more city/region mappings
|
||||
4. **Admin Trigger Endpoint**: Manual scraper trigger via API
|
||||
5. **Email Notifications**: Alert on scraper failures
|
||||
6. **Multiple Data Sources**: Support additional benchmark sites
|
||||
7. **Validation Rules**: Implement manual review thresholds
|
||||
8. **Rate Limiting**: Respect external site limits
|
||||
9. **CPU Type Extraction**: Parse detail pages for processor info
|
||||
10. **Historical Tracking**: Store benchmark history over time
|
||||
|
||||
## Success Criteria
|
||||
|
||||
### Functional Requirements ✅
|
||||
- [x] Fetch RSS feed from vpsbenchmarks.com
|
||||
- [x] Parse RSS items to extract benchmark metadata
|
||||
- [x] Fetch detail pages for Geekbench scores
|
||||
- [x] Insert/update database with deduplication
|
||||
- [x] Run on daily cron schedule
|
||||
- [x] Log scraper activity for monitoring
|
||||
|
||||
### Non-Functional Requirements ✅
|
||||
- [x] TypeScript compilation succeeds
|
||||
- [x] No external parsing dependencies (lightweight)
|
||||
- [x] Error handling prevents worker crashes
|
||||
- [x] Database migration applied successfully
|
||||
- [x] Documentation complete and comprehensive
|
||||
|
||||
### Testing Requirements ⏳
|
||||
- [x] Local parsing tests pass
|
||||
- [x] TypeScript type checking passes
|
||||
- [ ] Manual cron trigger test (pending deployment)
|
||||
- [ ] Production scraper run verification (pending deployment)
|
||||
- [ ] Database data validation (pending deployment)
|
||||
|
||||
## Conclusion
|
||||
|
||||
The VPSBenchmarks.com scraper is fully implemented and ready for deployment. All code is written, tested locally, and documented. The database schema has been successfully updated in production with duplicate removal.
|
||||
|
||||
**Next Action**: Deploy to Cloudflare Workers and monitor first automated run.
|
||||
|
||||
```bash
|
||||
# Deploy command
|
||||
npm run deploy
|
||||
|
||||
# Monitor logs
|
||||
npx wrangler tail
|
||||
|
||||
# Verify deployment
|
||||
curl https://server-recommend.kappa-d8e.workers.dev/api/health
|
||||
```
|
||||
|
||||
## Technical Debt
|
||||
|
||||
1. Consider adding HTML parser library if parsing becomes unreliable
|
||||
2. Expand location-to-country mapping as more regions appear
|
||||
3. Add retry logic for transient HTTP failures
|
||||
4. Implement admin API endpoint for manual scraper triggering
|
||||
5. Add email/webhook notifications for scraper failures
|
||||
6. Consider storing raw HTML for failed parses to aid debugging
|
||||
Reference in New Issue
Block a user