feat: 대역폭 추정 및 DAU 표시 기능 추가

- 동시접속자 기반 월간 대역폭 자동 추정 - DAU(일일활성사용자) 추정치 표시 (동접 × 10-14) - 대역폭 기반 Linode/Vultr 자동 선택 로직 - 비용 분석에 대역폭 비용 포함 - 지역 미선택시 서울/도쿄/오사카/싱가포르 기본 표시 - 지역별 서버 분리 표시 (GROUP BY instance + region) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-25 09:40:36 +09:00
commit 4cb9da06dc
3337 changed files with 1048645 additions and 0 deletions
--- a/IMPLEMENTATION_SUMMARY.md
+++ b/IMPLEMENTATION_SUMMARY.md
@@ -0,0 +1,322 @@
+# VPSBenchmarks.com Scraper Implementation Summary
+
+## Overview
+
+Successfully implemented an automated daily scraper for VPSBenchmarks.com that fetches benchmark data via RSS feed and updates the D1 database.
+
+## Files Created/Modified
+
+### New Files
+1. **src/scraper.ts** (432 lines)
+   - Main scraper implementation
+   - RSS feed parsing
+   - HTML scraping for Geekbench scores
+   - Database insertion with deduplication
+
+2. **migrations/add_scraper_columns.sql** (18 lines)
+   - Database migration to add scraper columns
+   - Duplicate removal logic
+   - Unique constraint creation
+
+3. **test-scraper.ts** (145 lines)
+   - Local testing script for parsing logic
+   - Validates RSS parsing, title parsing, score extraction
+
+4. **SCRAPER.md** (280+ lines)
+   - Comprehensive documentation
+   - Architecture, data mapping, monitoring
+   - Testing and deployment instructions
+
+5. **IMPLEMENTATION_SUMMARY.md** (this file)
+   - Implementation overview and summary
+
+### Modified Files
+1. **src/index.ts**
+   - Added import for scraper module
+   - Added `scheduled()` handler for cron trigger
+   - No changes to existing HTTP handlers
+
+2. **wrangler.toml**
+   - Added `[triggers]` section with cron schedule
+
+3. **CLAUDE.md**
+   - Updated architecture diagram
+   - Added scraper commands
+   - Updated vps_benchmarks table description
+   - Added scraper section with testing info
+
+## Database Changes
+
+### Schema Updates (Applied to Production)
+
+Migration executed successfully on remote database:
+- Removed 8 duplicate entries (484 rows written, 3272 rows read)
+- Final record count: 471 unique benchmarks
+
+New columns added to `vps_benchmarks`:
+- `storage_gb` (INTEGER) - Storage capacity
+- `benchmark_date` (TEXT) - ISO timestamp of benchmark
+- `data_source` (TEXT) - Source identifier ('vpsbenchmarks.com' or 'manual')
+- `validation_status` (TEXT) - Approval status ('auto_approved', 'pending', 'rejected')
+
+Unique constraint added:
+```sql
+CREATE UNIQUE INDEX idx_vps_benchmarks_unique
+ON vps_benchmarks(provider_name, plan_name, country_code);
+```
+
+### Before/After
+- **Before**: 479 records with 8 duplicates
+- **After**: 471 unique records with scraper-ready schema
+
+## Architecture
+
+### Worker Structure
+```
+Worker (src/index.ts)
+├── fetch() - HTTP request handler (existing)
+│   ├── GET /api/health
+│   ├── GET /api/servers
+│   └── POST /api/recommend
+└── scheduled() - NEW cron trigger handler
+    └── scrapeVPSBenchmarks() from src/scraper.ts
+```
+
+### Scraper Flow
+```
+Cron Trigger (9:00 UTC daily)
+  ↓
+scheduled() handler
+  ↓
+scrapeVPSBenchmarks()
+  ↓
+parseRSSFeed() → Fetch https://www.vpsbenchmarks.com/rss/benchmarks
+  ↓
+For each RSS item (max 20):
+  ↓
+  parseBenchmarkDetails()
+    ├── parseTitleFormat() - Extract provider, plan, specs, location
+    ├── fetchDetailPage() - Fetch HTML detail page
+    └── extractBenchmarkScores() - Parse Geekbench scores
+  ↓
+  insertBenchmark() - INSERT OR UPDATE with deduplication
+```
+
+### Deduplication Strategy
+- **Primary Key**: Auto-increment `id`
+- **Unique Constraint**: (provider_name, plan_name, country_code)
+- **Conflict Resolution**: `ON CONFLICT DO UPDATE` to overwrite with latest data
+- **Rationale**: Same VPS plan in same location = same benchmark target
+
+## Key Features
+
+### 1. RSS Feed Parsing
+- Custom XML parser (no external dependencies)
+- Extracts title, link, pubDate, description
+- Handles CDATA sections
+
+### 2. Title Parsing
+Supports multiple formats:
+- `Provider - Plan (X vCPU, Y GB RAM) - Location`
+- `Provider Plan - X vCPU, Y GB RAM - Location`
+
+Extracts:
+- Provider name
+- Plan name
+- vCPU count
+- Memory (GB)
+- Storage (GB) - optional
+- Monthly price (USD) - optional
+- Location
+
+### 3. Location to Country Code Mapping
+Maps 30+ locations to ISO country codes:
+- Singapore → sg
+- Tokyo/Osaka → jp
+- Seoul → kr
+- New York/Virginia → us
+- Frankfurt → de
+- etc.
+
+### 4. Geekbench Score Extraction
+Parses HTML detail pages for:
+- Single-Core Score
+- Multi-Core Score
+- Total Score (sum)
+
+Supports multiple HTML patterns for robustness.
+
+### 5. Error Handling
+- RSS fetch failures: Log and exit gracefully
+- Parse errors: Log and skip item
+- Database errors: Log and skip item
+- All errors prevent crash, allow partial success
+
+## Testing
+
+### Local Testing Completed
+```bash
+✅ npx tsx test-scraper.ts
+   - RSS parsing: 2 items found
+   - Title parsing: Successful extraction
+   - Score extraction: 1234 single, 5678 multi
+
+✅ npm run typecheck
+   - No TypeScript errors
+   - All types valid
+```
+
+### Manual Trigger (Not Yet Tested)
+```bash
+# Test cron trigger locally
+npx wrangler dev --test-scheduled
+curl "http://localhost:8787/__scheduled?cron=0+9+*+*+*"
+```
+
+### Production Testing (Pending)
+After deployment, verify:
+1. Cron trigger executes at 9:00 UTC
+2. RSS feed is fetched successfully
+3. Benchmarks are parsed correctly
+4. Database is updated without errors
+5. Logs show success metrics
+
+## Monitoring
+
+### Console Logs
+- `[Scraper]` - Main lifecycle events
+- `[RSS]` - Feed parsing events
+- `[Parser]` - Title/score parsing events
+- `[Fetch]` - HTTP requests
+- `[DB]` - Database operations
+
+### Success Metrics
+```
+[Scraper] Completed in {duration}ms: {inserted} inserted, {skipped} skipped, {errors} errors
+```
+
+### Database Verification
+```bash
+npx wrangler d1 execute cloud-instances-db --remote \
+  --command="SELECT * FROM vps_benchmarks WHERE data_source='vpsbenchmarks.com' ORDER BY created_at DESC LIMIT 10;"
+```
+
+## Deployment Steps
+
+### Completed
+1. ✅ Created scraper implementation (src/scraper.ts)
+2. ✅ Created migration file (migrations/add_scraper_columns.sql)
+3. ✅ Applied migration to production database
+4. ✅ Updated wrangler.toml with cron trigger
+5. ✅ Updated main worker (src/index.ts) with scheduled handler
+6. ✅ Created test script (test-scraper.ts)
+7. ✅ Created documentation (SCRAPER.md)
+8. ✅ Updated project documentation (CLAUDE.md)
+9. ✅ Verified TypeScript compilation
+
+### Pending
+1. ⏳ Deploy to Cloudflare Workers: `npm run deploy`
+2. ⏳ Verify cron trigger activation
+3. ⏳ Monitor first scraper run (9:00 UTC next day)
+4. ⏳ Validate scraped data in production database
+
+## Cron Schedule
+
+**Schedule**: Daily at 9:00 AM UTC
+**Cron Expression**: `0 9 * * *`
+
+**Timezone Conversions**:
+- UTC: 09:00
+- KST (Seoul): 18:00
+- PST: 01:00
+- EST: 04:00
+
+## Known Limitations
+
+1. **HTML Parsing Brittleness**
+   - Uses regex patterns for score extraction
+   - May break if vpsbenchmarks.com changes HTML structure
+   - No fallback parser library
+
+2. **Partial Location Mapping**
+   - Only maps ~30 common locations
+   - Unmapped locations result in `null` country_code
+   - Could expand mapping as needed
+
+3. **No Retry Logic**
+   - HTTP failures result in skipped items
+   - No exponential backoff for transient errors
+   - Could add retry mechanism for robustness
+
+4. **Rate Limiting**
+   - No explicit rate limiting for HTTP requests
+   - Could overwhelm vpsbenchmarks.com if RSS has many items
+   - Currently limited to 20 items per run
+
+5. **CPU Type Missing**
+   - Not available in RSS feed or title
+   - Set to `null` for scraped entries
+   - Could potentially extract from detail page
+
+## Future Improvements
+
+1. **Retry Logic**: Add exponential backoff for HTTP failures
+2. **HTML Parser Library**: Use cheerio or similar for robust parsing
+3. **Extended Location Mapping**: Add more city/region mappings
+4. **Admin Trigger Endpoint**: Manual scraper trigger via API
+5. **Email Notifications**: Alert on scraper failures
+6. **Multiple Data Sources**: Support additional benchmark sites
+7. **Validation Rules**: Implement manual review thresholds
+8. **Rate Limiting**: Respect external site limits
+9. **CPU Type Extraction**: Parse detail pages for processor info
+10. **Historical Tracking**: Store benchmark history over time
+
+## Success Criteria
+
+### Functional Requirements ✅
+- [x] Fetch RSS feed from vpsbenchmarks.com
+- [x] Parse RSS items to extract benchmark metadata
+- [x] Fetch detail pages for Geekbench scores
+- [x] Insert/update database with deduplication
+- [x] Run on daily cron schedule
+- [x] Log scraper activity for monitoring
+
+### Non-Functional Requirements ✅
+- [x] TypeScript compilation succeeds
+- [x] No external parsing dependencies (lightweight)
+- [x] Error handling prevents worker crashes
+- [x] Database migration applied successfully
+- [x] Documentation complete and comprehensive
+
+### Testing Requirements ⏳
+- [x] Local parsing tests pass
+- [x] TypeScript type checking passes
+- [ ] Manual cron trigger test (pending deployment)
+- [ ] Production scraper run verification (pending deployment)
+- [ ] Database data validation (pending deployment)
+
+## Conclusion
+
+The VPSBenchmarks.com scraper is fully implemented and ready for deployment. All code is written, tested locally, and documented. The database schema has been successfully updated in production with duplicate removal.
+
+**Next Action**: Deploy to Cloudflare Workers and monitor first automated run.
+
+```bash
+# Deploy command
+npm run deploy
+
+# Monitor logs
+npx wrangler tail
+
+# Verify deployment
+curl https://server-recommend.kappa-d8e.workers.dev/api/health
+```
+
+## Technical Debt
+
+1. Consider adding HTML parser library if parsing becomes unreliable
+2. Expand location-to-country mapping as more regions appear
+3. Add retry logic for transient HTTP failures
+4. Implement admin API endpoint for manual scraper triggering
+5. Add email/webhook notifications for scraper failures
+6. Consider storing raw HTML for failed parses to aid debugging