Files
cloud-orchestrator/IMPLEMENTATION_SUMMARY.md
kappa 4cb9da06dc feat: 대역폭 추정 및 DAU 표시 기능 추가
- 동시접속자 기반 월간 대역폭 자동 추정
- DAU(일일활성사용자) 추정치 표시 (동접 × 10-14)
- 대역폭 기반 Linode/Vultr 자동 선택 로직
- 비용 분석에 대역폭 비용 포함
- 지역 미선택시 서울/도쿄/오사카/싱가포르 기본 표시
- 지역별 서버 분리 표시 (GROUP BY instance + region)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-25 09:40:36 +09:00

9.3 KiB

VPSBenchmarks.com Scraper Implementation Summary

Overview

Successfully implemented an automated daily scraper for VPSBenchmarks.com that fetches benchmark data via RSS feed and updates the D1 database.

Files Created/Modified

New Files

  1. src/scraper.ts (432 lines)

    • Main scraper implementation
    • RSS feed parsing
    • HTML scraping for Geekbench scores
    • Database insertion with deduplication
  2. migrations/add_scraper_columns.sql (18 lines)

    • Database migration to add scraper columns
    • Duplicate removal logic
    • Unique constraint creation
  3. test-scraper.ts (145 lines)

    • Local testing script for parsing logic
    • Validates RSS parsing, title parsing, score extraction
  4. SCRAPER.md (280+ lines)

    • Comprehensive documentation
    • Architecture, data mapping, monitoring
    • Testing and deployment instructions
  5. IMPLEMENTATION_SUMMARY.md (this file)

    • Implementation overview and summary

Modified Files

  1. src/index.ts

    • Added import for scraper module
    • Added scheduled() handler for cron trigger
    • No changes to existing HTTP handlers
  2. wrangler.toml

    • Added [triggers] section with cron schedule
  3. CLAUDE.md

    • Updated architecture diagram
    • Added scraper commands
    • Updated vps_benchmarks table description
    • Added scraper section with testing info

Database Changes

Schema Updates (Applied to Production)

Migration executed successfully on remote database:

  • Removed 8 duplicate entries (484 rows written, 3272 rows read)
  • Final record count: 471 unique benchmarks

New columns added to vps_benchmarks:

  • storage_gb (INTEGER) - Storage capacity
  • benchmark_date (TEXT) - ISO timestamp of benchmark
  • data_source (TEXT) - Source identifier ('vpsbenchmarks.com' or 'manual')
  • validation_status (TEXT) - Approval status ('auto_approved', 'pending', 'rejected')

Unique constraint added:

CREATE UNIQUE INDEX idx_vps_benchmarks_unique
ON vps_benchmarks(provider_name, plan_name, country_code);

Before/After

  • Before: 479 records with 8 duplicates
  • After: 471 unique records with scraper-ready schema

Architecture

Worker Structure

Worker (src/index.ts)
├── fetch() - HTTP request handler (existing)
│   ├── GET /api/health
│   ├── GET /api/servers
│   └── POST /api/recommend
└── scheduled() - NEW cron trigger handler
    └── scrapeVPSBenchmarks() from src/scraper.ts

Scraper Flow

Cron Trigger (9:00 UTC daily)
  ↓
scheduled() handler
  ↓
scrapeVPSBenchmarks()
  ↓
parseRSSFeed() → Fetch https://www.vpsbenchmarks.com/rss/benchmarks
  ↓
For each RSS item (max 20):
  ↓
  parseBenchmarkDetails()
    ├── parseTitleFormat() - Extract provider, plan, specs, location
    ├── fetchDetailPage() - Fetch HTML detail page
    └── extractBenchmarkScores() - Parse Geekbench scores
  ↓
  insertBenchmark() - INSERT OR UPDATE with deduplication

Deduplication Strategy

  • Primary Key: Auto-increment id
  • Unique Constraint: (provider_name, plan_name, country_code)
  • Conflict Resolution: ON CONFLICT DO UPDATE to overwrite with latest data
  • Rationale: Same VPS plan in same location = same benchmark target

Key Features

1. RSS Feed Parsing

  • Custom XML parser (no external dependencies)
  • Extracts title, link, pubDate, description
  • Handles CDATA sections

2. Title Parsing

Supports multiple formats:

  • Provider - Plan (X vCPU, Y GB RAM) - Location
  • Provider Plan - X vCPU, Y GB RAM - Location

Extracts:

  • Provider name
  • Plan name
  • vCPU count
  • Memory (GB)
  • Storage (GB) - optional
  • Monthly price (USD) - optional
  • Location

3. Location to Country Code Mapping

Maps 30+ locations to ISO country codes:

  • Singapore → sg
  • Tokyo/Osaka → jp
  • Seoul → kr
  • New York/Virginia → us
  • Frankfurt → de
  • etc.

4. Geekbench Score Extraction

Parses HTML detail pages for:

  • Single-Core Score
  • Multi-Core Score
  • Total Score (sum)

Supports multiple HTML patterns for robustness.

5. Error Handling

  • RSS fetch failures: Log and exit gracefully
  • Parse errors: Log and skip item
  • Database errors: Log and skip item
  • All errors prevent crash, allow partial success

Testing

Local Testing Completed

✅ npx tsx test-scraper.ts
   - RSS parsing: 2 items found
   - Title parsing: Successful extraction
   - Score extraction: 1234 single, 5678 multi

✅ npm run typecheck
   - No TypeScript errors
   - All types valid

Manual Trigger (Not Yet Tested)

# Test cron trigger locally
npx wrangler dev --test-scheduled
curl "http://localhost:8787/__scheduled?cron=0+9+*+*+*"

Production Testing (Pending)

After deployment, verify:

  1. Cron trigger executes at 9:00 UTC
  2. RSS feed is fetched successfully
  3. Benchmarks are parsed correctly
  4. Database is updated without errors
  5. Logs show success metrics

Monitoring

Console Logs

  • [Scraper] - Main lifecycle events
  • [RSS] - Feed parsing events
  • [Parser] - Title/score parsing events
  • [Fetch] - HTTP requests
  • [DB] - Database operations

Success Metrics

[Scraper] Completed in {duration}ms: {inserted} inserted, {skipped} skipped, {errors} errors

Database Verification

npx wrangler d1 execute cloud-instances-db --remote \
  --command="SELECT * FROM vps_benchmarks WHERE data_source='vpsbenchmarks.com' ORDER BY created_at DESC LIMIT 10;"

Deployment Steps

Completed

  1. Created scraper implementation (src/scraper.ts)
  2. Created migration file (migrations/add_scraper_columns.sql)
  3. Applied migration to production database
  4. Updated wrangler.toml with cron trigger
  5. Updated main worker (src/index.ts) with scheduled handler
  6. Created test script (test-scraper.ts)
  7. Created documentation (SCRAPER.md)
  8. Updated project documentation (CLAUDE.md)
  9. Verified TypeScript compilation

Pending

  1. Deploy to Cloudflare Workers: npm run deploy
  2. Verify cron trigger activation
  3. Monitor first scraper run (9:00 UTC next day)
  4. Validate scraped data in production database

Cron Schedule

Schedule: Daily at 9:00 AM UTC Cron Expression: 0 9 * * *

Timezone Conversions:

  • UTC: 09:00
  • KST (Seoul): 18:00
  • PST: 01:00
  • EST: 04:00

Known Limitations

  1. HTML Parsing Brittleness

    • Uses regex patterns for score extraction
    • May break if vpsbenchmarks.com changes HTML structure
    • No fallback parser library
  2. Partial Location Mapping

    • Only maps ~30 common locations
    • Unmapped locations result in null country_code
    • Could expand mapping as needed
  3. No Retry Logic

    • HTTP failures result in skipped items
    • No exponential backoff for transient errors
    • Could add retry mechanism for robustness
  4. Rate Limiting

    • No explicit rate limiting for HTTP requests
    • Could overwhelm vpsbenchmarks.com if RSS has many items
    • Currently limited to 20 items per run
  5. CPU Type Missing

    • Not available in RSS feed or title
    • Set to null for scraped entries
    • Could potentially extract from detail page

Future Improvements

  1. Retry Logic: Add exponential backoff for HTTP failures
  2. HTML Parser Library: Use cheerio or similar for robust parsing
  3. Extended Location Mapping: Add more city/region mappings
  4. Admin Trigger Endpoint: Manual scraper trigger via API
  5. Email Notifications: Alert on scraper failures
  6. Multiple Data Sources: Support additional benchmark sites
  7. Validation Rules: Implement manual review thresholds
  8. Rate Limiting: Respect external site limits
  9. CPU Type Extraction: Parse detail pages for processor info
  10. Historical Tracking: Store benchmark history over time

Success Criteria

Functional Requirements

  • Fetch RSS feed from vpsbenchmarks.com
  • Parse RSS items to extract benchmark metadata
  • Fetch detail pages for Geekbench scores
  • Insert/update database with deduplication
  • Run on daily cron schedule
  • Log scraper activity for monitoring

Non-Functional Requirements

  • TypeScript compilation succeeds
  • No external parsing dependencies (lightweight)
  • Error handling prevents worker crashes
  • Database migration applied successfully
  • Documentation complete and comprehensive

Testing Requirements

  • Local parsing tests pass
  • TypeScript type checking passes
  • Manual cron trigger test (pending deployment)
  • Production scraper run verification (pending deployment)
  • Database data validation (pending deployment)

Conclusion

The VPSBenchmarks.com scraper is fully implemented and ready for deployment. All code is written, tested locally, and documented. The database schema has been successfully updated in production with duplicate removal.

Next Action: Deploy to Cloudflare Workers and monitor first automated run.

# Deploy command
npm run deploy

# Monitor logs
npx wrangler tail

# Verify deployment
curl https://server-recommend.kappa-d8e.workers.dev/api/health

Technical Debt

  1. Consider adding HTML parser library if parsing becomes unreliable
  2. Expand location-to-country mapping as more regions appear
  3. Add retry logic for transient HTTP failures
  4. Implement admin API endpoint for manual scraper triggering
  5. Add email/webhook notifications for scraper failures
  6. Consider storing raw HTML for failed parses to aid debugging