Files

kappa 4cb9da06dc feat: 대역폭 추정 및 DAU 표시 기능 추가

- 동시접속자 기반 월간 대역폭 자동 추정
- DAU(일일활성사용자) 추정치 표시 (동접 × 10-14)
- 대역폭 기반 Linode/Vultr 자동 선택 로직
- 비용 분석에 대역폭 비용 포함
- 지역 미선택시 서울/도쿄/오사카/싱가포르 기본 표시
- 지역별 서버 분리 표시 (GROUP BY instance + region)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

2026-01-25 09:40:36 +09:00

9.3 KiB

Raw Permalink Blame History

VPSBenchmarks.com Scraper Implementation Summary

Overview

Successfully implemented an automated daily scraper for VPSBenchmarks.com that fetches benchmark data via RSS feed and updates the D1 database.

Files Created/Modified

New Files

src/scraper.ts (432 lines)
- Main scraper implementation
- RSS feed parsing
- HTML scraping for Geekbench scores
- Database insertion with deduplication
migrations/add_scraper_columns.sql (18 lines)
- Database migration to add scraper columns
- Duplicate removal logic
- Unique constraint creation
test-scraper.ts (145 lines)
- Local testing script for parsing logic
- Validates RSS parsing, title parsing, score extraction
SCRAPER.md (280+ lines)
- Comprehensive documentation
- Architecture, data mapping, monitoring
- Testing and deployment instructions
IMPLEMENTATION_SUMMARY.md (this file)
- Implementation overview and summary

Modified Files

src/index.ts
- Added import for scraper module
- Added scheduled() handler for cron trigger
- No changes to existing HTTP handlers
wrangler.toml
- Added [triggers] section with cron schedule
CLAUDE.md
- Updated architecture diagram
- Added scraper commands
- Updated vps_benchmarks table description
- Added scraper section with testing info

Database Changes

Schema Updates (Applied to Production)

Migration executed successfully on remote database:

Removed 8 duplicate entries (484 rows written, 3272 rows read)
Final record count: 471 unique benchmarks

New columns added to vps_benchmarks:

storage_gb (INTEGER) - Storage capacity
benchmark_date (TEXT) - ISO timestamp of benchmark
data_source (TEXT) - Source identifier ('vpsbenchmarks.com' or 'manual')
validation_status (TEXT) - Approval status ('auto_approved', 'pending', 'rejected')

Unique constraint added:

CREATE UNIQUE INDEX idx_vps_benchmarks_unique
ON vps_benchmarks(provider_name, plan_name, country_code);

Before/After

Before: 479 records with 8 duplicates
After: 471 unique records with scraper-ready schema

Architecture

Worker Structure

Worker (src/index.ts)
├── fetch() - HTTP request handler (existing)
│   ├── GET /api/health
│   ├── GET /api/servers
│   └── POST /api/recommend
└── scheduled() - NEW cron trigger handler
    └── scrapeVPSBenchmarks() from src/scraper.ts

Scraper Flow

Cron Trigger (9:00 UTC daily)
  ↓
scheduled() handler
  ↓
scrapeVPSBenchmarks()
  ↓
parseRSSFeed() → Fetch https://www.vpsbenchmarks.com/rss/benchmarks
  ↓
For each RSS item (max 20):
  ↓
  parseBenchmarkDetails()
    ├── parseTitleFormat() - Extract provider, plan, specs, location
    ├── fetchDetailPage() - Fetch HTML detail page
    └── extractBenchmarkScores() - Parse Geekbench scores
  ↓
  insertBenchmark() - INSERT OR UPDATE with deduplication

Deduplication Strategy

Primary Key: Auto-increment id
Unique Constraint: (provider_name, plan_name, country_code)
Conflict Resolution: ON CONFLICT DO UPDATE to overwrite with latest data
Rationale: Same VPS plan in same location = same benchmark target

Key Features

1. RSS Feed Parsing

Custom XML parser (no external dependencies)
Extracts title, link, pubDate, description
Handles CDATA sections

2. Title Parsing

Supports multiple formats:

Provider - Plan (X vCPU, Y GB RAM) - Location
Provider Plan - X vCPU, Y GB RAM - Location

Extracts:

Provider name
Plan name
vCPU count
Memory (GB)
Storage (GB) - optional
Monthly price (USD) - optional
Location

3. Location to Country Code Mapping

Maps 30+ locations to ISO country codes:

Singapore → sg
Tokyo/Osaka → jp
Seoul → kr
New York/Virginia → us
Frankfurt → de
etc.

4. Geekbench Score Extraction

Parses HTML detail pages for:

Single-Core Score
Multi-Core Score
Total Score (sum)

Supports multiple HTML patterns for robustness.

5. Error Handling

RSS fetch failures: Log and exit gracefully
Parse errors: Log and skip item
Database errors: Log and skip item
All errors prevent crash, allow partial success

Testing

Local Testing Completed

✅ npx tsx test-scraper.ts
   - RSS parsing: 2 items found
   - Title parsing: Successful extraction
   - Score extraction: 1234 single, 5678 multi

✅ npm run typecheck
   - No TypeScript errors
   - All types valid

Manual Trigger (Not Yet Tested)

# Test cron trigger locally
npx wrangler dev --test-scheduled
curl "http://localhost:8787/__scheduled?cron=0+9+*+*+*"

Production Testing (Pending)

After deployment, verify:

Cron trigger executes at 9:00 UTC
RSS feed is fetched successfully
Benchmarks are parsed correctly
Database is updated without errors
Logs show success metrics

Monitoring

Console Logs

[Scraper] - Main lifecycle events
[RSS] - Feed parsing events
[Parser] - Title/score parsing events
[Fetch] - HTTP requests
[DB] - Database operations

Success Metrics

[Scraper] Completed in {duration}ms: {inserted} inserted, {skipped} skipped, {errors} errors

Database Verification

npx wrangler d1 execute cloud-instances-db --remote \
  --command="SELECT * FROM vps_benchmarks WHERE data_source='vpsbenchmarks.com' ORDER BY created_at DESC LIMIT 10;"

Deployment Steps

Completed

✅ Created scraper implementation (src/scraper.ts)
✅ Created migration file (migrations/add_scraper_columns.sql)
✅ Applied migration to production database
✅ Updated wrangler.toml with cron trigger
✅ Updated main worker (src/index.ts) with scheduled handler
✅ Created test script (test-scraper.ts)
✅ Created documentation (SCRAPER.md)
✅ Updated project documentation (CLAUDE.md)
✅ Verified TypeScript compilation

Pending

⏳ Deploy to Cloudflare Workers: npm run deploy
⏳ Verify cron trigger activation
⏳ Monitor first scraper run (9:00 UTC next day)
⏳ Validate scraped data in production database

Cron Schedule

Schedule: Daily at 9:00 AM UTC Cron Expression: 0 9 * * *

Timezone Conversions:

UTC: 09:00
KST (Seoul): 18:00
PST: 01:00
EST: 04:00

Known Limitations

HTML Parsing Brittleness
- Uses regex patterns for score extraction
- May break if vpsbenchmarks.com changes HTML structure
- No fallback parser library
Partial Location Mapping
- Only maps ~30 common locations
- Unmapped locations result in null country_code
- Could expand mapping as needed
No Retry Logic
- HTTP failures result in skipped items
- No exponential backoff for transient errors
- Could add retry mechanism for robustness
Rate Limiting
- No explicit rate limiting for HTTP requests
- Could overwhelm vpsbenchmarks.com if RSS has many items
- Currently limited to 20 items per run
CPU Type Missing
- Not available in RSS feed or title
- Set to null for scraped entries
- Could potentially extract from detail page

Future Improvements

Retry Logic: Add exponential backoff for HTTP failures
HTML Parser Library: Use cheerio or similar for robust parsing
Extended Location Mapping: Add more city/region mappings
Admin Trigger Endpoint: Manual scraper trigger via API
Email Notifications: Alert on scraper failures
Multiple Data Sources: Support additional benchmark sites
Validation Rules: Implement manual review thresholds
Rate Limiting: Respect external site limits
CPU Type Extraction: Parse detail pages for processor info
Historical Tracking: Store benchmark history over time

Success Criteria

Functional Requirements ✅

Fetch RSS feed from vpsbenchmarks.com
Parse RSS items to extract benchmark metadata
Fetch detail pages for Geekbench scores
Insert/update database with deduplication
Run on daily cron schedule
Log scraper activity for monitoring

Non-Functional Requirements ✅

TypeScript compilation succeeds
No external parsing dependencies (lightweight)
Error handling prevents worker crashes
Database migration applied successfully
Documentation complete and comprehensive

Testing Requirements ⏳

Local parsing tests pass
TypeScript type checking passes
Manual cron trigger test (pending deployment)
Production scraper run verification (pending deployment)
Database data validation (pending deployment)

Conclusion

The VPSBenchmarks.com scraper is fully implemented and ready for deployment. All code is written, tested locally, and documented. The database schema has been successfully updated in production with duplicate removal.

Next Action: Deploy to Cloudflare Workers and monitor first automated run.

# Deploy command
npm run deploy

# Monitor logs
npx wrangler tail

# Verify deployment
curl https://server-recommend.kappa-d8e.workers.dev/api/health

Technical Debt

Consider adding HTML parser library if parsing becomes unreliable
Expand location-to-country mapping as more regions appear
Add retry logic for transient HTTP failures
Implement admin API endpoint for manual scraper triggering
Add email/webhook notifications for scraper failures
Consider storing raw HTML for failed parses to aid debugging

9.3 KiB Raw Permalink Blame History