Calliope Calliope Hub Spawner Scaling Guide
Calliope Integration: This component is integrated into the Calliope AI platform. Some features and configurations may differ from the upstream project.
This document provides a linear scaling chart and recommendations for growing your Calliope Calliope Hub deployment from 10 to 10,000+ concurrent servers.
Quick Reference Chart
Vertical Scaling (Single Calliope Hub)
| Target Servers | Calliope Hub vCPU | Calliope Hub Memory | Database | Poll Interval | Monthly Cost* | Notes |
|---|---|---|---|---|---|---|
| 10-50 | 2 | 4GB | SQLite | 5-10s | $60 | Current config, ideal for dev/small teams |
| 50-100 | 2 | 4GB | SQLite | 5-10s | $60 | Good performance, no changes needed |
| 100-150 | 2 | 4GB | PostgreSQL | 10s | $60 + $20** | Approaching CPU limit, DB upgrade recommended |
| 150-300 | 4 | 8GB | PostgreSQL | 10-15s | $120 + $20** | 2x resources, smooth performance |
| 300-500 | 8 | 16GB | PostgreSQL | 15-20s | $245 + $20** | Near ECS API limits |
| 500-800 | 16 | 32GB | PostgreSQL | 20-30s | $490 + $20** | Approaching hard API limits |
*Calliope Hub task only (excludes spawned servers) **RDS PostgreSQL db.t3.small cost
Horizontal Scaling (Multiple Hubs)
| Target Servers | Calliope Hub Count | Calliope Hub Size | Total vCPU | Database | Optimizations | Monthly Cost* | Notes |
|---|---|---|---|---|---|---|---|
| 500-1,000 | 2 | 4 vCPU / 8GB | 8 vCPU | PostgreSQL | None | $240 + $50** | ALB + sticky sessions required |
| 1,000-2,000 | 3 | 4 vCPU / 8GB | 12 vCPU | PostgreSQL | Batching | $360 + $75** | API batching essential |
| 2,000-5,000 | 5 | 8 vCPU / 16GB | 40 vCPU | PostgreSQL | Batching + Caching | $1,200 + $150** | Sharding recommended |
| 5,000-10,000 | 10 | 8 vCPU / 16GB | 80 vCPU | PostgreSQL | Full stack | $2,400 + $300** | Event-driven recommended |
*Calliope Hub tasks only (excludes spawned servers and ALB costs) **RDS PostgreSQL cost (scales with connections)
Scaling Paths
Path 1: Small to Medium (10 โ 500 servers)
Vertical Scaling Only - Simplest approach
Step 1: Start (10-50 servers)
โโ Calliope Hub: 2 vCPU / 4GB
โโ Database: SQLite
โโ Cost: $60/month
Step 2: Growing (50-100 servers)
โโ Calliope Hub: 2 vCPU / 4GB (no change)
โโ Database: Switch to PostgreSQL
โโ Cost: $80/month (+$20 for RDS)
Step 3: Scaling (100-300 servers)
โโ Calliope Hub: 4 vCPU / 8GB
โโ Database: PostgreSQL
โโ Cost: $140/month
Step 4: Large (300-500 servers)
โโ Calliope Hub: 8 vCPU / 16GB
โโ Database: PostgreSQL
โโ Poll interval: 15-20s
โโ Cost: $265/monthTimeline: Can scale in days (just update task definition) Complexity: Low (no architectural changes)
Path 2: Medium to Large (500 โ 2,000 servers)
Add Horizontal Scaling - Moderate complexity
Step 1: Prepare (500 servers)
โโ Calliope Hub: 8 vCPU / 16GB
โโ Database: PostgreSQL (connection pooling)
โโ Implement: API call batching
โโ Cost: $265/month
Step 2: First Scale-Out (500-1,000 servers)
โโ Hubs: 2 ร 4 vCPU / 8GB
โโ Load Balancer: ALB with sticky sessions
โโ Database: PostgreSQL (larger instance)
โโ Cost: $340/month
Step 3: Growing (1,000-2,000 servers)
โโ Hubs: 3 ร 4 vCPU / 8GB
โโ Optimizations: Batching + caching
โโ Cost: $435/monthTimeline: 2-4 weeks (load balancer setup + testing) Complexity: Medium (requires coordination between hubs)
Path 3: Large Scale (2,000+ servers)
Enterprise Architecture - High complexity
Step 1: Optimize (2,000 servers)
โโ Hubs: 5 ร 8 vCPU / 16GB
โโ Sharding: By user group or service type
โโ Database: PostgreSQL (db.r5.2xlarge or larger)
โโ Caching: Redis cluster for API responses
โโ Cost: $1,500/month
Step 2: Event-Driven (5,000+ servers)
โโ Hubs: 10 ร 8 vCPU / 16GB
โโ Monitoring: Separate service with EventBridge
โโ Database: Aurora PostgreSQL (multi-AZ)
โโ Polling: Reduced to 60s (events handle state)
โโ Cost: $3,000+/monthTimeline: 2-3 months (architectural redesign) Complexity: High (requires distributed systems expertise)
Linear Scaling Formula
Calculate Required Calliope Hub Resources
CPU (vCPU):
Required vCPU = (Target Servers ร 0.1s CPU) รท Poll Interval
Example for 400 servers with 10s polling:
= (400 ร 0.1) รท 10
= 4 vCPUMemory (GB):
Required Memory = 1 GB + (Target Servers ร 0.001 GB)
Example for 400 servers:
= 1 + (400 ร 0.001)
= 1.4 GB (use 4GB for headroom)Calliope Hub Count (for horizontal scaling):
Calliope Hub Count = Target Servers รท 300
Example for 1,200 servers:
= 1,200 รท 300
= 4 hubs @ 4 vCPU / 8GB eachMigration Checklist
Upgrading Calliope Hub Resources (Vertical)
- Update task definition with new CPU/memory
- Deploy new task revision
- Monitor CPU/memory usage for 24 hours
- Verify poll performance improved
- Check for API throttling errors (should decrease)
Downtime: None (rolling update) Rollback: Easy (revert task definition)
Migrating to PostgreSQL
- Provision RDS PostgreSQL instance (db.t3.small minimum)
- Create database and Calliope Calliope Hub user
- Update
JUPYTERHUB_DB_URLin Secrets Manager - Stop Calliope Hub task (will lose session state)
- Start Calliope Hub with new DB URL (creates schema)
- Verify users can log in and spawn servers
- Monitor database connections and query performance
Downtime: ~5-10 minutes (Calliope Hub restart) Rollback: Difficult (data in new DB)
Adding Horizontal Scaling
- Ensure PostgreSQL is configured (required)
- Store all secrets in Secrets Manager (shared across hubs)
- Create Application Load Balancer (ALB)
- Configure target group with sticky sessions
- Update ECS service to
desired_count: 2 - Test user sessions stick to same Calliope Hub
- Configure orphan detection coordination
- Test idle culling coordination
- Gradually increase
desired_count
Downtime: None (add hubs incrementally)
Rollback: Medium difficulty (set desired_count: 1)
Performance Optimization Sequence
For optimal scaling, implement in this order:
Stage 1: Foundation (0-100 servers)
- โ Stable base configuration
- โ Monitoring and alerting
- โ Proper healthchecks (DONE in recent updates)
Stage 2: Database (100-200 servers)
- โฌ Migrate to PostgreSQL
- โฌ Add connection pooling (pgbouncer)
- โฌ Index optimization
Stage 3: Vertical Scaling (200-500 servers)
- โฌ Upgrade to 4 vCPU / 8GB
- โฌ Increase poll intervals (10s โ 15s)
- โฌ Monitor API throttling
Stage 4: API Optimization (500-1,000 servers)
- โฌ Implement batch API calls
- โฌ Add response caching (10-15s TTL)
- โฌ Upgrade to 8 vCPU / 16GB
Stage 5: Horizontal Scaling (1,000+ servers)
- โฌ Set up ALB with sticky sessions
- โฌ Deploy 2-3 Calliope Hub instances
- โฌ Coordinate background services
- โฌ Consider sharding strategy
Stage 6: Advanced (5,000+ servers)
- โฌ Event-driven architecture (EventBridge)
- โฌ Separate monitoring service
- โฌ Redis caching cluster
- โฌ Multi-region deployment
Quick Decision Matrix
“How should I scale?”
| Current Servers | Target Servers | Action | Timeline | Cost Delta |
|---|---|---|---|---|
| 50 | 100 | โ No change yet | N/A | $0 |
| 100 | 200 | Upgrade to 4 vCPU + PostgreSQL | 1 week | +$80/mo |
| 200 | 500 | Upgrade to 8 vCPU | 1 day | +$125/mo |
| 500 | 1,000 | Add batching + 2nd Calliope Hub | 2-3 weeks | +$140/mo |
| 1,000 | 2,000 | Add 1-2 more hubs + sharding | 4-6 weeks | +$240/mo |
| 2,000 | 5,000 | Redesign to event-driven | 8-12 weeks | +$1,000/mo |
Testing Your Limits
Load Testing Commands
1. Check current Calliope Hub resources:
aws ecs describe-tasks \
--cluster development \
--tasks <Calliope Hub-task-id> \
--query 'tasks[0].{cpu:cpu,memory:memory}'2. Monitor CPU during peak:
aws cloudwatch get-metric-statistics \
--namespace AWS/ECS \
--metric-name CPUUtilization \
--dimensions Name=ServiceName,Value=dev-Calliope Calliope Hub-service \
--start-time $(date -u -d '1 hour ago' +%Y-%m-%dT%H:%M:%S) \
--end-time $(date -u +%Y-%m-%dT%H:%M:%S) \
--period 60 \
--statistics Average,Maximum3. Check API throttling:
# Look for ThrottlingException in logs
aws logs filter-log-events \
--log-group-name /aws/ecs/Calliope Hub-calliope-dev/Calliope Hub-Calliope Calliope Hub-service \
--filter-pattern "ThrottlingException" \
--start-time $(date -d '1 hour ago' +%s)0004. Count active spawners:
# Check Calliope Hub logs for spawner count
aws logs filter-log-events \
--log-group-name /aws/ecs/Calliope Hub-calliope-dev/Calliope Hub-Calliope Calliope Hub-service \
--filter-pattern "running tasks" \
--max-items 10Emergency Scaling
If you suddenly need more capacity RIGHT NOW:
Option 1: Quick Vertical Scale (15 minutes)
# Update Calliope Hub task definition
# Change: cpu: "4096", memory: "8192"
# Deploy new revision
# Capacity: 100 โ 300 serversOption 2: Reduce Polling (5 minutes)
# In jupyterhub_config.py
c.Spawner.poll_interval = 30 # from 5-10s
# Restart Calliope Hub
# Capacity: 2x improvement
# Tradeoff: Slower dead server detectionOption 3: Disable Orphan Discovery (2 minutes)
# Temporarily disable in ecs.py poll() method
# Comment out orphan discovery block
# Restart Calliope Hub
# Capacity: 20% improvement
# Tradeoff: No automatic ghost cleanupBest Practices
Calliope Hub Resource Sizing
Always maintain headroom:
- Target 60-70% CPU utilization (not 90%)
- Keep 30-40% memory free
- Leave room for traffic spikes
Scale up when:
- CPU sustained > 70% for 5+ minutes
- Poll cycles taking > 15 seconds
- API throttling errors appear
Don’t over-provision:
- 2x capacity headroom is wasteful
- Scale in steps (2โ4โ8 vCPU)
- Monitor for 24 hours before adding more
Database Sizing
PostgreSQL Instances:
| Concurrent Servers | RDS Instance | Connections | Cost/Month |
|---|---|---|---|
| 100-500 | db.t3.small (2 vCPU, 2GB) | 100 | $25 |
| 500-1,500 | db.t3.medium (2 vCPU, 4GB) | 200 | $50 |
| 1,500-5,000 | db.r5.large (2 vCPU, 16GB) | 500 | $150 |
| 5,000-10,000 | db.r5.2xlarge (8 vCPU, 64GB) | 1,500 | $600 |
Connection Pooling:
# Use pgbouncer to reduce connection overhead
c.Calliope Calliope Hub.db_url = "postgresql://pgbouncer:6432/Calliope Calliope Hub"
# Pool settings for 1000 servers:
pool_size = 20 # Max active connections
max_overflow = 40 # Additional connections allowedCost Optimization
Calliope Hub Infrastructure Cost Breakdown
Formula:
Monthly Cost = (vCPU ร $14.64 ร 730h) + (GB Memory ร $1.61 ร 730h)Examples:
2 vCPU / 4GB = (2 ร $14.64 ร 730) + (4 ร $1.61 ร 730) = $61/mo
4 vCPU / 8GB = (4 ร $14.64 ร 730) + (8 ร $1.61 ร 730) = $122/mo
8 vCPU / 16GB = (8 ร $14.64 ร 730) + (16 ร $1.61 ร 730) = $244/moCost per Server (Calliope Hub overhead only)
| Calliope Hub Config | Servers | Calliope Hub Cost | Cost/Server |
|---|---|---|---|
| 2 vCPU / 4GB | 100 | $61 | $0.61 |
| 4 vCPU / 8GB | 300 | $122 | $0.41 |
| 8 vCPU / 16GB | 600 | $244 | $0.41 |
| 3 ร 4 vCPU hubs | 900 | $366 | $0.41 |
Key Insight: Calliope Hub overhead is <$0.50 per server - don’t under-provision!
Growth Scenarios
Scenario 1: Startup (Months 1-6)
Growth: 10 โ 100 servers
Month 1-3: 10-50 servers
- Keep current config (2 vCPU / 4GB)
- Monitor usage patterns
- Cost: $60/month
Month 4-6: 50-100 servers
- Migrate to PostgreSQL
- No Calliope Hub upgrade needed yet
- Cost: $80/month (+$20)
Total Investment: $20/month additional
Scenario 2: Rapid Growth (Months 1-12)
Growth: 50 โ 500 servers
Month 1-3: 50-100 servers
- Migrate to PostgreSQL immediately
- Cost: $80/month
Month 4-6: 100-200 servers
- Upgrade Calliope Hub to 4 vCPU / 8GB
- Cost: $140/month (+$60)
Month 7-9: 200-400 servers
- Upgrade Calliope Hub to 8 vCPU / 16GB
- Implement batch API calls
- Cost: $265/month (+$125)
Month 10-12: 400-500 servers
- Consider 2nd Calliope Hub if approaching limits
- Cost: $265-380/month
Total Investment: $300/month over 12 months
Scenario 3: Enterprise Scale (12-24 months)
Growth: 100 โ 2,000+ servers
Phase 1 (Months 1-6): 100-500 servers
- Vertical scaling (4 โ 8 vCPU)
- PostgreSQL migration
- Cost: $265/month
Phase 2 (Months 7-12): 500-1,000 servers
- Implement batching + caching
- Add 2nd Calliope Hub with ALB
- Cost: $340/month
Phase 3 (Months 13-18): 1,000-1,500 servers
- Add 3rd Calliope Hub
- Implement sharding
- Cost: $435/month
Phase 4 (Months 19-24): 1,500-2,000 servers
- Add 4-5th Calliope Hub
- Event-driven architecture
- Cost: $650/month
Total Investment: $650/month infrastructure
Monitoring Your Scale
Key Metrics Dashboard
1. Capacity Metrics
Current Servers / Max Recommended = Capacity %
Target: < 70% capacity
Warning: > 80% capacity
Critical: > 90% capacity2. Performance Metrics
Poll Cycle Duration: < 10s (good), 10-20s (ok), > 20s (bad)
API Error Rate: < 0.1% (good), 0.1-1% (warning), > 1% (critical)
Calliope Hub CPU: < 70% (good), 70-85% (warning), > 85% (critical)3. User Experience Metrics
Spawn Time: < 60s (excellent), 60-120s (good), > 120s (poor)
Poll Lag: < 30s (good), 30-60s (ok), > 60s (bad)
UI Response: < 1s (excellent), 1-3s (ok), > 3s (poor)Scale-Up Triggers
Automatic triggers for scaling:
Upgrade Calliope Hub (Vertical)
IF cpu_utilization > 70% for 10 minutes
OR poll_duration > 15s average
THEN upgrade to next vCPU tierAdd Database (PostgreSQL)
IF servers > 100
OR database_lock_errors > 0
THEN migrate to PostgreSQLAdd Calliope Hub (Horizontal)
IF servers > 400
AND cpu_utilization > 80% with 8+ vCPU
THEN add 2nd Calliope Hub with load balancerOptimize API Calls
IF api_throttle_errors > 0
OR servers > 300
THEN implement batch API callsAppendix: AWS Service Limits
Default Quotas (us-west-2)
| Service | Quota | Limit | Requestable To |
|---|---|---|---|
| ECS - Fargate vCPU | On-Demand | 10,000 | 100,000+ |
| ECS - Tasks per service | Running tasks | 5,000 | No limit |
| VPC - ENIs | Elastic IPs | 5,000 | 50,000 |
| ECS API - describe_tasks | TPS | 100 | 500 (case-by-case) |
| ECS API - list_tasks | TPS | 100 | 500 (case-by-case) |
| ECS - Task definitions | Active revisions | 1,000,000 | N/A |
Requesting Limit Increases
To request ECS API limit increase:
1. AWS Console โ Service Quotas
2. Search "ECS API rate limits"
3. Request increase with justification
4. Typical approval: 1-3 business daysRecommended when:
- Planning for > 500 concurrent servers
- Seeing throttling errors
- Before major launch/growth event
Summary Recommendations
By Scale
| Server Count | Recommended Setup | Estimated Cost |
|---|---|---|
| < 100 | 2 vCPU / 4GB single Calliope Hub | $60/mo |
| 100-300 | 4 vCPU / 8GB + PostgreSQL | $140/mo |
| 300-600 | 8 vCPU / 16GB + PostgreSQL + batching | $265/mo |
| 600-1,500 | 2-3 hubs @ 4 vCPU + ALB | $360/mo |
| 1,500-3,000 | 5 hubs @ 4 vCPU + sharding | $650/mo |
| 3,000+ | Event-driven architecture | Custom |
Critical Thresholds
- 100 servers: Migrate to PostgreSQL
- 200 servers: Upgrade to 4 vCPU
- 400 servers: Implement API batching
- 600 servers: Add horizontal scaling or hit API limits
- 1,000 servers: Architectural optimization required
For your current target (“a few 100 containers”):
โ Recommended: Upgrade to 4 vCPU / 8GB + PostgreSQL
- Handles 300-400 servers comfortably
- Simple vertical scaling
- Cost: ~$140/month
- No architectural changes needed
- Room for growth to 500+ servers
This gives you plenty of headroom without over-engineering!