Calliope Calliope Hub Spawner Scaling Guide

Calliope Integration: This component is integrated into the Calliope AI platform. Some features and configurations may differ from the upstream project.

This document provides a linear scaling chart and recommendations for growing your Calliope Calliope Hub deployment from 10 to 10,000+ concurrent servers.

Quick Reference Chart

Vertical Scaling (Single Calliope Hub)

Target Servers	Calliope Hub vCPU	Calliope Hub Memory	Database	Poll Interval	Monthly Cost*	Notes
10-50	2	4GB	SQLite	5-10s	$60	Current config, ideal for dev/small teams
50-100	2	4GB	SQLite	5-10s	$60	Good performance, no changes needed
100-150	2	4GB	PostgreSQL	10s	$60 + $20**	Approaching CPU limit, DB upgrade recommended
150-300	4	8GB	PostgreSQL	10-15s	$120 + $20**	2x resources, smooth performance
300-500	8	16GB	PostgreSQL	15-20s	$245 + $20**	Near ECS API limits
500-800	16	32GB	PostgreSQL	20-30s	$490 + $20**	Approaching hard API limits

*Calliope Hub task only (excludes spawned servers) **RDS PostgreSQL db.t3.small cost

Horizontal Scaling (Multiple Hubs)

Target Servers	Calliope Hub Count	Calliope Hub Size	Total vCPU	Database	Optimizations	Monthly Cost*	Notes
500-1,000	2	4 vCPU / 8GB	8 vCPU	PostgreSQL	None	$240 + $50**	ALB + sticky sessions required
1,000-2,000	3	4 vCPU / 8GB	12 vCPU	PostgreSQL	Batching	$360 + $75**	API batching essential
2,000-5,000	5	8 vCPU / 16GB	40 vCPU	PostgreSQL	Batching + Caching	$1,200 + $150**	Sharding recommended
5,000-10,000	10	8 vCPU / 16GB	80 vCPU	PostgreSQL	Full stack	$2,400 + $300**	Event-driven recommended

*Calliope Hub tasks only (excludes spawned servers and ALB costs) **RDS PostgreSQL cost (scales with connections)

Scaling Paths

Path 1: Small to Medium (10 → 500 servers)

Vertical Scaling Only - Simplest approach

Step 1: Start (10-50 servers)
├─ Calliope Hub: 2 vCPU / 4GB
├─ Database: SQLite
└─ Cost: $60/month

Step 2: Growing (50-100 servers)
├─ Calliope Hub: 2 vCPU / 4GB (no change)
├─ Database: Switch to PostgreSQL
└─ Cost: $80/month (+$20 for RDS)

Step 3: Scaling (100-300 servers)
├─ Calliope Hub: 4 vCPU / 8GB
├─ Database: PostgreSQL
└─ Cost: $140/month

Step 4: Large (300-500 servers)
├─ Calliope Hub: 8 vCPU / 16GB
├─ Database: PostgreSQL
├─ Poll interval: 15-20s
└─ Cost: $265/month

Timeline: Can scale in days (just update task definition) Complexity: Low (no architectural changes)

Path 2: Medium to Large (500 → 2,000 servers)

Add Horizontal Scaling - Moderate complexity

Step 1: Prepare (500 servers)
├─ Calliope Hub: 8 vCPU / 16GB
├─ Database: PostgreSQL (connection pooling)
├─ Implement: API call batching
└─ Cost: $265/month

Step 2: First Scale-Out (500-1,000 servers)
├─ Hubs: 2 × 4 vCPU / 8GB
├─ Load Balancer: ALB with sticky sessions
├─ Database: PostgreSQL (larger instance)
└─ Cost: $340/month

Step 3: Growing (1,000-2,000 servers)
├─ Hubs: 3 × 4 vCPU / 8GB
├─ Optimizations: Batching + caching
└─ Cost: $435/month

Timeline: 2-4 weeks (load balancer setup + testing) Complexity: Medium (requires coordination between hubs)

Path 3: Large Scale (2,000+ servers)

Enterprise Architecture - High complexity

Step 1: Optimize (2,000 servers)
├─ Hubs: 5 × 8 vCPU / 16GB
├─ Sharding: By user group or service type
├─ Database: PostgreSQL (db.r5.2xlarge or larger)
├─ Caching: Redis cluster for API responses
└─ Cost: $1,500/month

Step 2: Event-Driven (5,000+ servers)
├─ Hubs: 10 × 8 vCPU / 16GB
├─ Monitoring: Separate service with EventBridge
├─ Database: Aurora PostgreSQL (multi-AZ)
├─ Polling: Reduced to 60s (events handle state)
└─ Cost: $3,000+/month

Timeline: 2-3 months (architectural redesign) Complexity: High (requires distributed systems expertise)

Linear Scaling Formula

Calculate Required Calliope Hub Resources

CPU (vCPU):

Required vCPU = (Target Servers × 0.1s CPU) ÷ Poll Interval

Example for 400 servers with 10s polling:
= (400 × 0.1) ÷ 10
= 4 vCPU

Memory (GB):

Required Memory = 1 GB + (Target Servers × 0.001 GB)

Example for 400 servers:
= 1 + (400 × 0.001)
= 1.4 GB (use 4GB for headroom)

Calliope Hub Count (for horizontal scaling):

Calliope Hub Count = Target Servers ÷ 300

Example for 1,200 servers:
= 1,200 ÷ 300
= 4 hubs @ 4 vCPU / 8GB each

Migration Checklist

Upgrading Calliope Hub Resources (Vertical)

Update task definition with new CPU/memory
Deploy new task revision
Monitor CPU/memory usage for 24 hours
Verify poll performance improved
Check for API throttling errors (should decrease)

Downtime: None (rolling update) Rollback: Easy (revert task definition)

Migrating to PostgreSQL

Provision RDS PostgreSQL instance (db.t3.small minimum)
Create database and Calliope Calliope Hub user
Update JUPYTERHUB_DB_URL in Secrets Manager
Stop Calliope Hub task (will lose session state)
Start Calliope Hub with new DB URL (creates schema)
Verify users can log in and spawn servers
Monitor database connections and query performance

Downtime: ~5-10 minutes (Calliope Hub restart) Rollback: Difficult (data in new DB)

Adding Horizontal Scaling

Ensure PostgreSQL is configured (required)
Store all secrets in Secrets Manager (shared across hubs)
Create Application Load Balancer (ALB)
Configure target group with sticky sessions
Update ECS service to desired_count: 2
Test user sessions stick to same Calliope Hub
Configure orphan detection coordination
Test idle culling coordination
Gradually increase desired_count

Downtime: None (add hubs incrementally) Rollback: Medium difficulty (set desired_count: 1)

Performance Optimization Sequence

For optimal scaling, implement in this order:

Stage 1: Foundation (0-100 servers)

✅ Stable base configuration
✅ Monitoring and alerting
✅ Proper healthchecks (DONE in recent updates)

Stage 2: Database (100-200 servers)

⬜ Migrate to PostgreSQL
⬜ Add connection pooling (pgbouncer)
⬜ Index optimization

Stage 3: Vertical Scaling (200-500 servers)

⬜ Upgrade to 4 vCPU / 8GB
⬜ Increase poll intervals (10s → 15s)
⬜ Monitor API throttling

Stage 4: API Optimization (500-1,000 servers)

⬜ Implement batch API calls
⬜ Add response caching (10-15s TTL)
⬜ Upgrade to 8 vCPU / 16GB

Stage 5: Horizontal Scaling (1,000+ servers)

⬜ Set up ALB with sticky sessions
⬜ Deploy 2-3 Calliope Hub instances
⬜ Coordinate background services
⬜ Consider sharding strategy

Stage 6: Advanced (5,000+ servers)

⬜ Event-driven architecture (EventBridge)
⬜ Separate monitoring service
⬜ Redis caching cluster
⬜ Multi-region deployment

Quick Decision Matrix

“How should I scale?”

Current Servers	Target Servers	Action	Timeline	Cost Delta
50	100	✅ No change yet	N/A	$0
100	200	Upgrade to 4 vCPU + PostgreSQL	1 week	+$80/mo
200	500	Upgrade to 8 vCPU	1 day	+$125/mo
500	1,000	Add batching + 2nd Calliope Hub	2-3 weeks	+$140/mo
1,000	2,000	Add 1-2 more hubs + sharding	4-6 weeks	+$240/mo
2,000	5,000	Redesign to event-driven	8-12 weeks	+$1,000/mo

Testing Your Limits

Load Testing Commands

1. Check current Calliope Hub resources:

aws ecs describe-tasks \
  --cluster development \
  --tasks <Calliope Hub-task-id> \
  --query 'tasks[0].{cpu:cpu,memory:memory}'

2. Monitor CPU during peak:

aws cloudwatch get-metric-statistics \
  --namespace AWS/ECS \
  --metric-name CPUUtilization \
  --dimensions Name=ServiceName,Value=dev-Calliope Calliope Hub-service \
  --start-time $(date -u -d '1 hour ago' +%Y-%m-%dT%H:%M:%S) \
  --end-time $(date -u +%Y-%m-%dT%H:%M:%S) \
  --period 60 \
  --statistics Average,Maximum

3. Check API throttling:

# Look for ThrottlingException in logs
aws logs filter-log-events \
  --log-group-name /aws/ecs/Calliope Hub-calliope-dev/Calliope Hub-Calliope Calliope Hub-service \
  --filter-pattern "ThrottlingException" \
  --start-time $(date -d '1 hour ago' +%s)000

4. Count active spawners:

# Check Calliope Hub logs for spawner count
aws logs filter-log-events \
  --log-group-name /aws/ecs/Calliope Hub-calliope-dev/Calliope Hub-Calliope Calliope Hub-service \
  --filter-pattern "running tasks" \
  --max-items 10

Emergency Scaling

If you suddenly need more capacity RIGHT NOW:

Option 1: Quick Vertical Scale (15 minutes)

# Update Calliope Hub task definition
# Change: cpu: "4096", memory: "8192"
# Deploy new revision
# Capacity: 100 → 300 servers

Option 2: Reduce Polling (5 minutes)

# In jupyterhub_config.py
c.Spawner.poll_interval = 30  # from 5-10s
# Restart Calliope Hub
# Capacity: 2x improvement
# Tradeoff: Slower dead server detection

Option 3: Disable Orphan Discovery (2 minutes)

# Temporarily disable in ecs.py poll() method
# Comment out orphan discovery block
# Restart Calliope Hub
# Capacity: 20% improvement
# Tradeoff: No automatic ghost cleanup

Best Practices

Calliope Hub Resource Sizing

Always maintain headroom:

Target 60-70% CPU utilization (not 90%)
Keep 30-40% memory free
Leave room for traffic spikes

Scale up when:

CPU sustained > 70% for 5+ minutes
Poll cycles taking > 15 seconds
API throttling errors appear

Don’t over-provision:

2x capacity headroom is wasteful
Scale in steps (2→4→8 vCPU)
Monitor for 24 hours before adding more

Database Sizing

PostgreSQL Instances:

Concurrent Servers	RDS Instance	Connections	Cost/Month
100-500	db.t3.small (2 vCPU, 2GB)	100	$25
500-1,500	db.t3.medium (2 vCPU, 4GB)	200	$50
1,500-5,000	db.r5.large (2 vCPU, 16GB)	500	$150
5,000-10,000	db.r5.2xlarge (8 vCPU, 64GB)	1,500	$600

Connection Pooling:

# Use pgbouncer to reduce connection overhead
c.Calliope Calliope Hub.db_url = "postgresql://pgbouncer:6432/Calliope Calliope Hub"

# Pool settings for 1000 servers:
pool_size = 20  # Max active connections
max_overflow = 40  # Additional connections allowed

Cost Optimization

Calliope Hub Infrastructure Cost Breakdown

Formula:

Monthly Cost = (vCPU × $14.64 × 730h) + (GB Memory × $1.61 × 730h)

Examples:

2 vCPU / 4GB  = (2 × $14.64 × 730) + (4 × $1.61 × 730) = $61/mo
4 vCPU / 8GB  = (4 × $14.64 × 730) + (8 × $1.61 × 730) = $122/mo
8 vCPU / 16GB = (8 × $14.64 × 730) + (16 × $1.61 × 730) = $244/mo

Cost per Server (Calliope Hub overhead only)

Calliope Hub Config	Servers	Calliope Hub Cost	Cost/Server
2 vCPU / 4GB	100	$61	$0.61
4 vCPU / 8GB	300	$122	$0.41
8 vCPU / 16GB	600	$244	$0.41
3 × 4 vCPU hubs	900	$366	$0.41

Key Insight: Calliope Hub overhead is <$0.50 per server - don’t under-provision!

Growth Scenarios

Scenario 1: Startup (Months 1-6)

Growth: 10 → 100 servers

Month 1-3: 10-50 servers

Keep current config (2 vCPU / 4GB)
Monitor usage patterns
Cost: $60/month

Month 4-6: 50-100 servers

Migrate to PostgreSQL
No Calliope Hub upgrade needed yet
Cost: $80/month (+$20)

Total Investment: $20/month additional

Scenario 2: Rapid Growth (Months 1-12)

Growth: 50 → 500 servers

Month 1-3: 50-100 servers

Migrate to PostgreSQL immediately
Cost: $80/month

Month 4-6: 100-200 servers

Upgrade Calliope Hub to 4 vCPU / 8GB
Cost: $140/month (+$60)

Month 7-9: 200-400 servers

Upgrade Calliope Hub to 8 vCPU / 16GB
Implement batch API calls
Cost: $265/month (+$125)

Month 10-12: 400-500 servers

Consider 2nd Calliope Hub if approaching limits
Cost: $265-380/month

Total Investment: $300/month over 12 months

Scenario 3: Enterprise Scale (12-24 months)

Growth: 100 → 2,000+ servers

Phase 1 (Months 1-6): 100-500 servers

Vertical scaling (4 → 8 vCPU)
PostgreSQL migration
Cost: $265/month

Phase 2 (Months 7-12): 500-1,000 servers

Implement batching + caching
Add 2nd Calliope Hub with ALB
Cost: $340/month

Phase 3 (Months 13-18): 1,000-1,500 servers

Add 3rd Calliope Hub
Implement sharding
Cost: $435/month

Phase 4 (Months 19-24): 1,500-2,000 servers

Add 4-5th Calliope Hub
Event-driven architecture
Cost: $650/month

Total Investment: $650/month infrastructure

Monitoring Your Scale

Key Metrics Dashboard

1. Capacity Metrics

Current Servers / Max Recommended = Capacity %

Target: < 70% capacity
Warning: > 80% capacity
Critical: > 90% capacity

2. Performance Metrics

Poll Cycle Duration: < 10s (good), 10-20s (ok), > 20s (bad)
API Error Rate: < 0.1% (good), 0.1-1% (warning), > 1% (critical)
Calliope Hub CPU: < 70% (good), 70-85% (warning), > 85% (critical)

3. User Experience Metrics

Spawn Time: < 60s (excellent), 60-120s (good), > 120s (poor)
Poll Lag: < 30s (good), 30-60s (ok), > 60s (bad)
UI Response: < 1s (excellent), 1-3s (ok), > 3s (poor)

Scale-Up Triggers

Automatic triggers for scaling:

Upgrade Calliope Hub (Vertical)

IF cpu_utilization > 70% for 10 minutes
OR poll_duration > 15s average
THEN upgrade to next vCPU tier

Add Database (PostgreSQL)

IF servers > 100
OR database_lock_errors > 0
THEN migrate to PostgreSQL

Add Calliope Hub (Horizontal)

IF servers > 400
AND cpu_utilization > 80% with 8+ vCPU
THEN add 2nd Calliope Hub with load balancer

Optimize API Calls

IF api_throttle_errors > 0
OR servers > 300
THEN implement batch API calls

Appendix: AWS Service Limits

Default Quotas (us-west-2)

Service	Quota	Limit	Requestable To
ECS - Fargate vCPU	On-Demand	10,000	100,000+
ECS - Tasks per service	Running tasks	5,000	No limit
VPC - ENIs	Elastic IPs	5,000	50,000
ECS API - describe_tasks	TPS	100	500 (case-by-case)
ECS API - list_tasks	TPS	100	500 (case-by-case)
ECS - Task definitions	Active revisions	1,000,000	N/A

Requesting Limit Increases

To request ECS API limit increase:

1. AWS Console → Service Quotas
2. Search "ECS API rate limits"
3. Request increase with justification
4. Typical approval: 1-3 business days

Recommended when:

Planning for > 500 concurrent servers
Seeing throttling errors
Before major launch/growth event

Summary Recommendations

By Scale

Server Count	Recommended Setup	Estimated Cost
< 100	2 vCPU / 4GB single Calliope Hub	$60/mo
100-300	4 vCPU / 8GB + PostgreSQL	$140/mo
300-600	8 vCPU / 16GB + PostgreSQL + batching	$265/mo
600-1,500	2-3 hubs @ 4 vCPU + ALB	$360/mo
1,500-3,000	5 hubs @ 4 vCPU + sharding	$650/mo
3,000+	Event-driven architecture	Custom

Critical Thresholds

100 servers: Migrate to PostgreSQL
200 servers: Upgrade to 4 vCPU
400 servers: Implement API batching
600 servers: Add horizontal scaling or hit API limits
1,000 servers: Architectural optimization required

For your current target (“a few 100 containers”):

✅ Recommended: Upgrade to 4 vCPU / 8GB + PostgreSQL

Handles 300-400 servers comfortably
Simple vertical scaling
Cost: ~$140/month
No architectural changes needed
Room for growth to 500+ servers

This gives you plenty of headroom without over-engineering!

Calliope Calliope Hub Performance Optimization Guide Healthcheck Tuning and Recent Fixes