Calliope Calliope Hub Spawner Scaling Guide

Calliope Calliope Hub Spawner Scaling Guide

Calliope Integration: This component is integrated into the Calliope AI platform. Some features and configurations may differ from the upstream project.

This document provides a linear scaling chart and recommendations for growing your Calliope Calliope Hub deployment from 10 to 10,000+ concurrent servers.


Quick Reference Chart

Vertical Scaling (Single Calliope Hub)

Target ServersCalliope Hub vCPUCalliope Hub MemoryDatabasePoll IntervalMonthly Cost*Notes
10-5024GBSQLite5-10s$60Current config, ideal for dev/small teams
50-10024GBSQLite5-10s$60Good performance, no changes needed
100-15024GBPostgreSQL10s$60 + $20**Approaching CPU limit, DB upgrade recommended
150-30048GBPostgreSQL10-15s$120 + $20**2x resources, smooth performance
300-500816GBPostgreSQL15-20s$245 + $20**Near ECS API limits
500-8001632GBPostgreSQL20-30s$490 + $20**Approaching hard API limits

*Calliope Hub task only (excludes spawned servers) **RDS PostgreSQL db.t3.small cost

Horizontal Scaling (Multiple Hubs)

Target ServersCalliope Hub CountCalliope Hub SizeTotal vCPUDatabaseOptimizationsMonthly Cost*Notes
500-1,00024 vCPU / 8GB8 vCPUPostgreSQLNone$240 + $50**ALB + sticky sessions required
1,000-2,00034 vCPU / 8GB12 vCPUPostgreSQLBatching$360 + $75**API batching essential
2,000-5,00058 vCPU / 16GB40 vCPUPostgreSQLBatching + Caching$1,200 + $150**Sharding recommended
5,000-10,000108 vCPU / 16GB80 vCPUPostgreSQLFull stack$2,400 + $300**Event-driven recommended

*Calliope Hub tasks only (excludes spawned servers and ALB costs) **RDS PostgreSQL cost (scales with connections)


Scaling Paths

Path 1: Small to Medium (10 โ†’ 500 servers)

Vertical Scaling Only - Simplest approach

Step 1: Start (10-50 servers)
โ”œโ”€ Calliope Hub: 2 vCPU / 4GB
โ”œโ”€ Database: SQLite
โ””โ”€ Cost: $60/month

Step 2: Growing (50-100 servers)
โ”œโ”€ Calliope Hub: 2 vCPU / 4GB (no change)
โ”œโ”€ Database: Switch to PostgreSQL
โ””โ”€ Cost: $80/month (+$20 for RDS)

Step 3: Scaling (100-300 servers)
โ”œโ”€ Calliope Hub: 4 vCPU / 8GB
โ”œโ”€ Database: PostgreSQL
โ””โ”€ Cost: $140/month

Step 4: Large (300-500 servers)
โ”œโ”€ Calliope Hub: 8 vCPU / 16GB
โ”œโ”€ Database: PostgreSQL
โ”œโ”€ Poll interval: 15-20s
โ””โ”€ Cost: $265/month

Timeline: Can scale in days (just update task definition) Complexity: Low (no architectural changes)


Path 2: Medium to Large (500 โ†’ 2,000 servers)

Add Horizontal Scaling - Moderate complexity

Step 1: Prepare (500 servers)
โ”œโ”€ Calliope Hub: 8 vCPU / 16GB
โ”œโ”€ Database: PostgreSQL (connection pooling)
โ”œโ”€ Implement: API call batching
โ””โ”€ Cost: $265/month

Step 2: First Scale-Out (500-1,000 servers)
โ”œโ”€ Hubs: 2 ร— 4 vCPU / 8GB
โ”œโ”€ Load Balancer: ALB with sticky sessions
โ”œโ”€ Database: PostgreSQL (larger instance)
โ””โ”€ Cost: $340/month

Step 3: Growing (1,000-2,000 servers)
โ”œโ”€ Hubs: 3 ร— 4 vCPU / 8GB
โ”œโ”€ Optimizations: Batching + caching
โ””โ”€ Cost: $435/month

Timeline: 2-4 weeks (load balancer setup + testing) Complexity: Medium (requires coordination between hubs)


Path 3: Large Scale (2,000+ servers)

Enterprise Architecture - High complexity

Step 1: Optimize (2,000 servers)
โ”œโ”€ Hubs: 5 ร— 8 vCPU / 16GB
โ”œโ”€ Sharding: By user group or service type
โ”œโ”€ Database: PostgreSQL (db.r5.2xlarge or larger)
โ”œโ”€ Caching: Redis cluster for API responses
โ””โ”€ Cost: $1,500/month

Step 2: Event-Driven (5,000+ servers)
โ”œโ”€ Hubs: 10 ร— 8 vCPU / 16GB
โ”œโ”€ Monitoring: Separate service with EventBridge
โ”œโ”€ Database: Aurora PostgreSQL (multi-AZ)
โ”œโ”€ Polling: Reduced to 60s (events handle state)
โ””โ”€ Cost: $3,000+/month

Timeline: 2-3 months (architectural redesign) Complexity: High (requires distributed systems expertise)


Linear Scaling Formula

Calculate Required Calliope Hub Resources

CPU (vCPU):

Required vCPU = (Target Servers ร— 0.1s CPU) รท Poll Interval

Example for 400 servers with 10s polling:
= (400 ร— 0.1) รท 10
= 4 vCPU

Memory (GB):

Required Memory = 1 GB + (Target Servers ร— 0.001 GB)

Example for 400 servers:
= 1 + (400 ร— 0.001)
= 1.4 GB (use 4GB for headroom)

Calliope Hub Count (for horizontal scaling):

Calliope Hub Count = Target Servers รท 300

Example for 1,200 servers:
= 1,200 รท 300
= 4 hubs @ 4 vCPU / 8GB each

Migration Checklist

Upgrading Calliope Hub Resources (Vertical)

  • Update task definition with new CPU/memory
  • Deploy new task revision
  • Monitor CPU/memory usage for 24 hours
  • Verify poll performance improved
  • Check for API throttling errors (should decrease)

Downtime: None (rolling update) Rollback: Easy (revert task definition)


Migrating to PostgreSQL

  • Provision RDS PostgreSQL instance (db.t3.small minimum)
  • Create database and Calliope Calliope Hub user
  • Update JUPYTERHUB_DB_URL in Secrets Manager
  • Stop Calliope Hub task (will lose session state)
  • Start Calliope Hub with new DB URL (creates schema)
  • Verify users can log in and spawn servers
  • Monitor database connections and query performance

Downtime: ~5-10 minutes (Calliope Hub restart) Rollback: Difficult (data in new DB)


Adding Horizontal Scaling

  • Ensure PostgreSQL is configured (required)
  • Store all secrets in Secrets Manager (shared across hubs)
  • Create Application Load Balancer (ALB)
  • Configure target group with sticky sessions
  • Update ECS service to desired_count: 2
  • Test user sessions stick to same Calliope Hub
  • Configure orphan detection coordination
  • Test idle culling coordination
  • Gradually increase desired_count

Downtime: None (add hubs incrementally) Rollback: Medium difficulty (set desired_count: 1)


Performance Optimization Sequence

For optimal scaling, implement in this order:

Stage 1: Foundation (0-100 servers)

  1. โœ… Stable base configuration
  2. โœ… Monitoring and alerting
  3. โœ… Proper healthchecks (DONE in recent updates)

Stage 2: Database (100-200 servers)

  1. โฌœ Migrate to PostgreSQL
  2. โฌœ Add connection pooling (pgbouncer)
  3. โฌœ Index optimization

Stage 3: Vertical Scaling (200-500 servers)

  1. โฌœ Upgrade to 4 vCPU / 8GB
  2. โฌœ Increase poll intervals (10s โ†’ 15s)
  3. โฌœ Monitor API throttling

Stage 4: API Optimization (500-1,000 servers)

  1. โฌœ Implement batch API calls
  2. โฌœ Add response caching (10-15s TTL)
  3. โฌœ Upgrade to 8 vCPU / 16GB

Stage 5: Horizontal Scaling (1,000+ servers)

  1. โฌœ Set up ALB with sticky sessions
  2. โฌœ Deploy 2-3 Calliope Hub instances
  3. โฌœ Coordinate background services
  4. โฌœ Consider sharding strategy

Stage 6: Advanced (5,000+ servers)

  1. โฌœ Event-driven architecture (EventBridge)
  2. โฌœ Separate monitoring service
  3. โฌœ Redis caching cluster
  4. โฌœ Multi-region deployment

Quick Decision Matrix

“How should I scale?”

Current ServersTarget ServersActionTimelineCost Delta
50100โœ… No change yetN/A$0
100200Upgrade to 4 vCPU + PostgreSQL1 week+$80/mo
200500Upgrade to 8 vCPU1 day+$125/mo
5001,000Add batching + 2nd Calliope Hub2-3 weeks+$140/mo
1,0002,000Add 1-2 more hubs + sharding4-6 weeks+$240/mo
2,0005,000Redesign to event-driven8-12 weeks+$1,000/mo

Testing Your Limits

Load Testing Commands

1. Check current Calliope Hub resources:

aws ecs describe-tasks \
  --cluster development \
  --tasks <Calliope Hub-task-id> \
  --query 'tasks[0].{cpu:cpu,memory:memory}'

2. Monitor CPU during peak:

aws cloudwatch get-metric-statistics \
  --namespace AWS/ECS \
  --metric-name CPUUtilization \
  --dimensions Name=ServiceName,Value=dev-Calliope Calliope Hub-service \
  --start-time $(date -u -d '1 hour ago' +%Y-%m-%dT%H:%M:%S) \
  --end-time $(date -u +%Y-%m-%dT%H:%M:%S) \
  --period 60 \
  --statistics Average,Maximum

3. Check API throttling:

# Look for ThrottlingException in logs
aws logs filter-log-events \
  --log-group-name /aws/ecs/Calliope Hub-calliope-dev/Calliope Hub-Calliope Calliope Hub-service \
  --filter-pattern "ThrottlingException" \
  --start-time $(date -d '1 hour ago' +%s)000

4. Count active spawners:

# Check Calliope Hub logs for spawner count
aws logs filter-log-events \
  --log-group-name /aws/ecs/Calliope Hub-calliope-dev/Calliope Hub-Calliope Calliope Hub-service \
  --filter-pattern "running tasks" \
  --max-items 10

Emergency Scaling

If you suddenly need more capacity RIGHT NOW:

Option 1: Quick Vertical Scale (15 minutes)

# Update Calliope Hub task definition
# Change: cpu: "4096", memory: "8192"
# Deploy new revision
# Capacity: 100 โ†’ 300 servers

Option 2: Reduce Polling (5 minutes)

# In jupyterhub_config.py
c.Spawner.poll_interval = 30  # from 5-10s
# Restart Calliope Hub
# Capacity: 2x improvement
# Tradeoff: Slower dead server detection

Option 3: Disable Orphan Discovery (2 minutes)

# Temporarily disable in ecs.py poll() method
# Comment out orphan discovery block
# Restart Calliope Hub
# Capacity: 20% improvement
# Tradeoff: No automatic ghost cleanup

Best Practices

Calliope Hub Resource Sizing

Always maintain headroom:

  • Target 60-70% CPU utilization (not 90%)
  • Keep 30-40% memory free
  • Leave room for traffic spikes

Scale up when:

  • CPU sustained > 70% for 5+ minutes
  • Poll cycles taking > 15 seconds
  • API throttling errors appear

Don’t over-provision:

  • 2x capacity headroom is wasteful
  • Scale in steps (2โ†’4โ†’8 vCPU)
  • Monitor for 24 hours before adding more

Database Sizing

PostgreSQL Instances:

Concurrent ServersRDS InstanceConnectionsCost/Month
100-500db.t3.small (2 vCPU, 2GB)100$25
500-1,500db.t3.medium (2 vCPU, 4GB)200$50
1,500-5,000db.r5.large (2 vCPU, 16GB)500$150
5,000-10,000db.r5.2xlarge (8 vCPU, 64GB)1,500$600

Connection Pooling:

# Use pgbouncer to reduce connection overhead
c.Calliope Calliope Hub.db_url = "postgresql://pgbouncer:6432/Calliope Calliope Hub"

# Pool settings for 1000 servers:
pool_size = 20  # Max active connections
max_overflow = 40  # Additional connections allowed

Cost Optimization

Calliope Hub Infrastructure Cost Breakdown

Formula:

Monthly Cost = (vCPU ร— $14.64 ร— 730h) + (GB Memory ร— $1.61 ร— 730h)

Examples:

2 vCPU / 4GB  = (2 ร— $14.64 ร— 730) + (4 ร— $1.61 ร— 730) = $61/mo
4 vCPU / 8GB  = (4 ร— $14.64 ร— 730) + (8 ร— $1.61 ร— 730) = $122/mo
8 vCPU / 16GB = (8 ร— $14.64 ร— 730) + (16 ร— $1.61 ร— 730) = $244/mo

Cost per Server (Calliope Hub overhead only)

Calliope Hub ConfigServersCalliope Hub CostCost/Server
2 vCPU / 4GB100$61$0.61
4 vCPU / 8GB300$122$0.41
8 vCPU / 16GB600$244$0.41
3 ร— 4 vCPU hubs900$366$0.41

Key Insight: Calliope Hub overhead is <$0.50 per server - don’t under-provision!


Growth Scenarios

Scenario 1: Startup (Months 1-6)

Growth: 10 โ†’ 100 servers

Month 1-3: 10-50 servers

  • Keep current config (2 vCPU / 4GB)
  • Monitor usage patterns
  • Cost: $60/month

Month 4-6: 50-100 servers

  • Migrate to PostgreSQL
  • No Calliope Hub upgrade needed yet
  • Cost: $80/month (+$20)

Total Investment: $20/month additional


Scenario 2: Rapid Growth (Months 1-12)

Growth: 50 โ†’ 500 servers

Month 1-3: 50-100 servers

  • Migrate to PostgreSQL immediately
  • Cost: $80/month

Month 4-6: 100-200 servers

  • Upgrade Calliope Hub to 4 vCPU / 8GB
  • Cost: $140/month (+$60)

Month 7-9: 200-400 servers

  • Upgrade Calliope Hub to 8 vCPU / 16GB
  • Implement batch API calls
  • Cost: $265/month (+$125)

Month 10-12: 400-500 servers

  • Consider 2nd Calliope Hub if approaching limits
  • Cost: $265-380/month

Total Investment: $300/month over 12 months


Scenario 3: Enterprise Scale (12-24 months)

Growth: 100 โ†’ 2,000+ servers

Phase 1 (Months 1-6): 100-500 servers

  • Vertical scaling (4 โ†’ 8 vCPU)
  • PostgreSQL migration
  • Cost: $265/month

Phase 2 (Months 7-12): 500-1,000 servers

  • Implement batching + caching
  • Add 2nd Calliope Hub with ALB
  • Cost: $340/month

Phase 3 (Months 13-18): 1,000-1,500 servers

  • Add 3rd Calliope Hub
  • Implement sharding
  • Cost: $435/month

Phase 4 (Months 19-24): 1,500-2,000 servers

  • Add 4-5th Calliope Hub
  • Event-driven architecture
  • Cost: $650/month

Total Investment: $650/month infrastructure


Monitoring Your Scale

Key Metrics Dashboard

1. Capacity Metrics

Current Servers / Max Recommended = Capacity %

Target: < 70% capacity
Warning: > 80% capacity
Critical: > 90% capacity

2. Performance Metrics

Poll Cycle Duration: < 10s (good), 10-20s (ok), > 20s (bad)
API Error Rate: < 0.1% (good), 0.1-1% (warning), > 1% (critical)
Calliope Hub CPU: < 70% (good), 70-85% (warning), > 85% (critical)

3. User Experience Metrics

Spawn Time: < 60s (excellent), 60-120s (good), > 120s (poor)
Poll Lag: < 30s (good), 30-60s (ok), > 60s (bad)
UI Response: < 1s (excellent), 1-3s (ok), > 3s (poor)

Scale-Up Triggers

Automatic triggers for scaling:

Upgrade Calliope Hub (Vertical)

IF cpu_utilization > 70% for 10 minutes
OR poll_duration > 15s average
THEN upgrade to next vCPU tier

Add Database (PostgreSQL)

IF servers > 100
OR database_lock_errors > 0
THEN migrate to PostgreSQL

Add Calliope Hub (Horizontal)

IF servers > 400
AND cpu_utilization > 80% with 8+ vCPU
THEN add 2nd Calliope Hub with load balancer

Optimize API Calls

IF api_throttle_errors > 0
OR servers > 300
THEN implement batch API calls

Appendix: AWS Service Limits

Default Quotas (us-west-2)

ServiceQuotaLimitRequestable To
ECS - Fargate vCPUOn-Demand10,000100,000+
ECS - Tasks per serviceRunning tasks5,000No limit
VPC - ENIsElastic IPs5,00050,000
ECS API - describe_tasksTPS100500 (case-by-case)
ECS API - list_tasksTPS100500 (case-by-case)
ECS - Task definitionsActive revisions1,000,000N/A

Requesting Limit Increases

To request ECS API limit increase:

1. AWS Console โ†’ Service Quotas
2. Search "ECS API rate limits"
3. Request increase with justification
4. Typical approval: 1-3 business days

Recommended when:

  • Planning for > 500 concurrent servers
  • Seeing throttling errors
  • Before major launch/growth event

Summary Recommendations

By Scale

Server CountRecommended SetupEstimated Cost
< 1002 vCPU / 4GB single Calliope Hub$60/mo
100-3004 vCPU / 8GB + PostgreSQL$140/mo
300-6008 vCPU / 16GB + PostgreSQL + batching$265/mo
600-1,5002-3 hubs @ 4 vCPU + ALB$360/mo
1,500-3,0005 hubs @ 4 vCPU + sharding$650/mo
3,000+Event-driven architectureCustom

Critical Thresholds

  • 100 servers: Migrate to PostgreSQL
  • 200 servers: Upgrade to 4 vCPU
  • 400 servers: Implement API batching
  • 600 servers: Add horizontal scaling or hit API limits
  • 1,000 servers: Architectural optimization required

For your current target (“a few 100 containers”):

โœ… Recommended: Upgrade to 4 vCPU / 8GB + PostgreSQL

  • Handles 300-400 servers comfortably
  • Simple vertical scaling
  • Cost: ~$140/month
  • No architectural changes needed
  • Room for growth to 500+ servers

This gives you plenty of headroom without over-engineering!