Calliope Calliope Hub Horizontal Scaling Architecture
Calliope Integration: This component is integrated into the Calliope AI platform. Some features and configurations may differ from the upstream project.
This document describes the architecture and implementation details for horizontally scaling Calliope Calliope Hub to support thousands of concurrent spawned servers.
Overview
Calliope Calliope Hub can be scaled horizontally by running multiple Calliope Hub instances behind a load balancer. This allows distributing the polling and management overhead across multiple containers while maintaining a consistent user experience.
Key Requirement: All hubs must share the same PostgreSQL database and secrets.
Architecture Diagrams
Single Calliope Hub (Current)
βββββββββββββββββββββββββββ
β Users (Browsers) β
ββββββββββββββ¬βββββββββββββ
β
β HTTPS
β
ββββββββββββββΌβββββββββββββ
β Calliope Calliope Hub Calliope Hub β
β (ECS Task) β
β - 2 vCPU / 4GB β
β - SQLite DB β
β - Spawner polling β
ββββββββββββββ¬βββββββββββββ
β
β ECS API
β
ββββββββββββββΌβββββββββββββ
β Spawned Servers β
β (ECS Tasks) β
β - Lab / Chat / IDE β
β - 100-150 max β
βββββββββββββββββββββββββββBottleneck: Calliope Hub CPU and SQLite database
Horizontal Scaling (Target)
βββββββββββββββββββββββββββ
β Users (Browsers) β
ββββββββββββββ¬βββββββββββββ
β
β HTTPS
β
ββββββββββββββΌβββββββββββββ
β Application LB (ALB) β
β - Sticky Sessions β
β - Health Checks β
β - SSL Termination β
ββββββ¬βββββββββββββββ¬ββββββ
β β
β β
ββββββΌββββββ ββββββΌββββββ βββββββββββ
β Calliope Hub 1 β β Calliope Hub 2 β β Calliope Hub N β
β 4vCPU/8GBβ β 4vCPU/8GBβ β 4vCPU/8GBβ
ββββββ¬ββββββ ββββββ¬ββββββ ββββββ¬βββββ
β β β
β β β
ββββββββββββββββΌβββββββββββββββ
β
ββββββββββββΌβββββββββββ
β RDS PostgreSQL β
β (Shared Database) β
β - User accounts β
β - Spawner state β
β - Server metadata β
ββββββββββββ¬βββββββββββ
β
ββββββββββββΌβββββββββββ
β Secrets Manager β
β - DB credentials β
β - Crypt key β
β - OAuth secrets β
βββββββββββββββββββββββCapacity: Each Calliope Hub handles 300-400 servers = 900-1,200+ total
Component Details
1. Application Load Balancer (ALB)
Purpose:
- Distribute user requests across multiple Calliope Hub instances
- Maintain session affinity (users stick to same Calliope Hub)
- Health check Calliope Hub instances
- SSL termination
Configuration:
LoadBalancer:
Type: application
Scheme: internet-facing
IpAddressType: ipv4
TargetGroup:
Name: Calliope Calliope Hub-hubs
Protocol: HTTP
Port: 8000
VpcId: vpc-xxxxx
HealthCheck:
Path: /Calliope Hub/health
Protocol: HTTP
Interval: 30
Timeout: 5
HealthyThreshold: 2
UnhealthyThreshold: 3
Matcher: 200
TargetGroupAttributes:
- Key: stickiness.enabled
Value: true
- Key: stickiness.type
Value: app_cookie
- Key: stickiness.app_cookie.cookie_name
Value: Calliope Calliope Hub-session-id
- Key: stickiness.app_cookie.duration_seconds
Value: 86400 # 24 hours
- Key: deregistration_delay.timeout_seconds
Value: 60 # Drain connections on scale-down
- Key: load_balancing.algorithm.type
Value: least_outstanding_requests # Better than round_robin
Listener:
Port: 443
Protocol: HTTPS
Certificates:
- CertificateArn: arn:aws:acm:...
DefaultActions:
- Type: forward
TargetGroupArn: !Ref TargetGroupWhy Sticky Sessions:
- Calliope Hub maintains in-memory state (spawner objects, user sessions)
- Each user must talk to same Calliope Hub throughout session
- Database only stores persistent state, not runtime objects
Session Cookie: Calliope Calliope Hub creates Calliope Calliope Hub-session-id cookie on login
- Contains encrypted user info
- Valid for session duration
- Used by ALB for routing
2. RDS PostgreSQL Database
Purpose:
- Shared state across all hubs
- User accounts and authentication
- Spawner state (task_arn, server_name, etc.)
- OAuth tokens and sessions
Configuration:
DBInstance:
Engine: postgres
EngineVersion: "15.5"
DBInstanceClass: db.t3.medium # Start here, scale up as needed
AllocatedStorage: 100 # GB
StorageType: gp3
MultiAZ: true # High availability
DBName: Calliope Calliope Hub
MasterUsername: jupyterhub_admin
MasterUserPassword: !Ref DBPasswordSecret
VPCSecurityGroups:
- !Ref HubSecurityGroup
BackupRetentionPeriod: 7
PreferredBackupWindow: "03:00-04:00"
PreferredMaintenanceWindow: "sun:04:00-sun:05:00"
EnableCloudwatchLogsExports:
- postgresql
PerformanceInsights:
Enabled: true
RetentionPeriod: 7Connection Pooling (pgbouncer):
# Deploy pgbouncer as sidecar or separate service
Pgbouncer:
pool_mode: transaction # Best for Calliope Calliope Hub's usage pattern
max_client_conn: 500 # Total connections from all hubs
default_pool_size: 20 # Connections to PostgreSQL
reserve_pool_size: 10 # Additional connections for peak loadSchema Requirements:
- Calliope Calliope Hub tables (auto-created)
- User state tables (created by spawners)
- Proper indexes on taskArn, username, server_name
3. Secrets Manager
Shared Secrets (CRITICAL):
All hubs MUST use identical secrets:
Secret: Calliope Calliope Hub-secrets
JUPYTERHUB_CRYPT_KEY: <32-byte hex> # MUST be identical
JUPYTERHUB_COOKIE_SECRET: <32-byte hex> # MUST be identical
JUPYTERHUB_DB_URL: postgresql://... # Shared database
OAUTH_CLIENT_ID: <oauth-id>
OAUTH_CLIENT_SECRET: <oauth-secret>
OAUTH_CALLBACK_URL: https://your-domain/Calliope Hub/oauth_callbackWhy Critical:
- Crypt key: Decrypt cookies from any Calliope Hub
- Cookie secret: Validate sessions across hubs
- OAuth: Users can authenticate through any Calliope Hub
Task Definition Reference:
containerDefinitions:
- name: Calliope Calliope Hub
secrets:
- name: JUPYTERHUB_CRYPT_KEY
valueFrom: arn:aws:secretsmanager:...:Calliope Calliope Hub-secrets:JUPYTERHUB_CRYPT_KEY::
- name: JUPYTERHUB_COOKIE_SECRET
valueFrom: arn:aws:secretsmanager:...:Calliope Calliope Hub-secrets:JUPYTERHUB_COOKIE_SECRET::
- name: JUPYTERHUB_DB_URL
valueFrom: arn:aws:secretsmanager:...:Calliope Calliope Hub-secrets:JUPYTERHUB_DB_URL::4. ECS Service Configuration
Single Service, Multiple Tasks:
Service:
ServiceName: Calliope Calliope Hub-Calliope Hub-service
Cluster: development
TaskDefinition: Calliope Calliope Hub-Calliope Hub-task:latest
DesiredCount: 3 # Number of Calliope Hub instances
LaunchType: FARGATE
PlatformVersion: LATEST
NetworkConfiguration:
AwsvpcConfiguration:
Subnets:
- subnet-private-1
- subnet-private-2
- subnet-private-3
SecurityGroups:
- sg-Calliope Hub
AssignPublicIp: DISABLED
LoadBalancers:
- TargetGroupArn: !Ref HubTargetGroup
ContainerName: Calliope Calliope Hub
ContainerPort: 8000
DeploymentConfiguration:
MinimumHealthyPercent: 100 # Keep all hubs running during deploy
MaximumPercent: 200 # Can double tasks during deploy
DeploymentCircuitBreaker:
Enable: true
Rollback: true
ServiceRegistries: [] # Optional: Service DiscoveryAuto Scaling:
ScalableTarget:
ServiceNamespace: ecs
ResourceId: service/development/Calliope Calliope Hub-Calliope Hub-service
ScalableDimension: ecs:service:DesiredCount
MinCapacity: 2
MaxCapacity: 10
ScalingPolicy:
PolicyType: TargetTrackingScaling
TargetTrackingScalingPolicyConfiguration:
TargetValue: 60.0 # Target 60% CPU
PredefinedMetricSpecification:
PredefinedMetricType: ECSServiceAverageCPUUtilization
ScaleInCooldown: 300 # 5 minutes
ScaleOutCooldown: 60 # 1 minuteImplementation Strategies
Strategy A: Simple Horizontal (Recommended for 500-2,000 servers)
Architecture: N identical hubs behind ALB
Pros:
- Simple to implement
- Easy to understand
- Automatic failover
- Linear scaling
Cons:
- All hubs do same work (some duplication)
- Orphan discovery may race
- Idle culling needs coordination
Implementation:
- Deploy ALB with sticky sessions
- Update ECS service to
desiredCount: 2 - Test session stickiness
- Gradually increase count
Best for: Teams wanting simple horizontal scale without sharding complexity
Strategy B: Sharding by User Group
Architecture: Dedicated Calliope Hub per user group/team
ALB Path-based routing:
/user/team-alpha/* β Calliope Hub 1 (Team Alpha users only)
/user/team-beta/* β Calliope Hub 2 (Team Beta users only)
/user/team-gamma/* β Calliope Hub 3 (Team Gamma users only)Pros:
- No sticky sessions needed (deterministic routing)
- Perfect isolation between teams
- Independent scaling per team
- No orphan detection conflicts
Cons:
- More complex routing
- Need to assign users to groups
- Uneven load distribution possible
Implementation:
- Assign users to groups in Calliope Calliope Hub
- Configure ALB path-based routing
- Update jupyterhub_config.py per Calliope Hub to filter users
- Deploy separate ECS services per Calliope Hub
Best for: Organizations with distinct teams/departments
Strategy C: Sharding by Service Type
Architecture: Dedicated Calliope Hub per service (lab/chat/ide)
ALB Path-based routing:
*/lab/* β Calliope Hub 1 (Lab servers only)
*/chat/* β Calliope Hub 2 (Chat servers only)
*/ide/* β Calliope Hub 3 (IDE servers only)Pros:
- Optimize each Calliope Hub for specific service
- Different poll intervals per service
- Independent resource allocation
- Simpler orphan detection (one service type per Calliope Hub)
Cons:
- User fragmented across multiple hubs
- More complex routing
- Shared authentication state still needed
Implementation:
- Create 3 Calliope Hub configurations (one per service)
- Configure ALB path-based routing
- Update service_loader to filter available services
- Deploy separate ECS services
Best for: Heavy usage of specific services, want to optimize each independently
Coordination Between Hubs
When running multiple hubs, certain operations need coordination:
1. Orphan Detection
Problem: All hubs discover same orphaned tasks
Solutions:
Option A: Primary Calliope Hub Only
# jupyterhub_config.py
HUB_INSTANCE_ID = os.getenv("HUB_INSTANCE_ID", "Calliope Hub-1")
IS_PRIMARY = HUB_INSTANCE_ID == "Calliope Hub-1"
if IS_PRIMARY:
# Only primary Calliope Hub does orphan detection
c.Spawner.poll_interval = 10
else:
# Secondary hubs only poll their own tasks
# Disable orphan discovery by not checking for orphans
passOption B: Distributed Lock
# Use Redis or DB lock
import redis
lock = redis.Redis().lock("orphan-discovery-lock", timeout=300)
if lock.acquire(blocking=False):
try:
# This Calliope Hub wins the lock, do orphan discovery
discover_and_adopt_orphans()
finally:
lock.release()Option C: Task Assignment (Best)
# Assign orphan detection by task hash
task_hash = hash(task_arn) % HUB_COUNT
my_index = int(os.getenv("HUB_INDEX", "0"))
if task_hash == my_index:
# This Calliope Hub is responsible for this task
check_and_adopt_orphan(task)2. Idle Server Culling
Problem: Multiple hubs try to cull same idle servers
Solutions:
Option A: Primary Calliope Hub Only
# Only primary Calliope Hub runs idle culler
if IS_PRIMARY:
from jupyterhub_idle_culler import cull_idle_servers
c.Calliope Calliope Hub.services.append({
'name': 'idle-culler',
'command': [...],
})Option B: Database Lock
# Culler acquires DB lock before culling
# Only one Calliope Hub can cull at a time
async def cull_with_lock():
async with db_lock("idle-culler"):
await cull_idle_servers()Recommended: Option A (primary Calliope Hub only) for simplicity
3. Admin Operations
Problem: Admin UI only shows servers managed by current Calliope Hub
Solution: Query database directly for all servers
# In Calliope Hub config or custom handler
async def get_all_servers():
"""Get all servers across all hubs from database."""
from Calliope Calliope Hub.orm import Spawner as ORMSpawner
with db.session() as session:
all_spawners = session.query(ORMSpawner).filter(
ORMSpawner.server_id.isnot(None)
).all()
return [
{
'user': s.user.name,
'name': s.name,
'state': s.state,
'started': s.started,
}
for s in all_spawners
]Implementation Guide
Phase 1: Prerequisites (Week 1)
1.1. Migrate to PostgreSQL
# Create RDS instance
aws rds create-db-instance \
--db-instance-identifier Calliope Calliope Hub-db \
--db-instance-class db.t3.medium \
--engine postgres \
--engine-version 15.5 \
--master-username Calliope Calliope Hub \
--master-user-password <secure-password> \
--allocated-storage 100 \
--vpc-security-group-ids sg-xxxx \
--db-subnet-group-name private-subnets \
--backup-retention-period 7 \
--multi-az1.2. Update Secrets Manager
# Store database URL
aws secretsmanager create-secret \
--name Calliope Calliope Hub-secrets \
--secret-string '{
"JUPYTERHUB_DB_URL": "postgresql://Calliope Calliope Hub:password@endpoint:5432/Calliope Calliope Hub",
"JUPYTERHUB_CRYPT_KEY": "'$(openssl rand -hex 32)'",
"JUPYTERHUB_COOKIE_SECRET": "'$(openssl rand -hex 32)'"
}'1.3. Test Single Calliope Hub with PostgreSQL
# In jupyterhub_config.py
import os
c.Calliope Calliope Hub.db_url = os.getenv("JUPYTERHUB_DB_URL")Deploy and verify:
- Users can log in
- Servers spawn correctly
- State persists across Calliope Hub restarts
Phase 2: Load Balancer Setup (Week 2)
2.1. Create ALB
# Create load balancer
aws elbv2 create-load-balancer \
--name Calliope Calliope Hub-alb \
--subnets subnet-public-1 subnet-public-2 \
--security-groups sg-alb \
--scheme internet-facing \
--type application \
--ip-address-type ipv4
# Create target group
aws elbv2 create-target-group \
--name Calliope Calliope Hub-hubs \
--protocol HTTP \
--port 8000 \
--vpc-id vpc-xxxx \
--health-check-path /Calliope Hub/health \
--health-check-interval-seconds 30 \
--health-check-timeout-seconds 5 \
--healthy-threshold-count 2 \
--unhealthy-threshold-count 3
# Enable sticky sessions
aws elbv2 modify-target-group-attributes \
--target-group-arn <arn> \
--attributes \
Key=stickiness.enabled,Value=true \
Key=stickiness.type,Value=app_cookie \
Key=stickiness.app_cookie.cookie_name,Value=Calliope Calliope Hub-session-id \
Key=stickiness.app_cookie.duration_seconds,Value=864002.2. Update DNS
# Point your domain to ALB
aws route53 change-resource-record-sets \
--hosted-zone-id Z1234567890ABC \
--change-batch '{
"Changes": [{
"Action": "UPSERT",
"ResourceRecordSet": {
"Name": "jupyter.yourdomain.com",
"Type": "A",
"AliasTarget": {
"HostedZoneId": "<alb-zone-id>",
"DNSName": "<alb-dns-name>",
"EvaluateTargetHealth": true
}
}
}]
}'Phase 3: Deploy Multiple Hubs (Week 2-3)
3.1. Update Calliope Hub Task Definition
# Add Calliope Hub instance identification
environment:
- name: HUB_INSTANCE_ID
value: Calliope Hub-${TASK_ID} # Unique per task
- name: HUB_COUNT
value: "3" # Total number of hubs
- name: HUB_INDEX
value: "0" # Set dynamically or use launch config3.2. Update ECS Service
# Increase desired count
Service:
DesiredCount: 3 # Start with 3 hubs
# Register with load balancer
LoadBalancers:
- TargetGroupArn: <target-group-arn>
ContainerName: Calliope Calliope Hub
ContainerPort: 80003.3. Deploy and Verify
# Update service
aws ecs update-service \
--cluster development \
--service Calliope Calliope Hub-Calliope Hub-service \
--desired-count 3 \
--load-balancers targetGroupArn=<arn>,containerName=Calliope Calliope Hub,containerPort=8000
# Verify all tasks healthy
aws ecs describe-services \
--cluster development \
--services Calliope Calliope Hub-Calliope Hub-service \
--query 'services[0].{running:runningCount,desired:desiredCount,healthy:loadBalancers[0].targetHealth.state}'Phase 4: Configure Coordination (Week 3)
4.1. Designate Primary Calliope Hub
# jupyterhub_config.py
import os
import logging
logger = logging.getLogger(__name__)
HUB_INSTANCE_ID = os.getenv("HUB_INSTANCE_ID", "Calliope Hub-1")
IS_PRIMARY = os.getenv("PRIMARY_HUB", "false").lower() == "true"
logger.info(f"Calliope Hub Instance: {HUB_INSTANCE_ID}, Primary: {IS_PRIMARY}")
# Only primary Calliope Hub runs background services
if IS_PRIMARY:
logger.info("This is the primary Calliope Hub - enabling background services")
# Idle culler (only on primary)
from jupyterhub_idle_culler import cull_idle_servers
c.Calliope Calliope Hub.services.append({
'name': 'idle-culler',
'command': [...],
})
else:
logger.info("This is a secondary Calliope Hub - background services disabled")4.2. Environment Variables per Task
# Task Definition overrides for each instance
TaskDefinition:
ContainerDefinitions:
- Name: Calliope Calliope Hub
Environment:
# Calliope Hub 1 (primary)
- Name: PRIMARY_HUB
Value: "true"
- Name: HUB_INDEX
Value: "0"
# Hubs 2-N (secondary)
# PRIMARY_HUB: "false" (or omit)
# HUB_INDEX: "1", "2", etc.OR use launch configuration to set dynamically
Phase 5: Testing (Week 3-4)
5.1. Session Affinity Test
# Log in user, note which Calliope Hub handled request
# From Calliope Hub logs: "User <name> logged in on Calliope Hub-<id>"
# Make 10 requests, verify all go to same Calliope Hub
for i in {1..10}; do
curl -b cookies.txt https://jupyter.yourdomain.com/Calliope Hub/api/user
sleep 1
done
# Should see consistent Calliope Hub instance ID in ALB logs5.2. Failover Test
# Stop one Calliope Hub task
aws ecs update-service \
--cluster development \
--service Calliope Calliope Hub-Calliope Hub-service \
--desired-count 2
# Users on stopped Calliope Hub should:
# - Get logged out (session lost)
# - Can log in again (routed to healthy Calliope Hub)
# - Their servers still accessible (state in DB)5.3. Load Distribution Test
# Log in 30 users (10 per Calliope Hub)
# Check Calliope Hub logs to see distribution
aws logs filter-log-events \
--log-group-name /aws/ecs/Calliope Hub \
--filter-pattern "User.*logged in" \
--start-time $(date -d '10 minutes ago' +%s)0005.4. Database Consistency Test
-- Connect to PostgreSQL
-- Check spawner state
SELECT user.name, spawner.name, spawner.server_id, spawner.state
FROM spawners spawner
JOIN users user ON spawner.user_id = user.id
WHERE spawner.server_id IS NOT NULL;
-- Should show servers from all hubsOperational Procedures
Adding a Calliope Hub
# 1. Increase desired count
aws ecs update-service \
--cluster development \
--service Calliope Calliope Hub-Calliope Hub-service \
--desired-count 4 # from 3
# 2. Verify new Calliope Hub is healthy
aws ecs describe-services \
--cluster development \
--services Calliope Calliope Hub-Calliope Hub-service
# 3. Monitor for 1 hour
# - Check CPU distribution
# - Verify load balancing
# - Check for errorsRollback: Decrease desired count
Removing a Calliope Hub
# 1. Gracefully drain connections
# Set deregistration delay to 300s (5 minutes)
# 2. Decrease desired count
aws ecs update-service \
--cluster development \
--service Calliope Calliope Hub-Calliope Hub-service \
--desired-count 2 # from 3
# 3. Wait for task to drain
# Active users will be logged out when task stops
# They can log back in and be routed to healthy Calliope HubCalliope Hub Deployment / Update
# 1. Register new task definition revision
aws ecs register-task-definition --cli-input-json file://task-def.json
# 2. Update service with new task definition
aws ecs update-service \
--cluster development \
--service Calliope Calliope Hub-Calliope Hub-service \
--task-definition Calliope Calliope Hub-Calliope Hub-task:NEW_REVISION
# 3. ECS performs rolling update:
# - Starts new task
# - Waits for health check
# - Drains old task
# - Stops old task
# - Repeat for each Calliope Hub
# 4. Monitor deployment
aws ecs describe-services \
--cluster development \
--services Calliope Calliope Hub-Calliope Hub-service \
--query 'services[0].deployments'Zero downtime: MinimumHealthyPercent: 100, MaximumPercent: 200
Troubleshooting Horizontal Scaling
Problem: Users getting logged out randomly
Cause: Session stickiness not working
Debug:
# Check ALB target group attributes
aws elbv2 describe-target-group-attributes \
--target-group-arn <arn>
# Verify stickiness.enabled = true
# Verify cookie_name = Calliope Calliope Hub-session-idFix:
aws elbv2 modify-target-group-attributes \
--target-group-arn <arn> \
--attributes Key=stickiness.enabled,Value=trueProblem: Some servers not visible in UI
Cause: Calliope Hub showing only its own spawners, not all in DB
Debug:
-- Count servers in database
SELECT COUNT(*) FROM spawners WHERE server_id IS NOT NULL;
-- Count servers shown in UI
-- Compare numbersFix: Implement cross-Calliope Hub server listing (see Admin Operations above)
Problem: Multiple hubs adopting same orphan
Cause: Orphan detection racing
Debug:
# Check logs for orphan adoption
aws logs filter-log-events \
--log-group-name /aws/ecs/Calliope Hub \
--filter-pattern "Recorded task adoption"
# Look for duplicate adoptions of same task_arnFix: Implement distributed lock or primary-only orphan detection
Problem: Database connection pool exhausted
Cause: Too many hubs, each with multiple connections
Debug:
-- Check active connections
SELECT COUNT(*) FROM pg_stat_activity WHERE datname = 'Calliope Calliope Hub';
-- Check connection limit
SHOW max_connections;Fix:
# Reduce pool size per Calliope Hub
c.Calliope Calliope Hub.db_pool_size = 5 # from 10
c.Calliope Calliope Hub.db_max_overflow = 10 # from 20
# OR use pgbouncer for connection poolingSecurity Considerations
Network Isolation
βββββββββββββββββββββββββββββββββββββββββββ
β Public Subnets β
β βββββββββββββββββββββββββββββββββββββββ β
β β ALB (internet-facing) β β
β ββββββββββββββββ¬βββββββββββββββββββββββ β
ββββββββββββββββββΌβββββββββββββββββββββββββ
β
ββββββββββββββββββΌβββββββββββββββββββββββββ
β Private Subnets β
β ββββββββββββ ββββββββββββ β
β β Calliope Hub 1 β β Calliope Hub 2 β β
β ββββββ¬ββββββ ββββββ¬ββββββ β
β β β β
β βββββββββββ¬ββββ β
β β β
β ββββββββΌβββββββ β
β β RDS β β
β β (PostgreSQL)β β
β βββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββSecurity Groups:
ALB-SG:
Ingress:
- Port: 443
Source: 0.0.0.0/0 # Internet
Egress:
- Port: 8000
Destination: Calliope Hub-SG
Calliope Hub-SG:
Ingress:
- Port: 8000
Source: ALB-SG
Egress:
- Port: 5432
Destination: DB-SG
- Port: 443
Destination: 0.0.0.0/0 # AWS APIs
DB-SG:
Ingress:
- Port: 5432
Source: Calliope Hub-SG
Egress: NoneSecrets Rotation
Challenge: Must rotate secrets across all hubs simultaneously
Procedure:
- Generate new secrets
- Update Secrets Manager
- Trigger Calliope Hub task refresh (forces secret fetch)
- All hubs pick up new secrets
- No downtime (old sessions work until expiry)
# Update secret
aws secretsmanager update-secret \
--secret-id Calliope Calliope Hub-secrets \
--secret-string '{...new secrets...}'
# Force Calliope Hub task refresh (ECS will restart with new secrets)
aws ecs update-service \
--cluster development \
--service Calliope Calliope Hub-Calliope Hub-service \
--force-new-deploymentMonitoring Multiple Hubs
CloudWatch Dashboard
Metrics to monitor:
Calliope Hub CPU (per instance):
Namespace: AWS/ECS
Metric: CPUUtilization
Dimensions:
- ServiceName: Calliope Calliope Hub-Calliope Hub-service
- TaskId: <each-task-id>
Calliope Hub Memory (per instance):
Namespace: AWS/ECS
Metric: MemoryUtilization
ALB Request Count:
Namespace: AWS/ApplicationELB
Metric: RequestCount
Dimensions:
- LoadBalancer: app/Calliope Calliope Hub-alb/...
ALB Target Health:
Namespace: AWS/ApplicationELB
Metric: HealthyHostCount, UnhealthyHostCount
Database Connections:
Namespace: AWS/RDS
Metric: DatabaseConnections
Dimensions:
- DBInstanceIdentifier: Calliope Calliope Hub-dbAlerts
HubCPUHigh:
Condition: CPUUtilization > 80% for 5 minutes
Action: Scale up Calliope Hub resources or add another Calliope Hub
HubInstanceUnhealthy:
Condition: UnhealthyHostCount > 0 for 2 minutes
Action: Check Calliope Hub logs, may need to restart
DatabaseConnectionsHigh:
Condition: DatabaseConnections > 80% of max_connections
Action: Add connection pooling or scale database
APIThrottling:
Condition: ThrottlingException in logs
Action: Implement batching or request limit increaseCost Optimization
Right-Sizing
Don’t over-provision:
- Start with 2 hubs, add more as needed
- Monitor CPU for 1 week before scaling
- Target 60-70% CPU utilization
Calliope Hub sizing examples:
| Scenario | Recommended | Over-Provisioned | Wasted $/mo |
|---|---|---|---|
| 200 servers | 1 Γ 4 vCPU | 2 Γ 4 vCPU | $122 |
| 500 servers | 2 Γ 4 vCPU | 1 Γ 16 vCPU | $246 |
| 1,000 servers | 3 Γ 4 vCPU | 5 Γ 4 vCPU | $244 |
Auto-Scaling Recommendations
For consistent load:
- Don’t use auto-scaling (wastes time on scale-in/out)
- Provision for peak + 20% headroom
- Manually adjust based on growth
For variable load (e.g., classroom):
AutoScaling:
MinCapacity: 2 # Baseline
MaxCapacity: 5 # Peak
TargetCPU: 60%
ScaleInCooldown: 600 # 10 min (avoid thrashing)
ScaleOutCooldown: 60 # 1 min (respond quickly)Advanced Patterns
Multi-Region Deployment
For global deployments or disaster recovery:
βββββββββββββββββββββββββββββββββββββββββββ
β Route 53 (DNS) β
β Geolocation / Latency Routing β
βββββββββββββ¬ββββββββββββββββββ¬ββββββββββββ
β β
βββββββββΌβββββββ βββββββββΌβββββββ
β us-west-2 β β eu-west-1 β
β β β β
β 3 Hubs β β 3 Hubs β
β PostgreSQL β β PostgreSQL β
ββββββββββββββββ ββββββββββββββββ
β β
ββββββββββββββββββββββ
β
ββββββββββΌβββββββββ
β Aurora Global β
β (Multi-Region) β
βββββββββββββββββββRequirements:
- Aurora Global Database (primary + read replicas)
- Regional Calliope Hub deployments
- DNS-based routing
Complexity: Very high Cost: 2x infrastructure + data transfer
Service-Specific Hubs
Deploy separate hubs optimized for each service type:
# lab-Calliope Hub-config.py
c.Spawner.allowed_services = ["lab"]
c.Spawner.poll_interval = 30 # Lab can tolerate slower polling
# chat-Calliope Hub-config.py
c.Spawner.allowed_services = ["chat"]
c.Spawner.poll_interval = 10 # Chat needs faster detection
# ide-Calliope Hub-config.py
c.Spawner.allowed_services = ["waiide"]
c.Spawner.poll_interval = 15 # IDE moderate pollingBenefits:
- Optimize polling per service type
- Independent scaling
- Failure isolation
Drawback: More complex routing and management
Migration Stories
Example 1: Startup to Scale-Up
Company: Data science team Timeline: 6 months Growth: 20 β 300 servers
Month 0: 20 servers
ββ Config: 2 vCPU / 4GB / SQLite
ββ Cost: $60/month
Month 2: 75 servers (hit SQLite limits)
ββ Action: Migrate to PostgreSQL
ββ Cost: $80/month
Month 4: 150 servers (CPU at 80%)
ββ Action: Upgrade to 4 vCPU / 8GB
ββ Cost: $140/month
Month 6: 300 servers (stable)
ββ Final: 4 vCPU / 8GB / PostgreSQL
ββ Cost: $140/monthOutcome: Smooth scaling, no architectural changes needed
Example 2: Enterprise Deployment
Company: University with 5,000 students Timeline: 12 months Growth: 100 β 2,000 servers
Month 0-3: 100-300 servers
ββ Config: 8 vCPU / 16GB / PostgreSQL
ββ Cost: $265/month
Month 4-6: 300-600 servers (optimize)
ββ Action: Implement API batching + caching
ββ Cost: $265/month (same)
Month 7-9: 600-1,200 servers (scale out)
ββ Action: Add 2nd Calliope Hub + ALB
ββ Cost: $340/month (+$75)
Month 10-12: 1,200-2,000 servers (finalize)
ββ Action: Add 3rd Calliope Hub, implement sharding
ββ Cost: $435/month (+$95)Outcome: Successfully scaled 20x with 3 hubs
Summary
Key Takeaways
- Vertical scaling first - Simple and gets you to 300-500 servers
- PostgreSQL at 100+ servers - Essential for reliability
- Horizontal scaling at 500+ - Multiple hubs behind ALB
- API optimization at 300+ - Batching and caching essential
- Monitor continuously - CPU, poll duration, API errors
Next Steps
For your target of “a few 100 containers”:
Recommended Path:
- β Week 1: Upgrade to 4 vCPU / 8GB ($60 β $140/month)
- β Week 2: Migrate to PostgreSQL if not already
- βΈοΈ Week 3+: Monitor performance, scale as needed
This gives you solid capacity for 300-400 servers with minimal complexity!
See Also
- SPAWN_LIMITS.md - Detailed capacity and limits analysis
- OPTIMIZATION_GUIDE.md - Performance optimization techniques
- HEALTHCHECK_TUNING.md - Recent healthcheck improvements