Calliope Calliope Hub Horizontal Scaling Architecture

Calliope Calliope Hub Horizontal Scaling Architecture

Calliope Integration: This component is integrated into the Calliope AI platform. Some features and configurations may differ from the upstream project.

This document describes the architecture and implementation details for horizontally scaling Calliope Calliope Hub to support thousands of concurrent spawned servers.


Overview

Calliope Calliope Hub can be scaled horizontally by running multiple Calliope Hub instances behind a load balancer. This allows distributing the polling and management overhead across multiple containers while maintaining a consistent user experience.

Key Requirement: All hubs must share the same PostgreSQL database and secrets.


Architecture Diagrams

Single Calliope Hub (Current)

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚     Users (Browsers)    β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
             β”‚
             β”‚ HTTPS
             β”‚
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚   Calliope Calliope Hub Calliope Hub        β”‚
β”‚   (ECS Task)            β”‚
β”‚   - 2 vCPU / 4GB        β”‚
β”‚   - SQLite DB           β”‚
β”‚   - Spawner polling     β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
             β”‚
             β”‚ ECS API
             β”‚
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚   Spawned Servers       β”‚
β”‚   (ECS Tasks)           β”‚
β”‚   - Lab / Chat / IDE    β”‚
β”‚   - 100-150 max         β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Bottleneck: Calliope Hub CPU and SQLite database


Horizontal Scaling (Target)

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚     Users (Browsers)    β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
             β”‚
             β”‚ HTTPS
             β”‚
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Application LB (ALB)   β”‚
β”‚  - Sticky Sessions      β”‚
β”‚  - Health Checks        β”‚
β”‚  - SSL Termination      β”‚
β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”˜
     β”‚              β”‚
     β”‚              β”‚
β”Œβ”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Calliope Hub 1   β”‚  β”‚  Calliope Hub 2   β”‚  β”‚  Calliope Hub N  β”‚
β”‚ 4vCPU/8GBβ”‚  β”‚ 4vCPU/8GBβ”‚  β”‚ 4vCPU/8GBβ”‚
β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”˜
     β”‚              β”‚              β”‚
     β”‚              β”‚              β”‚
     β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                    β”‚
         β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
         β”‚   RDS PostgreSQL    β”‚
         β”‚   (Shared Database) β”‚
         β”‚   - User accounts   β”‚
         β”‚   - Spawner state   β”‚
         β”‚   - Server metadata β”‚
         β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                    β”‚
         β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
         β”‚   Secrets Manager   β”‚
         β”‚   - DB credentials  β”‚
         β”‚   - Crypt key       β”‚
         β”‚   - OAuth secrets   β”‚
         β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Capacity: Each Calliope Hub handles 300-400 servers = 900-1,200+ total


Component Details

1. Application Load Balancer (ALB)

Purpose:

  • Distribute user requests across multiple Calliope Hub instances
  • Maintain session affinity (users stick to same Calliope Hub)
  • Health check Calliope Hub instances
  • SSL termination

Configuration:

LoadBalancer:
  Type: application
  Scheme: internet-facing
  IpAddressType: ipv4

TargetGroup:
  Name: Calliope Calliope Hub-hubs
  Protocol: HTTP
  Port: 8000
  VpcId: vpc-xxxxx

  HealthCheck:
    Path: /Calliope Hub/health
    Protocol: HTTP
    Interval: 30
    Timeout: 5
    HealthyThreshold: 2
    UnhealthyThreshold: 3
    Matcher: 200

  TargetGroupAttributes:
    - Key: stickiness.enabled
      Value: true
    - Key: stickiness.type
      Value: app_cookie
    - Key: stickiness.app_cookie.cookie_name
      Value: Calliope Calliope Hub-session-id
    - Key: stickiness.app_cookie.duration_seconds
      Value: 86400  # 24 hours
    - Key: deregistration_delay.timeout_seconds
      Value: 60  # Drain connections on scale-down
    - Key: load_balancing.algorithm.type
      Value: least_outstanding_requests  # Better than round_robin

Listener:
  Port: 443
  Protocol: HTTPS
  Certificates:
    - CertificateArn: arn:aws:acm:...
  DefaultActions:
    - Type: forward
      TargetGroupArn: !Ref TargetGroup

Why Sticky Sessions:

  • Calliope Hub maintains in-memory state (spawner objects, user sessions)
  • Each user must talk to same Calliope Hub throughout session
  • Database only stores persistent state, not runtime objects

Session Cookie: Calliope Calliope Hub creates Calliope Calliope Hub-session-id cookie on login

  • Contains encrypted user info
  • Valid for session duration
  • Used by ALB for routing

2. RDS PostgreSQL Database

Purpose:

  • Shared state across all hubs
  • User accounts and authentication
  • Spawner state (task_arn, server_name, etc.)
  • OAuth tokens and sessions

Configuration:

DBInstance:
  Engine: postgres
  EngineVersion: "15.5"
  DBInstanceClass: db.t3.medium  # Start here, scale up as needed
  AllocatedStorage: 100  # GB
  StorageType: gp3
  MultiAZ: true  # High availability

  DBName: Calliope Calliope Hub
  MasterUsername: jupyterhub_admin
  MasterUserPassword: !Ref DBPasswordSecret

  VPCSecurityGroups:
    - !Ref HubSecurityGroup

  BackupRetentionPeriod: 7
  PreferredBackupWindow: "03:00-04:00"
  PreferredMaintenanceWindow: "sun:04:00-sun:05:00"

  EnableCloudwatchLogsExports:
    - postgresql

  PerformanceInsights:
    Enabled: true
    RetentionPeriod: 7

Connection Pooling (pgbouncer):

# Deploy pgbouncer as sidecar or separate service
Pgbouncer:
  pool_mode: transaction  # Best for Calliope Calliope Hub's usage pattern
  max_client_conn: 500    # Total connections from all hubs
  default_pool_size: 20   # Connections to PostgreSQL
  reserve_pool_size: 10   # Additional connections for peak load

Schema Requirements:

  • Calliope Calliope Hub tables (auto-created)
  • User state tables (created by spawners)
  • Proper indexes on taskArn, username, server_name

3. Secrets Manager

Shared Secrets (CRITICAL):

All hubs MUST use identical secrets:

Secret: Calliope Calliope Hub-secrets
  JUPYTERHUB_CRYPT_KEY: <32-byte hex>     # MUST be identical
  JUPYTERHUB_COOKIE_SECRET: <32-byte hex> # MUST be identical
  JUPYTERHUB_DB_URL: postgresql://...     # Shared database
  OAUTH_CLIENT_ID: <oauth-id>
  OAUTH_CLIENT_SECRET: <oauth-secret>
  OAUTH_CALLBACK_URL: https://your-domain/Calliope Hub/oauth_callback

Why Critical:

  • Crypt key: Decrypt cookies from any Calliope Hub
  • Cookie secret: Validate sessions across hubs
  • OAuth: Users can authenticate through any Calliope Hub

Task Definition Reference:

containerDefinitions:
  - name: Calliope Calliope Hub
    secrets:
      - name: JUPYTERHUB_CRYPT_KEY
        valueFrom: arn:aws:secretsmanager:...:Calliope Calliope Hub-secrets:JUPYTERHUB_CRYPT_KEY::
      - name: JUPYTERHUB_COOKIE_SECRET
        valueFrom: arn:aws:secretsmanager:...:Calliope Calliope Hub-secrets:JUPYTERHUB_COOKIE_SECRET::
      - name: JUPYTERHUB_DB_URL
        valueFrom: arn:aws:secretsmanager:...:Calliope Calliope Hub-secrets:JUPYTERHUB_DB_URL::

4. ECS Service Configuration

Single Service, Multiple Tasks:

Service:
  ServiceName: Calliope Calliope Hub-Calliope Hub-service
  Cluster: development
  TaskDefinition: Calliope Calliope Hub-Calliope Hub-task:latest
  DesiredCount: 3  # Number of Calliope Hub instances

  LaunchType: FARGATE
  PlatformVersion: LATEST

  NetworkConfiguration:
    AwsvpcConfiguration:
      Subnets:
        - subnet-private-1
        - subnet-private-2
        - subnet-private-3
      SecurityGroups:
        - sg-Calliope Hub
      AssignPublicIp: DISABLED

  LoadBalancers:
    - TargetGroupArn: !Ref HubTargetGroup
      ContainerName: Calliope Calliope Hub
      ContainerPort: 8000

  DeploymentConfiguration:
    MinimumHealthyPercent: 100  # Keep all hubs running during deploy
    MaximumPercent: 200         # Can double tasks during deploy
    DeploymentCircuitBreaker:
      Enable: true
      Rollback: true

  ServiceRegistries: []  # Optional: Service Discovery

Auto Scaling:

ScalableTarget:
  ServiceNamespace: ecs
  ResourceId: service/development/Calliope Calliope Hub-Calliope Hub-service
  ScalableDimension: ecs:service:DesiredCount
  MinCapacity: 2
  MaxCapacity: 10

ScalingPolicy:
  PolicyType: TargetTrackingScaling
  TargetTrackingScalingPolicyConfiguration:
    TargetValue: 60.0  # Target 60% CPU
    PredefinedMetricSpecification:
      PredefinedMetricType: ECSServiceAverageCPUUtilization
    ScaleInCooldown: 300   # 5 minutes
    ScaleOutCooldown: 60   # 1 minute

Implementation Strategies

Strategy A: Simple Horizontal (Recommended for 500-2,000 servers)

Architecture: N identical hubs behind ALB

Pros:

  • Simple to implement
  • Easy to understand
  • Automatic failover
  • Linear scaling

Cons:

  • All hubs do same work (some duplication)
  • Orphan discovery may race
  • Idle culling needs coordination

Implementation:

  1. Deploy ALB with sticky sessions
  2. Update ECS service to desiredCount: 2
  3. Test session stickiness
  4. Gradually increase count

Best for: Teams wanting simple horizontal scale without sharding complexity


Strategy B: Sharding by User Group

Architecture: Dedicated Calliope Hub per user group/team

ALB Path-based routing:
/user/team-alpha/* β†’ Calliope Hub 1 (Team Alpha users only)
/user/team-beta/*  β†’ Calliope Hub 2 (Team Beta users only)
/user/team-gamma/* β†’ Calliope Hub 3 (Team Gamma users only)

Pros:

  • No sticky sessions needed (deterministic routing)
  • Perfect isolation between teams
  • Independent scaling per team
  • No orphan detection conflicts

Cons:

  • More complex routing
  • Need to assign users to groups
  • Uneven load distribution possible

Implementation:

  1. Assign users to groups in Calliope Calliope Hub
  2. Configure ALB path-based routing
  3. Update jupyterhub_config.py per Calliope Hub to filter users
  4. Deploy separate ECS services per Calliope Hub

Best for: Organizations with distinct teams/departments


Strategy C: Sharding by Service Type

Architecture: Dedicated Calliope Hub per service (lab/chat/ide)

ALB Path-based routing:
*/lab/*   β†’ Calliope Hub 1 (Lab servers only)
*/chat/*  β†’ Calliope Hub 2 (Chat servers only)
*/ide/*   β†’ Calliope Hub 3 (IDE servers only)

Pros:

  • Optimize each Calliope Hub for specific service
  • Different poll intervals per service
  • Independent resource allocation
  • Simpler orphan detection (one service type per Calliope Hub)

Cons:

  • User fragmented across multiple hubs
  • More complex routing
  • Shared authentication state still needed

Implementation:

  1. Create 3 Calliope Hub configurations (one per service)
  2. Configure ALB path-based routing
  3. Update service_loader to filter available services
  4. Deploy separate ECS services

Best for: Heavy usage of specific services, want to optimize each independently


Coordination Between Hubs

When running multiple hubs, certain operations need coordination:

1. Orphan Detection

Problem: All hubs discover same orphaned tasks

Solutions:

Option A: Primary Calliope Hub Only

# jupyterhub_config.py
HUB_INSTANCE_ID = os.getenv("HUB_INSTANCE_ID", "Calliope Hub-1")
IS_PRIMARY = HUB_INSTANCE_ID == "Calliope Hub-1"

if IS_PRIMARY:
    # Only primary Calliope Hub does orphan detection
    c.Spawner.poll_interval = 10
else:
    # Secondary hubs only poll their own tasks
    # Disable orphan discovery by not checking for orphans
    pass

Option B: Distributed Lock

# Use Redis or DB lock
import redis
lock = redis.Redis().lock("orphan-discovery-lock", timeout=300)

if lock.acquire(blocking=False):
    try:
        # This Calliope Hub wins the lock, do orphan discovery
        discover_and_adopt_orphans()
    finally:
        lock.release()

Option C: Task Assignment (Best)

# Assign orphan detection by task hash
task_hash = hash(task_arn) % HUB_COUNT
my_index = int(os.getenv("HUB_INDEX", "0"))

if task_hash == my_index:
    # This Calliope Hub is responsible for this task
    check_and_adopt_orphan(task)

2. Idle Server Culling

Problem: Multiple hubs try to cull same idle servers

Solutions:

Option A: Primary Calliope Hub Only

# Only primary Calliope Hub runs idle culler
if IS_PRIMARY:
    from jupyterhub_idle_culler import cull_idle_servers
    c.Calliope Calliope Hub.services.append({
        'name': 'idle-culler',
        'command': [...],
    })

Option B: Database Lock

# Culler acquires DB lock before culling
# Only one Calliope Hub can cull at a time
async def cull_with_lock():
    async with db_lock("idle-culler"):
        await cull_idle_servers()

Recommended: Option A (primary Calliope Hub only) for simplicity


3. Admin Operations

Problem: Admin UI only shows servers managed by current Calliope Hub

Solution: Query database directly for all servers

# In Calliope Hub config or custom handler
async def get_all_servers():
    """Get all servers across all hubs from database."""
    from Calliope Calliope Hub.orm import Spawner as ORMSpawner

    with db.session() as session:
        all_spawners = session.query(ORMSpawner).filter(
            ORMSpawner.server_id.isnot(None)
        ).all()

        return [
            {
                'user': s.user.name,
                'name': s.name,
                'state': s.state,
                'started': s.started,
            }
            for s in all_spawners
        ]

Implementation Guide

Phase 1: Prerequisites (Week 1)

1.1. Migrate to PostgreSQL

# Create RDS instance
aws rds create-db-instance \
  --db-instance-identifier Calliope Calliope Hub-db \
  --db-instance-class db.t3.medium \
  --engine postgres \
  --engine-version 15.5 \
  --master-username Calliope Calliope Hub \
  --master-user-password <secure-password> \
  --allocated-storage 100 \
  --vpc-security-group-ids sg-xxxx \
  --db-subnet-group-name private-subnets \
  --backup-retention-period 7 \
  --multi-az

1.2. Update Secrets Manager

# Store database URL
aws secretsmanager create-secret \
  --name Calliope Calliope Hub-secrets \
  --secret-string '{
    "JUPYTERHUB_DB_URL": "postgresql://Calliope Calliope Hub:password@endpoint:5432/Calliope Calliope Hub",
    "JUPYTERHUB_CRYPT_KEY": "'$(openssl rand -hex 32)'",
    "JUPYTERHUB_COOKIE_SECRET": "'$(openssl rand -hex 32)'"
  }'

1.3. Test Single Calliope Hub with PostgreSQL

# In jupyterhub_config.py
import os
c.Calliope Calliope Hub.db_url = os.getenv("JUPYTERHUB_DB_URL")

Deploy and verify:

  • Users can log in
  • Servers spawn correctly
  • State persists across Calliope Hub restarts

Phase 2: Load Balancer Setup (Week 2)

2.1. Create ALB

# Create load balancer
aws elbv2 create-load-balancer \
  --name Calliope Calliope Hub-alb \
  --subnets subnet-public-1 subnet-public-2 \
  --security-groups sg-alb \
  --scheme internet-facing \
  --type application \
  --ip-address-type ipv4

# Create target group
aws elbv2 create-target-group \
  --name Calliope Calliope Hub-hubs \
  --protocol HTTP \
  --port 8000 \
  --vpc-id vpc-xxxx \
  --health-check-path /Calliope Hub/health \
  --health-check-interval-seconds 30 \
  --health-check-timeout-seconds 5 \
  --healthy-threshold-count 2 \
  --unhealthy-threshold-count 3

# Enable sticky sessions
aws elbv2 modify-target-group-attributes \
  --target-group-arn <arn> \
  --attributes \
    Key=stickiness.enabled,Value=true \
    Key=stickiness.type,Value=app_cookie \
    Key=stickiness.app_cookie.cookie_name,Value=Calliope Calliope Hub-session-id \
    Key=stickiness.app_cookie.duration_seconds,Value=86400

2.2. Update DNS

# Point your domain to ALB
aws route53 change-resource-record-sets \
  --hosted-zone-id Z1234567890ABC \
  --change-batch '{
    "Changes": [{
      "Action": "UPSERT",
      "ResourceRecordSet": {
        "Name": "jupyter.yourdomain.com",
        "Type": "A",
        "AliasTarget": {
          "HostedZoneId": "<alb-zone-id>",
          "DNSName": "<alb-dns-name>",
          "EvaluateTargetHealth": true
        }
      }
    }]
  }'

Phase 3: Deploy Multiple Hubs (Week 2-3)

3.1. Update Calliope Hub Task Definition

# Add Calliope Hub instance identification
environment:
  - name: HUB_INSTANCE_ID
    value: Calliope Hub-${TASK_ID}  # Unique per task
  - name: HUB_COUNT
    value: "3"  # Total number of hubs
  - name: HUB_INDEX
    value: "0"  # Set dynamically or use launch config

3.2. Update ECS Service

# Increase desired count
Service:
  DesiredCount: 3  # Start with 3 hubs

  # Register with load balancer
  LoadBalancers:
    - TargetGroupArn: <target-group-arn>
      ContainerName: Calliope Calliope Hub
      ContainerPort: 8000

3.3. Deploy and Verify

# Update service
aws ecs update-service \
  --cluster development \
  --service Calliope Calliope Hub-Calliope Hub-service \
  --desired-count 3 \
  --load-balancers targetGroupArn=<arn>,containerName=Calliope Calliope Hub,containerPort=8000

# Verify all tasks healthy
aws ecs describe-services \
  --cluster development \
  --services Calliope Calliope Hub-Calliope Hub-service \
  --query 'services[0].{running:runningCount,desired:desiredCount,healthy:loadBalancers[0].targetHealth.state}'

Phase 4: Configure Coordination (Week 3)

4.1. Designate Primary Calliope Hub

# jupyterhub_config.py

import os
import logging

logger = logging.getLogger(__name__)

HUB_INSTANCE_ID = os.getenv("HUB_INSTANCE_ID", "Calliope Hub-1")
IS_PRIMARY = os.getenv("PRIMARY_HUB", "false").lower() == "true"

logger.info(f"Calliope Hub Instance: {HUB_INSTANCE_ID}, Primary: {IS_PRIMARY}")

# Only primary Calliope Hub runs background services
if IS_PRIMARY:
    logger.info("This is the primary Calliope Hub - enabling background services")

    # Idle culler (only on primary)
    from jupyterhub_idle_culler import cull_idle_servers
    c.Calliope Calliope Hub.services.append({
        'name': 'idle-culler',
        'command': [...],
    })

else:
    logger.info("This is a secondary Calliope Hub - background services disabled")

4.2. Environment Variables per Task

# Task Definition overrides for each instance
TaskDefinition:
  ContainerDefinitions:
    - Name: Calliope Calliope Hub
      Environment:
        # Calliope Hub 1 (primary)
        - Name: PRIMARY_HUB
          Value: "true"
        - Name: HUB_INDEX
          Value: "0"

        # Hubs 2-N (secondary)
        # PRIMARY_HUB: "false" (or omit)
        # HUB_INDEX: "1", "2", etc.

OR use launch configuration to set dynamically


Phase 5: Testing (Week 3-4)

5.1. Session Affinity Test

# Log in user, note which Calliope Hub handled request
# From Calliope Hub logs: "User <name> logged in on Calliope Hub-<id>"

# Make 10 requests, verify all go to same Calliope Hub
for i in {1..10}; do
  curl -b cookies.txt https://jupyter.yourdomain.com/Calliope Hub/api/user
  sleep 1
done

# Should see consistent Calliope Hub instance ID in ALB logs

5.2. Failover Test

# Stop one Calliope Hub task
aws ecs update-service \
  --cluster development \
  --service Calliope Calliope Hub-Calliope Hub-service \
  --desired-count 2

# Users on stopped Calliope Hub should:
# - Get logged out (session lost)
# - Can log in again (routed to healthy Calliope Hub)
# - Their servers still accessible (state in DB)

5.3. Load Distribution Test

# Log in 30 users (10 per Calliope Hub)
# Check Calliope Hub logs to see distribution
aws logs filter-log-events \
  --log-group-name /aws/ecs/Calliope Hub \
  --filter-pattern "User.*logged in" \
  --start-time $(date -d '10 minutes ago' +%s)000

5.4. Database Consistency Test

-- Connect to PostgreSQL
-- Check spawner state
SELECT user.name, spawner.name, spawner.server_id, spawner.state
FROM spawners spawner
JOIN users user ON spawner.user_id = user.id
WHERE spawner.server_id IS NOT NULL;

-- Should show servers from all hubs

Operational Procedures

Adding a Calliope Hub

# 1. Increase desired count
aws ecs update-service \
  --cluster development \
  --service Calliope Calliope Hub-Calliope Hub-service \
  --desired-count 4  # from 3

# 2. Verify new Calliope Hub is healthy
aws ecs describe-services \
  --cluster development \
  --services Calliope Calliope Hub-Calliope Hub-service

# 3. Monitor for 1 hour
# - Check CPU distribution
# - Verify load balancing
# - Check for errors

Rollback: Decrease desired count


Removing a Calliope Hub

# 1. Gracefully drain connections
# Set deregistration delay to 300s (5 minutes)

# 2. Decrease desired count
aws ecs update-service \
  --cluster development \
  --service Calliope Calliope Hub-Calliope Hub-service \
  --desired-count 2  # from 3

# 3. Wait for task to drain
# Active users will be logged out when task stops
# They can log back in and be routed to healthy Calliope Hub

Calliope Hub Deployment / Update

# 1. Register new task definition revision
aws ecs register-task-definition --cli-input-json file://task-def.json

# 2. Update service with new task definition
aws ecs update-service \
  --cluster development \
  --service Calliope Calliope Hub-Calliope Hub-service \
  --task-definition Calliope Calliope Hub-Calliope Hub-task:NEW_REVISION

# 3. ECS performs rolling update:
#    - Starts new task
#    - Waits for health check
#    - Drains old task
#    - Stops old task
#    - Repeat for each Calliope Hub

# 4. Monitor deployment
aws ecs describe-services \
  --cluster development \
  --services Calliope Calliope Hub-Calliope Hub-service \
  --query 'services[0].deployments'

Zero downtime: MinimumHealthyPercent: 100, MaximumPercent: 200


Troubleshooting Horizontal Scaling

Problem: Users getting logged out randomly

Cause: Session stickiness not working

Debug:

# Check ALB target group attributes
aws elbv2 describe-target-group-attributes \
  --target-group-arn <arn>

# Verify stickiness.enabled = true
# Verify cookie_name = Calliope Calliope Hub-session-id

Fix:

aws elbv2 modify-target-group-attributes \
  --target-group-arn <arn> \
  --attributes Key=stickiness.enabled,Value=true

Problem: Some servers not visible in UI

Cause: Calliope Hub showing only its own spawners, not all in DB

Debug:

-- Count servers in database
SELECT COUNT(*) FROM spawners WHERE server_id IS NOT NULL;

-- Count servers shown in UI
-- Compare numbers

Fix: Implement cross-Calliope Hub server listing (see Admin Operations above)


Problem: Multiple hubs adopting same orphan

Cause: Orphan detection racing

Debug:

# Check logs for orphan adoption
aws logs filter-log-events \
  --log-group-name /aws/ecs/Calliope Hub \
  --filter-pattern "Recorded task adoption"

# Look for duplicate adoptions of same task_arn

Fix: Implement distributed lock or primary-only orphan detection


Problem: Database connection pool exhausted

Cause: Too many hubs, each with multiple connections

Debug:

-- Check active connections
SELECT COUNT(*) FROM pg_stat_activity WHERE datname = 'Calliope Calliope Hub';

-- Check connection limit
SHOW max_connections;

Fix:

# Reduce pool size per Calliope Hub
c.Calliope Calliope Hub.db_pool_size = 5  # from 10
c.Calliope Calliope Hub.db_max_overflow = 10  # from 20

# OR use pgbouncer for connection pooling

Security Considerations

Network Isolation

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Public Subnets                          β”‚
β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
β”‚ β”‚  ALB (internet-facing)              β”‚ β”‚
β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                 β”‚
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Private Subnets                         β”‚
β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”             β”‚
β”‚ β”‚  Calliope Hub 1   β”‚  β”‚  Calliope Hub 2   β”‚             β”‚
β”‚ β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”˜             β”‚
β”‚      β”‚             β”‚                    β”‚
β”‚      β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”˜                    β”‚
β”‚                β”‚                        β”‚
β”‚         β”Œβ”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”                 β”‚
β”‚         β”‚     RDS     β”‚                 β”‚
β”‚         β”‚ (PostgreSQL)β”‚                 β”‚
β”‚         β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                 β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Security Groups:

ALB-SG:
  Ingress:
    - Port: 443
      Source: 0.0.0.0/0  # Internet
  Egress:
    - Port: 8000
      Destination: Calliope Hub-SG

Calliope Hub-SG:
  Ingress:
    - Port: 8000
      Source: ALB-SG
  Egress:
    - Port: 5432
      Destination: DB-SG
    - Port: 443
      Destination: 0.0.0.0/0  # AWS APIs

DB-SG:
  Ingress:
    - Port: 5432
      Source: Calliope Hub-SG
  Egress: None

Secrets Rotation

Challenge: Must rotate secrets across all hubs simultaneously

Procedure:

  1. Generate new secrets
  2. Update Secrets Manager
  3. Trigger Calliope Hub task refresh (forces secret fetch)
  4. All hubs pick up new secrets
  5. No downtime (old sessions work until expiry)
# Update secret
aws secretsmanager update-secret \
  --secret-id Calliope Calliope Hub-secrets \
  --secret-string '{...new secrets...}'

# Force Calliope Hub task refresh (ECS will restart with new secrets)
aws ecs update-service \
  --cluster development \
  --service Calliope Calliope Hub-Calliope Hub-service \
  --force-new-deployment

Monitoring Multiple Hubs

CloudWatch Dashboard

Metrics to monitor:

Calliope Hub CPU (per instance):
  Namespace: AWS/ECS
  Metric: CPUUtilization
  Dimensions:
    - ServiceName: Calliope Calliope Hub-Calliope Hub-service
    - TaskId: <each-task-id>

Calliope Hub Memory (per instance):
  Namespace: AWS/ECS
  Metric: MemoryUtilization

ALB Request Count:
  Namespace: AWS/ApplicationELB
  Metric: RequestCount
  Dimensions:
    - LoadBalancer: app/Calliope Calliope Hub-alb/...

ALB Target Health:
  Namespace: AWS/ApplicationELB
  Metric: HealthyHostCount, UnhealthyHostCount

Database Connections:
  Namespace: AWS/RDS
  Metric: DatabaseConnections
  Dimensions:
    - DBInstanceIdentifier: Calliope Calliope Hub-db

Alerts

HubCPUHigh:
  Condition: CPUUtilization > 80% for 5 minutes
  Action: Scale up Calliope Hub resources or add another Calliope Hub

HubInstanceUnhealthy:
  Condition: UnhealthyHostCount > 0 for 2 minutes
  Action: Check Calliope Hub logs, may need to restart

DatabaseConnectionsHigh:
  Condition: DatabaseConnections > 80% of max_connections
  Action: Add connection pooling or scale database

APIThrottling:
  Condition: ThrottlingException in logs
  Action: Implement batching or request limit increase

Cost Optimization

Right-Sizing

Don’t over-provision:

  • Start with 2 hubs, add more as needed
  • Monitor CPU for 1 week before scaling
  • Target 60-70% CPU utilization

Calliope Hub sizing examples:

ScenarioRecommendedOver-ProvisionedWasted $/mo
200 servers1 Γ— 4 vCPU2 Γ— 4 vCPU$122
500 servers2 Γ— 4 vCPU1 Γ— 16 vCPU$246
1,000 servers3 Γ— 4 vCPU5 Γ— 4 vCPU$244

Auto-Scaling Recommendations

For consistent load:

  • Don’t use auto-scaling (wastes time on scale-in/out)
  • Provision for peak + 20% headroom
  • Manually adjust based on growth

For variable load (e.g., classroom):

AutoScaling:
  MinCapacity: 2  # Baseline
  MaxCapacity: 5  # Peak
  TargetCPU: 60%
  ScaleInCooldown: 600  # 10 min (avoid thrashing)
  ScaleOutCooldown: 60  # 1 min (respond quickly)

Advanced Patterns

Multi-Region Deployment

For global deployments or disaster recovery:

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚            Route 53 (DNS)               β”‚
β”‚      Geolocation / Latency Routing      β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
            β”‚                 β”‚
    β”Œβ”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”
    β”‚ us-west-2    β”‚  β”‚ eu-west-1    β”‚
    β”‚              β”‚  β”‚              β”‚
    β”‚ 3 Hubs       β”‚  β”‚ 3 Hubs       β”‚
    β”‚ PostgreSQL   β”‚  β”‚ PostgreSQL   β”‚
    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
         β”‚                    β”‚
         β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                  β”‚
         β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”
         β”‚  Aurora Global  β”‚
         β”‚   (Multi-Region) β”‚
         β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Requirements:

  • Aurora Global Database (primary + read replicas)
  • Regional Calliope Hub deployments
  • DNS-based routing

Complexity: Very high Cost: 2x infrastructure + data transfer


Service-Specific Hubs

Deploy separate hubs optimized for each service type:

# lab-Calliope Hub-config.py
c.Spawner.allowed_services = ["lab"]
c.Spawner.poll_interval = 30  # Lab can tolerate slower polling

# chat-Calliope Hub-config.py
c.Spawner.allowed_services = ["chat"]
c.Spawner.poll_interval = 10  # Chat needs faster detection

# ide-Calliope Hub-config.py
c.Spawner.allowed_services = ["waiide"]
c.Spawner.poll_interval = 15  # IDE moderate polling

Benefits:

  • Optimize polling per service type
  • Independent scaling
  • Failure isolation

Drawback: More complex routing and management


Migration Stories

Example 1: Startup to Scale-Up

Company: Data science team Timeline: 6 months Growth: 20 β†’ 300 servers

Month 0: 20 servers
β”œβ”€ Config: 2 vCPU / 4GB / SQLite
└─ Cost: $60/month

Month 2: 75 servers (hit SQLite limits)
β”œβ”€ Action: Migrate to PostgreSQL
└─ Cost: $80/month

Month 4: 150 servers (CPU at 80%)
β”œβ”€ Action: Upgrade to 4 vCPU / 8GB
└─ Cost: $140/month

Month 6: 300 servers (stable)
β”œβ”€ Final: 4 vCPU / 8GB / PostgreSQL
└─ Cost: $140/month

Outcome: Smooth scaling, no architectural changes needed


Example 2: Enterprise Deployment

Company: University with 5,000 students Timeline: 12 months Growth: 100 β†’ 2,000 servers

Month 0-3: 100-300 servers
β”œβ”€ Config: 8 vCPU / 16GB / PostgreSQL
└─ Cost: $265/month

Month 4-6: 300-600 servers (optimize)
β”œβ”€ Action: Implement API batching + caching
└─ Cost: $265/month (same)

Month 7-9: 600-1,200 servers (scale out)
β”œβ”€ Action: Add 2nd Calliope Hub + ALB
└─ Cost: $340/month (+$75)

Month 10-12: 1,200-2,000 servers (finalize)
β”œβ”€ Action: Add 3rd Calliope Hub, implement sharding
└─ Cost: $435/month (+$95)

Outcome: Successfully scaled 20x with 3 hubs


Summary

Key Takeaways

  1. Vertical scaling first - Simple and gets you to 300-500 servers
  2. PostgreSQL at 100+ servers - Essential for reliability
  3. Horizontal scaling at 500+ - Multiple hubs behind ALB
  4. API optimization at 300+ - Batching and caching essential
  5. Monitor continuously - CPU, poll duration, API errors

Next Steps

For your target of “a few 100 containers”:

Recommended Path:

  1. βœ… Week 1: Upgrade to 4 vCPU / 8GB ($60 β†’ $140/month)
  2. βœ… Week 2: Migrate to PostgreSQL if not already
  3. ⏸️ Week 3+: Monitor performance, scale as needed

This gives you solid capacity for 300-400 servers with minimal complexity!


See Also