Calliope Calliope Hub Performance Optimization Guide
Calliope Integration: This component is integrated into the Calliope AI platform. Some features and configurations may differ from the upstream project.
This document describes performance optimization techniques for improving Calliope Calliope Hub spawner performance, reducing API costs, and supporting higher concurrent server counts.
Quick Wins (No Code Changes)
1. Increase Poll Interval
Current: 5-10 seconds (randomized) Optimized: 15-30 seconds
# Calliope Hub/config/jupyterhub_config.py
c.Spawner.poll_interval = 20 # from random.randint(5, 10)
c.Spawner.poll_jitter = 0.2 # Add 20% jitter to prevent thundering herdImpact:
- API calls: -50% to -66%
- Capacity: +2x to +3x servers
- Tradeoff: Dead server detection slower (20s vs 10s)
Recommended for: Deployments with >200 servers
2. Optimize Healthcheck Timing
Recent improvements (October 2025):
- Chat: interval 5s → 15s, retries 5 → 8, start_period 30s → 45s
- WAIIDE: interval 10s → 15s, retries 3 → 8, start_period 30s → 45s
- Lab: Already optimal (30s interval, 8 retries, 60s start_period)
Why this helps:
- Less aggressive polling reduces ECS task overhead
- Longer start_period prevents premature failures
- More retries improve reliability
Files updated:
- Calliope Hub/services/chat.yaml
- Calliope Hub/services/waiide.yaml
- Calliope Hub/config/helpers/chat_helper.py
- Calliope Hub/config/helpers/waiide_helper.py
3. Increase Calliope Hub Resources
Current: 2 vCPU / 4GB Recommended: 4 vCPU / 8GB for >100 servers
# Update Calliope Hub task definition
TaskDefinition:
cpu: "4096" # from "2048"
memory: "8192" # from "4096"Impact:
- Capacity: +2x to +3x servers
- Cost: +$60/month
- No code changes
Medium-Effort Optimizations
4. Batch ECS API Calls
Problem: Each spawner polls independently → N API calls per cycle
Solution: Batch all spawners into single API call
Implementation:
# Calliope Hub/spawners/helpers/batch_poller.py (NEW FILE)
import asyncio
import boto3
from typing import Dict, List
from datetime import datetime, timezone
class BatchTaskPoller:
"""Batch multiple task status checks into single ECS API calls."""
def __init__(self, ecs_cluster: str, aws_region: str, logger):
self.ecs_cluster = ecs_cluster
self.aws_region = aws_region
self.log = logger
self.ecs_client = boto3.client("ecs", region_name=aws_region)
# Cache for task status (TTL: 10 seconds)
self._cache = {}
self._cache_ttl = 10
async def get_task_status(self, task_arn: str) -> Dict:
"""
Get task status, using cache if available.
Args:
task_arn: ECS task ARN
Returns:
Task description dict
"""
# Check cache first
if task_arn in self._cache:
cached_data, cached_time = self._cache[task_arn]
age = (datetime.now(timezone.utc) - cached_time).total_seconds()
if age < self._cache_ttl:
self.log.debug(f"Cache hit for {task_arn[-12:]} (age: {age:.1f}s)")
return cached_data
# Cache miss or expired - fetch from ECS
return await self._fetch_task_status(task_arn)
async def batch_get_task_status(self, task_arns: List[str]) -> Dict[str, Dict]:
"""
Batch fetch multiple task statuses in single API call.
Args:
task_arns: List of task ARNs to check
Returns:
Dict mapping task_arn to task description
"""
if not task_arns:
return {}
# Filter out cached tasks
uncached_arns = []
results = {}
for arn in task_arns:
if arn in self._cache:
cached_data, cached_time = self._cache[arn]
age = (datetime.now(timezone.utc) - cached_time).total_seconds()
if age < self._cache_ttl:
results[arn] = cached_data
continue
uncached_arns.append(arn)
# Fetch uncached tasks in batches of 100 (ECS API limit)
if uncached_arns:
self.log.info(f"Batch fetching {len(uncached_arns)} tasks from ECS")
for i in range(0, len(uncached_arns), 100):
batch = uncached_arns[i:i+100]
loop = asyncio.get_running_loop()
response = await loop.run_in_executor(
None,
lambda: self.ecs_client.describe_tasks(
cluster=self.ecs_cluster,
tasks=batch,
include=["TAGS"]
)
)
# Cache and return results
now = datetime.now(timezone.utc)
for task in response.get("tasks", []):
task_arn = task["taskArn"]
self._cache[task_arn] = (task, now)
results[task_arn] = task
return results
async def _fetch_task_status(self, task_arn: str) -> Dict:
"""Fetch single task status and cache it."""
loop = asyncio.get_running_loop()
response = await loop.run_in_executor(
None,
lambda: self.ecs_client.describe_tasks(
cluster=self.ecs_cluster,
tasks=[task_arn]
)
)
tasks = response.get("tasks", [])
if tasks:
task = tasks[0]
self._cache[task_arn] = (task, datetime.now(timezone.utc))
return task
return None
def clear_cache(self):
"""Clear the cache (e.g., after spawning new tasks)."""
self._cache.clear()Usage in ECS Spawner:
# Calliope Hub/spawners/ecs.py
# At module level, create shared batch poller
_batch_poller = None
def get_batch_poller(spawner):
global _batch_poller
if _batch_poller is None:
_batch_poller = BatchTaskPoller(
spawner.ecs_cluster,
spawner.aws_region,
spawner.log
)
return _batch_poller
# In poll() method, replace direct describe_tasks:
async def poll(self) -> Optional[int]:
if not self.task_arn:
return 0
# Use batch poller instead of direct ECS call
batch_poller = get_batch_poller(self)
task = await batch_poller.get_task_status(self.task_arn)
if not task:
return 0 # Task not found
# Rest of poll logic...Impact:
- API calls: -90% (10x reduction)
- Capacity: +10x servers
- Cost: Minimal (caching overhead)
- Complexity: Medium
5. Implement Coordinated Polling
Problem: All spawners poll independently
Solution: Coordinate polling in waves
# Calliope Hub/config/jupyterhub_config.py
import asyncio
import random
# Global poll coordinator
class PollCoordinator:
def __init__(self, poll_interval=15):
self.poll_interval = poll_interval
self.last_poll = {}
async def should_poll(self, spawner_id: str) -> bool:
"""
Check if spawner should poll now.
Coordinates polling to prevent simultaneous API calls.
"""
import time
now = time.time()
last_poll = self.last_poll.get(spawner_id, 0)
if now - last_poll < self.poll_interval:
return False # Too soon
# Add jitter to prevent thundering herd
jitter = random.uniform(-2, 2)
if now - last_poll < self.poll_interval + jitter:
return False
self.last_poll[spawner_id] = now
return True
poll_coordinator = PollCoordinator(poll_interval=15)
# In spawner's poll() method:
async def poll(self):
spawner_id = f"{self.user.name}/{self.name}"
if not await poll_coordinator.should_poll(spawner_id):
# Skip this poll cycle
return None # Assume still running
# Proceed with actual poll
...Impact:
- API calls: -30% to -50%
- CPU: -20% to -30%
- Complexity: Low
6. Response Caching Layer
Problem: Multiple polls within short window fetch same data
Solution: Cache ECS responses for 10-15 seconds
# Calliope Hub/spawners/helpers/ecs_cache.py (NEW FILE)
from datetime import datetime, timezone, timedelta
from typing import Dict, Optional
import asyncio
class ECSResponseCache:
"""Cache ECS API responses to reduce redundant calls."""
def __init__(self, ttl_seconds: int = 15):
self.ttl = timedelta(seconds=ttl_seconds)
self._cache: Dict[str, tuple] = {}
self._lock = asyncio.Lock()
async def get(self, key: str) -> Optional[Dict]:
"""Get cached response if not expired."""
async with self._lock:
if key in self._cache:
data, timestamp = self._cache[key]
if datetime.now(timezone.utc) - timestamp < self.ttl:
return data
else:
# Expired, remove
del self._cache[key]
return None
async def set(self, key: str, data: Dict):
"""Cache response with current timestamp."""
async with self._lock:
self._cache[key] = (data, datetime.now(timezone.utc))
async def clear_expired(self):
"""Background task to clear expired entries."""
while True:
await asyncio.sleep(60) # Every minute
async with self._lock:
now = datetime.now(timezone.utc)
expired = [
k for k, (_, ts) in self._cache.items()
if now - ts >= self.ttl
]
for k in expired:
del self._cache[k]
# Global cache instance
ecs_cache = ECSResponseCache(ttl_seconds=15)
# Start cleanup task
asyncio.create_task(ecs_cache.clear_expired())Usage:
# In ECS spawner poll() method
cache_key = f"task:{self.task_arn}"
# Try cache first
cached = await ecs_cache.get(cache_key)
if cached:
task = cached
else:
# Fetch from ECS
response = ecs_client.describe_tasks(...)
task = response["tasks"][0]
await ecs_cache.set(cache_key, task)Impact:
- API calls: -40% to -60%
- Response time: Faster (cache hits)
- Memory: +10-20 MB for cache
- Complexity: Low
High-Effort Optimizations
7. Event-Driven Architecture
Problem: Polling is wasteful - we poll even when nothing changed
Solution: Subscribe to ECS task state change events
Architecture:
┌─────────────────────────┐
│ ECS Task Events │
│ (State Changes) │
└────────────┬────────────┘
│
│ EventBridge
│
┌────────────▼────────────┐
│ Lambda Function │
│ or SNS Topic │
└────────────┬────────────┘
│
│ HTTP POST
│
┌────────────▼────────────┐
│ Calliope Calliope Hub Service │
│ /Calliope Hub/api/tasks/status │
│ (Custom endpoint) │
└─────────────────────────┘EventBridge Rule:
EventRule:
Name: ecs-task-state-changes
EventPattern:
source:
- aws.ecs
detail-type:
- ECS Task State Change
detail:
clusterArn:
- arn:aws:ecs:us-west-2:xxxx:cluster/development
lastStatus:
- RUNNING
- STOPPED
- DEPROVISIONING
Targets:
- Arn: !GetAtt TaskStatusLambda.Arn
Id: task-status-processorLambda Handler:
# lambda/ecs_task_status.py
import json
import requests
import os
JUPYTERHUB_API_TOKEN = os.environ["JUPYTERHUB_API_TOKEN"]
JUPYTERHUB_API_URL = os.environ["JUPYTERHUB_API_URL"]
def lambda_handler(event, context):
"""Process ECS task state change events and notify Calliope Calliope Hub."""
detail = event.get("detail", {})
task_arn = detail.get("taskArn")
last_status = detail.get("lastStatus")
desired_status = detail.get("desiredStatus")
# Extract task metadata
containers = detail.get("containers", [])
# Notify Calliope Calliope Hub of state change
response = requests.post(
f"{JUPYTERHUB_API_URL}/Calliope Hub/api/tasks/status",
headers={"Authorization": f"token {JUPYTERHUB_API_TOKEN}"},
json={
"task_arn": task_arn,
"status": last_status,
"desired_status": desired_status,
"containers": containers,
},
timeout=5,
)
return {
"statusCode": 200,
"body": json.dumps(f"Processed {task_arn}: {last_status}"),
}Calliope Calliope Hub Custom Handler:
# Calliope Hub/handlers/task_status_handler.py (NEW FILE)
from Calliope Calliope Hub.handlers import BaseHandler
from Calliope Calliope Hub.utils import token_authenticated
class TaskStatusHandler(BaseHandler):
"""Receive ECS task status updates from EventBridge."""
@token_authenticated
async def post(self):
"""Process task status update."""
data = self.get_json_body()
task_arn = data.get("task_arn")
status = data.get("status")
self.log.info(f"Received task status: {task_arn[-12:]} = {status}")
# Find spawner managing this task
for user in self.users.values():
for spawner in user.spawners.values():
if getattr(spawner, "task_arn", None) == task_arn:
# Update spawner's cached state
spawner._cached_task_status = status
spawner._cache_time = datetime.now()
if status == "STOPPED":
# Trigger cleanup
self.log.info(f"Task stopped, scheduling cleanup")
asyncio.create_task(spawner.stop())
self.set_status(200)
return
self.log.warning(f"No spawner found for task {task_arn[-12:]}")
self.set_status(404)
# Register handler in jupyterhub_config.py:
c.Calliope Calliope Hub.extra_handlers = [
(r"/Calliope Hub/api/tasks/status", TaskStatusHandler),
]Impact:
- API calls: -95% (only when state changes)
- Polling: Can reduce to 60s (events handle updates)
- Capacity: +10x to +20x servers
- Complexity: High (Lambda + EventBridge setup)
8. Separate Monitoring Service
Problem: Calliope Hub CPU wasted on polling instead of serving users
Solution: Dedicated monitoring service for health checks
Architecture:
┌──────────────┐
│ Calliope Calliope Hub │ ← Focus on spawning & routing
│ (No polling) │
└──────┬───────┘
│ Database (shared state)
│
┌──────▼────────────┐
│ Monitor Service │ ← Dedicated polling
│ - Polls all tasks │
│ - Updates DB │
│ - Sends events │
└───────────────────┘Monitor Service (Python):
# services/task_monitor.py
import asyncio
import boto3
from sqlalchemy import create_engine
class TaskMonitor:
"""Dedicated service for monitoring ECS task health."""
def __init__(self, db_url, ecs_cluster, aws_region):
self.db = create_engine(db_url)
self.ecs = boto3.client("ecs", region_name=aws_region)
self.cluster = ecs_cluster
async def monitor_loop(self):
"""Main monitoring loop."""
while True:
try:
# Get all spawner tasks from database
with self.db.connect() as conn:
result = conn.execute("""
SELECT id, user_id, name, state
FROM spawners
WHERE server_id IS NOT NULL
""")
spawners = result.fetchall()
# Extract task ARNs from state
task_arns = []
for spawner in spawners:
state = json.loads(spawner.state or "{}")
task_arn = state.get("task_arn")
if task_arn:
task_arns.append((spawner.id, task_arn))
# Batch check all tasks (100 at a time)
for i in range(0, len(task_arns), 100):
batch = [arn for _, arn in task_arns[i:i+100]]
response = self.ecs.describe_tasks(
cluster=self.cluster,
tasks=batch
)
# Update database with status
for task in response.get("tasks", []):
task_arn = task["taskArn"]
status = task["lastStatus"]
health = task.get("healthStatus", "UNKNOWN")
# Find spawner ID
spawner_id = next(
(sid for sid, arn in task_arns if arn == task_arn),
None
)
if spawner_id:
# Update spawner state in DB
self._update_spawner_status(spawner_id, status, health)
# Sleep before next poll
await asyncio.sleep(15)
except Exception as e:
logger.error(f"Monitor loop error: {e}")
await asyncio.sleep(60)
def _update_spawner_status(self, spawner_id, status, health):
"""Update spawner status in database."""
with self.db.connect() as conn:
conn.execute("""
UPDATE spawners
SET state = jsonb_set(
COALESCE(state::jsonb, '{}'::jsonb),
'{task_status}',
to_jsonb($1::text)
),
state = jsonb_set(
state::jsonb,
'{health_status}',
to_jsonb($2::text)
),
state = jsonb_set(
state::jsonb,
'{last_poll}',
to_jsonb($3::text)
)
WHERE id = $4
""", status, health, datetime.now().isoformat(), spawner_id)Deploy as ECS Service:
Service:
ServiceName: Calliope Calliope Hub-monitor
Cluster: development
TaskDefinition: task-monitor:latest
DesiredCount: 1 # Single instance
# No load balancer - internal serviceImpact:
- Calliope Hub CPU: -80% (no polling overhead)
- API calls: Centralized (easier to optimize)
- Capacity: +5x Calliope Hub capacity
- Complexity: Very high
Database Optimizations
9. Add Indexes
Problem: Slow queries on spawner lookups
Solution: Add strategic indexes
-- Index on user_id for fast user lookups
CREATE INDEX idx_spawners_user_id ON spawners(user_id);
-- Index on server_id for fast server lookups
CREATE INDEX idx_spawners_server_id ON spawners(server_id);
-- Composite index for task ARN lookups in state
CREATE INDEX idx_spawners_state_task_arn ON spawners USING gin (state jsonb_path_ops);
-- Index on last_activity for idle culling
CREATE INDEX idx_spawners_last_activity ON spawners(last_activity);
-- Index on user name for fast username lookups
CREATE INDEX idx_users_name ON users(name);Analyze query performance:
-- Enable query logging
ALTER DATABASE Calliope Calliope Hub SET log_min_duration_statement = 100;
-- Check slow queries
SELECT query, calls, mean_exec_time, max_exec_time
FROM pg_stat_statements
ORDER BY mean_exec_time DESC
LIMIT 20;10. Connection Pooling
Problem: Each Calliope Hub creates multiple database connections
Solution: Use pgbouncer for connection pooling
Deploy pgbouncer:
# pgbouncer.ini
[databases]
Calliope Calliope Hub = host=rds-endpoint port=5432 dbname=Calliope Calliope Hub
[pgbouncer]
listen_addr = 0.0.0.0
listen_port = 6432
auth_type = md5
auth_file = /etc/pgbouncer/userlist.txt
pool_mode = transaction
max_client_conn = 500
default_pool_size = 20
reserve_pool_size = 10
server_lifetime = 3600
server_idle_timeout = 600Update Calliope Calliope Hub:
# Connect through pgbouncer instead of direct RDS
c.Calliope Calliope Hub.db_url = "postgresql://user:pass@pgbouncer:6432/Calliope Calliope Hub"
# Reduce pool size (pgbouncer handles pooling)
c.Calliope Calliope Hub.db_pool_size = 5 # from 10
c.Calliope Calliope Hub.db_max_overflow = 5 # from 20Impact:
- Database connections: 100+ → 20-30
- Database CPU: -50%
- Cost: RDS can use smaller instance
- Complexity: Medium
Code-Level Optimizations
11. Lazy Loading
Problem: Calliope Hub loads all spawner state on startup
Solution: Load spawner state only when needed
# Calliope Hub/spawners/ecs.py
class ECSSpawner(BaseSpawner):
def __init__(self, *args, **kwargs):
super().__init__(*args, **kwargs)
self._state_loaded = False
async def _ensure_state_loaded(self):
"""Lazy load spawner state from database."""
if self._state_loaded:
return
# Load state from DB
state = await self._load_state_from_db()
if state:
self.load_state(state)
self._state_loaded = True
async def poll(self):
await self._ensure_state_loaded()
# ... rest of poll logicImpact:
- Calliope Hub startup: -50% faster
- Memory: -30% (only loaded spawners in memory)
- Complexity: Medium
12. Async Optimization
Problem: Serial API calls during spawn
Solution: Parallelize independent operations
# Before (serial):
async def start(self):
self._create_task_definition() # 500ms
self._run_task() # 1000ms
self._wait_for_running() # 2000ms
self._register_with_proxy() # 500ms
# Total: 4000ms
# After (parallel):
async def start(self):
# Run independent operations in parallel
task_def, proxy_ready = await asyncio.gather(
self._create_task_definition(),
self._prepare_proxy_registration(),
)
task_arn = await self._run_task(task_def)
# Parallel wait
await asyncio.gather(
self._wait_for_running(task_arn),
self._register_with_proxy(task_arn, proxy_ready),
)
# Total: 2500ms (40% faster)Impact:
- Spawn time: -30% to -40%
- User experience: Significantly better
- Complexity: Medium
13. Reduce Log Verbosity
Problem: Excessive logging increases I/O and CloudWatch costs
Solution: Log levels per environment
# jupyterhub_config.py
import logging
import os
# Environment-based log levels
if os.getenv("ENVIRONMENT") == "production":
c.Calliope Calliope Hub.log_level = "WARNING" # Only warnings and errors
c.Spawner.debug = False
else:
c.Calliope Calliope Hub.log_level = "INFO"
c.Spawner.debug = True
# Reduce SQLAlchemy logging
logging.getLogger("sqlalchemy.engine").setLevel(logging.WARNING)
logging.getLogger("tornado.access").setLevel(logging.WARNING)
# Keep important loggers at INFO
logging.getLogger("Calliope Calliope Hub.proxy").setLevel(logging.INFO)
logging.getLogger("Calliope Calliope Hub.spawner").setLevel(logging.INFO)Impact:
- CloudWatch costs: -60% to -80%
- I/O overhead: -50%
- Log clarity: Better (signal vs noise)
Monitoring Optimizations
14. Metrics Collection
Use CloudWatch Embedded Metrics Format:
# Calliope Hub/spawners/helpers/metrics.py (NEW FILE)
import json
import time
from datetime import datetime
class MetricsCollector:
"""Collect spawner metrics using EMF for CloudWatch."""
@staticmethod
def emit_spawn_duration(service_type: str, duration: float):
"""Emit spawn duration metric."""
print(json.dumps({
"_aws": {
"Timestamp": int(time.time() * 1000),
"CloudWatchMetrics": [{
"Namespace": "Calliope Calliope Hub/Spawner",
"Dimensions": [["ServiceType"]],
"Metrics": [{"Name": "SpawnDuration", "Unit": "Seconds"}]
}]
},
"ServiceType": service_type,
"SpawnDuration": duration,
}))
@staticmethod
def emit_poll_duration(duration: float, task_count: int):
"""Emit poll cycle metrics."""
print(json.dumps({
"_aws": {
"Timestamp": int(time.time() * 1000),
"CloudWatchMetrics": [{
"Namespace": "Calliope Calliope Hub/Spawner",
"Dimensions": [["Operation"]],
"Metrics": [
{"Name": "PollDuration", "Unit": "Seconds"},
{"Name": "TaskCount", "Unit": "Count"}
]
}]
},
"Operation": "Poll",
"PollDuration": duration,
"TaskCount": task_count,
}))Usage:
# In spawner
from helpers.metrics import MetricsCollector
async def start(self):
start_time = time.time()
# ... spawn logic ...
duration = time.time() - start_time
MetricsCollector.emit_spawn_duration(self.instance_type, duration)Create Dashboard:
# CloudWatch dashboard to track
# - Average spawn duration by service
# - Poll duration trends
# - API call counts
# - Active spawner countNetwork Optimizations
15. VPC Endpoints
Problem: ECS API calls go through NAT Gateway ($$$)
Solution: VPC Endpoints for AWS services
VPCEndpoints:
- ServiceName: com.amazonaws.us-west-2.ecs
VpcEndpointType: Interface
SubnetIds:
- subnet-private-1
- subnet-private-2
SecurityGroupIds:
- sg-vpc-endpoints
- ServiceName: com.amazonaws.us-west-2.ecs-agent
VpcEndpointType: Interface
- ServiceName: com.amazonaws.us-west-2.ecr.api
VpcEndpointType: Interface
- ServiceName: com.amazonaws.us-west-2.ecr.dkr
VpcEndpointType: Interface
- ServiceName: com.amazonaws.us-west-2.s3
VpcEndpointType: Gateway # Free!
- ServiceName: com.amazonaws.us-west-2.logs
VpcEndpointType: InterfaceImpact:
- NAT Gateway costs: -90% (API traffic stays in VPC)
- Latency: -20% to -30% (faster API calls)
- Cost savings: ~$45-90/month (depending on API volume)
- Complexity: Low (one-time setup)
16. Reduce EFS Latency
Problem: EFS mount delays during container startup
Solutions:
A. Use Provisioned Throughput
EFS:
ThroughputMode: provisioned
ProvisionedThroughputInMibps: 10 # from burstingCost: ~$60/month for 10 MB/s Benefit: Consistent performance, faster startup
B. Pre-create User Directories
# In spawner pre_spawn_hook
async def pre_spawn_hook(spawner):
# Pre-create user directory structure on EFS
# Reduces container startup time
await provision_user_directory(spawner.user.name)C. Cache Common Files
# In container image, pre-populate common configs
COPY jupyter_config_template.py /opt/jupyter/config/
COPY jupyter_ai_schema.json /opt/jupyter/ai/Impact:
- Spawn time: -10% to -20%
- EFS costs: +$60/month (if provisioned)
- User experience: Noticeably faster
Summary: Optimization Priority
For 100-200 Servers
Priority 1: Quick Wins
- ✅ Fix healthcheck timing (DONE)
- ✅ Fix orphan detection bug (DONE)
- ⬜ Migrate to PostgreSQL
- ⬜ Increase poll interval to 15s
Expected Impact: Stable, reliable system
For 200-500 Servers
Priority 1: Resource Scaling
- ⬜ Upgrade to 4 vCPU / 8GB Calliope Hub
- ⬜ PostgreSQL with connection pooling
- ⬜ Poll interval to 20s
Priority 2: Code Optimization 4. ⬜ Implement response caching 5. ⬜ Coordinated polling with jitter 6. ⬜ VPC endpoints (cost savings)
Expected Impact: 2-3x capacity increase
For 500-1,000 Servers
Priority 1: Architectural Changes
- ⬜ Horizontal scaling (2-3 hubs + ALB)
- ⬜ Batch API calls
- ⬜ Response caching
Priority 2: Advanced Optimization 4. ⬜ Distributed coordination (locks/primary election) 5. ⬜ Database query optimization 6. ⬜ Async parallelization
Expected Impact: 5x capacity increase
For 1,000+ Servers
Priority 1: Advanced Architecture
- ⬜ Event-driven with EventBridge
- ⬜ Separate monitoring service
- ⬜ Redis caching cluster
- ⬜ 5+ hubs with sharding
Priority 2: Enterprise Features 5. ⬜ Multi-region deployment 6. ⬜ Advanced monitoring and alerting 7. ⬜ Capacity planning automation
Expected Impact: 10x+ capacity increase
Performance Testing
Load Testing Script
# tests/load_test.py
import asyncio
import aiohttp
import time
from typing import List
async def spawn_server(session: aiohttp.ClientSession, username: str, server_name: str):
"""Spawn a single server."""
start = time.time()
async with session.post(
f"https://your-Calliope Hub/Calliope Hub/api/users/{username}/servers/{server_name}",
headers={"Authorization": f"token {API_TOKEN}"},
json={"instance_type": "lab", "instance_size": "medium"}
) as resp:
duration = time.time() - start
return {
"username": username,
"server_name": server_name,
"status": resp.status,
"duration": duration,
}
async def load_test(num_servers: int):
"""Spawn N servers concurrently and measure performance."""
async with aiohttp.ClientSession() as session:
tasks = []
# Create spawn tasks
for i in range(num_servers):
username = f"loadtest-user-{i:04d}"
server_name = f"lab-{int(time.time())}"
tasks.append(spawn_server(session, username, server_name))
# Execute concurrently
print(f"Spawning {num_servers} servers...")
start = time.time()
results = await asyncio.gather(*tasks, return_exceptions=True)
total_time = time.time() - start
# Analyze results
successful = sum(1 for r in results if isinstance(r, dict) and r["status"] == 202)
failed = len(results) - successful
avg_duration = sum(r["duration"] for r in results if isinstance(r, dict)) / len(results)
print(f"""
Load Test Results:
- Total servers: {num_servers}
- Successful: {successful}
- Failed: {failed}
- Total time: {total_time:.1f}s
- Avg spawn time: {avg_duration:.1f}s
- Spawns/second: {successful / total_time:.2f}
""")
if __name__ == "__main__":
asyncio.run(load_test(50))Run at different scales:
python load_test.py 50 # Baseline
python load_test.py 100 # 2x
python load_test.py 200 # 4x
python load_test.py 500 # 10xBenchmarking
Key metrics to track:
1. Spawn latency (P50, P95, P99)
Target: P95 < 60s
2. Poll cycle duration
Target: < 10s for all spawners
3. API error rate
Target: < 0.1%
4. Calliope Hub CPU utilization
Target: < 70%
5. Database query time
Target: < 100ms for 95th percentileCollect with:
# In spawner
import time
import logging
logger = logging.getLogger(__name__)
async def poll(self):
start = time.time()
# ... poll logic ...
duration = time.time() - start
if duration > 5:
logger.warning(f"Slow poll: {duration:.2f}s for {self.user.name}/{self.name}")
# Emit metric
emit_metric("PollDuration", duration)Cost Optimization
API Call Reduction
Current (200 servers, 10s polling):
200 servers × (3600s / 10s) × 24h × 30d = 5,184,000 API calls/month
Cost: Free tier covers first 1M, then $0.01 per 10,000 = ~$42/monthWith batching (200 servers, 15s polling, batched):
1 batch call × (3600s / 15s) × 24h × 30d = 172,800 API calls/month
Cost: Free tier covers all = $0/monthSavings: $42/month (+ API stability)
CloudWatch Logs Reduction
Current (verbose logging):
200 servers × 1 MB logs/day = 200 MB/day = 6 GB/month
Cost: $0.50/GB ingestion + $0.03/GB storage = ~$3.18/monthOptimized (WARNING level in prod):
200 servers × 100 KB logs/day = 20 MB/day = 600 MB/month
Cost: $0.015 + $0.018 = ~$0.33/monthSavings: $2.85/month (+ faster log queries)
Appendix: Configuration Examples
Optimized jupyterhub_config.py
# For 300-500 servers on 8 vCPU / 16GB Calliope Hub
import random
import os
# Polling configuration
c.Spawner.poll_interval = 20 # Slower polling for stability
c.Spawner.poll_jitter = 0.2 # 20% jitter
# Activity tracking
c.Calliope Calliope Hub.last_activity_interval = 300 # 5 minutes
# Database configuration (PostgreSQL)
c.Calliope Calliope Hub.db_url = os.getenv("JUPYTERHUB_DB_URL")
c.Calliope Calliope Hub.db_pool_size = 10
c.Calliope Calliope Hub.db_max_overflow = 20
# Logging (production)
if os.getenv("ENVIRONMENT") == "production":
c.Calliope Calliope Hub.log_level = "WARNING"
else:
c.Calliope Calliope Hub.log_level = "INFO"
# Proxy configuration
c.ConfigurableHTTPProxy.should_start = True
c.ConfigurableHTTPProxy.api_url = "http://127.0.0.1:8001"
# Spawner defaults
c.Spawner.start_timeout = 600 # 10 minutes
c.Spawner.http_timeout = 120 # 2 minutesOptimized ECS Task Definition
TaskDefinition:
Family: Calliope Hub-calliope-dev-Calliope Calliope Hub-task
NetworkMode: awsvpc
RequiresCompatibilities:
- FARGATE
Cpu: "4096" # 4 vCPU for 300-400 servers
Memory: "8192" # 8 GB
ContainerDefinitions:
- Name: Calliope Calliope Hub
Image: <your-Calliope Hub-image>
Essential: true
PortMappings:
- ContainerPort: 8000
Protocol: tcp
Environment:
- Name: JUPYTERHUB_SERVICE_MODE
Value: ecs
- Name: ENVIRONMENT
Value: production
- Name: AWS_DEFAULT_REGION
Value: us-west-2
Secrets:
- Name: JUPYTERHUB_DB_URL
ValueFrom: arn:aws:secretsmanager:...:JUPYTERHUB_DB_URL::
- Name: JUPYTERHUB_CRYPT_KEY
ValueFrom: arn:aws:secretsmanager:...:JUPYTERHUB_CRYPT_KEY::
LogConfiguration:
LogDriver: awslogs
Options:
awslogs-group: /aws/ecs/Calliope Hub-calliope-dev
awslogs-region: us-west-2
awslogs-stream-prefix: Calliope Hub
HealthCheck:
Command:
- CMD-SHELL
- curl -f http://localhost:8000/Calliope Hub/health || exit 1
Interval: 30
Timeout: 5
Retries: 3
StartPeriod: 60Summary Checklist
Before Optimizing
- Establish baseline metrics (current servers, CPU, poll time)
- Set up CloudWatch dashboards
- Document current performance
Quick Optimizations (Week 1)
- Fix healthcheck timings (DONE)
- Fix orphan detection bug (DONE)
- Increase poll interval to 15-20s
- Reduce log verbosity in production
- Add VPC endpoints
Medium Optimizations (Weeks 2-4)
- Migrate to PostgreSQL
- Implement response caching
- Upgrade Calliope Hub resources (4 vCPU / 8GB)
- Add database indexes
- Implement coordinated polling
Advanced Optimizations (Months 2-3)
- Batch API calls
- Set up pgbouncer
- Horizontal scaling (2-3 hubs + ALB)
- Lazy loading of spawner state
- Async optimization in spawn flow
Enterprise Scale (Months 3-6)
- Event-driven architecture
- Separate monitoring service
- Redis caching cluster
- Multi-region deployment (if needed)
See Also
- SPAWN_LIMITS.md - Capacity and limits analysis
- SPAWNER_SCALING.md - Scaling recommendations
- ARCHITECTURE.md - Horizontal scaling architecture
- HEALTHCHECK_TUNING.md - Recent healthcheck improvements