Calliope Calliope Hub Performance Optimization Guide

Calliope Calliope Hub Performance Optimization Guide

Calliope Integration: This component is integrated into the Calliope AI platform. Some features and configurations may differ from the upstream project.

This document describes performance optimization techniques for improving Calliope Calliope Hub spawner performance, reducing API costs, and supporting higher concurrent server counts.


Quick Wins (No Code Changes)

1. Increase Poll Interval

Current: 5-10 seconds (randomized) Optimized: 15-30 seconds

# Calliope Hub/config/jupyterhub_config.py
c.Spawner.poll_interval = 20  # from random.randint(5, 10)
c.Spawner.poll_jitter = 0.2   # Add 20% jitter to prevent thundering herd

Impact:

  • API calls: -50% to -66%
  • Capacity: +2x to +3x servers
  • Tradeoff: Dead server detection slower (20s vs 10s)

Recommended for: Deployments with >200 servers


2. Optimize Healthcheck Timing

Recent improvements (October 2025):

  • Chat: interval 5s → 15s, retries 5 → 8, start_period 30s → 45s
  • WAIIDE: interval 10s → 15s, retries 3 → 8, start_period 30s → 45s
  • Lab: Already optimal (30s interval, 8 retries, 60s start_period)

Why this helps:

  • Less aggressive polling reduces ECS task overhead
  • Longer start_period prevents premature failures
  • More retries improve reliability

Files updated:

  • Calliope Hub/services/chat.yaml
  • Calliope Hub/services/waiide.yaml
  • Calliope Hub/config/helpers/chat_helper.py
  • Calliope Hub/config/helpers/waiide_helper.py

3. Increase Calliope Hub Resources

Current: 2 vCPU / 4GB Recommended: 4 vCPU / 8GB for >100 servers

# Update Calliope Hub task definition
TaskDefinition:
  cpu: "4096"      # from "2048"
  memory: "8192"   # from "4096"

Impact:

  • Capacity: +2x to +3x servers
  • Cost: +$60/month
  • No code changes

Medium-Effort Optimizations

4. Batch ECS API Calls

Problem: Each spawner polls independently → N API calls per cycle

Solution: Batch all spawners into single API call

Implementation:

# Calliope Hub/spawners/helpers/batch_poller.py (NEW FILE)

import asyncio
import boto3
from typing import Dict, List
from datetime import datetime, timezone

class BatchTaskPoller:
    """Batch multiple task status checks into single ECS API calls."""

    def __init__(self, ecs_cluster: str, aws_region: str, logger):
        self.ecs_cluster = ecs_cluster
        self.aws_region = aws_region
        self.log = logger
        self.ecs_client = boto3.client("ecs", region_name=aws_region)

        # Cache for task status (TTL: 10 seconds)
        self._cache = {}
        self._cache_ttl = 10

    async def get_task_status(self, task_arn: str) -> Dict:
        """
        Get task status, using cache if available.

        Args:
            task_arn: ECS task ARN

        Returns:
            Task description dict
        """
        # Check cache first
        if task_arn in self._cache:
            cached_data, cached_time = self._cache[task_arn]
            age = (datetime.now(timezone.utc) - cached_time).total_seconds()

            if age < self._cache_ttl:
                self.log.debug(f"Cache hit for {task_arn[-12:]} (age: {age:.1f}s)")
                return cached_data

        # Cache miss or expired - fetch from ECS
        return await self._fetch_task_status(task_arn)

    async def batch_get_task_status(self, task_arns: List[str]) -> Dict[str, Dict]:
        """
        Batch fetch multiple task statuses in single API call.

        Args:
            task_arns: List of task ARNs to check

        Returns:
            Dict mapping task_arn to task description
        """
        if not task_arns:
            return {}

        # Filter out cached tasks
        uncached_arns = []
        results = {}

        for arn in task_arns:
            if arn in self._cache:
                cached_data, cached_time = self._cache[arn]
                age = (datetime.now(timezone.utc) - cached_time).total_seconds()

                if age < self._cache_ttl:
                    results[arn] = cached_data
                    continue

            uncached_arns.append(arn)

        # Fetch uncached tasks in batches of 100 (ECS API limit)
        if uncached_arns:
            self.log.info(f"Batch fetching {len(uncached_arns)} tasks from ECS")

            for i in range(0, len(uncached_arns), 100):
                batch = uncached_arns[i:i+100]

                loop = asyncio.get_running_loop()
                response = await loop.run_in_executor(
                    None,
                    lambda: self.ecs_client.describe_tasks(
                        cluster=self.ecs_cluster,
                        tasks=batch,
                        include=["TAGS"]
                    )
                )

                # Cache and return results
                now = datetime.now(timezone.utc)
                for task in response.get("tasks", []):
                    task_arn = task["taskArn"]
                    self._cache[task_arn] = (task, now)
                    results[task_arn] = task

        return results

    async def _fetch_task_status(self, task_arn: str) -> Dict:
        """Fetch single task status and cache it."""
        loop = asyncio.get_running_loop()
        response = await loop.run_in_executor(
            None,
            lambda: self.ecs_client.describe_tasks(
                cluster=self.ecs_cluster,
                tasks=[task_arn]
            )
        )

        tasks = response.get("tasks", [])
        if tasks:
            task = tasks[0]
            self._cache[task_arn] = (task, datetime.now(timezone.utc))
            return task

        return None

    def clear_cache(self):
        """Clear the cache (e.g., after spawning new tasks)."""
        self._cache.clear()

Usage in ECS Spawner:

# Calliope Hub/spawners/ecs.py

# At module level, create shared batch poller
_batch_poller = None

def get_batch_poller(spawner):
    global _batch_poller
    if _batch_poller is None:
        _batch_poller = BatchTaskPoller(
            spawner.ecs_cluster,
            spawner.aws_region,
            spawner.log
        )
    return _batch_poller

# In poll() method, replace direct describe_tasks:
async def poll(self) -> Optional[int]:
    if not self.task_arn:
        return 0

    # Use batch poller instead of direct ECS call
    batch_poller = get_batch_poller(self)
    task = await batch_poller.get_task_status(self.task_arn)

    if not task:
        return 0  # Task not found

    # Rest of poll logic...

Impact:

  • API calls: -90% (10x reduction)
  • Capacity: +10x servers
  • Cost: Minimal (caching overhead)
  • Complexity: Medium

5. Implement Coordinated Polling

Problem: All spawners poll independently

Solution: Coordinate polling in waves

# Calliope Hub/config/jupyterhub_config.py

import asyncio
import random

# Global poll coordinator
class PollCoordinator:
    def __init__(self, poll_interval=15):
        self.poll_interval = poll_interval
        self.last_poll = {}

    async def should_poll(self, spawner_id: str) -> bool:
        """
        Check if spawner should poll now.
        Coordinates polling to prevent simultaneous API calls.
        """
        import time
        now = time.time()

        last_poll = self.last_poll.get(spawner_id, 0)
        if now - last_poll < self.poll_interval:
            return False  # Too soon

        # Add jitter to prevent thundering herd
        jitter = random.uniform(-2, 2)
        if now - last_poll < self.poll_interval + jitter:
            return False

        self.last_poll[spawner_id] = now
        return True

poll_coordinator = PollCoordinator(poll_interval=15)

# In spawner's poll() method:
async def poll(self):
    spawner_id = f"{self.user.name}/{self.name}"

    if not await poll_coordinator.should_poll(spawner_id):
        # Skip this poll cycle
        return None  # Assume still running

    # Proceed with actual poll
    ...

Impact:

  • API calls: -30% to -50%
  • CPU: -20% to -30%
  • Complexity: Low

6. Response Caching Layer

Problem: Multiple polls within short window fetch same data

Solution: Cache ECS responses for 10-15 seconds

# Calliope Hub/spawners/helpers/ecs_cache.py (NEW FILE)

from datetime import datetime, timezone, timedelta
from typing import Dict, Optional
import asyncio

class ECSResponseCache:
    """Cache ECS API responses to reduce redundant calls."""

    def __init__(self, ttl_seconds: int = 15):
        self.ttl = timedelta(seconds=ttl_seconds)
        self._cache: Dict[str, tuple] = {}
        self._lock = asyncio.Lock()

    async def get(self, key: str) -> Optional[Dict]:
        """Get cached response if not expired."""
        async with self._lock:
            if key in self._cache:
                data, timestamp = self._cache[key]
                if datetime.now(timezone.utc) - timestamp < self.ttl:
                    return data
                else:
                    # Expired, remove
                    del self._cache[key]
        return None

    async def set(self, key: str, data: Dict):
        """Cache response with current timestamp."""
        async with self._lock:
            self._cache[key] = (data, datetime.now(timezone.utc))

    async def clear_expired(self):
        """Background task to clear expired entries."""
        while True:
            await asyncio.sleep(60)  # Every minute
            async with self._lock:
                now = datetime.now(timezone.utc)
                expired = [
                    k for k, (_, ts) in self._cache.items()
                    if now - ts >= self.ttl
                ]
                for k in expired:
                    del self._cache[k]

# Global cache instance
ecs_cache = ECSResponseCache(ttl_seconds=15)

# Start cleanup task
asyncio.create_task(ecs_cache.clear_expired())

Usage:

# In ECS spawner poll() method
cache_key = f"task:{self.task_arn}"

# Try cache first
cached = await ecs_cache.get(cache_key)
if cached:
    task = cached
else:
    # Fetch from ECS
    response = ecs_client.describe_tasks(...)
    task = response["tasks"][0]
    await ecs_cache.set(cache_key, task)

Impact:

  • API calls: -40% to -60%
  • Response time: Faster (cache hits)
  • Memory: +10-20 MB for cache
  • Complexity: Low

High-Effort Optimizations

7. Event-Driven Architecture

Problem: Polling is wasteful - we poll even when nothing changed

Solution: Subscribe to ECS task state change events

Architecture:

┌─────────────────────────┐
│   ECS Task Events       │
│   (State Changes)       │
└────────────┬────────────┘
             │
             │ EventBridge
             │
┌────────────▼────────────┐
│   Lambda Function       │
│   or SNS Topic          │
└────────────┬────────────┘
             │
             │ HTTP POST
             │
┌────────────▼────────────┐
│  Calliope Calliope Hub Service     │
│  /Calliope Hub/api/tasks/status  │
│  (Custom endpoint)      │
└─────────────────────────┘

EventBridge Rule:

EventRule:
  Name: ecs-task-state-changes
  EventPattern:
    source:
      - aws.ecs
    detail-type:
      - ECS Task State Change
    detail:
      clusterArn:
        - arn:aws:ecs:us-west-2:xxxx:cluster/development
      lastStatus:
        - RUNNING
        - STOPPED
        - DEPROVISIONING

  Targets:
    - Arn: !GetAtt TaskStatusLambda.Arn
      Id: task-status-processor

Lambda Handler:

# lambda/ecs_task_status.py

import json
import requests
import os

JUPYTERHUB_API_TOKEN = os.environ["JUPYTERHUB_API_TOKEN"]
JUPYTERHUB_API_URL = os.environ["JUPYTERHUB_API_URL"]

def lambda_handler(event, context):
    """Process ECS task state change events and notify Calliope Calliope Hub."""

    detail = event.get("detail", {})
    task_arn = detail.get("taskArn")
    last_status = detail.get("lastStatus")
    desired_status = detail.get("desiredStatus")

    # Extract task metadata
    containers = detail.get("containers", [])

    # Notify Calliope Calliope Hub of state change
    response = requests.post(
        f"{JUPYTERHUB_API_URL}/Calliope Hub/api/tasks/status",
        headers={"Authorization": f"token {JUPYTERHUB_API_TOKEN}"},
        json={
            "task_arn": task_arn,
            "status": last_status,
            "desired_status": desired_status,
            "containers": containers,
        },
        timeout=5,
    )

    return {
        "statusCode": 200,
        "body": json.dumps(f"Processed {task_arn}: {last_status}"),
    }

Calliope Calliope Hub Custom Handler:

# Calliope Hub/handlers/task_status_handler.py (NEW FILE)

from Calliope Calliope Hub.handlers import BaseHandler
from Calliope Calliope Hub.utils import token_authenticated

class TaskStatusHandler(BaseHandler):
    """Receive ECS task status updates from EventBridge."""

    @token_authenticated
    async def post(self):
        """Process task status update."""
        data = self.get_json_body()

        task_arn = data.get("task_arn")
        status = data.get("status")

        self.log.info(f"Received task status: {task_arn[-12:]} = {status}")

        # Find spawner managing this task
        for user in self.users.values():
            for spawner in user.spawners.values():
                if getattr(spawner, "task_arn", None) == task_arn:
                    # Update spawner's cached state
                    spawner._cached_task_status = status
                    spawner._cache_time = datetime.now()

                    if status == "STOPPED":
                        # Trigger cleanup
                        self.log.info(f"Task stopped, scheduling cleanup")
                        asyncio.create_task(spawner.stop())

                    self.set_status(200)
                    return

        self.log.warning(f"No spawner found for task {task_arn[-12:]}")
        self.set_status(404)

# Register handler in jupyterhub_config.py:
c.Calliope Calliope Hub.extra_handlers = [
    (r"/Calliope Hub/api/tasks/status", TaskStatusHandler),
]

Impact:

  • API calls: -95% (only when state changes)
  • Polling: Can reduce to 60s (events handle updates)
  • Capacity: +10x to +20x servers
  • Complexity: High (Lambda + EventBridge setup)

8. Separate Monitoring Service

Problem: Calliope Hub CPU wasted on polling instead of serving users

Solution: Dedicated monitoring service for health checks

Architecture:

┌──────────────┐
│ Calliope Calliope Hub   │  ← Focus on spawning & routing
│ (No polling) │
└──────┬───────┘
       │ Database (shared state)
       │
┌──────▼────────────┐
│ Monitor Service   │  ← Dedicated polling
│ - Polls all tasks │
│ - Updates DB      │
│ - Sends events    │
└───────────────────┘

Monitor Service (Python):

# services/task_monitor.py

import asyncio
import boto3
from sqlalchemy import create_engine

class TaskMonitor:
    """Dedicated service for monitoring ECS task health."""

    def __init__(self, db_url, ecs_cluster, aws_region):
        self.db = create_engine(db_url)
        self.ecs = boto3.client("ecs", region_name=aws_region)
        self.cluster = ecs_cluster

    async def monitor_loop(self):
        """Main monitoring loop."""
        while True:
            try:
                # Get all spawner tasks from database
                with self.db.connect() as conn:
                    result = conn.execute("""
                        SELECT id, user_id, name, state
                        FROM spawners
                        WHERE server_id IS NOT NULL
                    """)
                    spawners = result.fetchall()

                # Extract task ARNs from state
                task_arns = []
                for spawner in spawners:
                    state = json.loads(spawner.state or "{}")
                    task_arn = state.get("task_arn")
                    if task_arn:
                        task_arns.append((spawner.id, task_arn))

                # Batch check all tasks (100 at a time)
                for i in range(0, len(task_arns), 100):
                    batch = [arn for _, arn in task_arns[i:i+100]]

                    response = self.ecs.describe_tasks(
                        cluster=self.cluster,
                        tasks=batch
                    )

                    # Update database with status
                    for task in response.get("tasks", []):
                        task_arn = task["taskArn"]
                        status = task["lastStatus"]
                        health = task.get("healthStatus", "UNKNOWN")

                        # Find spawner ID
                        spawner_id = next(
                            (sid for sid, arn in task_arns if arn == task_arn),
                            None
                        )

                        if spawner_id:
                            # Update spawner state in DB
                            self._update_spawner_status(spawner_id, status, health)

                # Sleep before next poll
                await asyncio.sleep(15)

            except Exception as e:
                logger.error(f"Monitor loop error: {e}")
                await asyncio.sleep(60)

    def _update_spawner_status(self, spawner_id, status, health):
        """Update spawner status in database."""
        with self.db.connect() as conn:
            conn.execute("""
                UPDATE spawners
                SET state = jsonb_set(
                    COALESCE(state::jsonb, '{}'::jsonb),
                    '{task_status}',
                    to_jsonb($1::text)
                ),
                state = jsonb_set(
                    state::jsonb,
                    '{health_status}',
                    to_jsonb($2::text)
                ),
                state = jsonb_set(
                    state::jsonb,
                    '{last_poll}',
                    to_jsonb($3::text)
                )
                WHERE id = $4
            """, status, health, datetime.now().isoformat(), spawner_id)

Deploy as ECS Service:

Service:
  ServiceName: Calliope Calliope Hub-monitor
  Cluster: development
  TaskDefinition: task-monitor:latest
  DesiredCount: 1  # Single instance

  # No load balancer - internal service

Impact:

  • Calliope Hub CPU: -80% (no polling overhead)
  • API calls: Centralized (easier to optimize)
  • Capacity: +5x Calliope Hub capacity
  • Complexity: Very high

Database Optimizations

9. Add Indexes

Problem: Slow queries on spawner lookups

Solution: Add strategic indexes

-- Index on user_id for fast user lookups
CREATE INDEX idx_spawners_user_id ON spawners(user_id);

-- Index on server_id for fast server lookups
CREATE INDEX idx_spawners_server_id ON spawners(server_id);

-- Composite index for task ARN lookups in state
CREATE INDEX idx_spawners_state_task_arn ON spawners USING gin (state jsonb_path_ops);

-- Index on last_activity for idle culling
CREATE INDEX idx_spawners_last_activity ON spawners(last_activity);

-- Index on user name for fast username lookups
CREATE INDEX idx_users_name ON users(name);

Analyze query performance:

-- Enable query logging
ALTER DATABASE Calliope Calliope Hub SET log_min_duration_statement = 100;

-- Check slow queries
SELECT query, calls, mean_exec_time, max_exec_time
FROM pg_stat_statements
ORDER BY mean_exec_time DESC
LIMIT 20;

10. Connection Pooling

Problem: Each Calliope Hub creates multiple database connections

Solution: Use pgbouncer for connection pooling

Deploy pgbouncer:

# pgbouncer.ini
[databases]
Calliope Calliope Hub = host=rds-endpoint port=5432 dbname=Calliope Calliope Hub

[pgbouncer]
listen_addr = 0.0.0.0
listen_port = 6432
auth_type = md5
auth_file = /etc/pgbouncer/userlist.txt
pool_mode = transaction
max_client_conn = 500
default_pool_size = 20
reserve_pool_size = 10
server_lifetime = 3600
server_idle_timeout = 600

Update Calliope Calliope Hub:

# Connect through pgbouncer instead of direct RDS
c.Calliope Calliope Hub.db_url = "postgresql://user:pass@pgbouncer:6432/Calliope Calliope Hub"

# Reduce pool size (pgbouncer handles pooling)
c.Calliope Calliope Hub.db_pool_size = 5  # from 10
c.Calliope Calliope Hub.db_max_overflow = 5  # from 20

Impact:

  • Database connections: 100+ → 20-30
  • Database CPU: -50%
  • Cost: RDS can use smaller instance
  • Complexity: Medium

Code-Level Optimizations

11. Lazy Loading

Problem: Calliope Hub loads all spawner state on startup

Solution: Load spawner state only when needed

# Calliope Hub/spawners/ecs.py

class ECSSpawner(BaseSpawner):
    def __init__(self, *args, **kwargs):
        super().__init__(*args, **kwargs)
        self._state_loaded = False

    async def _ensure_state_loaded(self):
        """Lazy load spawner state from database."""
        if self._state_loaded:
            return

        # Load state from DB
        state = await self._load_state_from_db()
        if state:
            self.load_state(state)

        self._state_loaded = True

    async def poll(self):
        await self._ensure_state_loaded()
        # ... rest of poll logic

Impact:

  • Calliope Hub startup: -50% faster
  • Memory: -30% (only loaded spawners in memory)
  • Complexity: Medium

12. Async Optimization

Problem: Serial API calls during spawn

Solution: Parallelize independent operations

# Before (serial):
async def start(self):
    self._create_task_definition()  # 500ms
    self._run_task()                # 1000ms
    self._wait_for_running()        # 2000ms
    self._register_with_proxy()     # 500ms
    # Total: 4000ms

# After (parallel):
async def start(self):
    # Run independent operations in parallel
    task_def, proxy_ready = await asyncio.gather(
        self._create_task_definition(),
        self._prepare_proxy_registration(),
    )

    task_arn = await self._run_task(task_def)

    # Parallel wait
    await asyncio.gather(
        self._wait_for_running(task_arn),
        self._register_with_proxy(task_arn, proxy_ready),
    )
    # Total: 2500ms (40% faster)

Impact:

  • Spawn time: -30% to -40%
  • User experience: Significantly better
  • Complexity: Medium

13. Reduce Log Verbosity

Problem: Excessive logging increases I/O and CloudWatch costs

Solution: Log levels per environment

# jupyterhub_config.py

import logging
import os

# Environment-based log levels
if os.getenv("ENVIRONMENT") == "production":
    c.Calliope Calliope Hub.log_level = "WARNING"  # Only warnings and errors
    c.Spawner.debug = False
else:
    c.Calliope Calliope Hub.log_level = "INFO"
    c.Spawner.debug = True

# Reduce SQLAlchemy logging
logging.getLogger("sqlalchemy.engine").setLevel(logging.WARNING)
logging.getLogger("tornado.access").setLevel(logging.WARNING)

# Keep important loggers at INFO
logging.getLogger("Calliope Calliope Hub.proxy").setLevel(logging.INFO)
logging.getLogger("Calliope Calliope Hub.spawner").setLevel(logging.INFO)

Impact:

  • CloudWatch costs: -60% to -80%
  • I/O overhead: -50%
  • Log clarity: Better (signal vs noise)

Monitoring Optimizations

14. Metrics Collection

Use CloudWatch Embedded Metrics Format:

# Calliope Hub/spawners/helpers/metrics.py (NEW FILE)

import json
import time
from datetime import datetime

class MetricsCollector:
    """Collect spawner metrics using EMF for CloudWatch."""

    @staticmethod
    def emit_spawn_duration(service_type: str, duration: float):
        """Emit spawn duration metric."""
        print(json.dumps({
            "_aws": {
                "Timestamp": int(time.time() * 1000),
                "CloudWatchMetrics": [{
                    "Namespace": "Calliope Calliope Hub/Spawner",
                    "Dimensions": [["ServiceType"]],
                    "Metrics": [{"Name": "SpawnDuration", "Unit": "Seconds"}]
                }]
            },
            "ServiceType": service_type,
            "SpawnDuration": duration,
        }))

    @staticmethod
    def emit_poll_duration(duration: float, task_count: int):
        """Emit poll cycle metrics."""
        print(json.dumps({
            "_aws": {
                "Timestamp": int(time.time() * 1000),
                "CloudWatchMetrics": [{
                    "Namespace": "Calliope Calliope Hub/Spawner",
                    "Dimensions": [["Operation"]],
                    "Metrics": [
                        {"Name": "PollDuration", "Unit": "Seconds"},
                        {"Name": "TaskCount", "Unit": "Count"}
                    ]
                }]
            },
            "Operation": "Poll",
            "PollDuration": duration,
            "TaskCount": task_count,
        }))

Usage:

# In spawner
from helpers.metrics import MetricsCollector

async def start(self):
    start_time = time.time()
    # ... spawn logic ...
    duration = time.time() - start_time

    MetricsCollector.emit_spawn_duration(self.instance_type, duration)

Create Dashboard:

# CloudWatch dashboard to track
# - Average spawn duration by service
# - Poll duration trends
# - API call counts
# - Active spawner count

Network Optimizations

15. VPC Endpoints

Problem: ECS API calls go through NAT Gateway ($$$)

Solution: VPC Endpoints for AWS services

VPCEndpoints:
  - ServiceName: com.amazonaws.us-west-2.ecs
    VpcEndpointType: Interface
    SubnetIds:
      - subnet-private-1
      - subnet-private-2
    SecurityGroupIds:
      - sg-vpc-endpoints

  - ServiceName: com.amazonaws.us-west-2.ecs-agent
    VpcEndpointType: Interface

  - ServiceName: com.amazonaws.us-west-2.ecr.api
    VpcEndpointType: Interface

  - ServiceName: com.amazonaws.us-west-2.ecr.dkr
    VpcEndpointType: Interface

  - ServiceName: com.amazonaws.us-west-2.s3
    VpcEndpointType: Gateway  # Free!

  - ServiceName: com.amazonaws.us-west-2.logs
    VpcEndpointType: Interface

Impact:

  • NAT Gateway costs: -90% (API traffic stays in VPC)
  • Latency: -20% to -30% (faster API calls)
  • Cost savings: ~$45-90/month (depending on API volume)
  • Complexity: Low (one-time setup)

16. Reduce EFS Latency

Problem: EFS mount delays during container startup

Solutions:

A. Use Provisioned Throughput

EFS:
  ThroughputMode: provisioned
  ProvisionedThroughputInMibps: 10  # from bursting

Cost: ~$60/month for 10 MB/s Benefit: Consistent performance, faster startup

B. Pre-create User Directories

# In spawner pre_spawn_hook
async def pre_spawn_hook(spawner):
    # Pre-create user directory structure on EFS
    # Reduces container startup time
    await provision_user_directory(spawner.user.name)

C. Cache Common Files

# In container image, pre-populate common configs
COPY jupyter_config_template.py /opt/jupyter/config/
COPY jupyter_ai_schema.json /opt/jupyter/ai/

Impact:

  • Spawn time: -10% to -20%
  • EFS costs: +$60/month (if provisioned)
  • User experience: Noticeably faster

Summary: Optimization Priority

For 100-200 Servers

Priority 1: Quick Wins

  1. ✅ Fix healthcheck timing (DONE)
  2. ✅ Fix orphan detection bug (DONE)
  3. ⬜ Migrate to PostgreSQL
  4. ⬜ Increase poll interval to 15s

Expected Impact: Stable, reliable system


For 200-500 Servers

Priority 1: Resource Scaling

  1. ⬜ Upgrade to 4 vCPU / 8GB Calliope Hub
  2. ⬜ PostgreSQL with connection pooling
  3. ⬜ Poll interval to 20s

Priority 2: Code Optimization 4. ⬜ Implement response caching 5. ⬜ Coordinated polling with jitter 6. ⬜ VPC endpoints (cost savings)

Expected Impact: 2-3x capacity increase


For 500-1,000 Servers

Priority 1: Architectural Changes

  1. ⬜ Horizontal scaling (2-3 hubs + ALB)
  2. ⬜ Batch API calls
  3. ⬜ Response caching

Priority 2: Advanced Optimization 4. ⬜ Distributed coordination (locks/primary election) 5. ⬜ Database query optimization 6. ⬜ Async parallelization

Expected Impact: 5x capacity increase


For 1,000+ Servers

Priority 1: Advanced Architecture

  1. ⬜ Event-driven with EventBridge
  2. ⬜ Separate monitoring service
  3. ⬜ Redis caching cluster
  4. ⬜ 5+ hubs with sharding

Priority 2: Enterprise Features 5. ⬜ Multi-region deployment 6. ⬜ Advanced monitoring and alerting 7. ⬜ Capacity planning automation

Expected Impact: 10x+ capacity increase


Performance Testing

Load Testing Script

# tests/load_test.py

import asyncio
import aiohttp
import time
from typing import List

async def spawn_server(session: aiohttp.ClientSession, username: str, server_name: str):
    """Spawn a single server."""
    start = time.time()

    async with session.post(
        f"https://your-Calliope Hub/Calliope Hub/api/users/{username}/servers/{server_name}",
        headers={"Authorization": f"token {API_TOKEN}"},
        json={"instance_type": "lab", "instance_size": "medium"}
    ) as resp:
        duration = time.time() - start
        return {
            "username": username,
            "server_name": server_name,
            "status": resp.status,
            "duration": duration,
        }

async def load_test(num_servers: int):
    """Spawn N servers concurrently and measure performance."""
    async with aiohttp.ClientSession() as session:
        tasks = []

        # Create spawn tasks
        for i in range(num_servers):
            username = f"loadtest-user-{i:04d}"
            server_name = f"lab-{int(time.time())}"
            tasks.append(spawn_server(session, username, server_name))

        # Execute concurrently
        print(f"Spawning {num_servers} servers...")
        start = time.time()

        results = await asyncio.gather(*tasks, return_exceptions=True)

        total_time = time.time() - start

        # Analyze results
        successful = sum(1 for r in results if isinstance(r, dict) and r["status"] == 202)
        failed = len(results) - successful
        avg_duration = sum(r["duration"] for r in results if isinstance(r, dict)) / len(results)

        print(f"""
Load Test Results:
- Total servers: {num_servers}
- Successful: {successful}
- Failed: {failed}
- Total time: {total_time:.1f}s
- Avg spawn time: {avg_duration:.1f}s
- Spawns/second: {successful / total_time:.2f}
        """)

if __name__ == "__main__":
    asyncio.run(load_test(50))

Run at different scales:

python load_test.py 50   # Baseline
python load_test.py 100  # 2x
python load_test.py 200  # 4x
python load_test.py 500  # 10x

Benchmarking

Key metrics to track:

1. Spawn latency (P50, P95, P99)
   Target: P95 < 60s

2. Poll cycle duration
   Target: < 10s for all spawners

3. API error rate
   Target: < 0.1%

4. Calliope Hub CPU utilization
   Target: < 70%

5. Database query time
   Target: < 100ms for 95th percentile

Collect with:

# In spawner
import time
import logging

logger = logging.getLogger(__name__)

async def poll(self):
    start = time.time()

    # ... poll logic ...

    duration = time.time() - start
    if duration > 5:
        logger.warning(f"Slow poll: {duration:.2f}s for {self.user.name}/{self.name}")

    # Emit metric
    emit_metric("PollDuration", duration)

Cost Optimization

API Call Reduction

Current (200 servers, 10s polling):

200 servers × (3600s / 10s) × 24h × 30d = 5,184,000 API calls/month

Cost: Free tier covers first 1M, then $0.01 per 10,000 = ~$42/month

With batching (200 servers, 15s polling, batched):

1 batch call × (3600s / 15s) × 24h × 30d = 172,800 API calls/month

Cost: Free tier covers all = $0/month

Savings: $42/month (+ API stability)


CloudWatch Logs Reduction

Current (verbose logging):

200 servers × 1 MB logs/day = 200 MB/day = 6 GB/month

Cost: $0.50/GB ingestion + $0.03/GB storage = ~$3.18/month

Optimized (WARNING level in prod):

200 servers × 100 KB logs/day = 20 MB/day = 600 MB/month

Cost: $0.015 + $0.018 = ~$0.33/month

Savings: $2.85/month (+ faster log queries)


Appendix: Configuration Examples

Optimized jupyterhub_config.py

# For 300-500 servers on 8 vCPU / 16GB Calliope Hub

import random
import os

# Polling configuration
c.Spawner.poll_interval = 20  # Slower polling for stability
c.Spawner.poll_jitter = 0.2   # 20% jitter

# Activity tracking
c.Calliope Calliope Hub.last_activity_interval = 300  # 5 minutes

# Database configuration (PostgreSQL)
c.Calliope Calliope Hub.db_url = os.getenv("JUPYTERHUB_DB_URL")
c.Calliope Calliope Hub.db_pool_size = 10
c.Calliope Calliope Hub.db_max_overflow = 20

# Logging (production)
if os.getenv("ENVIRONMENT") == "production":
    c.Calliope Calliope Hub.log_level = "WARNING"
else:
    c.Calliope Calliope Hub.log_level = "INFO"

# Proxy configuration
c.ConfigurableHTTPProxy.should_start = True
c.ConfigurableHTTPProxy.api_url = "http://127.0.0.1:8001"

# Spawner defaults
c.Spawner.start_timeout = 600  # 10 minutes
c.Spawner.http_timeout = 120   # 2 minutes

Optimized ECS Task Definition

TaskDefinition:
  Family: Calliope Hub-calliope-dev-Calliope Calliope Hub-task
  NetworkMode: awsvpc
  RequiresCompatibilities:
    - FARGATE
  Cpu: "4096"      # 4 vCPU for 300-400 servers
  Memory: "8192"   # 8 GB

  ContainerDefinitions:
    - Name: Calliope Calliope Hub
      Image: <your-Calliope Hub-image>
      Essential: true

      PortMappings:
        - ContainerPort: 8000
          Protocol: tcp

      Environment:
        - Name: JUPYTERHUB_SERVICE_MODE
          Value: ecs
        - Name: ENVIRONMENT
          Value: production
        - Name: AWS_DEFAULT_REGION
          Value: us-west-2

      Secrets:
        - Name: JUPYTERHUB_DB_URL
          ValueFrom: arn:aws:secretsmanager:...:JUPYTERHUB_DB_URL::
        - Name: JUPYTERHUB_CRYPT_KEY
          ValueFrom: arn:aws:secretsmanager:...:JUPYTERHUB_CRYPT_KEY::

      LogConfiguration:
        LogDriver: awslogs
        Options:
          awslogs-group: /aws/ecs/Calliope Hub-calliope-dev
          awslogs-region: us-west-2
          awslogs-stream-prefix: Calliope Hub

      HealthCheck:
        Command:
          - CMD-SHELL
          - curl -f http://localhost:8000/Calliope Hub/health || exit 1
        Interval: 30
        Timeout: 5
        Retries: 3
        StartPeriod: 60

Summary Checklist

Before Optimizing

  • Establish baseline metrics (current servers, CPU, poll time)
  • Set up CloudWatch dashboards
  • Document current performance

Quick Optimizations (Week 1)

  • Fix healthcheck timings (DONE)
  • Fix orphan detection bug (DONE)
  • Increase poll interval to 15-20s
  • Reduce log verbosity in production
  • Add VPC endpoints

Medium Optimizations (Weeks 2-4)

  • Migrate to PostgreSQL
  • Implement response caching
  • Upgrade Calliope Hub resources (4 vCPU / 8GB)
  • Add database indexes
  • Implement coordinated polling

Advanced Optimizations (Months 2-3)

  • Batch API calls
  • Set up pgbouncer
  • Horizontal scaling (2-3 hubs + ALB)
  • Lazy loading of spawner state
  • Async optimization in spawn flow

Enterprise Scale (Months 3-6)

  • Event-driven architecture
  • Separate monitoring service
  • Redis caching cluster
  • Multi-region deployment (if needed)

See Also