Calliope Calliope Hub ECS Spawner - Capacity and Limits

Calliope Calliope Hub ECS Spawner - Capacity and Limits

Calliope Integration: This component is integrated into the Calliope AI platform. Some features and configurations may differ from the upstream project.

This document describes the theoretical and practical limits for concurrent spawned servers in the Calliope Calliope Hub ECS deployment.

Executive Summary

Current Calliope Hub Configuration (2 vCPU / 4GB):

  • Recommended: 100-150 concurrent servers
  • Maximum: ~200 servers before degradation

Optimized Calliope Hub Configuration (4 vCPU / 8GB):

  • Recommended: 300-400 concurrent servers
  • Maximum: ~500 servers (ECS API limits)

With Horizontal Scaling (Multiple hubs + PostgreSQL):

  • Practical: 1,000-3,000 concurrent servers
  • Theoretical: 10,000+ with advanced optimizations

Architecture Overview

Monitoring & Polling

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Calliope Calliope Hub β”‚  Poll interval: 5-10s (randomized)
β”‚     Calliope Hub     β”‚  Activity updates: 120s
β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”˜  Orphan discovery: 300s
       β”‚
       β”‚ ECS API Calls
       β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
       β”‚                 β”‚
β”Œβ”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”   β”Œβ”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”
β”‚describe_tasksβ”‚   β”‚ list_tasks  β”‚
β”‚  (per poll) β”‚   β”‚(orphan scan)β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Per-Server Overhead:

  • Poll cycle: 1 describe_tasks call every 5-10s
  • API rate: ~0.1-0.2 requests/second per server
  • Memory: ~1 MB per spawner object
  • CPU: ~0.1s per poll cycle

Bottleneck Analysis

1. AWS ECS API Rate Limits ⚠️ PRIMARY BOTTLENECK

Limits (per AWS Region):

  • describe_tasks: 100 TPS (transactions per second)
  • list_tasks: 100 TPS
  • Max tasks per call: 100 tasks

Impact:

100 TPS Γ· 0.15 requests/server = ~666 concurrent servers (hard limit)

With safety margin: ~500-600 servers maximum

Symptoms of hitting this limit:

  • API throttling errors in logs
  • ThrottlingException: Rate exceeded
  • Slow poll responses (>30s)
  • Delayed server status updates

Mitigation:

  • Request limit increase from AWS Support
  • Implement API call batching (see OPTIMIZATION_GUIDE.md)
  • Add response caching layer

2. Calliope Hub Container CPU ⚠️ YOUR PRIMARY CONSTRAINT

Current Configuration: 2 vCPU

CPU Usage Calculation:

Servers Γ— Poll CPU Time Γ· Poll Interval = Sustained CPU

Example:
1000 servers Γ— 0.1s CPU Γ· 10s interval = 10 CPU cores needed

Capacity by CPU:

Calliope Hub vCPUConcurrent ServersCPU Usage
2 vCPU (current)15075%
2 vCPU200100% (maxed)
4 vCPU400100%
8 vCPU800100%
16 vCPU1,600100%

Symptoms of CPU saturation:

  • Slow web UI responses
  • Poll cycles taking >10s
  • Calliope Hub becomes unresponsive
  • Spawn timeouts

3. Calliope Hub Container Memory

Current Configuration: 4GB RAM

Memory Usage:

Base Calliope Hub: ~1 GB
Per Spawner: ~1 MB (object + state)

Total = 1 GB + (Servers Γ— 1 MB)

Capacity by Memory:

Calliope Hub MemoryConcurrent ServersUsage
4GB (current)3,0004 GB
8GB7,0008 GB
16GB15,00016 GB

Memory is NOT a bottleneck for typical deployments (<1000 servers).


4. Database Performance

SQLite (default):

  • Limit: ~100-200 concurrent servers
  • Issue: Write lock contention on state updates
  • Symptoms:
    • Slow spawn operations
    • “Database is locked” errors
    • Delayed stop operations

PostgreSQL (recommended for scale):

  • Limit: 10,000+ concurrent servers
  • Requirements: Connection pooling, proper indexing
  • No practical limit at your scale

5. ECS Task Limits

Fargate Limits (per region):

  • Tasks per cluster: No hard limit (auto-scales)
  • ENIs (networking): 5,000 per VPC (requestable to 50,000)
  • Fargate vCPU quota: 10,000 vCPUs default (requestable to 100,000+)

Capacity by Quota:

QuotaInstance SizeConcurrent Servers
10,000 vCPU4 vCPU (medium)2,500
10,000 vCPU8 vCPU (large)1,250
100,000 vCPU4 vCPU (medium)25,000

ECS limits are NOT a bottleneck - API rate limits hit first.


6. Network Bandwidth

Per Poll Cycle:

  • describe_tasks response: ~5-10 KB
  • 1000 servers @ 10s polling: 500 KB/s - 1 MB/s

Not a bottleneck until 10,000+ servers.


Realistic Capacity Matrix

Current Setup (2 vCPU / 4GB)

ServersStatusPerformanceBottleneckNotes
50βœ… ExcellentResponse <500msNoneIdeal for small teams
100βœ… GoodResponse <1sNoneSweet spot
150⚠️ DegradedResponse 1-2sCPU 75%Approaching limit
200❌ StrugglingResponse 2-5sCPU maxedNot recommended
300❌ BrokenTimeouts commonAPI throttlingSystem unstable

Recommendation: 100-150 servers maximum


Optimized Setup (4 vCPU / 8GB)

ServersStatusPerformanceBottleneckNotes
200βœ… ExcellentResponse <500msNoneRecommended
400βœ… GoodResponse <1sNoneSweet spot
500⚠️ DegradedResponse 1-2sAPI pressureApproaching API limits
600❌ ThrottledTimeoutsECS APIHard limit hit

Recommendation: 300-400 servers optimal, 500 maximum


Large Scale (8 vCPU / 16GB + PostgreSQL)

ServersStatusPerformanceBottleneckMitigation Needed
500βœ… GoodResponse <1sNonePostgreSQL required
1,000⚠️ API pressureResponse 1-2sECS APIBatching recommended
2,000❌ Needs optimizationDegradedECS APIMust implement batching + caching

Requires: Batch API calls, caching, PostgreSQL


Horizontal Scaling (Multiple Hubs)

SetupServersCost/Month*Requirements
2 hubs @ 4 vCPU600-800$240PostgreSQL, ALB
3 hubs @ 4 vCPU900-1,200$360+ Sticky sessions
5 hubs @ 4 vCPU1,500-2,000$600+ Coordination
10 hubs @ 4 vCPU3,000-4,000$1,200+ Sharding

*Fargate pricing (us-west-2), Calliope Hub tasks only, excludes spawned servers

Requires: See ARCHITECTURE.md for horizontal scaling guide


Cost Analysis

Calliope Hub Infrastructure Costs (Fargate us-west-2)

ConfigurationvCPU CostMemory CostTotal/Month
2 vCPU / 4GB$29.28$32.04~$61
4 vCPU / 8GB$58.56$64.08~$123
8 vCPU / 16GB$117.12$128.16~$245
16 vCPU / 32GB$234.24$256.32~$491

Assumes 24/7 uptime, 730 hours/month

Spawned Server Costs (Example)

Medium Instance (4 vCPU / 12GB):

  • Cost: ~$120/month per server (24/7)
  • 100 servers: ~$12,000/month
  • 400 servers: ~$48,000/month

Calliope Hub cost is negligible compared to spawned servers - don’t skimp on Calliope Hub resources!


Monitoring & Alerts

Key Metrics to Monitor

1. Calliope Hub CPU Utilization

Alert: > 70% sustained for 5 minutes
Action: Upgrade to more vCPUs

2. Poll Cycle Duration

Alert: > 15 seconds
Action: Check API throttling, increase poll interval

3. API Error Rate

Alert: ThrottlingException > 1% of requests
Action: Implement batching or request limit increase

4. Database Lock Contention (SQLite only)

Alert: "Database is locked" errors
Action: Migrate to PostgreSQL

5. Memory Usage

Alert: > 80% of allocated memory
Action: Upgrade memory or investigate memory leak

CloudWatch Metrics

# Calliope Hub CPU
Namespace: ECS/Fargate
Metric: CPUUtilization
Dimension: ServiceName=Calliope Hub-service

# API Calls
Namespace: AWS/ECS
Metric: APICallCount
Statistic: Sum

# Task Health
Namespace: AWS/ECS
Metric: HealthyTaskCount, UnhealthyTaskCount

Troubleshooting

Problem: Slow server spawns (>2 minutes)

Possible Causes:

  1. CPU saturation - Calliope Hub taking too long to poll
  2. Database lock contention (SQLite)
  3. EFS mount latency
  4. Container image pull time

Solutions:

  1. Upgrade Calliope Hub vCPU
  2. Switch to PostgreSQL
  3. Use provisioned EFS throughput
  4. Pre-pull images to nodes

Problem: API throttling errors

Error:

ThrottlingException: Rate exceeded

Solutions:

  1. Implement batch API calls (see OPTIMIZATION_GUIDE.md)
  2. Increase poll interval to 15-30s
  3. Request AWS API limit increase
  4. Add response caching layer

Problem: Ghost servers in UI

Symptoms:

  • Servers show “running” but are actually stopped
  • Can’t connect to server
  • Task not found in ECS

Causes:

  • Poll method bug (FIXED in recent updates)
  • Orphan detection not running
  • Database state mismatch

Solutions:

  1. Update to latest code with poll fixes
  2. Enable orphan detection service
  3. Manual cleanup: Stop server via UI

Version History

VersionDateChanges
1.02025-10-13Initial documentation - capacity analysis and limits

See Also