Calliope Calliope Hub ECS Spawner - Capacity and Limits

Calliope Integration: This component is integrated into the Calliope AI platform. Some features and configurations may differ from the upstream project.

This document describes the theoretical and practical limits for concurrent spawned servers in the Calliope Calliope Hub ECS deployment.

Executive Summary

Current Calliope Hub Configuration (2 vCPU / 4GB):

Recommended: 100-150 concurrent servers
Maximum: ~200 servers before degradation

Optimized Calliope Hub Configuration (4 vCPU / 8GB):

Recommended: 300-400 concurrent servers
Maximum: ~500 servers (ECS API limits)

With Horizontal Scaling (Multiple hubs + PostgreSQL):

Practical: 1,000-3,000 concurrent servers
Theoretical: 10,000+ with advanced optimizations

Architecture Overview

Monitoring & Polling

┌─────────────┐
│  Calliope Calliope Hub │  Poll interval: 5-10s (randomized)
│     Calliope Hub     │  Activity updates: 120s
└──────┬──────┘  Orphan discovery: 300s
       │
       │ ECS API Calls
       ├─────────────────┐
       │                 │
┌──────▼──────┐   ┌──────▼──────┐
│describe_tasks│   │ list_tasks  │
│  (per poll) │   │(orphan scan)│
└─────────────┘   └─────────────┘

Per-Server Overhead:

Poll cycle: 1 describe_tasks call every 5-10s
API rate: ~0.1-0.2 requests/second per server
Memory: ~1 MB per spawner object
CPU: ~0.1s per poll cycle

Bottleneck Analysis

1. AWS ECS API Rate Limits ⚠️ PRIMARY BOTTLENECK

Limits (per AWS Region):

describe_tasks: 100 TPS (transactions per second)
list_tasks: 100 TPS
Max tasks per call: 100 tasks

Impact:

100 TPS ÷ 0.15 requests/server = ~666 concurrent servers (hard limit)

With safety margin: ~500-600 servers maximum

Symptoms of hitting this limit:

API throttling errors in logs
ThrottlingException: Rate exceeded
Slow poll responses (>30s)
Delayed server status updates

Mitigation:

Request limit increase from AWS Support
Implement API call batching (see OPTIMIZATION_GUIDE.md)
Add response caching layer

2. Calliope Hub Container CPU ⚠️ YOUR PRIMARY CONSTRAINT

Current Configuration: 2 vCPU

CPU Usage Calculation:

Servers × Poll CPU Time ÷ Poll Interval = Sustained CPU

Example:
1000 servers × 0.1s CPU ÷ 10s interval = 10 CPU cores needed

Capacity by CPU:

Calliope Hub vCPU	Concurrent Servers	CPU Usage
2 vCPU (current)	150	75%
2 vCPU	200	100% (maxed)
4 vCPU	400	100%
8 vCPU	800	100%
16 vCPU	1,600	100%

Symptoms of CPU saturation:

Slow web UI responses
Poll cycles taking >10s
Calliope Hub becomes unresponsive
Spawn timeouts

3. Calliope Hub Container Memory

Current Configuration: 4GB RAM

Memory Usage:

Base Calliope Hub: ~1 GB
Per Spawner: ~1 MB (object + state)

Total = 1 GB + (Servers × 1 MB)

Capacity by Memory:

Calliope Hub Memory	Concurrent Servers	Usage
4GB (current)	3,000	4 GB
8GB	7,000	8 GB
16GB	15,000	16 GB

Memory is NOT a bottleneck for typical deployments (<1000 servers).

4. Database Performance

SQLite (default):

Limit: ~100-200 concurrent servers
Issue: Write lock contention on state updates
Symptoms:
- Slow spawn operations
- “Database is locked” errors
- Delayed stop operations

PostgreSQL (recommended for scale):

Limit: 10,000+ concurrent servers
Requirements: Connection pooling, proper indexing
No practical limit at your scale

5. ECS Task Limits

Fargate Limits (per region):

Tasks per cluster: No hard limit (auto-scales)
ENIs (networking): 5,000 per VPC (requestable to 50,000)
Fargate vCPU quota: 10,000 vCPUs default (requestable to 100,000+)

Capacity by Quota:

Quota	Instance Size	Concurrent Servers
10,000 vCPU	4 vCPU (medium)	2,500
10,000 vCPU	8 vCPU (large)	1,250
100,000 vCPU	4 vCPU (medium)	25,000

ECS limits are NOT a bottleneck - API rate limits hit first.

6. Network Bandwidth

Per Poll Cycle:

describe_tasks response: ~5-10 KB
1000 servers @ 10s polling: 500 KB/s - 1 MB/s

Not a bottleneck until 10,000+ servers.

Realistic Capacity Matrix

Current Setup (2 vCPU / 4GB)

Servers	Status	Performance	Bottleneck	Notes
50	✅ Excellent	Response <500ms	None	Ideal for small teams
100	✅ Good	Response <1s	None	Sweet spot
150	⚠️ Degraded	Response 1-2s	CPU 75%	Approaching limit
200	❌ Struggling	Response 2-5s	CPU maxed	Not recommended
300	❌ Broken	Timeouts common	API throttling	System unstable

Recommendation: 100-150 servers maximum

Optimized Setup (4 vCPU / 8GB)

Servers	Status	Performance	Bottleneck	Notes
200	✅ Excellent	Response <500ms	None	Recommended
400	✅ Good	Response <1s	None	Sweet spot
500	⚠️ Degraded	Response 1-2s	API pressure	Approaching API limits
600	❌ Throttled	Timeouts	ECS API	Hard limit hit

Recommendation: 300-400 servers optimal, 500 maximum

Large Scale (8 vCPU / 16GB + PostgreSQL)

Servers	Status	Performance	Bottleneck	Mitigation Needed
500	✅ Good	Response <1s	None	PostgreSQL required
1,000	⚠️ API pressure	Response 1-2s	ECS API	Batching recommended
2,000	❌ Needs optimization	Degraded	ECS API	Must implement batching + caching

Requires: Batch API calls, caching, PostgreSQL

Horizontal Scaling (Multiple Hubs)

Setup	Servers	Cost/Month*	Requirements
2 hubs @ 4 vCPU	600-800	$240	PostgreSQL, ALB
3 hubs @ 4 vCPU	900-1,200	$360	+ Sticky sessions
5 hubs @ 4 vCPU	1,500-2,000	$600	+ Coordination
10 hubs @ 4 vCPU	3,000-4,000	$1,200	+ Sharding

*Fargate pricing (us-west-2), Calliope Hub tasks only, excludes spawned servers

Requires: See ARCHITECTURE.md for horizontal scaling guide

Cost Analysis

Calliope Hub Infrastructure Costs (Fargate us-west-2)

Configuration	vCPU Cost	Memory Cost	Total/Month
2 vCPU / 4GB	$29.28	$32.04	~$61
4 vCPU / 8GB	$58.56	$64.08	~$123
8 vCPU / 16GB	$117.12	$128.16	~$245
16 vCPU / 32GB	$234.24	$256.32	~$491

Assumes 24/7 uptime, 730 hours/month

Spawned Server Costs (Example)

Medium Instance (4 vCPU / 12GB):

Cost: ~$120/month per server (24/7)
100 servers: ~$12,000/month
400 servers: ~$48,000/month

Calliope Hub cost is negligible compared to spawned servers - don’t skimp on Calliope Hub resources!

Monitoring & Alerts

Key Metrics to Monitor

1. Calliope Hub CPU Utilization

Alert: > 70% sustained for 5 minutes
Action: Upgrade to more vCPUs

2. Poll Cycle Duration

Alert: > 15 seconds
Action: Check API throttling, increase poll interval

3. API Error Rate

Alert: ThrottlingException > 1% of requests
Action: Implement batching or request limit increase

4. Database Lock Contention (SQLite only)

Alert: "Database is locked" errors
Action: Migrate to PostgreSQL

5. Memory Usage

Alert: > 80% of allocated memory
Action: Upgrade memory or investigate memory leak

CloudWatch Metrics

# Calliope Hub CPU
Namespace: ECS/Fargate
Metric: CPUUtilization
Dimension: ServiceName=Calliope Hub-service

# API Calls
Namespace: AWS/ECS
Metric: APICallCount
Statistic: Sum

# Task Health
Namespace: AWS/ECS
Metric: HealthyTaskCount, UnhealthyTaskCount

Troubleshooting

Problem: Slow server spawns (>2 minutes)

Possible Causes:

CPU saturation - Calliope Hub taking too long to poll
Database lock contention (SQLite)
EFS mount latency
Container image pull time

Solutions:

Upgrade Calliope Hub vCPU
Switch to PostgreSQL
Use provisioned EFS throughput
Pre-pull images to nodes

Problem: API throttling errors

Error:

ThrottlingException: Rate exceeded

Solutions:

Implement batch API calls (see OPTIMIZATION_GUIDE.md)
Increase poll interval to 15-30s
Request AWS API limit increase
Add response caching layer

Problem: Ghost servers in UI

Symptoms:

Servers show “running” but are actually stopped
Can’t connect to server
Task not found in ECS

Causes:

Poll method bug (FIXED in recent updates)
Orphan detection not running
Database state mismatch

Solutions:

Update to latest code with poll fixes
Enable orphan detection service
Manual cleanup: Stop server via UI

Version History

Version	Date	Changes
1.0	2025-10-13	Initial documentation - capacity analysis and limits

Calliope Calliope Hub ECS Spawner - Capacity and Limits

Executive Summary

Architecture Overview

Monitoring & Polling

Bottleneck Analysis

1. AWS ECS API Rate Limits ⚠️ PRIMARY BOTTLENECK

2. Calliope Hub Container CPU ⚠️ YOUR PRIMARY CONSTRAINT

3. Calliope Hub Container Memory

4. Database Performance

5. ECS Task Limits

6. Network Bandwidth

Realistic Capacity Matrix

Current Setup (2 vCPU / 4GB)

Optimized Setup (4 vCPU / 8GB)

Large Scale (8 vCPU / 16GB + PostgreSQL)

Horizontal Scaling (Multiple Hubs)

Cost Analysis

Calliope Hub Infrastructure Costs (Fargate us-west-2)

Spawned Server Costs (Example)

Monitoring & Alerts

Key Metrics to Monitor

CloudWatch Metrics

Troubleshooting

Problem: Slow server spawns (>2 minutes)

Problem: API throttling errors

Problem: Ghost servers in UI

Version History

See Also