Calliope Calliope Hub ECS Spawner - Capacity and Limits
Calliope Integration: This component is integrated into the Calliope AI platform. Some features and configurations may differ from the upstream project.
This document describes the theoretical and practical limits for concurrent spawned servers in the Calliope Calliope Hub ECS deployment.
Executive Summary
Current Calliope Hub Configuration (2 vCPU / 4GB):
- Recommended: 100-150 concurrent servers
- Maximum: ~200 servers before degradation
Optimized Calliope Hub Configuration (4 vCPU / 8GB):
- Recommended: 300-400 concurrent servers
- Maximum: ~500 servers (ECS API limits)
With Horizontal Scaling (Multiple hubs + PostgreSQL):
- Practical: 1,000-3,000 concurrent servers
- Theoretical: 10,000+ with advanced optimizations
Architecture Overview
Monitoring & Polling
βββββββββββββββ
β Calliope Calliope Hub β Poll interval: 5-10s (randomized)
β Calliope Hub β Activity updates: 120s
ββββββββ¬βββββββ Orphan discovery: 300s
β
β ECS API Calls
βββββββββββββββββββ
β β
ββββββββΌβββββββ ββββββββΌβββββββ
βdescribe_tasksβ β list_tasks β
β (per poll) β β(orphan scan)β
βββββββββββββββ βββββββββββββββPer-Server Overhead:
- Poll cycle: 1
describe_taskscall every 5-10s - API rate: ~0.1-0.2 requests/second per server
- Memory: ~1 MB per spawner object
- CPU: ~0.1s per poll cycle
Bottleneck Analysis
1. AWS ECS API Rate Limits β οΈ PRIMARY BOTTLENECK
Limits (per AWS Region):
describe_tasks: 100 TPS (transactions per second)list_tasks: 100 TPS- Max tasks per call: 100 tasks
Impact:
100 TPS Γ· 0.15 requests/server = ~666 concurrent servers (hard limit)
With safety margin: ~500-600 servers maximumSymptoms of hitting this limit:
- API throttling errors in logs
ThrottlingException: Rate exceeded- Slow poll responses (>30s)
- Delayed server status updates
Mitigation:
- Request limit increase from AWS Support
- Implement API call batching (see OPTIMIZATION_GUIDE.md)
- Add response caching layer
2. Calliope Hub Container CPU β οΈ YOUR PRIMARY CONSTRAINT
Current Configuration: 2 vCPU
CPU Usage Calculation:
Servers Γ Poll CPU Time Γ· Poll Interval = Sustained CPU
Example:
1000 servers Γ 0.1s CPU Γ· 10s interval = 10 CPU cores neededCapacity by CPU:
| Calliope Hub vCPU | Concurrent Servers | CPU Usage |
|---|---|---|
| 2 vCPU (current) | 150 | 75% |
| 2 vCPU | 200 | 100% (maxed) |
| 4 vCPU | 400 | 100% |
| 8 vCPU | 800 | 100% |
| 16 vCPU | 1,600 | 100% |
Symptoms of CPU saturation:
- Slow web UI responses
- Poll cycles taking >10s
- Calliope Hub becomes unresponsive
- Spawn timeouts
3. Calliope Hub Container Memory
Current Configuration: 4GB RAM
Memory Usage:
Base Calliope Hub: ~1 GB
Per Spawner: ~1 MB (object + state)
Total = 1 GB + (Servers Γ 1 MB)Capacity by Memory:
| Calliope Hub Memory | Concurrent Servers | Usage |
|---|---|---|
| 4GB (current) | 3,000 | 4 GB |
| 8GB | 7,000 | 8 GB |
| 16GB | 15,000 | 16 GB |
Memory is NOT a bottleneck for typical deployments (<1000 servers).
4. Database Performance
SQLite (default):
- Limit: ~100-200 concurrent servers
- Issue: Write lock contention on state updates
- Symptoms:
- Slow spawn operations
- “Database is locked” errors
- Delayed stop operations
PostgreSQL (recommended for scale):
- Limit: 10,000+ concurrent servers
- Requirements: Connection pooling, proper indexing
- No practical limit at your scale
5. ECS Task Limits
Fargate Limits (per region):
- Tasks per cluster: No hard limit (auto-scales)
- ENIs (networking): 5,000 per VPC (requestable to 50,000)
- Fargate vCPU quota: 10,000 vCPUs default (requestable to 100,000+)
Capacity by Quota:
| Quota | Instance Size | Concurrent Servers |
|---|---|---|
| 10,000 vCPU | 4 vCPU (medium) | 2,500 |
| 10,000 vCPU | 8 vCPU (large) | 1,250 |
| 100,000 vCPU | 4 vCPU (medium) | 25,000 |
ECS limits are NOT a bottleneck - API rate limits hit first.
6. Network Bandwidth
Per Poll Cycle:
describe_tasksresponse: ~5-10 KB- 1000 servers @ 10s polling: 500 KB/s - 1 MB/s
Not a bottleneck until 10,000+ servers.
Realistic Capacity Matrix
Current Setup (2 vCPU / 4GB)
| Servers | Status | Performance | Bottleneck | Notes |
|---|---|---|---|---|
| 50 | β Excellent | Response <500ms | None | Ideal for small teams |
| 100 | β Good | Response <1s | None | Sweet spot |
| 150 | β οΈ Degraded | Response 1-2s | CPU 75% | Approaching limit |
| 200 | β Struggling | Response 2-5s | CPU maxed | Not recommended |
| 300 | β Broken | Timeouts common | API throttling | System unstable |
Recommendation: 100-150 servers maximum
Optimized Setup (4 vCPU / 8GB)
| Servers | Status | Performance | Bottleneck | Notes |
|---|---|---|---|---|
| 200 | β Excellent | Response <500ms | None | Recommended |
| 400 | β Good | Response <1s | None | Sweet spot |
| 500 | β οΈ Degraded | Response 1-2s | API pressure | Approaching API limits |
| 600 | β Throttled | Timeouts | ECS API | Hard limit hit |
Recommendation: 300-400 servers optimal, 500 maximum
Large Scale (8 vCPU / 16GB + PostgreSQL)
| Servers | Status | Performance | Bottleneck | Mitigation Needed |
|---|---|---|---|---|
| 500 | β Good | Response <1s | None | PostgreSQL required |
| 1,000 | β οΈ API pressure | Response 1-2s | ECS API | Batching recommended |
| 2,000 | β Needs optimization | Degraded | ECS API | Must implement batching + caching |
Requires: Batch API calls, caching, PostgreSQL
Horizontal Scaling (Multiple Hubs)
| Setup | Servers | Cost/Month* | Requirements |
|---|---|---|---|
| 2 hubs @ 4 vCPU | 600-800 | $240 | PostgreSQL, ALB |
| 3 hubs @ 4 vCPU | 900-1,200 | $360 | + Sticky sessions |
| 5 hubs @ 4 vCPU | 1,500-2,000 | $600 | + Coordination |
| 10 hubs @ 4 vCPU | 3,000-4,000 | $1,200 | + Sharding |
*Fargate pricing (us-west-2), Calliope Hub tasks only, excludes spawned servers
Requires: See ARCHITECTURE.md for horizontal scaling guide
Cost Analysis
Calliope Hub Infrastructure Costs (Fargate us-west-2)
| Configuration | vCPU Cost | Memory Cost | Total/Month |
|---|---|---|---|
| 2 vCPU / 4GB | $29.28 | $32.04 | ~$61 |
| 4 vCPU / 8GB | $58.56 | $64.08 | ~$123 |
| 8 vCPU / 16GB | $117.12 | $128.16 | ~$245 |
| 16 vCPU / 32GB | $234.24 | $256.32 | ~$491 |
Assumes 24/7 uptime, 730 hours/month
Spawned Server Costs (Example)
Medium Instance (4 vCPU / 12GB):
- Cost: ~$120/month per server (24/7)
- 100 servers: ~$12,000/month
- 400 servers: ~$48,000/month
Calliope Hub cost is negligible compared to spawned servers - don’t skimp on Calliope Hub resources!
Monitoring & Alerts
Key Metrics to Monitor
1. Calliope Hub CPU Utilization
Alert: > 70% sustained for 5 minutes
Action: Upgrade to more vCPUs2. Poll Cycle Duration
Alert: > 15 seconds
Action: Check API throttling, increase poll interval3. API Error Rate
Alert: ThrottlingException > 1% of requests
Action: Implement batching or request limit increase4. Database Lock Contention (SQLite only)
Alert: "Database is locked" errors
Action: Migrate to PostgreSQL5. Memory Usage
Alert: > 80% of allocated memory
Action: Upgrade memory or investigate memory leakCloudWatch Metrics
# Calliope Hub CPU
Namespace: ECS/Fargate
Metric: CPUUtilization
Dimension: ServiceName=Calliope Hub-service
# API Calls
Namespace: AWS/ECS
Metric: APICallCount
Statistic: Sum
# Task Health
Namespace: AWS/ECS
Metric: HealthyTaskCount, UnhealthyTaskCountTroubleshooting
Problem: Slow server spawns (>2 minutes)
Possible Causes:
- CPU saturation - Calliope Hub taking too long to poll
- Database lock contention (SQLite)
- EFS mount latency
- Container image pull time
Solutions:
- Upgrade Calliope Hub vCPU
- Switch to PostgreSQL
- Use provisioned EFS throughput
- Pre-pull images to nodes
Problem: API throttling errors
Error:
ThrottlingException: Rate exceededSolutions:
- Implement batch API calls (see OPTIMIZATION_GUIDE.md)
- Increase poll interval to 15-30s
- Request AWS API limit increase
- Add response caching layer
Problem: Ghost servers in UI
Symptoms:
- Servers show “running” but are actually stopped
- Can’t connect to server
- Task not found in ECS
Causes:
- Poll method bug (FIXED in recent updates)
- Orphan detection not running
- Database state mismatch
Solutions:
- Update to latest code with poll fixes
- Enable orphan detection service
- Manual cleanup: Stop server via UI
Version History
| Version | Date | Changes |
|---|---|---|
| 1.0 | 2025-10-13 | Initial documentation - capacity analysis and limits |
See Also
- SPAWNER_SCALING.md - Scaling recommendations and capacity planning
- ARCHITECTURE.md - Horizontal scaling architecture
- OPTIMIZATION_GUIDE.md - Performance optimization techniques