Troubleshooting: Successful Build but ECS Runtime Failures

Troubleshooting: Successful Build but ECS Runtime Failures

Calliope Integration: This component is integrated into the Calliope AI platform. Some features and configurations may differ from the upstream project.

Symptoms

  • Docker build completes successfully (30+ minutes, no errors)
  • ECS container fails with:
    • Error: Cannot find module '/home/calliope/WAIIDE-server/out/server-main.js'
    • cp: cannot stat '/home/calliope/scripts/jupyter_server_config.py'

Common Causes When Build Succeeds

1. Wrong Image Tag/Version in ECS

The most common issue - ECS is pulling a different image than what you built.

# Check what image ECS is actually using
aws ecs describe-task-definition --task-definition your-task-name | grep image

# Verify the image digest
docker inspect your-registry/waiide:latest | grep -A 5 "RepoDigests"

# Compare with what ECS pulled
aws ecs describe-tasks --cluster your-cluster --tasks your-task-id | grep imageDigest

2. Registry Push/Pull Issues

Incomplete Push

# Verify the pushed image size (should be ~3-4GB)
aws ecr describe-images --repository-name waiide --image-ids imageTag=latest

# Or for Docker Hub
docker manifest inspect calliopeai/waiide:latest

Multi-Architecture Confusion

# Check if you accidentally pushed multi-arch manifest
docker buildx imagetools inspect your-registry/waiide:latest

# ECS might be pulling wrong architecture
# Ensure you pushed specifically linux/amd64
docker buildx build --platform linux/amd64 --push -t your-registry/waiide:latest-amd64 .

3. ECS Task Definition Cache

ECS might be using cached task definition:

# Force new task definition revision
aws ecs register-task-definition --cli-input-json file://task-def.json

# Force new deployment
aws ecs update-service --cluster your-cluster --service your-service --force-new-deployment

4. Build vs Runtime Architecture

Your GitHub Actions runner (8GB) built for amd64, but verify:

# In your GitHub Actions workflow, add:
- name: Verify built image
  run: |
    docker run --rm your-registry/waiide:latest uname -m
    docker run --rm your-registry/waiide:latest ls -la /home/calliope/WAIIDE-server/
    docker run --rm your-registry/waiide:latest ls -la /home/calliope/scripts/

Debugging Steps

1. Pull and Test the Exact Image ECS Uses

# Pull the exact image ECS is using
docker pull your-registry/waiide:latest

# Test it locally
docker run -it --rm your-registry/waiide:latest bash -c "
  echo '=== Checking WAIIDE Server ==='
  ls -la /home/calliope/WAIIDE-server/ | head -10
  echo '=== Checking for server-main.js ==='
  find /home/calliope -name 'server-main.js' 2>/dev/null | head -5
  echo '=== Checking scripts ==='
  ls -la /home/calliope/scripts/ | head -10
  echo '=== Image architecture ==='
  uname -m
"

2. Compare Image Layers

# Check if the image has the expected layers
docker history your-registry/waiide:latest | grep -E "(WAIIDE|scripts|COPY)"

3. ECS Exec Into Container

# Enable ECS Exec
aws ecs update-service --cluster your-cluster --service your-service --enable-execute-command

# Exec into running container
aws ecs execute-command --cluster your-cluster --task your-task-id --container waiide --interactive --command "/bin/bash"

# Inside container, check:
find / -name "server-main.js" 2>/dev/null
find / -name "jupyter_server_config.py" 2>/dev/null
ls -la /home/

Quick Fix Solutions

1. Use Explicit Image Digest

Instead of using :latest tag:

{
  "image": "your-registry/waiide@sha256:abc123...",
  "taskDefinitionArn": "..."
}

2. Tag with Unique Version

# In GitHub Actions
docker buildx build \
  --platform linux/amd64 \
  --push \
  -t your-registry/waiide:$(git rev-parse --short HEAD) \
  -t your-registry/waiide:latest \
  .

# Update ECS to use specific tag
"image": "your-registry/waiide:a1b2c3d"

3. Verify Push Completion

# Add to GitHub Actions after push
- name: Verify pushed image
  run: |
    docker pull ${{ env.REGISTRY }}/waiide:latest
    docker run --rm ${{ env.REGISTRY }}/waiide:latest ls -la /home/calliope/WAIIDE-server/

GitHub Actions Build Verification

Add this to your workflow to ensure build artifacts exist:

- name: Build and verify
  run: |
    docker buildx build --platform linux/amd64 --load -t waiide:test .
    
    # Verify critical files before pushing
    docker run --rm waiide:test bash -c "
      test -f /home/calliope/WAIIDE-server/server-main.js || test -f /home/calliope/WAIIDE-server/out/server-main.js || exit 1
      test -f /home/calliope/scripts/jupyter_server_config.py || exit 1
      echo 'Build verification passed!'
    "
    
    # Only push if verification passed
    docker buildx build --platform linux/amd64 --push -t ${{ env.REGISTRY }}/waiide:latest .

Most Likely Issue

Given your build was successful, the most likely issue is:

  1. ECS is pulling an older/different image than what you just built
  2. The push to registry was incomplete despite appearing successful
  3. Multi-architecture manifest causing ECS to pull wrong variant

Immediate Action: Pull the exact image URL that ECS is using and test it locally to see what’s actually in it.