AWS ECS tasks are being killed by OOM without leaving any trace

Question

I have an ECS cluster where I place a container that runs as a daemon to monitor all other processes. However, I'm seeing this containers being killed by OOM from time to time without leaving a trace. I just happened to spot one of them being killed. This is causing some log duplication but I wonder if there is a way to trace these restarts because when I look on the ECS Cluster events, there are no information about this tasks being restarted by any means.

I know more from kubernetes so I would say an analogy here. When this happens on kubernetes you would see a RESTARTS counter when you get information from all pods (kubectl get pods) is there any way to find this information on AWS ECS tasks? I'm struggling to find on documentation

I identified the tasks, and also I identified the status of each tasks to gain more information, but I'm unable to find any hint that the process was restarted or killed before.

this is a task detail example

- attachments: []
  attributes:
  - name: ecs.cpu-architecture
    value: x86_64
  availabilityZone: us-east-2c
  clusterArn: arn:aws:ecs:us-west-2:99999999999:cluster/dev
  connectivity: CONNECTED
  connectivityAt: '2023-01-24T23:03:23.315000-05:00'
  containerInstanceArn: arn:aws:ecs:us-east-2:99999999999:container-instance/dev/eb8875fhfghghghfjyjk88c8f96433b8
  containers:
  - containerArn: arn:aws:ecs:us-east-2:99999999999:container/dev/05d4a402ee274a3ca90a86e46292a63a/e54af51f-2420-47ab-bff6-dcd4f976ad2e
    cpu: '500'
    healthStatus: HEALTHY
    image: public.ecr.aws/datadog/agent:7.36.1
    lastStatus: RUNNING
    memory: '750'
    name: datadog-agent
    networkBindings:
    - bindIP: 0.0.0.0
      containerPort: 8125
      hostPort: 8125
      protocol: udp
    - bindIP: 0.0.0.0
      containerPort: 8126
      hostPort: 8126
      protocol: tcp
    networkInterfaces: []
    runtimeId: 75559b7327258d69fe61cac2dfe58b12d292bdb7b3a720c457231ee9e3e4190a
    taskArn: arn:aws:ecs:us-east-2:99999999999:task/dev/05d4a402ee274a3ca90a86e46292a63a
  cpu: '500'
  createdAt: '2023-01-24T23:03:22.841000-05:00'
  desiredStatus: RUNNING
  enableExecuteCommand: false
  group: service:datadog-agent
  healthStatus: HEALTHY
  lastStatus: RUNNING
  launchType: EC2
  memory: '750'
  overrides:
    containerOverrides:
    - name: datadog-agent
    inferenceAcceleratorOverrides: []
  pullStartedAt: '2023-01-24T23:03:25.471000-05:00'
  pullStoppedAt: '2023-01-24T23:03:39.790000-05:00'
  startedAt: '2023-01-24T23:03:47.514000-05:00'
  startedBy: ecs-svc/1726924224402147943
  tags: []
  taskArn: arn:aws:ecs:us-west-2:99999999999:task/dev/05d4a402ee274a3ca90a86e46292a63a
  taskDefinitionArn: arn:aws:ecs:us-west-2:99999999999:task-definition/datadog-agent-task:5
  version: 2

BlackStar · Accepted Answer

So, after debugging a lot within the little information AWS provides for this use case, I ended up doing a process to find the answer:

List all tasks ids of a given service with aws-cli with flag --desired-status STOPPED and dump all to a json file

aws ecs list-tasks --cluster dev --service-name datadog-agent --desired-status STOPPED --output json > ecs_tasks.json

using jq and aws-cli, describe all previously found tasks ids to get further information on each one of them

aws ecs describe-tasks --cluster dev --tasks $(jq -j '.taskArns[] | (.|" ",.)' ./ecs_tasks.json) --output yaml > ecs_tasks_describe.log

I could came up with a script to group and summarize the information but, since I only had to watch over 20 stopped tasks I ended up dumping the information in yaml format for easiness. I found two key properties on the output:
- For each task object, there is a reason for why it was stopped that told me nothing more than it was stopped because a container within the task exited (doesn't say the exit code to help though)

stoppedReason: Essential container in task exited

* For each task object, there is an array of containers objects under **containers** property. There you'll sometimes find **reason** property which can explain a bit more of why the container stopped

reason: 'OutOfMemoryError: Container killed due to memory usage'

Note: This information would give you all events for a given service for at least the last hour. In my case it gave me 8 hours of events but AWS documentation only promises 1 hour https://docs.aws.amazon.com/AmazonECS/latest/developerguide/stopped-task-errors.html

Stopped tasks only appear in the Amazon ECS console, AWS CLI, and AWS SDKs for at least 1 hour after the task stops. After that, the details of the stopped task expire and aren't available in Amazon ECS.

AWS ECS tasks are being killed by OOM without leaving any trace

Tags:

amazon-web-services

amazon-ecs

datadog

BlackStar

1 Answers

BlackStar

Recent Activity

Donate For Us

AWS ECS tasks are being killed by OOM without leaving any trace

Tags:

amazon-web-services

amazon-ecs

datadog

BlackStar

1 Answers

BlackStar

Related questions

Recent Activity

Donate For Us