Tuesday, November 15, 2022

[SOLVED] Using capacity providers for services in EC2 with Terraform, tasks remains in an infinite state of PROVISIONING

November 15, 2022 amazon-ec2, amazon-ecs, amazon-web-services, terraform, terraform-provider-aws

Issue

As the title says im using Capacity Providers for scaling my instances when a service updates its desired count. Im new with autoscaling things, so im probably missing something. The problem is once i update the desired count, the alarm of the autoscaling works properly and the instances scale out very well but meanwhile the service creates new tasks and it never changes from "PROVISIONING" to "RUNNING"

First ill show you how id configured the launch template, the autoscaling and capacity-provider (replacing some variables for its values for an easier understanding)

locals {
  ami_id = jsondecode(data.aws_ssm_parameter.ecs_optimized_ami.value)["image_id"]
  cluster_user_data = "${base64encode(<<EOF
    #! /bin/bash
    sudo apt-get update
    sudo echo "ECS_CLUSTER=${aws_ecs_cluster.cluster.name}" >> /etc/ecs/ecs.config
  EOF
  )}"
}

resource "aws_launch_template" "ecs-public" {
  name_prefix   = "ecs-${var.cluster_name}-public"
  image_id      = local.ami_id
  instance_type = "t3.small"

  iam_instance_profile {
    name = aws_iam_instance_profile.cluster.name
  }

  block_device_mappings {
    device_name = "/dev/xvda"

    ebs {
      volume_size = var.cluster_instance_root_block_device_size
      volume_type = var.cluster_instance_root_block_device_type
    }
  }
  # TODO use Dynamic here
  network_interfaces {
    device_index                = 0
    security_groups             = var.security_groups_external
    delete_on_termination       = true
    subnet_id                   = var.public_subnet_ids[0]
  }

  network_interfaces {
    device_index                = 1
    security_groups             = var.security_groups_external
    delete_on_termination       = true
    subnet_id                   = var.public_subnet_ids[1]
  }

  user_data = local.cluster_user_data
  key_name = aws_key_pair.generated_key[0].key_name

  lifecycle {
    create_before_destroy = true
  }

  depends_on = [
    null_resource.iam_wait
  ]
}

resource "aws_autoscaling_group" "cluster_public" {
  name_prefix = "asg-public-${var.cluster_name}"
  vpc_zone_identifier         = var.public_subnet_ids

  launch_template {
    id      = aws_launch_template.ecs-public.id
    version = "$Latest"
  }

  min_size         = 1
  max_size         = 5
  desired_capacity = 1

  protect_from_scale_in = false

  tag {
    key                 = "Name"
    value               = "worker-public-${var.cluster_name}"
    propagate_at_launch = true
  }

  tag {
    key                 = "ClusterName"
    value               = var.cluster_name
    propagate_at_launch = true
  }

  tag {
    key                 = "AmazonECSManaged"
    value               = true
    propagate_at_launch = true
  }

  dynamic "tag" {
    for_each = var.tags
    content {
      key                 = tag.key
      value               = tag.value
      propagate_at_launch = true
    }
  }

  lifecycle {
    create_before_destroy = true
  }

}

resource "aws_ecs_capacity_provider" "autoscaling_group_public" {
  name = "cp-${var.cluster_name}-public"

  auto_scaling_group_provider {
    auto_scaling_group_arn = aws_autoscaling_group.cluster_public.arn

    managed_termination_protection = "DISABLED"

    managed_scaling {
      status = "ENABLED"
      target_capacity = 100
      minimum_scaling_step_size = 1
      maximum_scaling_step_size = 100
    }
  }
}

resource "aws_ecs_cluster_capacity_providers" "cluster_capacity_providers" {

  cluster_name = aws_ecs_cluster.cluster.name

  capacity_providers = [aws_ecs_capacity_provider.autoscaling_group_private[0].name, aws_ecs_capacity_provider.autoscaling_group_public[0].name]
}

That is how the autoscaling and capacity provider are configured on the cluster, and here is the module im using for the service & task

module "container_definition" {
  source                       = "cloudposse/ecs-container-definition/aws"
  version                      = "0.58.1"
  container_name               = local.container_name
  container_image              = "${module.global_settings.aws_account_id}.dkr.ecr.${module.global_settings.region}.amazonaws.com/${local.project_name}:Staging-latest"
  container_memory             = 512
  container_memory_reservation = 256
  container_cpu                = 256
  essential                    = true
  readonly_root_filesystem     = false
  environment                  = local.task_environment_variables
  port_mappings                = local.port_mappings
  log_configuration            = local.container_log_configuration
}

module "ecs_alb_service_task" {
  source = "cloudposse/ecs-alb-service-task/aws"
  version = "0.66.2"
  namespace                          = var.cluster_name
  stage                              = "Staging"
  name                               = local.project_name
  attributes                         = []
  container_definition_json          = module.container_definition.sensitive_json_map_encoded_list

  #Load Balancer
  alb_security_group                 = var.security_group_id
  ecs_load_balancers                 = local.ecs_load_balancer_config

  #VPC
  vpc_id                             = var.vpc_id
  subnet_ids                         = var.subnet_ids
  network_mode                       = "awsvpc"

  #Capacity Provider Strategy 
  capacity_provider_strategies       = 
  [
    {
      capacity_provider = var.capacity_provider_name
      weight            = 1
      base              = 0
    }
  ]

  desired_count                      = 2
  launch_type                        = "EC2"
  ignore_changes_desired_count       = true

  ecs_cluster_arn                    = var.cluster_arn
  security_group_ids                 = [var.security_group_id]
  ignore_changes_task_definition     = true
  health_check_grace_period_seconds  = 200
  deployment_minimum_healthy_percent = 100
  deployment_maximum_percent         = 200
  deployment_controller_type         = "ECS"
  task_memory                        = 512
  task_cpu                           = 256
  force_new_deployment               = true
  ordered_placement_strategy         = 
  [
        {
            type  = "spread"
            field = "attribute:ecs.availability-zone"
        },{
            type  = "spread"
            field = "instanceId"
        }
  ]

  
  
  label_order                        = local.label_order
  labels_as_tags                     = local.labels_as_tags
  propagate_tags                     = local.propagate_tags
  tags                               = merge(var.tags, local.tags)

  task_exec_role_arn                 = [module.task_excecution_role.task_excecution_role_arn]
  task_role_arn                      = [module.task_excecution_role.task_excecution_role_arn]
}

EDIT: I found that the instances are not registering correctly in the cluster, this usually happens when user_data is bad configured, and instead the instance is registered in the default cluster, this is neither happening

I tryed to change the desired count multiple times
I tryed deleting all the service and recreating it
I modified "ignore_changes_desired_count" to true
I tryed setting in "capacity_provider_strategies" a base of 1 and weight of 3
I changed the instance_type from t3.micro to t3.small

Solution

Finally after all the investigation i found out that Launch Templates doesnt like to you to put "subnet_id" when you are using the template for an autoscaling on EC2. Ill show you how i found the answer first: First you need to go to EC2, click Launch Templates and click on eddit the template you are using

Click the button "Provide guidance to help me set up a template that I can use with EC2 Auto Scaling" and it going to show you if you have mistakes.

In my case my mistake was that i filled with terraform the field "subnet_id" that is part of the advanced network configuration Finally the code ends this way:

resource "aws_launch_template" "ecs-public" {
  name_prefix   = "ecs-${var.cluster_name}-public"
  image_id      = local.ami_id
  instance_type = "t3.small"

  iam_instance_profile {
    name = aws_iam_instance_profile.cluster.name
  }

  block_device_mappings {
    device_name = "/dev/xvda"

    ebs {
      volume_size = var.cluster_instance_root_block_device_size
      volume_type = var.cluster_instance_root_block_device_type
    }
  }
  
  network_interfaces {
    associate_public_ip_address = true
    device_index                = 0
    security_groups             = var.security_groups_external
    delete_on_termination       = true
  }

  user_data = local.cluster_user_data
  key_name = aws_key_pair.generated_key[0].key_name

  lifecycle {
    create_before_destroy = true
  }

  depends_on = [
    null_resource.iam_wait
  ]
}

Answered By - Math.Random

Answer Checked By - Robin (WPSolving Admin)

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Tuesday, November 15, 2022

[SOLVED] Using capacity providers for services in EC2 with Terraform, tasks remains in an infinite state of PROVISIONING

Issue

Solution

Popular Posts

Labels