ECS CapacityProvider Bug: Security Groups Must Be Specified

by Alex Johnson 60 views

Introduction

This article addresses a significant bug encountered while using the AWS::ECS::CapacityProvider resource in AWS CloudFormation. Specifically, it highlights a discrepancy between the CloudFormation documentation and the actual behavior of the Elastic Container Service (ECS) API concerning the SecurityGroups property within the ManagedInstancesNetworkConfiguration. The documentation incorrectly states that SecurityGroups is optional, but the ECS API mandates its inclusion, leading to deployment failures and considerable frustration for users. This article aims to provide a detailed explanation of the issue, its impact, steps to reproduce it, and potential solutions to mitigate the problem. This issue impacts various users, from those utilizing CloudFormation directly to those leveraging higher-level abstractions like AWS CDK and Terraform.

The Core Issue: Documentation vs. API

The central problem lies in the conflicting information provided by the CloudFormation documentation and the actual requirements of the ECS API. According to the CloudFormation documentation for ManagedInstancesNetworkConfiguration, the SecurityGroups property is marked as "Required: No". This suggests that users can create an ECS Capacity Provider with Managed Instances without explicitly specifying security groups. However, in practice, the ECS API rejects any attempt to create a capacity provider that omits this property. This discrepancy leads to failed CloudFormation deployments, as the system encounters validation errors during resource creation. The error message clearly indicates that the Network Configuration must include security groups, directly contradicting the documented behavior.

Replicating the Bug: A Step-by-Step Guide

To illustrate the issue, consider the following steps to reproduce the bug using AWS CDK, a popular infrastructure-as-code framework. The provided CDK code defines a stack that creates a VPC, an ECS cluster, an EC2 instance role and profile, and a managed instance capacity provider. This code snippet is intended to create an ECS Capacity Provider with Managed Instances, relying on the documented optionality of the SecurityGroups property. However, deploying this code will result in a failure due to the API's requirement for security groups.

CDK Code

import { Size, Stack } from "aws-cdk-lib";
import { CpuManufacturer, SubnetType, Vpc } from "aws-cdk-lib/aws-ec2";
import { Cluster, ManagedInstancesCapacityProvider } from "aws-cdk-lib/aws-ecs";
import { InstanceProfile, ManagedPolicy, Role, ServicePrincipal } from "aws-cdk-lib/aws-iam";
import type { Construct } from "constructs";

export class MyStack extends Stack {
  constructor(scope: Construct, id: string, props: StackProps = {}) {
    super(scope, id, props);

    const vpc = new Vpc(this, "Vpc", {
      maxAzs: 2,
      natGateways: 1,
      subnetConfiguration: [
        {
          name: "Public",
          subnetType: SubnetType.PUBLIC,
          cidrMask: 18,
        },
        {
          name: "Private",
          subnetType: SubnetType.PRIVATE_WITH_EGRESS,
          cidrMask: 18,
        },
      ],
    });

    const cluster = new Cluster(this, "ManagedInstancesCluster", {
      vpc,
    });

    const instanceRole = new Role(this, "InstanceRole", {
      assumedBy: new ServicePrincipal("ec2.amazonaws.com"),
      managedPolicies: [
        ManagedPolicy.fromAwsManagedPolicyName("AmazonECSInstanceRolePolicyForManagedInstances"),
      ],
    });

    const instanceProfile = new InstanceProfile(this, "InstanceProfile", {
      role: instanceRole,
    });

    const miCapacityProvider = new ManagedInstancesCapacityProvider(this, "MICapacityProvider", {
      ec2InstanceProfile: instanceProfile,
      subnets: vpc.privateSubnets,
      instanceRequirements: {
        vCpuCountMin: 1,
        memoryMin: Size.gibibytes(2),
        cpuManufacturers: [CpuManufacturer.AMD],
      },
    });

    cluster.addManagedInstancesCapacityProvider(miCapacityProvider);
  }
}

Deployment and the Inevitable Error

To deploy the above stack using the CDK CLI, execute the following command:

cdk deploy

Upon execution, CloudFormation will attempt to create the capacity provider, but the deployment will fail with the following error message:

1:35:46 AM | CREATE_FAILED        | AWS::ECS::CapacityProvider            | MICapacityProviderC44A5890
Resource handler returned message: "Invalid request provided: CreateCapacityProvider error: Managed Instances capacity provider must specify a Network Configuration that contain security groups (Service: Ecs, Status Code: 400, Request ID: 22245e8f-d8d8-4874-af93-2f3c13716ef2) (SDK Attempt Count: 1)" (RequestToken: fa766a71-1b28-7fe6-e26b-9d6025f645ef, HandlerErrorCode: InvalidRequest)

❌  cdk-aws-ecs-managed-instance-dev failed: ToolkitError: The stack named cdk-aws-ecs-managed-instance-dev failed creation, it may need to be manually deleted from the AWS console: ROLLBACK_COMPLETE: Resource handler returned message: "Invalid request provided: CreateCapacityProvider error: Managed Instances capacity provider must specify a Network Configuration that contain security groups (Service: Ecs, Status Code: 400, Request ID: 22245e8f-d8d8-4874-af93-2f3c13716ef2) (SDK Attempt Count: 1)" (RequestToken: fa766a71-1b28-7fe6-e26b-9d6025f645ef, HandlerErrorCode: InvalidRequest)

This error message unequivocally demonstrates that the ECS API requires the SecurityGroups property, despite the CloudFormation documentation indicating otherwise.

Impact Across Multiple Frameworks

This documentation error does not exclusively affect CloudFormation users. The ripple effect extends to those utilizing higher-level infrastructure-as-code frameworks, such as AWS CDK, AWS SAM, and Terraform. These frameworks often rely on the CloudFormation documentation to generate their abstractions and providers. Consequently, the incorrect documentation leads to flawed constructs and modules that fail to function as expected.

  • AWS CDK Users: The L2 ManagedInstancesCapacityProvider construct in AWS CDK does not expose the securityGroups property, primarily because the documentation indicates that it is optional. This forces users to resort to L1 constructs or escape hatches to manually configure the security groups.
  • AWS SAM Users: Similarly, AWS SAM users face challenges when creating ECS Managed Instances capacity providers due to the documentation discrepancy.
  • Terraform Users: Terraform users relying on the AWS provider are also affected, as the provider's behavior mirrors the CloudFormation resource definition.

Proposed Solutions

To resolve this issue, one of the following actions must be taken:

Option 1: Correct the Documentation (Preferred)

The most straightforward solution is to update the CloudFormation documentation to accurately reflect the ECS API's behavior. The documentation should be modified to state that the SecurityGroups property in the ManagedInstancesNetworkConfiguration is "Required: Yes". This will align the documentation with the actual API validation rules and prevent further confusion among users.

Option 2: Adjust the API Validation

Alternatively, the ECS API could be updated to make the SecurityGroups property truly optional, as the documentation suggests. This would involve modifying the API to either accept capacity provider creation without security groups or to automatically assign a default security group if none is specified. However, given the current validation behavior, correcting the documentation appears to be the more practical and immediate solution.

Root Cause Analysis

The root cause of this issue stems from a disconnect between the CloudFormation documentation and the ECS API implementation. It is likely that either the documentation was not updated to reflect a change in the API's validation rules, or the API's validation logic was not aligned with the intended behavior described in the documentation. Based on the error message and the API's explicit validation for security groups, it is more probable that the documentation is incorrect.

Corrective Actions

To address this issue, the following steps should be taken:

  1. Update the CloudFormation documentation for AWS::ECS::CapacityProvider at https://docs.aws.amazon.com/AWSCloudFormation/latest/TemplateReference/aws-properties-ecs-capacityprovider-managedinstancesnetworkconfiguration.html.

    Change:

    SecurityGroups:
      Required: No  ❌ (Incorrect)
    

    To:

    SecurityGroups:
      Required: Yes  ✅ (Matches actual API behavior)
    
  2. Notify the AWS CDK team to update the ManagedInstancesCapacityProvider L2 construct to expose the securityGroups property once the documentation is corrected. This will enable CDK users to easily configure security groups for their capacity providers.

Workarounds

Until the documentation is updated and the CDK construct is modified, users can employ the following workarounds:

  • Use L1 Constructs: Utilize the CfnCapacityProvider L1 construct in CloudFormation or CDK and explicitly specify the security groups.
  • Use Escape Hatches: Employ escape hatches in CDK to manually add the security groups configuration to the ManagedInstancesCapacityProvider.

Conclusion

The discrepancy between the CloudFormation documentation and the ECS API regarding the SecurityGroups property in ManagedInstancesNetworkConfiguration poses a significant challenge for users creating ECS Capacity Providers with Managed Instances. By correcting the documentation and updating the AWS CDK construct, AWS can resolve this issue and provide a more consistent and user-friendly experience. In the meantime, users can leverage the workarounds described in this article to mitigate the problem. Addressing this bug is crucial for ensuring that users can effectively deploy and manage their ECS infrastructure using CloudFormation and related frameworks. You can find more information on AWS CloudFormation here.