An Elixir Migration to Microservices in AWS as ECS Fargate using Service Discovery to interconnect Nodes in CDK

Published in

Towards AWS

7 min readAug 29, 2021

Our load increased and we came to a point where we had to scale our umbrella application (elixir/phoenix) with two services running on a plain EC2 instance.

The migration had the following requirements:

We were having downtime on each deployment and this was restricting us to “deployment times” so we want to avoid downtime on deployments
We had cron jobs running on each service so we want our Cron jobs (Quantum) to work in our new architecture. It was not a priority that they scale so they can run on one instance of each service
Our business logic sends some events and the processes wait for specific events to come back (node interaction). So the nodes should be interconnected in order to receive the events that are coming back to the processes.

This post will not show the business logic but a way to deploy multiple elixir services on ECS fargate with multiple interconnected instances and a job scheduler that runs only on one instance of each service.
For the sake of simplicity I just created one elixir application which will be used in two different services with multiple instances.

You can see the source code in https://github.com/felipeloha/elixir-ecs

To get everything up and running without explanation you can see the readme.

The migration

To achieve our requirements we went through the following steps:

Configure elixir to discover multiple nodes with docker compose and run cron jobs only in one instance
Create AWS infrastructure with service discovery
Debug fargate instances and service discovery

Elixir configuration

The application (elixir/phoenix) has the following “features”:

One controller with two routes “/” and “/krillin” which will help us later to see the routing to the different services
https://github.com/felipeloha/elixir-ecs/blob/main/hi/lib/hi_web/controllers/page_controller.ex
https://github.com/felipeloha/elixir-ecs/blob/main/hi/lib/hi_web/router.ex
One cron job running every 5 seconds and prints something in the log
https://github.com/felipeloha/elixir-ecs/blob/main/hi/lib/hi/scheduler.ex

The following components were installed and configured to achieve our requirements:

libcluster with one topology to use with docker and another one to use with AWS service discovery
Quantum (cron jobs) with a local strategy configured with Highlander.
Highlander makes the Quantum cron jobs are only started in one process of all connected. If a process goes down, it will start it in another one.
The Quantum local strategy assures that the jobs are run only in the one node
A dockerfile and a docker-compose to build the application and run them as two separate services

Scheduler Configuration
Notice the Quantum.RunStrategy.Local that assures that the jobs are run only in one instance

#hi/config/config.exsconfig :hi, Hi.Scheduler,
  debug_logging: true,
  global: true,
  timezone: :utc,
  run_strategy: {Quantum.RunStrategy.Local, :cluster},
  overlap: false

Scheduler and libcluster Initialization

Notice that:

The environment variable TOPOLOGY_TYPE is controlling wether libcluster gossips or DNS polls the other. Gossip is enough for docker. DNS Poll and its configuration is needed for AWS service discovery
Our Hi.Scheduler is enforce with Highlander so that it runs only in one process

#hi/lib/hi/application.exdef start(_type, _args) do
    # gossip topologies for local testing with docker compose
    local_topology = [
      ecs: [
        strategy: Cluster.Strategy.Gossip
      ]
    ]dns_poll_topology = [
      ecs: [
        strategy: Cluster.Strategy.DNSPoll,
        config: [
          polling_interval: 1000,
          query: get_env("SERVICE_DISCOVERY_ENDPOINT", "ecs.local"),
          node_basename: get_env("NODE_NAME_QUERY", "node.local")
        ]
      ]
    ]topologies =
      case get_env("TOPOLOGY_TYPE", false) do
        "local" -> local_topology
        _ -> dns_poll_topology
      endchildren = [
      {Cluster.Supervisor, [topologies, [name: Hi.ClusterSupervisor]]},
      HiWeb.Telemetry,
      # only one scheduler per cluster
      {Highlander, Hi.Scheduler},
      {Phoenix.PubSub, name: Hi.PubSub},
      HiWeb.Endpoint
    ]
    opts = [strategy: :one_for_one, name: Hi.Supervisor]
    Supervisor.start_link(children, opts)
  end

At this point you can test the application with docker compose where you will see that the nodes are connected and the cron job is running only in one service.

docker-compose build && docker-compose up

Service Discovery on AWS ECS

The CDK project is in https://github.com/felipeloha/elixir-ecs/tree/main/infrastructure and the instructions to get it running are in the readme at top level.

It creates the following resources:

ECR repository to store the docker image of the service
Load balancer
Namespace and service discovery which is used to link the service instances
Two fargate services (krillin and vegeta) each one with two instances/tasks (both are based on the same docker image)
Security groups to connect the services and the instances within the services

Most of these resources are created in the nested stack https://github.com/felipeloha/elixir-ecs/blob/main/infrastructure/lib/app-resources.ts

Namespace and discovery service

//lib/app-resources.tsconst namespace = new servicediscovery.PrivateDnsNamespace(this, 'Namespace', {
    name: `${props.prefix}-ns`,
    vpc: props.vpc,
});const discoveryService = namespace.createService('DiscoveryService', {
    name: `${props.prefix}-discovery-service`,
    dnsRecordType: servicediscovery.DnsRecordType.A,
    dnsTtl: cdk.Duration.seconds(10),
    routingPolicy: servicediscovery.RoutingPolicy.MULTIVALUE,
});

Fargate service

A fargate service should be created with following additional properties:

The env variable TOPOLOGY_TYPE is not passed so that libcluster is DNS Polling by default
The service discovery endpoint and node name query must be passed to the application so that DNS polling works. NODE_NAME_QUERY and SERVICE_DISCOVERY_ENDPOINT
The NODE_NAME is passed to the container as well. This works together with the libcluster parameters. If you want to connect all nodes in all services you can just name the all nodes in the same way or extend your topology
The service is associated with the discovery service as this.service.associateCloudMapService

//lib/app-resources.tsclass Service extends cdk.NestedStack {
    service: ecs.FargateService;
    constructor(scope: cdk.Construct, id: string, props: ServiceProps) {
        super(scope, id, props);const containerPort = 4000;
        const fargateTaskDefinition = new ecs.FargateTaskDefinition(this, `${props.prefix}TaskDefinition`, {
            memoryLimitMiB: 512,
            cpu: 256,
        });
        const container = fargateTaskDefinition.addContainer(`${props.prefix}Container`, {
            image: ecs.ContainerImage.fromEcrRepository(
                ecr.Repository.fromRepositoryName(this, `${props.prefix}ECRRepository`, props.repository),
                props.version,
            ),
            logging: props.ecsLogDriver,
            environment: {
                RELEASE_COOKIE: 'my-cookie',
                SERVICE_DISCOVERY_ENDPOINT: `${props.discoveryService.serviceName}.${props.discoveryService.namespace.namespaceName}`,
                //##### connect only instances within a service
                NODE_NAME: `${props.serviceName}-dbz`,
                NODE_NAME_QUERY: `${props.serviceName}-dbz`,
            },
        });
        container.addPortMappings({ containerPort });this.service = new ecs.FargateService(this, 'Service', {
            cluster: props.cluster,
            assignPublicIp: true,
            serviceName: `${props.prefix}-service-${props.serviceName}`,
            propagateTags: ecs.PropagatedTagSource.SERVICE,
            taskDefinition: fargateTaskDefinition,
            desiredCount: 2,
            enableExecuteCommand: true,
            healthCheckGracePeriod: cdk.Duration.seconds(20),
            circuitBreaker: { rollback: true },
        });
        this.service.associateCloudMapService({ service: props.discoveryService });
    }
}

Create services and allow traffic

Creating the services is not enough to cover the requirements. They should be able to communicate with each other.

The following snippet creates the service and allows traffic within the instances of the services AND between the instances of both services. Beware of opening just the ports you need. If you don’t have to communicate between services you can remove the first two “connections”

//lib/app-resources.ts// create services
this.vegetaService = new Service(this, 'vegetaService', {
...
    serviceName: 'vegeta-service',
    discoveryService: discoveryService,
}).service;this.krillinService = new Service(this, 'krillinService', {
...
    repository: props.repository,
    serviceName: 'krillin-service',
    discoveryService: discoveryService,
}).service;
//register services in the load balancer
const vegetaTargetGroup = listener.addTargets('vegetaTarget', {
    ...
    targets: [this.vegetaService],
    ...
});
const krillinTargetGroup = listener.addTargets('krillinTarget', {
...
    conditions: [elbv2.ListenerCondition.pathPatterns(['/krillin', '/krillin/*'])],
    targets: [this.krillinService],
...
});
//connect instances of both services
this.krillinService.connections.allowFrom(
    this.vegetaService,
    ec2.Port.allTcp(),
    `${props.prefix} krillin to vegeta`,
);
this.krillinService.connections.allowFrom(
    this.vegetaService,
    ec2.Port.allTcp(),
    `${props.prefix} vegeta to krillin`,
);//connect instances within services
this.krillinService.connections.allowFrom(
    this.krillinService,
    ec2.Port.allTcp(),
    `${props.prefix} krillin to krillin`,
);
this.vegetaService.connections.allowFrom(
    this.vegetaService,
    ec2.Port.allTcp(),
    `${props.prefix} vegeta to vegeta`,
);

At this point you can deploy the infrastructure. Beware of building the docker image and pushing it to the ECR repository right after the repository is created and before the services are created.

The Result

When everything is running you can monitor your services in the browser and see both services discovering only their instances:

The logs (Cloudwatch) will also show that only one cron job is running on each instance

19:00:10.643 [info] Job running on: :"vegeta-service-dbz@172.31.25.17" with cookie :"my-cookie" from possible hosts: [:"vegeta-service-dbz@172.31.11.177"]
19:00:10.259 [info] Job running on: :"krillin-service-dbz@172.31.22.8" with cookie :"my-cookie" from possible hosts: [:"krillin-service-dbz@172.31.6.25"]
19:00:15.643 [info] Job running on: :"vegeta-service-dbz@172.31.25.17" with cookie :"my-cookie" from possible hosts: [:"vegeta-service-dbz@172.31.11.177"]
19:00:15.259 [info] Job running on: :"krillin-service-dbz@172.31.22.8" with cookie :"my-cookie" from possible hosts: [:"krillin-service-dbz@172.31.6.25"]

Bonus: investigating fargate services and service discovery

If you have any problems during the whole process, here some useful methods to debug possible troubles

Access fargate instance
To access a fargate instance you can exec-command which needs to be enabled when creating the service as enableExecuteCommand: true

Then you can access the bash as:

aws ecs execute-command  \
--region us-east-2 \
--cluster elixir-ecs-cluster \
--task aed8db0c83cb4f118701925d8e706812 \
--container VegetaContainer \
--command "/bin/bash" \
--interactive

Validating service discovery
In the fargate instance you can query and see the registered services as:
‘dig elxir-ecs-discovery-service.elixir-ecs-ns +short’
You might need bind-tools in your docker image

Validating that libcluster is really getting the discovered IPs and you can connect to any node
Beware that due missing permissions you could see the IPs but you might not be able to connect to the nodes.
In the fargate instance you can do the following to investigate further:

./prod/rel/hi/bin/hi remote#in the IEX
"elxir-ecs-discovery-service.elixir-ecs-ns" |> String.to_charlist() |> Cluster.Strategy.DNSPoll.lookup_all_ips()#or list the nodes
Node.list#or connect to any node
Node.connect(:"vegeta-service-dbz@172.31.11.177")

Some final takeaways:

Distributed job execution: Quantum cluster/swarm is not working well and the community is either using it with Highlander as described or going for other solutions. See https://elixirforum.com/t/running-quantum-job-from-single-node/33611/16
CDK speeds up the development a lot with the “patterns” but sometimes some of those can restrict he development. We tried to use the applicationLoadBalancedFargateService at first and the code was a mess afterwards because we had to make a lot of effort to pass some configuration to the nested resources

I hope this post saves you some time if you are trying to tackle any of this requirements.

If you have any questions, any feedback or ideas about improving this approach, please feel free to contact me.

If you liked this post, please follow me. I would be very grateful ;)

An Elixir Migration to Microservices in AWS as ECS Fargate using Service Discovery to interconnect Nodes in CDK

The migration

Elixir configuration

Service Discovery on AWS ECS

Bonus: investigating fargate services and service discovery

Written by Felipe López