I have made changes to the docker for aws cloudformation template to change the ami to https://aws.amazon.com/marketplace/pp/Amazon-Web-Services-Deep-Learning-AMI-Ubuntu-1604/B077GCH38C for the availability of nvidia docker and changed the instance type to g3.4xlarge. I made a bunch of other tweaks as well.
When I create the stack, I can ssh into an instance, and docker swarm is initialized and has access to all the nodes. There are no error logs. But, periodically, the EC2 instances get shut down without any informative logs in the system log of the terminated instances.
I was wondering if anyone has any idea why this may be happening
Here is my cloudformation template:
The stack is supposed to create 3 nodes (3 manager, 0 workers). A few minutes after the creation of the stack, the EC2 instances begin to shut-down and in their place, new instances get created and join the swarm. When I ssh into an EC2 instance, I usually have 2-3 minutes until it gets shut down.