So this might be a slightly more controversial post, but I really enjoy leveraging spot instances with EKS and GKE to help reduce costs and to help just keep things a little tidier. Plus to me there is some built-in chaos testing when doing this.
Ultimately it helps to greatly control cost. The trade-off is that it will get recycled regularly but it helps to keep the costs of running a cluster lower than they would be otherwise. It is great for Dev, QA, and Staging environments that don’t necessarily need guaranteed uptime on their nodes.
According to the AWS documentation, the spot instances can save you up to 90% which can be a pretty significant chunk of savings for lower-level environments. GKE offers similar savings, but I didn’t find anything that said specifically how much.
Ultimately Netflix wrote a whole tool that operates as a Chaos Monkey and will randomly stop services. This provides almost a built-in way of doing it. It will cause pods to regularly have to reschedule and thus pull a fresh container. This means that it shows what happens when the service restarts and how easily it is able to resume what it was doing.
This does mean that if you only have a single replica then it will cause the entire service to go offline temporarily. This could be an even different test to help ensure that systems are up and working properly as intended.
Using this in Production
Ultimately i have done plenty of production deployments on spot instances without problems. If costs are a heavy concern then this can help to alleviate some of the concerns assuming that you do this wisely.
The biggest thing is to ensure that you have some replication. Doing this means that you should never be below 2 replicas of an application. From here the simple solution is just to always ensure that they are scheduled on different hosts:
affinity: podAntiAffinity: requiredDuringSchedulingIgnoredDuringExecution: - labelSelector: matchExpressions: - key: app operator: In values: - app-name topologyKey: "kubernetes.io/hostname"
This ensures that they are always scheduled on different nodes and helps to ensure that they are less likely to both be taken down. GKE offers a little bit more in their docs in even being able to choose to schedule something on a spot instance or not:
affinity: nodeAffinity: requiredDuringSchedulingIgnoredDuringExecution: nodeSelectorTerms: - matchExpressions: - key: cloud.google.com/gke-spot operator: In values: - "true"
Overall I will always recommend this for Dev and QA environments. It can help keep the costs low and can help to ensure that the environment more closely matches higher levels. Doing this in production is a scary thought honestly. It could be tough and it could cause a lot of issues if you aren’t careful. However this is a way to help keep costs low and leverage something that still provides you with the same great Kubernetes service that you need.