Kubernetes – Stupid Human mistakes going to prod

So I figured we would have a little fun with this post. It doesn’t all have to be highly technical right?

As with any new platform, application or service the opportunity for learning what NOT to do is ever present and when taken in the right light, can be quite hilarious for others to consume. So without further ado, here is our list of what NOT to do going to production with Kubernetes:

  1. Disaster Recovery testing should probably be a planned event
  2. Don’t let the Lead make real quick ‘minor’ changes
  3. Don’t let anyone else do it either
  4. Kube-dns resources required
  5. Communication is good
  6. ETCD….just leave it alone
  7. Init 0 is not a good command to ‘restart’ a server

 

I think you will recognize virtually everything here is related to people. We are the biggest disasters waiting to happen. With that being said, give your people some slack to make mistakes. It can be more fun that way.

 

Disaster Recovery testing should probably be a planned event

This one is probably my favorite for two reasons: I can neither confirm nor deny who did this and somehow no customers managed to notice.

Back in Oct/Nov of 2015 we had dev teams working on the Kubernetes/PaaS/Bitesize but their applications weren’t in production yet. We however were treating the environment as if it were production. Semi-Scheduled upgrades, no changes without the test process etc etc. As you probably know by now, our entire deploy process for Kubernetes and surrounding applications are in Terraform. But this was before we started using remote state Terraform. So if someone happened to be in the wrong directory AND happened to run a terraform destroy AND THEN typed YES to validate that’s what they wanted, we can confirm a production (kinda) environment will go down with reckless abandon.

The cool part is we managed to redeploy, restore databases and applications running on the platform within 30 minutes and NO ONE NOTICED (or at least they never said anything…..maybe they felt bad for us).

Yes yes yes, our customers didn’t have proper monitoring in place just yet.

Needless to say, the particular individual has been harassed endlessly by team mates and will never live it down.

The term “Beer Mike” exists for a reason.

what we learned: “Beer Mike” will pull off some cool shit or blow it up. Middle ground is rare.

Follow up: Our first customer going to production asked us sometime later if we had every performed DR testing. We were able to honestly say, ‘yes’. 😉

 

Don’t let the Lead make real quick ‘minor’ changes

As most of you know by now, we automate EVERYTHING. But I’ve been known to make changes and then go back and add it to automation. Especially in the early days prior to us being in production even though we had various development teams using the platform. I made a change to a security group during troubleshooting that allowed two of our components in the platform to communicate with each other in the environment. It was a valid change, it needed to happen and it made our customers happy………until we upgraded.

what we learned: AUTOMATE FIRST

 

Don’t let anyone else do it either

Whats worse about this one is this particular individual is extremely smart but made a change and took off for a week. That was fun.

what we learned: don’t let anyone else do it either.

 

Kube-dns resources required

This one pretty much speaks for itself but here are the details. Our kube-dns container was set for best-effort and got a grand total of 50Mi memory and 1/10th of a cpu to serve internal dns for our entire platform. Notice I said container (not plural). We also failed to scale it out. So when one of our customers decided to perform a 6500 concurrent (and I mean concurrent) user load test, things were a tad bit slow.

what we learned: scale kube-dns. Having 1/3 to 1/4 as many running in the cluster is a good idea. At larger scale above 100 nodes, it can be 1/8. These measurements highly depend on how many services in your environment use kube-dns.  Example: Nginx Ingress Controllers rely on it heavily.

 

Communication is good

Establish good communication channels early on with your customers. In our case, the various teams using our platform. We didn’t know until it started that there was a 6500 concurrent user load test. Ouch! What’s worse is, it was apart of a really large perf testing effort and they thought it was only going to be 450 concurrent users.

what we learned: Stay close to our customers. Keep in touch.

 

ETCD….just leave it alone

Yes this is Kubernetes and yes its quite resilient but don’t just restore ETCD, plan it out a bit, test that its going to work, have others validate that shit.

We had been playing around with scrubbing ETCD data and then dropping everything into a new Kubernetes cluster which would successfully start up all the namespaces, containers, replication controllers and everything we needed to run. The problem was, when we scrub the data and restored back into the same cluster with servers that already had labels and config values. You see, node labels are in ETCD. Our scrub script would pull all that out to make it generic so it could be deployed into another cluster. The problem is, when you do that to an existing cluster instead of a new cluster coming up, it would wipe all the labels associated with our minions which meant NONE of the containers would spin up. Fun for all.

What we learned: If you want to migrate shit, pull data from the API and do it that way. Getting fancy with ETCD leads to DUH moments.

 

Init 0 is not a good command to ‘restart’ a server

An anonymous colleague of mine meant to restart a server running an nginx controller during troubleshooting. Init 0 doesn’t work so good for that. Fortunately we run everything in ASGs so it just spun up another node but still not the smartest thing in the world if you can avoid it.

 

That’s it folks. I’m sure there are more. We’ll add to this list as the teams memory becomes less forgetful.

 

@devoperandi

3 thoughts on “Kubernetes – Stupid Human mistakes going to prod”

  1. Some of the following we have been bitten by, and some are just advice that we have implemented.

    1. DNS is an interesting one in Kubernetes. By default only one replica is created, however scaling up can also have interesting effects. kube-up.sh uses ephemeral storage. AWS ephemeral storage is relatively small and even on larger instances is easily saturated.

    SkyDNS is backed by etcd, which requires storage, DNS is a critical piece of infrastructure on kubernetes. When we saturated disk space, we hit issues that messed up kube-proxy and caused intermittent failures. Applications in the cluster would intermittently take excessive time to resolve DNS.

    2. Logging, don’t run within the cluster.

    3. Influxdb, as above.

    4. Monitoring, applications exposing a healthcheck should have a meaningful healthcheck.

    5. If using ELBs, ensure to use the correct protocol (additional metrics at ELB, 2xx, 4xx, 5xx, backend connection errors etc).

    6. Monitor both inside and external from the cluster.

    1. These are great man. Thanks for sharing your insights. As you know we intentionally log off to Kafka for #2 in your list.

Leave a Reply

Your email address will not be published. Required fields are marked *