Python Client for Kubernetes

For reasons I’ll divulge in a future post, we needed a python client to interact with Kubernetes. Our latest and greatest work is going to rely pretty heavily on it and we’ve had difficulty finding one that is fully functional.

SPOILER: Go to the bottom of the article if you just want the code. 😉

We explored options like libCloud, pykube and even went back to some of the original python-kubernetes clients like you would see on Pypi. What we found was they were all either a) out of date, b) still very much in their infancy or c) no longer being contributed. And we realized sitting around waiting on someone else to build and maintain one just wasn’t going to work.

So with a lot of exploring and a ton of learning (primarily due to my lack of python skillz), we came to realize we could simply generate our own with codegen. You see, Kubernetes uses swagger for its API and codegen allows us to create our own python client using the swagger spec.

# on mac install swagger-codegen

brew install swagger-codegen

Acquire v1.json from v1.json at Kubernetes website

and run something like:

swagger-codegen generate -l python -o k8sclient -i v1.json

And this was fantastic……..until it didn’t work and the build fails.

You see, Kubernetes is running swagger spec 1.2 and they are using “type”: “any” which is an undefined custom type and codegen doesn’t know how to handle it.

See the github issues referenced here and here for a more detailed explanation.

The end result is, while custom types in swagger-spec 1.2 are allowed, there was no way to document the custom types for codegen to consume. This is fixed in swagger-spec 2.0 with “additionalProperties” to allow this mapping to occur.

But we still had a problem. We couldn’t easily create a python client from codegen.

So what we have done, right or wrong, is replace everything in the v1.json of

"type": "any"

with

"type": "string"

and it works.

With that here is a link to the v1.json file with the change.

But we also did the same thing for extensions/v1beta because we are working on some future endeavors so here is a link to that as well.

With these v1.json and v1.beta1.json files you should be able to create your own python client for Kubernetes.

Or if you choose, you could just use the clients we created. We intend to keep these clients updated but if you find we haven’t, feel free to create your own. Its dead simple.

https://github.com/mward29/python-k8sclient

https://github.com/mward29/python-k8sclient-v1beta1

 

As a final departing note, these python clients have NOT been fully vetted. We have not run across any issues as of this moment but if you find an issue before we do, PLEASE be a good samaritan and let us know.

The beta version, because its running against the beta api extensions may not have everything you would expect in it.

 

Vault in Kubernetes

First off thanks to Martin for taking this from a POC to a product within Kubernetes.

When it comes to managing secrets inside Kubernetes, Vault is our go to solution. It is not exposed externally at this time although we have considered it for external workloads. We are working with it in a couple areas including dynamic secrets and have intentions of using it with OTP, SSH, MFA and SSL cert rotation in the near future.

We spin Vault up as a part of our default cluster build, use consul as its storage backend, automatically unseal the vault and ship the keys off to admins.

Reference Deploying Consul in Kubernetes for more information there.

First off lets start with the Dockerfile. This is a pretty standard Dockerfile. Nothing crazy here.

FROM alpine:latest
MAINTAINER Martin Devlin <martin.devlin@pearson.com>

ENV VAULT_VERSION 0.4.1
ENV VAULT_PORT 7392

COPY config.json /etc/vault/config.json

RUN apk --update add openssl zip\
&& mkdir -p /etc/vault/ssl \
&& wget http://releases.hashicorp.com/vault/${VAULT_VERSION}/vault_${VAULT_VERSION}_linux_amd64.zip \
&& unzip vault_${VAULT_VERSION}_linux_amd64.zip \
&& mv vault /usr/local/bin/ \
&& rm -f vault_${VAULT_VERSION}_linux_amd64.zip

EXPOSE ${VAULT_PORT}

COPY /run.sh /usr/bin/run.sh
RUN chmod +x /usr/bin/run.sh

ENTRYPOINT ["/usr/bin/run.sh"]
CMD []

 

But now lets take a look at run.sh. This is where the magic happens.

#!/bin/sh


if [ ! -z ${VAULT_SERVICE_PORT} ]; then
  export VAULT_PORT=${VAULT_SERVICE_PORT}
else
  export VAULT_PORT=7392
fi

if [ ! -z ${CONSUL_SERVICE_HOST} ]; then
  export CONSUL_SERVICE_HOST=${CONSUL_SERVICE_HOST}
else
  export CONSUL_SERVICE_HOST="127.0.0.1"
fi

if [ ! -z ${CONSUL_SERVICE_PORT} ]; then
  export CONSUL_PORT=${CONSUL_SERVICE_PORT}
else
  export CONSUL_PORT=8500
fi

openssl req -x509 -newkey rsa:1024 -nodes -keyout /etc/vault/ssl/some-vault-key.key -out /etc/vault/ssl/some-vault-crt.crt -days some_number_of_days -subj "/CN=some-vault-cn-or-other" 

  export VAULT_IP=`hostname -i`

sed -i "s,%%CONSUL_SERVICE_HOST%%,$CONSUL_SERVICE_HOST," /etc/vault/config.json
sed -i "s,%%CONSUL_PORT%%,$CONSUL_PORT,"                 /etc/vault/config.json
sed -i "s,%%VAULT_IP%%,$VAULT_IP,"                       /etc/vault/config.json
sed -i "s,%%VAULT_PORT%%,$VAULT_PORT,"                   /etc/vault/config.json

## Master stuff

master() {

  vault server -config=/etc/vault/config.json $@ &

  if [ ! -f ~/vault_keys.txt ]; then

    export VAULT_SKIP_VERIFY=true
    
    export VAULT_ADDR="https://${VAULT_IP}:${VAULT_PORT}"

    vault init -address=${VAULT_ADDR} > ~/vault_keys.txt

    export VAULT_TOKEN=`grep 'Initial Root Token:' ~/vault_keys.txt | awk '{print $NF}'`
    
    vault unseal `grep 'Key 1:' ~/vault_keys.txt | awk '{print $NF}'`
    vault unseal `grep 'Key 2:' ~/vault_keys.txt | awk '{print $NF}'`
    vault unseal `grep 'Key 3:' ~/vault_keys.txt | awk '{print $NF}'`
    vault unseal `grep 'Key 4:' ~/vault_keys.txt | awk '{print $NF}'`
    vault unseal `grep 'Key 5:' ~/vault_keys.txt | awk '{print $NF}'`
    vault unseal `grep 'Key 6:' ~/vault_keys.txt | awk '{print $NF}'`
    vault unseal `grep 'Key 7:' ~/vault_keys.txt | awk '{print $NF}'`
    vault unseal `grep 'Key 8:' ~/vault_keys.txt | awk '{print $NF}'`
    vault unseal `grep 'Key another_key:' ~/vault_keys.txt | awk '{print $NF}'`

  fi

}

case "$1" in
  master)           master $@;;
  *)                exec vault server -config=/etc/vault/config.json $@;;
esac

### Exec sending keys to admins
exec /tmp/shipit.sh
 

sleep 600

Above we do a few important things:

  1. We use environment variables from within the container to set configs in config.json
  2. We generate an x509 cert
  3. We unseal the vault with some sed magic
  4. We run shipit.sh to send off the keys and remove the vault_keys.txt file. The shipit script has information on admins we dynamically created to send keys to.

 

Here is what config.json looks like. Nothing major. A basic Vault config.json.

### Vault config

backend "consul" {
 address = "%%CONSUL_SERVICE_HOST%%:%%CONSUL_PORT%%"
 path = "vault"
 advertise_addr = "https://%%VAULT_IP%%:%%VAULT_PORT%%"
}

listener "tcp" {
 address = "%%VAULT_IP%%:%%VAULT_PORT%%"
 tls_key_file = "/etc/vault/ssl/some-key.key"
 tls_cert_file = "/etc/vault/ssl/some-crt.crt"
}

disable_mlock = true

 

Kubernetes Config for Vault. We deploy a service accessible internally to the cluster with proper credentials. And we create a replication controller to ensure a Vault container is always up.

---
apiVersion: v1
kind: Service
metadata:
  name: vault
  namespace: your_namespace
  labels:
    name: vault-svc
spec:
  ports:
    - name: vaultport
      port: 8200
  selector:
    app: vault
---
apiVersion: v1
kind: ReplicationController
metadata:
  name: vault
  namespace: your-namespace
spec:
  replicas: 1
  selector:
    app: vault
  template:
    metadata:
      labels:
        app: vault
    spec:
      containers:
        - name: vault
          image: 'private_repo_url:5000/vault:latest'
          imagePullPolicy: Always
          ports:
            - containerPort: 8200
              name: vaultport

 

Once Vault is up and running we insert a myriad of policies by which Vault can use to for various secret and auth backends. For obvious reasons I won’t be showing those.

 

@devoperandi

 

Note: Some data in code above intentionally changed for security reasons.

 

 

Deploying Consul in Kubernetes

Deploying many distributed clustering technologies in Kubernetes can require some finesse. Not so with Consul. It dead simple.

We deploy Consul with Terraform as a part of our Kubernetes cluster deployment strategy. You can read more about it here.

We currently deploy Consul as a 3 node cluster with 2 Kubernetes configuration files. Technically we could narrow it down to one but we tend to keep our service configs separate.

  • consul-svc.yaml – to create a service for other applications to interact with
  • consul.yaml – to create consul servers in a replication controller

When we bring up a new Kubernetes cluster, we push a bunch of files to Amazon S3. Along with those files are the two listed above. The Kubernetes Master pulls these files down from S3 and places them along with others in /etc/kubernetes/addons/ directory. We then execute everything in /etc/kubernetes/addons in a for loop using kubectl create -f.

Lets take a look at consul-svc.yaml

---
apiVersion: v1
kind: Service
metadata:
  name: svc-consul
  namespace: kube-system
  labels:
    name: consul-svc
spec:
  ports:
    # the port that this service should serve on
    - name: http
      port: 8500
    - name: rpc
      port: 8400
    - name: serflan
      port: 8301
    - name: serfwan
      port: 8302
    - name: server
      port: 8300
    - name: consuldns
      port: 8600
  # label keys and values that must match in order to receive traffic for this service
  selector:
    app: consul

 

Nothing special about consul-svc.yaml. Just a generic service config file.

So what about consul.yaml

apiVersion: v1
kind: ReplicationController
metadata:
  namespace: kube-system
  name: consul
spec:
  replicas: 3
  selector:
    app: consul
  template:
    metadata:
      labels:
        app: consul
    spec:
      containers:
        - name: consul
          command: [ "/bin/start", "-server", "-bootstrap-expect", "3", "-atlas", "account_user_name/consul", "-atlas-join", "-atlas-token", "%%ATLAS_TOKEN%%" ]
          image: progrium/consul:latest
          imagePullPolicy: Always
          ports:
          - containerPort: 8500
            name: ui-port
          - containerPort: 8400
            name: alt-port
          - containerPort: 53
            name: udp-port
          - containerPort: 443
            name: https-port
          - containerPort: 8080
            name: http-port
          - containerPort: 8301
            name: serflan
          - containerPort: 8302
            name: serfwan
          - containerPort: 8600
            name: consuldns
          - containerPort: 8300
            name: server

 

Of all this code, there is one important line.

 command: [ "/bin/start", "-server", "-bootstrap-expect", "3", "-atlas", "account_user_name/consul", "-atlas-join", "-atlas-token", "%%ATLAS_TOKEN%%" ]

-bootstrap-expect – sets the number of consul servers that need to join before bootstrapping

-atlas – enables atlas integration

-atlas-join – enables auto-join

-atlas-token – sets a token you can get from your Atlas account.

Note: make sure to replace ‘account_user_name’ with your atlas account user name.

Getting an atlas account for this purpose is free so don’t hesitate but make sure you realize, the token you generate should be highly highly secure.

So go sign up for an account. Once done click on your username in the upper right hand corner then click ‘Tokens’.

You’ll see something like this:

Screen Shot 2016-01-15 at 5.21.09 PM

Generate a token with a description and use this token in %%ATLAS_TOKEN%% in the consul.yaml above.

In order to populate the Token at run time, we execute a small for loop to perform a sed replace before running ‘kubectl create -f’ on the yamls.

 

	for f in ${dest}/*.yaml ; do

		# Consul template
	# Deprecated we no longer use CONSUL_SERVICE_IP
	# sed -i "s,%%CONSUL_SERVICE_IP%%,${CONSUL_SERVICE_IP}," $f
		sed -i "s,%%ATLAS_TOKEN%%,${ATLAS_TOKEN}," $f

	done

 

The two variables listed above get derived from a resource “template_file” in Terraform.

Note: I’ve left out volume mounts from the config. Needless to say, we create volume mounts to a default location and back those up to S3.

Good Luck and let me know if you have questions.

 

@devoperandi

Upgraded Nginx-controller for Kubernetes

From my friend Simas. The ever reclusive genius behind the curtains. I’m beginning to feel like I might be repeating myself quite often if he keeps up this pace. I might also have to get my own ass to work so I have something to show.

For those that don’t know, the nginx-controller is basically an alpha external load balancer for Kubernetes that listens on a specified port(s) and routes traffic to applications in Kubernetes. You can read more about that in my post Load Balancing in Kubernetes.

So we’ve come out with an update to the original nginx-controller located (here).

The original nginx-controller was an excellent start so we chose to push the ball forward a little bit. Please understand this has not been thoroughly tested outside our team but we would love your feedback so feel free to have at it.

Here is pretty much the entirety of the code that was added. It creates a map and iterates over it to populate the nginx.conf file.

		knownHosts := make(map[string]extensions.Ingress)
		// we need a loop to see deselect all duplicate entries
		for _, item := range ingresses.Items {
			for _, rule := range item.Spec.Rules {
				if v, ok := knownHosts[rule.Host]; ok {
					knownTS := v.ObjectMeta.CreationTimestamp
					newTS := item.ObjectMeta.CreationTimestamp
					if newTS.Before(knownTS) {
						knownHosts[rule.Host] = item
					}
				} else {
					knownHosts[rule.Host] = item
				}
			}
		}

 

Here is the link to the Github where the code is located. Feel free to try it out.

 

 

 

How we do builds in Kubernetes

First off. All credit for this goes to my friend Simas. I’m simply relaying what he has accomplished because it would be a shame if others didn’t benefit from his expertise. He is truly talented in this space and provides simple yet elegant designs that just work.

Coming into my current position we have 400+ development teams. Virtually all of which are managing their own build pipelines. This requires significant time and effort to manage, develop and automate. Each team designates their own developer on a rotating basis, or worse, completely dedicates a dev to make sure the build process goes smoothly.

What we found when looking across these teams was they were all basically doing the same thing. Sometimes using a different build server, automating using a different scripting language or running in a different code repo but all in all, its the same basic process with the same basic principles. And because we have so many dev teams, we were bound to run into enough teams developing in a particular language that it would make sense to standardize their process so multiple teams could take advantage of it. This combined with the power of docker images and we have a win/win situation.

So let me define what I mean by “build process” just so we can narrow the scope a bit. Build process – The process of building application(s) code using a common build platform. This is our first step in a complete CI/CD workflow.

So why haven’t we finished it already? Along with the Dev teams we have quite a few other engineering teams involved including QA/Performance/CISO etc etc and we haven’t finished laying out how all those teams will work together in the pipeline.

We have questions like:

Do QA/Perf/Security engineers all have access to multiple kubernetes namespaces or do they have their own project area and provide a set of endpoints and services which can be called to utilize their capabilities?

Do we mock cross-functional services in each namespace or provide endpoints to be accessed from anywhere?

What about continuous system/integration testing?

Continuous performance testing? How do we do this without adversely affecting our dev efforts?

Those are just a few of the questions we are working through. We have tons of them. Needless to say, we started with the build process.

We create Docker images for each and every Java/NodeJS/Go/Ruby/language_of_the_month our developers choose. These images are very much standardized. Allowing for built-in, centrally managed, monitored, secure containers that deploy in very short periods of time. The only deltas are the packages for the actual application. We build those in deb packages and standardize the install process, directory locations, version per language type etc etc.

Dev teams get their own namespace in Kubernetes. In fact, in most cases they get three. Dev, Stage and Prod. For the purpose of this conversation every dev team is developing an application stack which could consist of 1 to many micro services. Every namespace has its own Hubot and its own Jenkins build server which is completely vanilla to start with.

See Integrating Hubot and Kubernetes for more info on Hubot.

Each Jenkins build server connects to at least two repositories. A standard jenkins job repo that consists of all the standardized builds for each language and the application code repositories for the applications. EVERY Jenkins server connects to the same jenkins job repo. Jenkins polls each repo for changes every X minutes depending on the requirements of the team. We thought about web hooks to notify Jenkins when a new build is needed but chose to poll from Jenkins instead. Primarily because we treat every external resource as if it has gremlins and we didn’t want to deal with firewalls. We’ve been looking at options to replace this but haven’t settled on anything at this point.

Screen Shot 2016-01-08 at 6.34.53 PM

 

jenkins job repo –

  1. all the possible standardized build jobs
  2. dockerfiles for building base images – ie java,nodejs,ruby etc etc
  3. metadata on communicating with the local hubot
  4. sets up kubectl for its namespace

application code repo –

  1. Contains application code
  2. Contains a default.json file

default.json is key to the success of the build process.

It has three primary functions:

  1. Informs Jenkins what type of build it should be setup for. Ex. If XYZ team writes code in Java and NodeJS, it tells Jenkins to configure itself for those build types. This way we aren’t configuring every Jenkins server for build artifacts it will never build.
  2. It tells Jenkins meta-data about the application like application name, version, namespace(s) to deploy to, min/max number of containers to deploy, associated kubernetes services etc etc
  3. Provides Jenkins various build commands and artifacts particular to the application

Here is a very simple example of what that default.json might look like.

{
  "namespace": "someproject",
  "application": {
    "name": "sample-application",
    "type": "http_html",
    "version": "3.x.x"
  },
  "build": {
    "system_setup": {
      "buildfacts": [ // Configure the Jenkins server
        "java",
        "nodejs"
      ]
    },
    "build_steps": [
      {
        "shell": "some shell commands"
      },
      {
        "gradle": {
          "useWrapper": true,
          "tasks": "clean build -Ddeployment.target=???"
        }
      }
    ]
  },
  "build_command": "some command to execute the build",
  "artifacts": "target/",
  "services": [
    {
      "name": "sample-service",
      "external_url": "www.sample-service.com",
      "application": "someproject/sample-application",
      "instances": {
        "min": 2,
        "max": 5
      }
    }
  ]
}

 

Ok now for a little more complexity:

 

Screen Shot 2016-01-08 at 6.41.45 PM

So what just happened?

1) Dev commits code to application repository

2) Jenkins polls the jenkins build repo and application repositories for changes

3) If there is a new standard build image (say for Java), jenkins will build the latest version of the application with this image and push the image to the docker registry with a specialized tag. Then notify Dev team of the change to provide feedback through hubot.

When there is a version change in the application code repository Jenkins runs typical local tests, builds deb package, ships it to apt repository, then builds a docker image combining a standardized image from the jenkins build repo with the deb package for the application and pushes the image to the Docker registry.

4) Deploy application into namespace with preconfigured kubectl client

5) Execute system/integration tests

6) Feedback loop to Dev team through Hubot

7) Rinse and repeat into Staging/Prod on success

 

Now you are probably thinking, what about all those extra libraries that some applications may need but other do not.

Answer: If its not a common library, it goes in the application build.

 

All in all this is a pretty typical workflow.  And for the most part you are absolutely correct. So what value do we get by separating the standard/base build images and placing it into its own repository?

  • App Eng develops standard images for each language and bakes in security/compliance/regulatory concerns
  • Separation of concerns – Devs write code, System/App eng handles the rest including automated feedback loops
  • Security Guarantee – Baked in security, compliance and regulatory requirements ensuring consistency across the platform
  • Devs spend more time doing what they do best
  • Economies of scale – Now we can have a few people creating/managing images while maintaining a distributed build platform
  • Scalable build process – Every Dev team has their own Jenkins without the overhead associated with managing it
  • Jenkins servers can be upgraded, replaced, redeployed, refactored, screwed up, thrown out, crapped on and we can be back to a running state in a matter of minutes. WOOHOO Jenkins is now cattle.
  • Standardized containers means less time spent troubleshooting
  • Less chance of unrecognized security concerns across the landscape
  • Accelerated time to market with even less risk

 

Lets be realistic, there are always benefits and limitations to anything and this design is not the exception.

Here are some difficulties SO FAR:

  • Process challenges in adjusting to change
  • Devs can’t run whatever version for a given language they want
  • Devs could be prevented from taking advantage of new features in the latest versions of say Java IF the App Eng team can’t keep up

 

Worth Mentioning:

  • Both Devs and App Eng don’t have direct access to Jenkins servers
  • Because direct access is discourage, exceptional logging combined with exceptional analytics is an absolute must

 

Ok so if you made it thus far. I’m either a damn good writer, your seriously interested in what I have to say or you totally crazy about build pipelines. Somehow I don’t think its option 1. Cheers

 

@devoperandi

Load Balancing in Kubernetes

There are two different types of load balancing in Kubernetes. I’m going to label them internal and external.

Internal – aka “service” is load balancing across containers of the same type using a label. These services generally expose an internal cluster ip and port(s) that can be referenced internally as an environment variable to each pod.

Ex. 3 of the same application are running across multiple nodes in a cluster. A service can load balance between these containers with a single endpoint. Allowing for container failures and even node failures within the cluster while preserving accessibility of the application.

 

External

Services can also act as external load balancers if you wish through a NodePort or LoadBalancer type.

NodePort will expose a high level port externally on every node in the cluster. By default somewhere between 30000-32767. When scaling this up to 100 or more nodes, it becomes less than stellar. Its also not great because who hits an application over high level ports like this? So now you need another external load balancer to do the port translation for you. Not optimal.

LoadBalancer helps with this somewhat by creating an external load balancer for you if running Kubernetes in GCE, AWS or another supported cloud provider. The pods get exposed on a high range external port and the load balancer routes directly to the pods. This bypasses the concept of a service in Kubernetes, still requires high range ports to be exposed, allows for no segregation of duties, requires all nodes in the cluster to be externally routable (at minimum) and will end up causing real issues if you have more than X number of applications to expose where X is the range created for this task.

Because services were not the long-term answer for external routing, some contributors came out with Ingress and Ingress Controllers. This in my mind is the future of external load balancing in Kubernetes. It removes most, if not all, the issues with NodePort and Loadbalancer, is quite scalable and utilizes some technologies we already know and love like HAproxy, Nginx or Vulcan. So lets take a high level look at what this thing does.

Ingress – Collection of rules to reach cluster services.

Ingress Controller – HAproxy, Vulcan, Nginx pod that listens to the /ingresses endpoint to update itself and acts as a load balancer for Ingresses. It also listens on its assigned port for external requests.

Screen Shot 2016-01-02 at 9.13.53 AM

 

In the diagram above we have an Ingress Controller listening on :443 consisting of an nginx pod. This pod looks at the kubernetes master for newly created Ingresses. It then parses each Ingress and creates a backend for each ingress in nginx. Nginx –> Ingress –> Service –> application pod.

With this combination we get the benefits of a full fledged load balancer, listening on normal ports for traffic that is fully automated.

Creating new Ingresses are quite simple. You’ll notice this is a beta extension. It will be GA pretty soon.

apiVersion: extensions/v1beta1
kind: Ingress
metadata:
  name: example-ingress
spec:
  rules:
  - host: ex.domain.io
    http:
      paths:
      - path: /
        backend:
          serviceName: example
          servicePort: 443

Creating the Ingress Controller is also quite easy.

apiVersion: v1
kind: ReplicationController
metadata:
  name: nginx-ingress
  labels:
    app: nginx-ingress
spec:
  replicas: 1
  selector:
    app: nginx-ingress
  template:
    metadata:
      labels:
        app: nginx-ingress
    spec:
      containers:
      - image: gcr.io/google_containers/nginx-ingress:0.1
        imagePullPolicy: Always
        name: nginx
        ports:
        - containerPort: 80
          hostPort: 80

Here is an ingress controller for nginx. I would use this as a template by which to create your own. The default pod is in its infancy and doesn’t handle multiple backends very well. Its written in Go but you could quite easily write this in whatever language you want. Its a pretty simple little program.

For more information here is the link to Ingress Controllers at Kubernetes project.

 

Kubernetes/Terraform – Multiple Availability Zone deployments

While some may disagree, personally I think Kubernetes is becoming the defacto standard for anyone wishing to orchestrate containers in wide scale deployments. It has good api support, is under active development, is backed by various large companies, is completely open-source, is quite scalable for most workloads and has a pretty good feature set for initial release.

Now having used it for sometime now I’ll be the first to claim is has some shortcomings. Things like external load balancing (Ingresses), autoscaling and persistent storage for things like databases would be pretty good ones to point out. But these also happen to be under active development and I expect will have viable solutions in the near future.

Load Balancing – Ingresses and Ingress Controllers

Autoscaling – built into Kubernetes soon

Persistent Storage – Kubernetes-Ceph/Flocker

So what about MultiAZ deployments?

We deploy/manage Kubernetes using Terraform (Hashicorp). Even though Terraform isn’t technically production ready yet, I expect it to fill a great role in our future deployments so we were willing to accept its relative immaturity for its ever expanding feature set.

You can read more about Terraform here. We are using several of their products including Vault and Consul.

Terraform:

  1. Creates a VPC, subnets, routes, security groups, ACLs, Elastic IPs and Nat machine
  2. Creates IAM users
  3. Creates a public and private hosted zone in route53 and adds dns entries
  4. Pushes data up to an AWS S3 bucket with dynamically generated files from Terraform
  5. Deploys autoscaling groups, launch configurations for master and minions
  6. Sets up an ELB for the Kubernetes Master
  7. Deploys the servers with user-data

 

Create the VPC

resource "aws_vpc" "example" {
    cidr_block = "${var.vpc_cidr}"
    instance_tenancy = "dedicated"
    enable_dns_support  = true
    enable_dns_hostnames  = true

    tags {
        Name = "example-${var.environment}"
    }
}

resource "aws_internet_gateway" "default" {
    vpc_id = "${aws_vpc.example.id}"
}

 

Create Routes

resource "aws_route53_record" "kubernetes-master" {
	# domain.io zone id
	zone_id = "${var.zone_id}"
	# Have to limit wildcard to one *
	name    = "master.${var.environment}.domain.io"
	type    = "A"

	alias {
	name = "${aws_elb.kube-master.dns_name}"
	zone_id = "${aws_elb.kube-master.zone_id}"
	evaluate_target_health = true
	}
}

resource "aws_route53_zone" "vpc_zone" {
  name   = "${var.environment}.kube"
  vpc_id = "${aws_vpc.example.id}"
}

resource "aws_route53_record" "kubernetes-master-vpc" {
  zone_id = "${aws_route53_zone.vpc_zone.zone_id}"
  name    = "master.${var.environment}.kube"
  type    = "A"

	alias {
	name = "${aws_elb.kube-master.dns_name}"
	zone_id = "${aws_elb.kube-master.zone_id}"
	evaluate_target_health = true
	}
}

Create Subnets – Example of public subnet

/*
  Public Subnet
*/
resource "aws_subnet" "us-east-1c-public" {
	vpc_id            = "${aws_vpc.example.id}"
	cidr_block        = "${var.public_subnet_cidr_c}"
	availability_zone = "${var.availability_zone_c}"

	tags {
		Name        = "public-subnet-${var.environment}-${var.availability_zone_c}"
		Environment = "${var.environment}"
	}
}

resource "aws_subnet" "us-east-1a-public" {
	vpc_id            = "${aws_vpc.example.id}"
	cidr_block        = "${var.public_subnet_cidr_a}"
	availability_zone = "${var.availability_zone_a}"

	tags {
		Name        = "public-subnet-${var.environment}-${var.availability_zone_a}"
		Environment = "${var.environment}"
	}
}

resource "aws_subnet" "us-east-1b-public" {
	vpc_id            = "${aws_vpc.example.id}"
	cidr_block        = "${var.public_subnet_cidr_b}"
	availability_zone = "${var.availability_zone_b}"

	tags {
		Name        = "public-subnet-${var.environment}-${var.availability_zone_b}"
		Environment = "${var.environment}"
	}
}

resource "aws_route_table" "us-east-1c-public" {
	vpc_id = "${aws_vpc.example.id}"

	route {
		cidr_block = "0.0.0.0/0"
		gateway_id = "${aws_internet_gateway.default.id}"
	}

	tags {
		Name        = "public-subnet-${var.environment}-${var.availability_zone_c}"
		Environment = "${var.environment}"
	}
}
resource "aws_route_table" "us-east-1a-public" {
	vpc_id = "${aws_vpc.example.id}"

	route {
		cidr_block = "0.0.0.0/0"
		gateway_id = "${aws_internet_gateway.default.id}"
	}

	tags {
		Name        = "public-subnet-${var.environment}-${var.availability_zone_a}"
		Environment = "${var.environment}"
	}
}
resource "aws_route_table" "us-east-1b-public" {
	vpc_id = "${aws_vpc.example.id}"

	route {
		cidr_block = "0.0.0.0/0"
		gateway_id = "${aws_internet_gateway.default.id}"
	}

	tags {
		Name        = "public-subnet-${var.environment}-${var.availability_zone_b}"
		Environment = "${var.environment}"
	}
}

resource "aws_route_table_association" "us-east-1c-public" {
	subnet_id      = "${aws_subnet.us-east-1c-public.id}"
	route_table_id = "${aws_route_table.us-east-1c-public.id}"
}
resource "aws_route_table_association" "us-east-1a-public" {
	subnet_id      = "${aws_subnet.us-east-1a-public.id}"
	route_table_id = "${aws_route_table.us-east-1a-public.id}"
}
resource "aws_route_table_association" "us-east-1b-public" {
	subnet_id      = "${aws_subnet.us-east-1b-public.id}"
	route_table_id = "${aws_route_table.us-east-1b-public.id}"
}

}

Create Security Groups– Notice how the ingress is only to the vpc cidr block

/*
  Kubernetes SG
*/
resource "aws_security_group" "kubernetes_sg" {
    name = "kubernetes_sg"
    description = "Allow traffic to pass over any port internal to the VPC"

    ingress {
        from_port = 0
        to_port = 65535
        protocol = "tcp"
        cidr_blocks = ["${var.vpc_cidr}"]
    }

    egress {
        from_port = 0
        to_port = 65535
        protocol = "tcp"
        cidr_blocks = ["0.0.0.0/0"]
    }

    ingress {
        from_port = 0
        to_port = 65535
        protocol = "udp"
        cidr_blocks = ["${var.vpc_cidr}"]
    }

    egress {
        from_port = 0
        to_port = 65535
        protocol = "udp"
        cidr_blocks = ["0.0.0.0/0"]
    }

    egress {
        from_port = 0
        to_port = 65535
        protocol = "udp"
        cidr_blocks = ["0.0.0.0/0"]
    }

    vpc_id = "${aws_vpc.example.id}"

    tags {
        Name = "kubernetes-${var.environment}"
    }
}

Create the S3 bucket and add data

resource "aws_s3_bucket" "s3bucket" {
    bucket = "kubernetes-example-${var.environment}"
    acl = "private"
    force_destroy = true

    tags {
        Name = "kubernetes-example-${var.environment}"
        Environment = "${var.environment}"
    }
}

You’ll notice below is an example of a file pushed to S3. We add depends_on aws_s3_bucket because Terraform will attempt to add files to the S3 bucket before it is created without it. (To be fixed soon according to Hashicorp)

resource "aws_s3_bucket_object" "setupetcdsh" {
    bucket = "kubernetes-example-${var.environment}"
    key = "scripts/setup_etcd.sh"
    source = "scripts/setup_etcd.sh"
    depends_on = ["aws_s3_bucket.s3bucket"]
}

 

We distribute the cluster across 3 AZs, with 2 subnets (1 public, 1 private) per AZ. We allow internet access to the cluster through Load Balancer minions and the Kubernetes Master only. This reduces our exposure while maintaining scalability of load throughout.

Screen Shot 2016-01-01 at 3.28.44 PM

Load Balancer minions are just Kubernetes minions with a label of role=loadbalancer that we have chosen to deploy into the DMZ so they have exposure to the internet. They are also in an AutoScaling Group. We have added enough logic into the creation of these minions for them to assign themselves a pre-designated, Terraform created Elastic IP. We do this because we have A records pointing to these Elastic IPs in a public DNS zone and we don’t want to worry about DNS propagation.

In order to get Kubernetes into multiple availability zones we had to figure out what to do with etcd. Kubernetes k/v store. Many people are attempting to distribute etcd across AZs with everything else but we found ourselves questioning the benefit of that. If you have insights into it that we don’t, feel free to comment below. We currently deploy etcd in typical fashion with the Master, the API server, the controller and the scheduler. Thus there wasn’t much reason to distribute etcd. If the API or the Master goes down, having etcd around is of little benefit. So we chose to backup etcd on a regular basis and push that out to AWS S3. The etcd files are small so we can expect to back it up often without incurring any severe penalties. We then deploy our Master into an autoscaling group with scaling size of min=1 and max=1. When the Master comes up, it automatically attempts to pull in the etcd files from S3 (if available) and starts up its services. This combined with some deep health checks allows the autoscaling group to rebuild the master quickly.

We do the same with all our minions. They are created with autoscaling groups and deployed across multiple AZs.

We create a launch configuration that uses an AWS AMI, instance type (size), associates a public IP (for API calls and proxy requests), assigns some security groups, and adds some EBS volumes. Notice the launch configuration calls a user-data file. We utilize user-data heavily to provision the servers.

AWS launch configuration for the Master:

 resource "aws_launch_configuration" "kubernetes-master" {
 image_id = "${var.centos_ami}"
 instance_type = "${var.instance_type}"
 associate_public_ip_address = true
 key_name = "${var.aws_key_name}"
 security_groups = ["${aws_security_group.kubernetes_sg.id}","${aws_security_group.nat.id}"]
 user_data = "${template_file.userdatamaster.rendered}"
 ebs_block_device = {
 device_name = "/dev/xvdf"
 volume_type = "gp2"
 volume_size = 20
 delete_on_termination = true
 }
 ephemeral_block_device = {
 device_name = "/dev/xvdc"
 virtual_name = "ephemeral0"
 }
 ephemeral_block_device = {
 device_name = "/dev/xvde"
 virtual_name = "ephemeral1"
 }
 connection {
 user = "centos"
 agent = true
 }
 }

Then we deploy an autoscaling group that will describe the AZs to deploy into, min/max number of servers, health check, the launch configuration above and adds it to an ELB. We don’t actually use ELBs much in our deployment strategy but for the Master it made sense.

AutoScaling Group configuration:

resource "aws_autoscaling_group" "kubernetes-master" {
 vpc_zone_identifier = ["${aws_subnet.us-east-1c-public.id}","${aws_subnet.us-east-1b-public.id}","${aws_subnet.us-east-1a-public.id}"]
 name = "kubernetes-master-${var.environment}"
 max_size = 1
 min_size = 1
 health_check_grace_period = 100
 health_check_type = "EC2"
 desired_capacity = 1
 force_delete = false
 launch_configuration = "${aws_launch_configuration.kubernetes-master.name}"
 load_balancers = ["${aws_elb.kube-master.id}"]
 tag {
 key = "Name"
 value = "master.${var.environment}.kube"
 propagate_at_launch = true
 }
 tag {
 key = "Environment"
 value = "${var.environment}"
 propagate_at_launch = true
 }
 depends_on = ["aws_s3_bucket.s3bucket","aws_launch_configuration.kubernetes-master"]
 }

 

I mentioned earlier we use a user-data file to do quite a bit when provisioning a new Kubernetes minion or master. There are 5 primary things we use this file for:

  1. Polling the AWS API for an initial set of information
  2. Pulling dynamically configured scripts and files from S3 to create Kubernetes
  3. Exporting a list of environment variables for Kubernetes to use
  4. Creating an internal DNS record in Route53.

 

We poll the AWS API for the following:

Notice how we poll for the Master IP address which is then used for a minion to join the cluster.

MASTER_IP=`aws ec2 describe-instances --region=us-east-1 --filters "Name=tag-value,Values=master.${ENVIRONMENT}.kube" "Name=instance-state-code,Values=16" | jq '.Reservations[].Instances[].PrivateIpAddress'`
PUBLIC_IP=`curl http://169.254.169.254/latest/meta-data/public-ipv4`
PRIVATE_IP=`curl http://169.254.169.254/latest/meta-data/local-ipv4`
INSTANCE_ID=`curl http://169.254.169.254/latest/meta-data/instance-id`
AVAIL_ZONE=`curl http://169.254.169.254/latest/meta-data/placement/availability-zone`

 

List of environment variables to export:

MASTER_IP=$MASTER_IP
PRIVATE_IP=$PRIVATE_IP
#required for minions to join the cluster
API_ENDPOINT=https://$MASTER_IP
ENVIRONMENT=${ENVIRONMENT}
#etcd env config 
LISTEN_PEER_URLS=http://localhost:2380
LISTEN_CLIENT_URLS=http://0.0.0.0:2379
ADVERTISE_CLIENT_URLS=http://$PRIVATE_IP:2379
AVAIL_ZONE=$AVAIL_ZONE
#version of Kubernetes to install pulled from Terraform variable
KUBERNETES_VERSION=${KUBERNETES_VERSION}
KUBERNETES_RELEASE=${KUBERNETES_RELEASE}
INSTANCE_ID=$INSTANCE_ID
#zoneid for route53 record retrieved from Terraform
ZONE_ID=${ZONE_ID}

When an AWS server starts up it runs its user-data file with the above preconfigured information.

We deploy Kubernetes using a base CentOS AMI that has been stripped down with docker and aws cli installed.

The server then pulls down the Kubernetes files specific to its cluster and role.

aws s3 cp --recursive s3://kubernetes-example-${ENVIRONMENT} /tmp/

It then runs a series of scripts much like what k8s.io runs. These scripts set the server up based on the config variables listed above.

 

Currently we label our Kubernetes minions to guarantee containers are distributed across multiple AZs but the Kubernetes project has some work currently in process that will allow minions to be AZ aware.

 

 

UPDATE: The ubernetes team has an active working group our their vision of Multi-AZ. You can read up on that here and see their meeting notes here. Once complete I expect we’ll move that direction as well.