Kubernetes container guarantees (and oversubscription)

Reading through the release notes of Kubernetes 1.4, I came across some fantastical news. News so good, I should have expected it. News that could not come at a better time. I’m talking about container guarantees, or what Kubernetes calls Resource Quality of Service. Let me be frank here, its like the Kubernetes team was just trying to confuse me. I’m sure the rest of you immediately knew what they were talking about but I’m a simpleton. So after reading it 5 times, I think I finally got ahold of it.

In a nutshell, when resource min and max values are set, quality of service dictates container priority when a server is oversubscribed.

Let me say this another way, we can oversubscribe server resources AND decide which containers stay alive and which ones get killed off.

Think of it like Linux OOM killer but with more fine grained control. In the Linux OOM Killer, the only thing you can do to help determine what does or does not get killed off, is adjust oom_score_adj per process. Which as it turns out is exactly what Kubernetes is doing.

Here are the details:

There are 3 levels of priority.

BestEffort – These are the containers Kubernetes will kill off first when under memory pressure.

Guaranteed – Take top priority over everything else. Kubernetes will try everything to keep these alive.

Burstable – Likely to be killed off when no more BestEffort pods exist and they have exceeded the REQUEST amount.

 

And there are two parameters you need to consider.

request – the base number of resources (cpu and ram) a container wants at runtime.

limit – The upper limit the container can consume if not already used elsewhere.

Notice how I mentioned memory pressure up above. Under CPU pressure, nothing will be killed off. Containers will simply get throttled instead.

 

So how do we determine which priority level a container will have?

Guaranteed if request == limit OR only the limits set

which looks like:

containers:
    name: mywebapp
        resources:
            limits:
                cpu: 10m
                memory: 1Gi
            requests:
                cpu: 10m
                memory: 1Gi

OR

containers:
    name: foo
        resources:
            limits:
                cpu: 10m
                memory: 1Gi

### Setting requests is optional

 

 

Burstable if request less than limit OR one of the containers has nothing set

containers:
    name: foo
        resources:
            limits:
                cpu: 10m
                memory: 1Gi
            requests:
                cpu: 10m
                memory: 1Gi

    name: bar

now recognize there are two containers above, one with nothing specified so that container gets BestEffort which makes the Pod as a whole Burstable.

OR

containers:
    name: foo
        resources:
            limits:
                memory: 1Gi

    name: bar
        resources:
            limits:
                cpu: 100m

This config above has two different resources set. One has memory set and the other cpu. Thus once again, Burstable.

 

BestEffort if no defined resources assigned.

containers:
    name: foo
        resources:
    name: bar
        resources:

 

This is just the tip of the iceberg on container guarantees.

There is a lot more there around cgroups, swap and compressible vs incompressible resources.

Head over to the github page to read more.

Cluster Consul using Kubernetes API

Recently we had the desire to cluster Consul (Hashicorps K/V store) without calling out to Atlas. We deploy many clusters per day as we are constantly testing and we want Consul to simply bring itself up without having to reach out over the internet.

So we added a few changes to our Consul setup, so here goes-

Dockerfile:

This is a typical dockerfile for Consul running on alpine. Nothing out of the ordinary.

FROM alpine:3.2
MAINTAINER 	Martin Devlin <martin.devlin@pearson.com>

ENV CONSUL_VERSION    0.6.3
ENV CONSUL_HTTP_PORT  8500
ENV CONSUL_HTTPS_PORT 8543
ENV CONSUL_DNS_PORT   53

RUN apk --update add openssl zip curl ca-certificates jq \
&& cat /etc/ssl/certs/*.pem > /etc/ssl/certs/ca-certificates.crt \
&& sed -i -r '/^#.+/d' /etc/ssl/certs/ca-certificates.crt \
&& rm -rf /var/cache/apk/* \
&& mkdir -p /etc/consul/ssl /ui /data \
&& wget http://releases.hashicorp.com/consul/${CONSUL_VERSION}/consul_${CONSUL_VERSION}_linux_amd64.zip \
&& unzip consul_${CONSUL_VERSION}_linux_amd64.zip \
&& mv consul /bin/ \
&& rm -f consul_${CONSUL_VERSION}_linux_amd64.zip \
&& cd /ui \
&& wget http://releases.hashicorp.com/consul/${CONSUL_VERSION}/consul_${CONSUL_VERSION}_web_ui.zip \
&& unzip consul_${CONSUL_VERSION}_web_ui.zip \
&& rm -f consul_${CONSUL_VERSION}_linux_amd64.zip

COPY config.json /etc/consul/config.json

EXPOSE ${CONSUL_HTTP_PORT}
EXPOSE ${CONSUL_HTTPS_PORT}
EXPOSE ${CONSUL_DNS_PORT}

COPY run.sh /usr/bin/run.sh
RUN chmod +x /usr/bin/run.sh

ENTRYPOINT ["/usr/bin/run.sh"]
CMD []

 

config.json:

And here is config.json referenced in the Dockerfile.

{
  "data_dir": "/data",
  "ui_dir": "/ui",
  "client_addr": "0.0.0.0",
  "ports": {
    "http"  : %%CONSUL_HTTP_PORT%%,
    "https" : %%CONSUL_HTTPS_PORT%%,
    "dns"   : %%CONSUL_DNS_PORT%%
  },
  "start_join":{
    %%LIST_PODIPS%%
  },
  "acl_default_policy": "deny",
  "acl_datacenter": "%%ENVIRONMENT%%",
  "acl_master_token": "%%MASTER_TOKEN%%",
  "key_file" : "/etc/consul/ssl/consul.key",
  "cert_file": "/etc/consul/ssl/consul.crt",
  "recursor": "8.8.8.8",
  "disable_update_check": true,
  "encrypt" : "%%GOSSIP_KEY%%",
  "log_level": "INFO",
  "enable_syslog": false
}

If you have read my past Consul blog you might notice we have added the following.

  "start_join":{
    %%LIST_PODIPS%%
  },

This is important because we are going to have each Consul container query the Kubernetes API using a Kubernetes Token to pull in a list of IPs for the Consul cluster to join up.

Important note – if you are running more than just the default token per namespace, you need to explicitly grant READ access to the API for the Token associated with the container.

And here is run.sh:

#!/bin/sh
KUBE_TOKEN=`cat /var/run/secrets/kubernetes.io/serviceaccount/token`
NAMESPACE=`cat /var/run/secrets/kubernetes.io/serviceaccount/namespace`

if [ -z ${CONSUL_SERVER_COUNT} ]; then
  export CONSUL_SERVER_COUNT=3
fi

if [ -z ${CONSUL_HTTP_PORT} ]; then
  export CONSUL_HTTP_PORT=8500
fi

if [ -z ${CONSUL_HTTPS_PORT} ]; then
  export CONSUL_HTTPS_PORT=8243
fi

if [ -z ${CONSUL_DNS_PORT} ]; then
  export CONSUL_DNS_PORT=53
fi

if [ -z ${CONSUL_SERVICE_HOST} ]; then
  export CONSUL_SERVICE_HOST="127.0.0.1"
fi

if [ -z ${CONSUL_WEB_UI_ENABLE} ]; then
  export CONSUL_WEB_UI_ENABLE="true"
fi

if [ -z ${CONSUL_SSL_ENABLE} ]; then
  export CONSUL_SSL_ENABLE="true"
fi

if [ ${CONSUL_SSL_ENABLE} == "true" ]; then
  if [ ! -z ${CONSUL_SSL_KEY} ] &&  [ ! -z ${CONSUL_SSL_CRT} ]; then
    echo ${CONSUL_SSL_KEY} > /etc/consul/ssl/consul.key
    echo ${CONSUL_SSL_CRT} > /etc/consul/ssl/consul.crt
  else
    openssl req -x509 -newkey rsa:2048 -nodes -keyout /etc/consul/ssl/consul.key -out /etc/consul/ssl/consul.crt -days 365 -subj "/CN=consul.kube-system.svc.cluster.local"
  fi
fi

export CONSUL_IP=`hostname -i`

if [ -z ${ENVIRONMENT} ] || [ -z ${MASTER_TOKEN} ] || [ -z ${GOSSIP_KEY} ]; then
  echo "Error: ENVIRONMENT, MASTER_TOKEN and GOSSIP_KEY environment vars must be set"
  exit 1
fi

LIST_IPS=`curl -sSk -H "Authorization: Bearer $KUBE_TOKEN" https://$KUBERNETES_SERVICE_HOST:$KUBERNETES_PORT_443_TCP_PORT/api/v1/namespaces/$NAMESPACE/pods | jq '.items[] | select(.status.containerStatuses[].name=="consul") | .status .podIP'`

#basic test to see if we have ${CONSUL_SERVER_COUNT} number of containers alive
VALUE='0'

while [ $VALUE != ${CONSUL_SERVER_COUNT} ]; do
  echo "waiting 10s on all the consul containers to spin up"
  sleep 10
  LIST_IPS=`curl -sSk -H "Authorization: Bearer $KUBE_TOKEN" https://$KUBERNETES_SERVICE_HOST:$KUBERNETES_PORT_443_TCP_PORT/api/v1/namespaces/kube-system/pods | jq '.items[] | select(.status.containerStatuses[].name=="consul") | .status .podIP'`
  echo "$LIST_IPS" | sed -e 's/$/,/' -e '$s/,//' > tester
  VALUE=`cat tester | wc -l`
done

LIST_IPS_FORMATTED=`echo "$LIST_IPS" | sed -e 's/$/,/' -e '$s/,//'`

sed -i "s,%%ENVIRONMENT%%,$ENVIRONMENT,"             /etc/consul/config.json
sed -i "s,%%MASTER_TOKEN%%,$MASTER_TOKEN,"           /etc/consul/config.json
sed -i "s,%%GOSSIP_KEY%%,$GOSSIP_KEY,"               /etc/consul/config.json
sed -i "s,%%CONSUL_HTTP_PORT%%,$CONSUL_HTTP_PORT,"   /etc/consul/config.json
sed -i "s,%%CONSUL_HTTPS_PORT%%,$CONSUL_HTTPS_PORT," /etc/consul/config.json
sed -i "s,%%CONSUL_DNS_PORT%%,$CONSUL_DNS_PORT,"     /etc/consul/config.json
sed -i "s,%%LIST_PODIPS%%,$LIST_IPS_FORMATTED,"      /etc/consul/config.json

cmd="consul agent -server -config-dir=/etc/consul -dc ${ENVIRONMENT} -bootstrap-expect ${CONSUL_SERVER_COUNT}"

if [ ! -z ${CONSUL_DEBUG} ]; then
  ls -lR /etc/consul
  cat /etc/consul/config.json
  echo "${cmd}"
  sed -i "s,INFO,DEBUG," /etc/consul/config.json
fi

consul agent -server -config-dir=/etc/consul -dc ${ENVIRONMENT} -bootstrap-expect ${CONSUL_SERVER_COUNT}"

Lets go through the options here: Notice in the script we do have some defaults enabled so we may or may not included them when starting up the container.

LIST_PODIPS = a list of Consul IPs for the consul node to join to

CONSUL_WEB_UI_ENABLE = true|false – if you want a web ui

CONSUL_SSL_ENABLE = SSL for cluster communication

If true expects:

CONSUL_SSL_KEY – SSL Key

CONSUL_SSL_CRT – SSL Cert

 

First we pull in the Kubernetes Token and Namespace. This is the default location for this information in every container and should work for your needs.

KUBE_TOKEN=`cat /var/run/secrets/kubernetes.io/serviceaccount/token`
NAMESPACE=`cat /var/run/secrets/kubernetes.io/serviceaccount/namespace`

And then we use those ENV VARS with some fancy jq to get a list of IPs formatted so we can shove them into config.json.

LIST_IPS=`curl -sSk -H "Authorization: Bearer $KUBE_TOKEN" https://$KUBERNETES_SERVICE_HOST:$KUBERNETES_PORT_443_TCP_PORT/api/v1/namespaces/$NAMESPACE/pods | jq '.items[] | select(.status.containerStatuses[].name=="consul") | .status .podIP'`

And we wait until the number of CONSUL_SERVER_COUNT has started up

#basic test to see if we have ${CONSUL_SERVER_COUNT} number of containers alive
echo "$LIST_IPS" | sed -e 's/$/,/' -e '$s/,//' > tester
VALUE=`cat tester | wc -l`

while [ $VALUE != ${CONSUL_SERVER_COUNT} ]; do
  echo "waiting 10s on all the consul containers to spin up"
  sleep 10
  echo "$LIST_IPS" | sed -e 's/$/,/' -e '$s/,//' > tester
  VALUE=`cat tester | wc -l`
done

You’ll notice this could certainly be cleaner but its working.

Then we inject the IPs into the config.json:

sed -i "s,%%LIST_PODIPS%%,$LIST_IPS_FORMATTED,"      /etc/consul/config.json

which simplifies our consul runtime command quite nicely:

consul agent -server -config-dir=/etc/consul -dc ${ENVIRONMENT} -bootstrap-expect ${CONSUL_SERVER_COUNT}"

 

Alright so all of that is in for the Consul image.

Now lets have a look at the Kubernetes config files.

consul.yaml

apiVersion: v1
kind: ReplicationController
metadata:
  namespace: kube-system
  name: consul
spec:
  replicas: ${CONSUL_COUNT}                               # number of consul containers
  # selector identifies the set of Pods that this
  # replication controller is responsible for managing
  selector:
    app: consul
  template:
    metadata:
      labels:
        # Important: these labels need to match the selector above
        # The api server enforces this constraint.
        pool: consulpool
        app: consul
    spec:
      containers:
        - name: consul
          env:
            - name: "ENVIRONMENT"
              value: "SOME_ENVIRONMENT_NAME"             # some name
            - name: "MASTER_TOKEN"
              value: "INITIAL_MASTER_TOKEN_FOR_ACCESS"   # UUID preferable
            - name: "GOSSIP_KEY"
              value: "ENCRYPTION_KEY_FOR_GOSSIP"         # some random key for encryption
            - name: "CONSUL_DEBUG"
              value: "false"                             # to debug or not to debug
            - name: "CONSUL_SERVER_COUNT"
              value: "${CONSUL_COUNT}"                   # integer value for number of containers
          image: 'YOUR_CONSUL_IMAGE_HERE'
          resources:
            limits:
              cpu: ${CONSUL_CPU}                         # how much CPU are you giving the container
              memory: ${CONSUL_RAM}                      # how much RAM are you giving the container
          imagePullPolicy: Always
          ports:
          - containerPort: 8500
            name: ui-port
          - containerPort: 8400
            name: alt-port
          - containerPort: 53
            name: udp-port
          - containerPort: 8543
            name: https-port
          - containerPort: 8500
            name: http-port
          - containerPort: 8301
            name: serflan
          - containerPort: 8302
            name: serfwan
          - containerPort: 8600
            name: consuldns
          - containerPort: 8300
            name: server
#      nodeSelector:                                     # optional
#        role: minion                                    # optional

You might notice we need to move this to a deployment/replicaset instead of a replication controller.

These vars should look familiar by now.

CONSUL_COUNT = number of consul containers we want to run

CONSUL_HTTP_PORT = set port for http

CONSUL_HTTPS_PORT = set port for https

CONSUL_DNS_PORT = set port for dns

ENVIRONMENT = consul datacenter name

MASTER_TOKEN = the root token you want to have super admin privs to access the cluster

GOSSIP_KEY = an encryption key for cluster communication

 

consul-svc.yaml

---
apiVersion: v1
kind: Service
metadata:
  name: consul
  namespace: kube-system
  labels:
    name: consul
spec:
  ports:
    # the port that this service should serve on
    - name: http
      port: 8500
    - name: https
      port: 8543
    - name: rpc
      port: 8400
    - name: serflan
      port: 8301
    - name: serfwan
      port: 8302
    - name: server
      port: 8300
    - name: consuldns
      port: 53
  # label keys and values that must match in order to receive traffic for this service
  selector:
    pool: consulpool

 

consul-ing.yaml

apiVersion: extensions/v1beta1
kind: Ingress
metadata:
  name: consul
  namespace: kube-system
  labels:
    ssl: "true"
    httpsBackend: "true"
    httpsOnly: "true"
spec:
  rules:
  - host: consul.%%ENVIRONMENT%%.%%DOMAIN%%
    http:
      paths:
      - backend:
          serviceName: consul
          servicePort: 8543
        path: /

We run ingress controllers so this will provision an ingress so we can make Consul externally available.

 

 

 

@devoperandi