An Apology to Bill Burke

What can I say? The founder of Keycloak Bill Burke himself somehow found this blog about Keycloak integration with Kubernetes way back when aaannndddd I completely missed it.

Bill, I’m very sorry, please accept my apology. Many reasons abound but the bottom line is, I’m honored you even read my post.

For those that don’t know (how dare you), Bill Burke created one of the greatest Identity and Access Management solutions. It was acquired/sponsored by Redhat sometime back.

Thanks Bill for all you do and keep up the great work.

My exit from a great gig

Today is my last day as a Principal Cloud Platform Architect at Pearson. Over the last few weeks I’ve had a bit of a retrospect on my time at Pearson. Our accomplishments, our failures, our team and all the opportunity it has created. I’m amazed how far we have come. I’m thankful for the opportunity and the freedom to reach beyond what many thought possible.

I left nothing on the table. I gave it all I had.

But one question still stands.

What IS Bitesize?

Bitesize is a platform purpose built for the future of Pearson.

Its a combination of interwoven technologies and application development methodologies that has resulted in advancements far beyond what any cloud provider can currently attain.

Its a team of engineers that believe, support and challenge themselves and others both in and outside the team.

Its a group of leaders that believe in their team and are willing to stick their neck out for what’s right.

Its a philosophy of change.

Its an evolving set of standards that increase fidelity of these interwoven pieces.

Bitesize is the convergence of disparate teams to make something greater than it could individually.

Its a treadmill of technology.

 

As I began thinking about what has been accomplished I decided to write down a few. I know I will leave many unsaid but hopefully this will give a view into just how much has been accomplished. Many may see this list and think one of two things: “That’s total bullshit, no way they did all that” or better yet “Ok I buy it, but have you reached nirvana?”

My answer to the former – “test your bullshit meter, its a bit off”

My answer to the latter – “as soon as one reaches a star, they realize how many other stars they want to reach”

 

A few achievements to date:

Fully automated platform and application upgrades

Completely scalable CI/CD pipeline treated as cattle (i.e. if they go away they self configure when they come back up)

Built-in log and data aggregation per Kubernetes namespace

Deep cloud provider integrations without lock-in

Fully automated database provisioning (containers and virtual machines)

Dynamic Certificate Management using Hashicorp Vault fully integrated with Load Balancers

100% availability for the platform and applications through critical business periods (to my knowledge this has not been achieved til now at Pearson)

Dynamic Application Configuration

Immutable Application architecture

OAuth into the platform and various infrastructure components

Universal API for single point of use

Audit Control and Compliance throughout the stack

Baked in Enterprise Governance

Highly secure, full BGP mesh cross geographic regions, capable of standing up new endpoints < 10 seconds

8.5 Million concurrent (and I do mean concurrent) user performance test, 150-250ms avg response

Enterprise Chargeback model

Dynamic CIDR provisioning (NSOT) for AWS and Kubernetes

Open Sourced Authz webhook resulting in its adoption by CoreOS

Automated generation of StackStorm AWS packs

Contributed StackStorm Kubernetes Pack to StackStorm Exchange

Contributing next generation (over 106 new packs) of StackStorm AWS Packs to StackStorm Exchange (currently in incubator)

Open Sourced many new technologies including Environment Operator, StackStorm Packs, Kong plugins, Kubernetes Test Harness, Nginx Controller, Jenkins plugin for Environment Operator, and CI/CD pipeline

On-Demand Locust (Perf testing suite) on Kubernetes using Iron Functions , deployed < 10 seconds

Integrated Monitoring/Alerting throughout the stack

Self-onboarding of applications through to production with little or no assistance from the Bitesize team

Congrats team. You’ve got this.
@devoperandi

Open Source – Environment Operator

The day has finally come. Today we are announcing our open source project Environment Operator (EO).

Environment Operator is used throughout our project and has rapidly gained a name for itself as being well written and well thought out. Props go out to Simas Cepaitis, Cristian Radu and Ben Somogyi who have all contributed.

At its core, EO enables a seamless application deployment capability for a given environment/namespace within Kubernetes.

Benefits/Features:

  • multi-cluster deployments
  • audit trail
  • status
  • consistent definition of customer environments
  • separate build from deploy
  • minimizes risk and scope of impact
  • simple abstraction from Kubernetes
  • BYO CI/CD
  • empowers our customers (dev teams)
  • API interface
  • multiple forms of authentication
  • deploy through yaml config and API
  • written in Go
  • Docker Registries

 

Multi-Cluster Deployments – With EO running in each namespace and exposed via an API, CI/CD pipelines can simply call the API endpoint regardless of Kubernetes cluster and deploy new services.

 

API Interface – EO has its own API endpoint for deployments, status, logs and the like. This combined with a yaml config for its environment is a very powerful combination.

 

Audit Trail – EO provides an audit trail of all changes to the environment through its logging to stdout.

 

Status – EO provides a /status endpoint by which to understand the status of an environment or individual services within the environment with /status/${service}

 

Separate Build from Deploy – What we found was, while our CI/CD pipeline is quite robust it lacked real-time feedback and audit capabilities needed by our dev teams. Separating our build from deploy allowed us to add in these additional features, simplify its use and enabled our dev teams to bring their own familiar pipelines to our project.

 

Minimize Risk and Scope of impact – Because EO runs inside the Kubernetes cluster we could limit its capabilities through Kubernetes service accounts to only its namespace. This limits risk and impact to other dev teams running in the same cluster as well as requiring a dev to call and entirely wrong API endpoint in order to effect another environment. Further more, authentication is setup for each EO, so separation of concerns between environments can easily be made.

 

Simple Abstraction – Because EO is so simple to use, it has enabled our teams to get up and running much faster in Kubernetes. Very little prior knowledge is required, they can use their same pipelines by using a common DSL in our Jenkins plugin and get all the real-time information all from one place per environment.

 

BYO CI/CD – I think this is pretty self-explanatory but we have many dev teams at Pearson that already have their own CI/CD pipelines. They can continue using their pipeline or choose to use ours.

 

Empower our Dev teams – Ultimately EO is about empowering Dev teams to manage their own environments without requiring tons of prior knowledge to get started. Simply deploy EO and go.

 

Authentication – EO currently supports two different kinds of authentication. Token based which gets pulled from a Kubernetes secret or OAuth. We currently tie directly into Keycloak for auth.

 

Plugin (DSL) for Jenkins – Because most of our Dev teams run Jenkins, we wrote a plugin to hook directly into it. Other plugins could very easily be written.

 

Docker Registries – EO can connect to private, public, gcloud and docker hub registries.

 

As you can see, Environment Operator has a fair amount of capabilities built-in but we aren’t stopping there.

Near term objectives:

  • Stateful sets
  • Kubernetes Jobs
  • Prometheus

 

Github:

https://github.com/pearsontechnology/environment-operator

https://github.com/pearsontechnology/environment-operator-jenkins-plugin

 

Let us know what you think!

@devoperandi

 

 

 

 

Kubernetes/PaaS: Automated Test Framework

First off, mad props go out to Ben Somogyi and Martin Devlin. They have been digging deep on this and have made great progress. I wanted to make sure I call them out and all the honors go to them. I just have the honor of telling you about it.

You might be thinking right about now, “why an automated test framework? Doesn’t Kubernetes team test their own stuff already?” Of course they do but we have a fair number of apps/integrations to make sure out platform components all work together with Kubernetes. Take for example, when we upgrade Kubernetes, deploy a new stackstorm integration or add some authentication capability. All of these things need to be tested to ensure our platform works every time.

At what point did we decide we needed an automated test framework? Right about the time we realized we were committing so much back to our project that we couldn’t keep up with the testing. Prior to this time, we tested each PR requiring 2 +1s (minus the author) to allow a PR to get merged. What we found was we were spending so much time testing (thoroughly?) that we were loosing valuable development time. We are a pretty small dev shop. Literally 5 (+3 Ops) guys developing new features into our PaaS. So naturally there is a balancing act here. Do we spend more time writing test cases or actually testing ourselves? There comes a tipping when it makes more sense to write test cases and automate it and use people for other things. We felt like we hit that point.

Here is what our current test workflow looks like. Its subject to change but this is our most recent iteration.

QA Automation Workflow

Notice we are running TravisCI to kick everything off. If you have read our other blog posts, you know we also have a Jenkins plugin and you are probably thinking, ‘why Travis when you already have written your own Jenkins plugin?’ Its rather simple really. We use TravisCI to kick off tests through Github which deploys a completely new AWS VPC / Kubernetes Cluster from scratch, runs a series of tests to make sure it came up properly, all the endpoints are available and the deploys Jenkins into a namespace which kicks off a series of internal tests on the cluster.

Basically TravisCI is for external / infrastructure testing to make sure Terraform/Ansible run correctly and all the external dependencies come up and Jenkins to deploy / test at the container level for internal components.

If you haven’t already read it, you may consider reading Kubernetes A/B Cluster Deploys because we are capable of deploying two completely separate clusters inside the same AWS VPC for the purpose of A/B migrations.

Travis looks at any pull requests (PR) being made to our dev branch. For each PR TravisCI will run through the complete QA automation process. Below are the highlights. You can look at the image above for details.

1. create a branch from the PR and merge in the dev branch

2. Linting/Unit tests

3. Cluster deploy

  • If anything fails during deploy of the VPC, paasA or paasB, the process will fail, and tear down the environment with the logs of it in TravisCI build logs.

Here is an example of one of our builds that is failing from the TravisCI.

Screen Shot 2016-08-27 at 1.30.03 PM

4. Test paasA with paasB

  • Smoke Test
  • Deploy ‘Testing’ containers into paasB
  • Retrieve tests
  • Execute tests against paasA
  • Capture results
  • Publish back to Travis

5. Destroy environment

 

One massive advantage of having A and B clusters is we can use one to test the other. This enables a large portion of our testing automation to exist in containers. Thus making our test automation parallel, fast and scalable to a large extent.

The entire process takes about 25 minutes. Not too shabby for literally building an entire environment from the ground up and running tests against it and we don’t expect the length of time to change much. In large part because of the parallel testing. This is a from scratch, completely automated QA automation framework for PaaS. I’m thinking 25-30 minutes is pretty damn good. You?

Screen Shot 2016-08-27 at 1.44.41 PM

 

Alright get to the testing already.

First is our helper script for setting a few params like timeouts and numbers of servers for each type. anything in ‘${}’ is a Terraform variable that we inject on Terraform deploy.

helper.bash

#!/bin/bash

## Statics

#Long Timeout (For bootstrap waits)
LONG_TIMEOUT=<integer_seconds>

#Normal Timeout (For kubectl waits)
TIMEOUT=<integer_seconds>

# Should match minion_count in terraform.tfvars
MINION_COUNT=${MINION_COUNT}

LOADBALANCER_COUNT=${LOADBALANCER_COUNT}

ENVIRONMENT=${ENVIRONMENT}

## Functions

# retry_timeout takes 2 args: command [timeout (secs)]
retry_timeout () {
  count=0
  while [[ ! `eval $1` ]]; do
    sleep 1
    count=$((count+1))
    if [[ "$count" -gt $2 ]]; then
      return 1
    fi
  done
}

# values_equal takes 2 values, both must be non-null and equal
values_equal () {
  if [[ "X$1" != "X" ]] || [[ "X$2" != "X" ]] && [[ $1 == $2 ]]; then
    return 0
  else
    return 1
  fi
}

# min_value_met takes 2 values, both must be non-null and 2 must be equal or greater than 1
min_value_met () {
  if [[ "X$1" != "X" ]] || [[ "X$2" != "X" ]] && [[ $2 -ge $1 ]]; then
    return 0
  else
    return 1
  fi
}

 

You will notice we have divided our high level tests by Kubernetes resource types. Services, Ingresses, Pods etc etc

First we test a few things to make sure our minions and loadbalancer (minions) came up. Notice we are using kubectl for much of this. May as well, its there and its easy.

If you want to know more about what we mean by load balancer minions.

instance_counts.bats

#!/usr/bin/env bats

set -o pipefail

load ../helpers

# Infrastructure

@test "minion count" {
  MINIONS=`kubectl get nodes --selector=role=minion --no-headers | wc -l`
  min_value_met $MINION_COUNT $MINIONS
}

@test "loadbalancer count" {
  LOADBALANCERS=`kubectl get nodes --selector=role=loadbalancer --no-headers | wc -l`
  values_equal $LOADBALANCERS $LOADBALANCERS
}

 

pod_counts.bats

#!/usr/bin/env bats

set -o pipefail

load ../helpers

@test "bitesize-registry pods" {
  BITESIZE_REGISTRY_DESIRED=`kubectl get rc bitesize-registry --namespace=default -o jsonpath='{.spec.replicas}'`
  BITESIZE_REGISTRY_CURRENT=`kubectl get rc bitesize-registry --namespace=default -o jsonpath='{.status.replicas}'`
  values_equal $BITESIZE_REGISTRY_DESIRED $BITESIZE_REGISTRY_CURRENT
}

@test "kube-dns pods" {
  KUBE_DNS_DESIRED=`kubectl get rc kube-dns-v18 --namespace=kube-system -o jsonpath='{.spec.replicas}'`
  KUBE_DNS_CURRENT=`kubectl get rc kube-dns-v18 --namespace=kube-system -o jsonpath='{.status.replicas}'`
  values_equal $KUBE_DNS_DESIRED $KUBE_DNS_CURRENT
}

@test "consul pods" {
  CONSUL_DESIRED=`kubectl get rc consul --namespace=kube-system -o jsonpath='{.spec.replicas}'`
  CONSUL_CURRENT=`kubectl get rc consul --namespace=kube-system -o jsonpath='{.status.replicas}'`
  values_equal $CONSUL_DESIRED $CONSUL_CURRENT
}

@test "vault pods" {
  VAULT_DESIRED=`kubectl get rc vault --namespace=kube-system -o jsonpath='{.spec.replicas}'`
  VAULT_CURRENT=`kubectl get rc vault --namespace=kube-system -o jsonpath='{.status.replicas}'`
  values_equal $VAULT_DESIRED $VAULT_CURRENT
}

@test "es-master pods" {
  ES_MASTER_DESIRED=`kubectl get rc es-master --namespace=default -o jsonpath='{.spec.replicas}'`
  ES_MASTER_CURRENT=`kubectl get rc es-master --namespace=default -o jsonpath='{.status.replicas}'`
  values_equal $ES_MASTER_DESIRED $ES_MASTER_CURRENT
}

@test "es-data pods" {
  ES_DATA_DESIRED=`kubectl get rc es-data --namespace=default -o jsonpath='{.spec.replicas}'`
  ES_DATA_CURRENT=`kubectl get rc es-data --namespace=default -o jsonpath='{.status.replicas}'`
  values_equal $ES_DATA_DESIRED $ES_DATA_CURRENT
}

@test "es-client pods" {
  ES_CLIENT_DESIRED=`kubectl get rc es-client --namespace=default -o jsonpath='{.spec.replicas}'`
  ES_CLIENT_CURRENT=`kubectl get rc es-client --namespace=default -o jsonpath='{.status.replicas}'`
  values_equal $ES_CLIENT_DESIRED $ES_CLIENT_CURRENT
}

@test "monitoring-heapster-v6 pods" {
  HEAPSTER_DESIRED=`kubectl get rc monitoring-heapster-v6 --namespace=kube-system -o jsonpath='{.spec.replicas}'`
  HEAPSTER_CURRENT=`kubectl get rc monitoring-heapster-v6 --namespace=kube-system -o jsonpath='{.status.replicas}'`
  values_equal $HEAPSTER_DESIRED $HEAPSTER_CURRENT
}

 

service.bats

#!/usr/bin/env bats

set -o pipefail

load ../helpers

# Services

@test "kubernetes service" {
  retry_timeout "kubectl get svc kubernetes --namespace=default --no-headers" $TIMEOUT
}

@test "bitesize-registry service" {
  retry_timeout "kubectl get svc bitesize-registry --namespace=default --no-headers" $TIMEOUT
}

@test "fabric8 service" {
  retry_timeout "kubectl get svc fabric8 --namespace=default --no-headers" $TIMEOUT
}

@test "kube-dns service" {
  retry_timeout "kubectl get svc kube-dns --namespace=kube-system --no-headers" $TIMEOUT
}

@test "kube-ui service" {
  retry_timeout "kubectl get svc kube-ui --namespace=kube-system --no-headers" $TIMEOUT
}

@test "consul service" {
  retry_timeout "kubectl get svc consul --namespace=kube-system --no-headers" $TIMEOUT
}

@test "vault service" {
  retry_timeout "kubectl get svc vault --namespace=kube-system --no-headers" $TIMEOUT
}

@test "elasticsearch service" {
  retry_timeout "kubectl get svc elasticsearch --namespace=default --no-headers" $TIMEOUT
}

@test "elasticsearch-discovery service" {
  retry_timeout "kubectl get svc elasticsearch-discovery --namespace=default --no-headers" $TIMEOUT
}

@test "monitoring-heapster service" {
  retry_timeout "kubectl get svc monitoring-heapster --namespace=kube-system --no-headers" $TIMEOUT
}

 

ingress.bats

#!/usr/bin/env bats

set -o pipefail

load ../helpers

# Ingress

@test "consul ingress" {
  retry_timeout "kubectl get ing consul --namespace=kube-system --no-headers" $TIMEOUT
}

@test "vault ingress" {
  retry_timeout "kubectl get ing vault --namespace=kube-system --no-headers" $TIMEOUT
}

Now that we have a pretty good level of certainty the cluster stood up as expected, we can begin deeper testing into the various components and integrations within our platform. Stackstorm, Kafka, ElasticSearch, Grafana, Keycloak, Vault and Consul. AWS endpoints, internal endpoints, port mappings, security……….. and the list goes on.  All core components that our team provides our customers.

Stay tuned for more as it all begins to fall into place.