DistOS 2021F Experience 2
Introduction
In this experience, you will be playing with a multi-node Kubernetes cluster simulated using minikube
, running the Cilium CNI, a replicated Apache Cassandra database and an updated version of our beloved “printerfacts” service that consumes facts stored in the Cassandra database. As usual, you may wish to consult the relevant documentation if you get stuck. Links to documentation will be provided along with hints later on in this document.
With the exception of Part 2, completing this experience shouldn’t take more than a couple of hours. Feel free to collaborate with other students if you get stuck. However, you must acknowledge any collaboration. Additionally, copying and pasting or simply “changing up” each other’s answers will be treated as an academic integrity violation.
Submissions
Your experience report should be submitted as a PDF file on Brightspace, written in paragraph form. Code snippets or screenshots are allowed to augment your prose but are not required. We will post a submission link on Brightspace within the next few days.
The official due date for the experience report is December 10th (the last day of class), but we will continue to accept submissions up until the date of the final exam if you need extra time.
Receiving Your Grade
This experience is broken up into three parts:
- A series of easy tasks designed to get you familiar Cilium, Cassandra, and a multi-node K8s cluster.
- A harder challenge that involves programming a client to interact with the Cassandra database and writing a custom Cilium security policy to lock down your cluster.
- An opportunity to reflect on this experience and make connections to overall themes in the course.
Students are expected to complete Part 1 and Part 3 to get a grade of at most a B+. Part 2 is optional, but must be completed to receive a grade of A- or higher. Marks will be deducted for insufficient explanations or answers that demonstrate a lack of effort. By this logic, you should have a fairly clear idea of what grade you will receive when you submit this experience, based on how much effort you put in.
Setting Up Your Environment
For this experience, a new VM image has been provided for you. It has a larger disk size, more RAM, and a batteries-included installation of the Cilium and Hubble CLI. First, delete your old instance using the OpenStack web console, then create a new one. You can follow the same instructions as before, except replace the COMP4000-studentvm-v1
image with COMP4000-studentvm-v2
. When selecting your flavour, make sure you pick the flavour with 16GiB of disk space and 8GB of RAM.
Once your VM is set up, SSH into it using your preferred SSH client. You should probably be working with at least 2 terminals for the rest of this experience. After SSHing into your instance, run the following command to spin up a multi-node k8s cluster: minikube start -n3 --cni=cilium
. This might take a few minutes to run to completion.
Verify your cluster has started correctly using minikube status
. You should see something like the following:
minikube type: Control Plane host: Running kubelet: Running apiserver: Running kubeconfig: Configured minikube-m02 type: Worker host: Running kubelet: Running minikube-m03 type: Worker host: Running kubelet: Running
Cilium and Hubble
Once your cluster is running, you can configure the cluster to use Cilium and Hubble. Due to some minikube and Cilium quirks, we need to uninstall the default version of Cilium installed by minikube and reinstall it using the Cilium CLI. You can do so with the following commands:
cilium uninstall && cilium install && cilium hubble enable
This may take a few minutes to run to completion.
You can run the command cilium status
to check the status of your Cilium installation. Once everything has installed correctly, you should see something like the following when running cilium status
:
student@alpine:~$ cilium status /¯¯\ /¯¯\__/¯¯\ Cilium: OK \__/¯¯\__/ Operator: OK /¯¯\__/¯¯\ Hubble: OK \__/¯¯\__/ ClusterMesh: disabled \__/ Deployment cilium-operator Desired: 1, Ready: 1/1, Available: 1/1 Deployment hubble-relay Desired: 1, Ready: 1/1, Available: 1/1 DaemonSet cilium Desired: 3, Ready: 3/3, Available: 3/3 Containers: cilium Running: 3 cilium-operator Running: 1 hubble-relay Running: 1 Cluster Pods: 2/2 managed by Cilium Image versions cilium quay.io/cilium/cilium:v1.10.5: 3 cilium-operator quay.io/cilium/operator-generic:v1.10.5: 1 hubble-relay quay.io/cilium/hubble-relay:v1.10.5: 1
Cassandra
We will be using the helm
package manger to install Cassandra in our k8s cluster. To do so, run the following commands:
wget https://homeostasis.scs.carleton.ca/~will/cassandra.yml helm repo add bitnami https://charts.bitnami.com/bitnami helm install cassandra bitnami/cassandra -f cassandra.yml
You can verify the status of your Cassandra installation using kubectl get pods --watch
. Wait until both cassandra-0
and cassandra-1
show 1/1
under READY
. This might take a few minutes. The output should look something like this:
student@alpine:~$ kubectl get pods NAME READY STATUS RESTARTS AGE cassandra-0 1/1 Running 0 4m2s cassandra-1 1/1 Running 0 2m8s
Printerfacts
The printerfacts service from last experience has received an upgrade to work with our new Cassandra database. First, run the migrations (provided as a batch job) to initialize the Cassandra database and populate it with some facts about printers:
wget https://homeostasis.scs.carleton.ca/~will/migrations.yml kubectl apply -f migrations.yml
After applying the migrations, you’re ready to deploy the printerfacts service:
wget https://homeostasis.scs.carleton.ca/~will/deploy.yml kubectl apply -f deploy.yml
Verify that printerfacts has been correctly deployed with kubectl get pods
. You should see something like this:
student@alpine:~$ kubectl get pods NAME READY STATUS RESTARTS AGE cassandra-0 1/1 Running 0 31m cassandra-1 1/1 Running 0 30m server-85f9f4465b-9vhn9 1/1 Running 0 38m server-85f9f4465b-b7s2n 1/1 Running 0 38m server-85f9f4465b-bvgqs 1/1 Running 0 38m server-85f9f4465b-bwf42 1/1 Running 0 38m server-85f9f4465b-gn5sm 1/1 Running 0 38m
Finally, run minikube tunnel
in a separate terminal to set up port forwarding for our LoadBalancer
API objects. This will enable you to interact with printerfacts from your VM, rather than needing to spin up a client pod like last time.
Verify that it worked correctly by running curl 10.96.0.201
(this is the IP you will use to talk to printerfacts). Note that depending on your specific configuration, you may need to SSH into minikube using minikube ssh
before you can access the load balancer service. If your curl command hangs forever, this may be the issue.
Tasks
Part 1: Multi-Node Kubernetes, Cilium, and Cassandra (Easy)
Follow the instructions for each of the following numbered tasks. Make an effort to answer the accompanying questions, but more importantly please note down all of your observations and describe what you did for each task. You should also feel free to write down whatever questions you may have about a given task.
To achieve the best possible grade in this section, you must demonstrate that you have made an effort to understand the results of each task. (Note that an effort does not strictly mean a full understanding; it is okay to have questions!)
Thanks the Cilium CNI, your Kubernetes cluster has been outfitted with new eBPF superpowers. In particular, Cilium installs a series of eBPF programs into your (VM’s) kernel which can be used to monitor traffic between containers, pods, and nodes, as well as enforce L4–L7 security policy. To test out your new superpowers, run
cilium hubble port-forward&
followed byhubble observe
. What is all that output? Can you make sense of any of it?- Hint 1: Try checking
hubble --help
andhubble observe --help
- Hint 2: Check out the Hubble docs
- Hint 1: Try checking
Unlike last time, our cluster consists of three nodes rather than just one. Try running
curl 10.96.0.201
a few times and notice that the output now includes a node name in addition to a pod name. Do you notice any patterns in the output? Compare what you see with the output from runningkubectl get pods -o wide
. Try to come up with an explanation for the distribution of printerfacts pods over the nodes in your cluster.- Hint 1: Have a look at the
affinity
section ofdeploy.yml
. - Hint 2: Have a look at the relevant documentation.
- Hint 3: If you still can’t get it, an educated guess will suffice.
- Hint 1: Have a look at the
Printerfacts is now a CRUD app that supports creating, reading, updating, and deleting facts from the Cassandra database. In particular, we now support the following endpoints:
- GET
10.96.0.201/fact
: Get a random fact from the database as a JSON object - GET
10.96.0.201/fact/keys
: Get a list of all fact keys in the database as a JSON object - GET
10.96.0.201/fact/<key>
: Get the fact with the key <key> in the database as a JSON object - POST
10.96.0.201/fact
: Create a new fact where the request body is a JSON object of the form{ "fact": "Fact Here", "kind": "Kind of fact (e.g. Cat fact)" }
- PUT
10.96.0.201/fact/<key>
: Modify the fact with key <key> in the database where the request body is a JSON object of the form{ "fact": "Fact Here", "kind": "Kind of fact (e.g. Cat fact)" }
- DELETE
10.96.0.201/fact/<key>
: Delete the fact with key <key> from the database
Try out each of the printerfacts endpoints. Note that you can specify the HTTP request type using
curl -X <type>
(for example,curl -X POST
to send a POST request). You can send a JSON body as a payload by usingcurl -H 'Content-type: application/json' -d '{"key": "value"}'
where you replace the{"key": "value"}
part with your JSON object.Optional: Try restarting the Cassandra pods using
kubectl rollout restart statefulset cassandra
. While the pods are restarting, watch the output ofkubectl get pods --watch
and try making requests to the printerfacts service at the same time. Do you notice any downtime?- GET
Let’s use Cilium and Hubble to observe dataflows between our Cassandra database and the printerfacts service. To do this, run the following Hubble command:
hubble observe --label component=server --label component=printerfacts -f
. Make a few requests to the printerfacts service, exercising various API endpoints. Try and make sense of some of the network traffic you observe.In an effort to secure your cluster, you decide to employ a Cilium network policy that locks down the printerfacts service. First, download the example security policy by running
wget https://homeostasis.scs.carleton.ca/~will/printerfacts-policy.yml
, then runkubectl apply -f policy.yml
to apply it. The example policy allows GET requests to/
,/fact
,/fact/keys
, and/fact/<key>
. Examine the policy file and make sure you understand how it works.Try out your policy by making a few valid requests followed by some invalid ones. While your policy is applied, try using Hubble to observe the HTTP traffic by running
hubble observe --protocol http
. Optionally, try extending the policy to make some other valid routes work (e.g. PUT, POST, and DELETE requests).Hint: You may wish to consult the Cilium docs
Now that you’re familiar with printerfacts, it’s time to simulate a node failure. Since both printerfacts and Cassandra are replicated across multiple nodes in our cluster, taking one of them down should not impact either service.
Let’s first try a planned node disruption. This kind of disruption might occur when a cluster administrator wants to take a node down for maintenance, for example to perform a kernel update. Start by draining node
minikube-m02
with the commandkubectl drain minikube-m02 --ignore-daemonsets
. Now try interacting with the printerfacts service as before. Make note of any unusual behaviour and try to come up with a best-guess explanation.Now you can bring your node back up using
kubectl uncordon minikube-m02
. Once your node has been uncordoned, it should be ready for scheduling again. Try scaling up the printerfacts deployment to get some pods running on the node again.Finally, let’s simulate a total node failure. To do this, we will kill the underlying kubelet for our node, which is running on your system as a Docker container. Use the
docker ps
command to find the container ID that corresponds to your node, then kill it usingdocker kill
. Repeat the same experiments from before, documenting any unusual behaviour you observe.
Part 2: Interacting with Cassandra (Hard)
Note that this part of the experience is only required if you wish to achieve a grade of A- or higher. You can also choose to skip one of the two questions here, but doing so will likely impact your grade.
Write a Cilium security policy that allows only SELECT and INSERT queries to the
pfacts.facts
table in Cassandra. All other queries should be denied. Apply your policy and demonstrate how it works using a few examples queries. You may wish to consult the Cilium policy docs.Hint: You can use this policy template as a starting point.
Write and deploy your own containerized application as a replicated deployment to interact with the Cassandra database in some interesting way. You can choose to extend the printerfacts schema or come up with your own schema, depending on your preference. Be sure to explain how your application deals with Cassandra’s consistency model in a replicated cluster. In order to achieve points for this question, you must make some meaningful modifications to the database. Just consuming the existing data in a new way is not sufficient.
Hint 1: Cassandra allows the client to choose their consistency level when making queries. Hint 2: Cassandra queries are made in CQL, a NoSQL query language that is similar to but not totally compatible with SQL. Hint 3: You can spawn a
cqlsh
session to issue test CQL queries likekubectl exec -it cassandra-0 -- cqlsh
Part 3: Reflection
Summarize your experience with multi-node Kubernetes, Cilium, and Cassandra in a few paragraphs (both the good and the bad). What concepts do you see reflected here from the research papers we have read thus far? After having some hands on experience with a distributed system technology, have any of your opinions or initial assumptions changed? Feel free to list any other thoughts you have.
Acknowledgements
The idea for the printerfacts API comes from Christine Dodrill’s wonderful blog post.