DistOS 2021F Experience 2

From Soma-notes
Revision as of 17:08, 29 November 2021 by Housedhorse (talk | contribs) (Clarify question 3)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)

Introduction

In this experience, you will be playing with a multi-node Kubernetes cluster simulated using minikube, running the Cilium CNI, a replicated Apache Cassandra database and an updated version of our beloved “printerfacts” service that consumes facts stored in the Cassandra database. As usual, you may wish to consult the relevant documentation if you get stuck. Links to documentation will be provided along with hints later on in this document.

With the exception of Part 2, completing this experience shouldn’t take more than a couple of hours. Feel free to collaborate with other students if you get stuck. However, you must acknowledge any collaboration. Additionally, copying and pasting or simply “changing up” each other’s answers will be treated as an academic integrity violation.

Submissions

Your experience report should be submitted as a PDF file on Brightspace, written in paragraph form. Code snippets or screenshots are allowed to augment your prose but are not required. We will post a submission link on Brightspace within the next few days.

The official due date for the experience report is December 10th (the last day of class), but we will continue to accept submissions up until the date of the final exam if you need extra time.

Receiving Your Grade

This experience is broken up into three parts:

  1. A series of easy tasks designed to get you familiar Cilium, Cassandra, and a multi-node K8s cluster.
  2. A harder challenge that involves programming a client to interact with the Cassandra database and writing a custom Cilium security policy to lock down your cluster.
  3. An opportunity to reflect on this experience and make connections to overall themes in the course.

Students are expected to complete Part 1 and Part 3 to get a grade of at most a B+. Part 2 is optional, but must be completed to receive a grade of A- or higher. Marks will be deducted for insufficient explanations or answers that demonstrate a lack of effort. By this logic, you should have a fairly clear idea of what grade you will receive when you submit this experience, based on how much effort you put in.

Setting Up Your Environment

For this experience, a new VM image has been provided for you. It has a larger disk size, more RAM, and a batteries-included installation of the Cilium and Hubble CLI. First, delete your old instance using the OpenStack web console, then create a new one. You can follow the same instructions as before, except replace the COMP4000-studentvm-v1 image with COMP4000-studentvm-v2. When selecting your flavour, make sure you pick the flavour with 16GiB of disk space and 8GB of RAM.

Once your VM is set up, SSH into it using your preferred SSH client. You should probably be working with at least 2 terminals for the rest of this experience. After SSHing into your instance, run the following command to spin up a multi-node k8s cluster: minikube start -n3 --cni=cilium. This might take a few minutes to run to completion.

Verify your cluster has started correctly using minikube status. You should see something like the following:

minikube
type: Control Plane
host: Running
kubelet: Running
apiserver: Running
kubeconfig: Configured

minikube-m02
type: Worker
host: Running
kubelet: Running

minikube-m03
type: Worker
host: Running
kubelet: Running

Cilium and Hubble

Once your cluster is running, you can configure the cluster to use Cilium and Hubble. Due to some minikube and Cilium quirks, we need to uninstall the default version of Cilium installed by minikube and reinstall it using the Cilium CLI. You can do so with the following commands:

cilium uninstall && cilium install && cilium hubble enable

This may take a few minutes to run to completion.

You can run the command cilium status to check the status of your Cilium installation. Once everything has installed correctly, you should see something like the following when running cilium status:

student@alpine:~$ cilium status
    /¯¯\
 /¯¯\__/¯¯\    Cilium:         OK
 \__/¯¯\__/    Operator:       OK
 /¯¯\__/¯¯\    Hubble:         OK
 \__/¯¯\__/    ClusterMesh:    disabled
    \__/

Deployment        cilium-operator    Desired: 1, Ready: 1/1, Available: 1/1
Deployment        hubble-relay       Desired: 1, Ready: 1/1, Available: 1/1
DaemonSet         cilium             Desired: 3, Ready: 3/3, Available: 3/3
Containers:       cilium             Running: 3
                  cilium-operator    Running: 1
                  hubble-relay       Running: 1
Cluster Pods:     2/2 managed by Cilium
Image versions    cilium             quay.io/cilium/cilium:v1.10.5: 3
                  cilium-operator    quay.io/cilium/operator-generic:v1.10.5: 1
                  hubble-relay       quay.io/cilium/hubble-relay:v1.10.5: 1

Cassandra

We will be using the helm package manger to install Cassandra in our k8s cluster. To do so, run the following commands:

wget https://homeostasis.scs.carleton.ca/~will/cassandra.yml
helm repo add bitnami https://charts.bitnami.com/bitnami
helm install cassandra bitnami/cassandra -f cassandra.yml

You can verify the status of your Cassandra installation using kubectl get pods --watch. Wait until both cassandra-0 and cassandra-1 show 1/1 under READY. This might take a few minutes. The output should look something like this:

student@alpine:~$ kubectl get pods
NAME                      READY   STATUS    RESTARTS      AGE
cassandra-0               1/1     Running   0             4m2s
cassandra-1               1/1     Running   0             2m8s

Printerfacts

The printerfacts service from last experience has received an upgrade to work with our new Cassandra database. First, run the migrations (provided as a batch job) to initialize the Cassandra database and populate it with some facts about printers:

wget https://homeostasis.scs.carleton.ca/~will/migrations.yml
kubectl apply -f migrations.yml

After applying the migrations, you’re ready to deploy the printerfacts service:

wget https://homeostasis.scs.carleton.ca/~will/deploy.yml
kubectl apply -f deploy.yml

Verify that printerfacts has been correctly deployed with kubectl get pods. You should see something like this:

student@alpine:~$ kubectl get pods
NAME                      READY   STATUS    RESTARTS      AGE
cassandra-0               1/1     Running   0             31m
cassandra-1               1/1     Running   0             30m
server-85f9f4465b-9vhn9   1/1     Running   0             38m
server-85f9f4465b-b7s2n   1/1     Running   0             38m
server-85f9f4465b-bvgqs   1/1     Running   0             38m
server-85f9f4465b-bwf42   1/1     Running   0             38m
server-85f9f4465b-gn5sm   1/1     Running   0             38m

Finally, run minikube tunnel in a separate terminal to set up port forwarding for our LoadBalancer API objects. This will enable you to interact with printerfacts from your VM, rather than needing to spin up a client pod like last time.

Verify that it worked correctly by running curl 10.96.0.201 (this is the IP you will use to talk to printerfacts). Note that depending on your specific configuration, you may need to SSH into minikube using minikube ssh before you can access the load balancer service. If your curl command hangs forever, this may be the issue.

Tasks

Part 1: Multi-Node Kubernetes, Cilium, and Cassandra (Easy)

Follow the instructions for each of the following numbered tasks. Make an effort to answer the accompanying questions, but more importantly please note down all of your observations and describe what you did for each task. You should also feel free to write down whatever questions you may have about a given task.

To achieve the best possible grade in this section, you must demonstrate that you have made an effort to understand the results of each task. (Note that an effort does not strictly mean a full understanding; it is okay to have questions!)

  1. Thanks the Cilium CNI, your Kubernetes cluster has been outfitted with new eBPF superpowers. In particular, Cilium installs a series of eBPF programs into your (VM’s) kernel which can be used to monitor traffic between containers, pods, and nodes, as well as enforce L4–L7 security policy. To test out your new superpowers, run cilium hubble port-forward& followed by hubble observe. What is all that output? Can you make sense of any of it?

  2. Unlike last time, our cluster consists of three nodes rather than just one. Try running curl 10.96.0.201 a few times and notice that the output now includes a node name in addition to a pod name. Do you notice any patterns in the output? Compare what you see with the output from running kubectl get pods -o wide. Try to come up with an explanation for the distribution of printerfacts pods over the nodes in your cluster.

    • Hint 1: Have a look at the affinity section of deploy.yml.
    • Hint 2: Have a look at the relevant documentation.
    • Hint 3: If you still can’t get it, an educated guess will suffice.
  3. Printerfacts is now a CRUD app that supports creating, reading, updating, and deleting facts from the Cassandra database. In particular, we now support the following endpoints:

    • GET 10.96.0.201/fact: Get a random fact from the database as a JSON object
    • GET 10.96.0.201/fact/keys: Get a list of all fact keys in the database as a JSON object
    • GET 10.96.0.201/fact/<key>: Get the fact with the key <key> in the database as a JSON object
    • POST 10.96.0.201/fact: Create a new fact where the request body is a JSON object of the form { "fact": "Fact Here", "kind": "Kind of fact (e.g. Cat fact)" }
    • PUT 10.96.0.201/fact/<key>: Modify the fact with key <key> in the database where the request body is a JSON object of the form { "fact": "Fact Here", "kind": "Kind of fact (e.g. Cat fact)" }
    • DELETE 10.96.0.201/fact/<key>: Delete the fact with key <key> from the database

    Try out each of the printerfacts endpoints. Note that you can specify the HTTP request type using curl -X <type> (for example, curl -X POST to send a POST request). You can send a JSON body as a payload by using curl -H 'Content-type: application/json' -d '{"key": "value"}' where you replace the {"key": "value"} part with your JSON object.

    Optional: Try restarting the Cassandra pods using kubectl rollout restart statefulset cassandra. While the pods are restarting, watch the output of kubectl get pods --watch and try making requests to the printerfacts service at the same time. Do you notice any downtime?

  4. Let’s use Cilium and Hubble to observe dataflows between our Cassandra database and the printerfacts service. To do this, run the following Hubble command: hubble observe --label component=server --label component=printerfacts -f. Make a few requests to the printerfacts service, exercising various API endpoints. Try and make sense of some of the network traffic you observe.

  5. In an effort to secure your cluster, you decide to employ a Cilium network policy that locks down the printerfacts service. First, download the example security policy by running wget https://homeostasis.scs.carleton.ca/~will/printerfacts-policy.yml, then run kubectl apply -f policy.yml to apply it. The example policy allows GET requests to /, /fact, /fact/keys, and /fact/<key>. Examine the policy file and make sure you understand how it works.

    Try out your policy by making a few valid requests followed by some invalid ones. While your policy is applied, try using Hubble to observe the HTTP traffic by running hubble observe --protocol http. Optionally, try extending the policy to make some other valid routes work (e.g. PUT, POST, and DELETE requests).

    Hint: You may wish to consult the Cilium docs

  6. Now that you’re familiar with printerfacts, it’s time to simulate a node failure. Since both printerfacts and Cassandra are replicated across multiple nodes in our cluster, taking one of them down should not impact either service.

    Let’s first try a planned node disruption. This kind of disruption might occur when a cluster administrator wants to take a node down for maintenance, for example to perform a kernel update. Start by draining node minikube-m02 with the command kubectl drain minikube-m02 --ignore-daemonsets. Now try interacting with the printerfacts service as before. Make note of any unusual behaviour and try to come up with a best-guess explanation.

    Now you can bring your node back up using kubectl uncordon minikube-m02. Once your node has been uncordoned, it should be ready for scheduling again. Try scaling up the printerfacts deployment to get some pods running on the node again.

    Finally, let’s simulate a total node failure. To do this, we will kill the underlying kubelet for our node, which is running on your system as a Docker container. Use the docker ps command to find the container ID that corresponds to your node, then kill it using docker kill. Repeat the same experiments from before, documenting any unusual behaviour you observe.

Part 2: Interacting with Cassandra (Hard)

Note that this part of the experience is only required if you wish to achieve a grade of A- or higher. You can also choose to skip one of the two questions here, but doing so will likely impact your grade.

  1. Write a Cilium security policy that allows only SELECT and INSERT queries to the pfacts.facts table in Cassandra. All other queries should be denied. Apply your policy and demonstrate how it works using a few examples queries. You may wish to consult the Cilium policy docs.

    Hint: You can use this policy template as a starting point.

  2. Write and deploy your own containerized application as a replicated deployment to interact with the Cassandra database in some interesting way. You can choose to extend the printerfacts schema or come up with your own schema, depending on your preference. Be sure to explain how your application deals with Cassandra’s consistency model in a replicated cluster. In order to achieve points for this question, you must make some meaningful modifications to the database. Just consuming the existing data in a new way is not sufficient.

    Hint 1: Cassandra allows the client to choose their consistency level when making queries. Hint 2: Cassandra queries are made in CQL, a NoSQL query language that is similar to but not totally compatible with SQL. Hint 3: You can spawn a cqlsh session to issue test CQL queries like kubectl exec -it cassandra-0 -- cqlsh

Part 3: Reflection

Summarize your experience with multi-node Kubernetes, Cilium, and Cassandra in a few paragraphs (both the good and the bad). What concepts do you see reflected here from the research papers we have read thus far? After having some hands on experience with a distributed system technology, have any of your opinions or initial assumptions changed? Feel free to list any other thoughts you have.

Acknowledgements

The idea for the printerfacts API comes from Christine Dodrill’s wonderful blog post.