Kubernetes Topology-Aware Volumes and How to Set Them Up

Written by Yifat Perry, Technical Content Manager | Jan 12, 2022 7:31:09 AM

When you run an application, the physical layout in your architecture (CPU, memory, and storage), forms what is known as your topology, and it directly impacts your application's performance. The same is true for applications that are container-based and running on Kubernetes storage.

In this blog, we will see how topology affects applications running in Kubernetes clusters and take a close look at Kubernetes’ topology-aware volume provisioning feature.

Jump down to the how-to steps using the links below:

What Is Topology and Why Does It Matter?
The Topology Challenge with Kubernetes
- How the Kubernetes Scheduler Works
- Kubernetes Scheduler and Topology Challenges
Kubernetes Topology-Aware Volumes
How to Set Up a Kubernetes Topology-Aware Volume
- Terminology
- Walkthrough on Creating a Topology-Aware Persistent Volume
Conclusion

What Is Topology and Why Does It Matter?

First, let's understand what a topology is in the simplest terms. Topology refers to the arrangement of physical layouts in a computer. This includes devices such as the CPU, memory, disk, etc.

If you look at performance-critical workloads, such as in high performance computing (HPC), financial applications, or Internet of Things (IoT), topology information is required in order to schedule all the work that will take place for those applications in co-located CPUs and other devices. This will ensure optimal performance.

Now the question is: why does your topology matter to Kubernetes?

The Topology Challenge with Kubernetes

As we discussed above, high-performance applications need topology information to co-locate CPU or devices for optimal performance. One of the major challenges is that the default Kubernetes scheduler is not topology aware. This is mainly due to how the Kubernetes scheduler works.

How the Kubernetes Scheduler Works

The Kubernetes scheduler is responsible for scheduling pods to the nodes. Let's elaborate more on this with the help of an example. Let's say you are running a pod that requires 8 CPUs. The first step is:

Filtering: The scheduler will filter out the nodes which don't meet this requirement. For example, let’s say you have 3 worker nodes: one with 6 CPUs, the other with 9 CPUs, and the last one with 12 CPUs. The scheduler will filter out the node with 6 CPUs as it doesn't meet the requirement of 8 CPUs that was requested by the pod.
Scoring: Once it filters out the nodes that don’t meet the requirement, the next step is scoring the ones that do. The scheduler will rank the nodes based on how much the resource is left after assigning the pod. In the above case, after placing the pod of a node with 9 CPUs, only one CPU is left free, while the second node with 12 CPUs will have 3 free CPUs after placing the pod. Since the system with 12 CPUs will get better scoring due to the higher number of free CPUs, the scheduler will decide that the system is the ideal location for the pod to be placed.

Note: Kubernetes scheduler only decides which pod goes to which node. It doesn't place the pod on the nodes; that's the work of Kubelet which creates the pod.

Kubernetes Scheduler and Topology Challenges

Now that you have a brief idea of how the scheduler assigns pods to nodes, let's see why it presents some challenges with topology.

Prior to the introduction of topology-aware volumes, running pods with the zonal persistent disks was a significant challenge due to Kubernetes handling dynamic provisioning and pod scheduling independently.

As soon as a persistent volume claim (PVC) is created, the volume gets provisioned. The provisioner has no knowledge of what pods were in use when that happened, which means the volume could have been provisioned in one zone, while the pod was scheduled in another availability zone. This would ultimately result in a failed pod.

The workaround for this issue has been to overprovision nodes in all the nodes or to manually create volumes in the correct zones, but both of these solutions defeat the whole purpose of dynamic provisioning. This issue has been addressed with the introduction of Kubernetes topology-aware dynamic provisioning.

Kubernetes Topology-Aware Volumes

Starting with Kubernetes 1.12, support has been introduced for topology-aware dynamic provisioning for persistent volumes. Using the topology manager, Kubernetes will now draw on inputs from the Kubernetes scheduler to make informed decisions on where the most ideal location to provision a persistent volume for a pod is. This is especially helpful in the multi-availability zone environments, so that volumes will be provisioned in the same availability zone where the pod is running.

What the topology manager does is ensure that a Kubernetes pod is provided with resources that are correctly aligned at runtime. The topology manager is an integral part of the Kubelet, but it doesn't apply any specific constraint; it's the responsibility of the topology manager to enforce resource alignment.

Let's try to understand the use case for this technology. In cases where an application is horizontally scalable, you would want multiple replicas of that application. But at the same time, you want these replicas to be in different availability zones. This way, if one availability zone goes down, your application should not be impacted. Topology-aware provisioning makes sure that more optimal placement, with both the pod and the persistent volume in the same AZ, takes place.

How to Set Up a Kubernetes Topology-Aware Volume

Terminology

Before we start to set up a Kubernetes topology-aware volume, let us first understand a few key terms:

Storage Class: The storage class allows dynamic provisioning, where volumes are created only when the application requires them.
Volume Binding Mode: This field has control over exactly when dynamic provisioning and volume binding will take place. The default mode is Immediate, which indicates that volume binding and dynamic provisioning occur once the persistent volume claim (PVC) is created.
StatefulSet: StatefulSet is similar to deployments but with advantages such as persistent storage, unique network identifier, and graceful deployment and scaling. Read more about StatefulSet here.
TopologyKey: A topologyKey is used to label nodes. If the two nodes have identical labels, the scheduler treats both nodes in the same topology. Kubernetes scheduler always balances the pods in each topology domain.

Walkthrough on Creating a Topology-Aware Persistent Volume

In the following scenario we will first define a new storage class and a volume claim template. The volume claim template will create two persistent volumes in different availability zones that will later be binded to pods running in these availability zones. But first we will set the volume to be topology-aware.

To set up a topology-aware volume, you can create a storage class with volumeBindingMode set to WaitForFirstConsumer, or we can use a storage class that already comes with this mode enabled.

As discussed above, the default mode is Immediate, but the problem with that mode is that the storage backend is topology-constrained and not accessible to all nodes in the cluster, i.e., nodes in different availability zones. This will result in the creation of persistent volume without the knowledge of pod scheduling requirements. Setting volumeBindingMode to WaitForFirstConsumer solves this problem by delaying the provisioning or binding of a persistent volume until the pod that uses the persistent volume claim is created.

For example, EBS volumes are availability zone-specific. That is, a volume present in us-west-2a can't be mounted to an instance in us-west-2b or 2c. This is a topology constraint of AWS. This is intended to reduce latency between instances and the backend storage. The main idea behind WaitForFirstConsumer is to delay the provisioning so that pod will only try to bind the volume once the pod is created.

NOTE: All the below examples use AWS EKS. You can read more about AWS EKS here.

To verify an existing storage class in AWS, use the command below:

$ kubectl get sc
NAME PROVISIONER RECLAIMPOLICY VOLUMEBINDINGMODE ALLOWVOLUMEEXPANSION AGE
gp2 (default) kubernetes.io/aws-ebs Delete WaitForFirstConsumer false 16m

As you can see, the default storage class(gp2), which comes with AWS, has volumeBindingMode set to WaitForFirstConsumer.

You can create your storage class by using the below yaml definition and set volumeBindingMode: WaitForFirstConsumer.

$ cat sc.yaml
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: standard
provisioner: kubernetes.io/aws-ebs
parameters:
  type: gp2
volumeBindingMode: WaitForFirstConsumer

Create the storage class using the command below:

$ kubectl create -f sc.yaml storageclass.storage.k8s.io/standard created

To verify the newly created storage class, use the following command:

$ kubectl get sc
NAME PROVISIONER RECLAIMPOLICY VOLUMEBINDINGMODE ALLOWVOLUMEEXPANSION AGE
gp2 (default) kubernetes.io/aws-ebs Delete WaitForFirst Consumer false 15m

To use the StorageClass we just created, we will create a StatefulSet using that storage class. For this we will need to define volumeClaimTemplates and the storageClassName.

Note: In the below example, we are first creating the headless service. Headless service is helpful in cases when you don't need an IP or load-balancing. This is useful in a StatefulSet where you need to provide the network identity to your pod.

apiVersion: v1
kind: Service
metadata:
 name: nginx-headless
 labels:
   run: my-nginx
spec:
 ports:
 - port: 80
   name: web
 clusterIP: None
 selector:
   run: my-nginx
---
apiVersion: apps/v1
kind: StatefulSet
metadata:
 name: nginx-sts
spec:
 serviceName: "nginx-headless"
 replicas: 2
 selector:
   matchLabels:
     run: my-nginx
 template:
   metadata:
     labels:
       run: my-nginx      
     containers:
     - name: nginx
       image: nginx
       volumeMounts:
       - name: www
         mountPath: /var/www/
volumeClaimTemplates:
- metadata:
   name: www
  spec:
   storageClassName: standard
   accessModes:
    - ReadWriteOnce
   resources:
     requests:
      storage: 100Mi

Now, if you try to get the VolumeIDs, you will see the volumes are provisioned in different availability zones.

$ kubectl describe pv |grep VolumeID
VolumeID: aws://us-east-1f/vol-0fca6b7845155b415
VolumeID: aws://us-east-1a/vol-0f9dacfb7d7180968

You can also add other pod scheduling constraints that aren’t directly related to topology, such as pod affinity and anti-affinity, taints and tolerations, resource requirements, and node selector:

   spec:
     affinity:
       nodeAffinity:
         requiredDuringSchedulingIgnoredDuringExecution:
           nodeSelectorTerms:
           - matchExpressions:
             - topologyKey: failure-domain.beta.kubernetes.io/zone
               operator: In
               values:
               - us-east-1d
               - us-east-1f

For example, in this case we are adding the constraint that the pod can only be created in availability zones us-east-1d and us-east-1f.

To get the topologyKey (failure-domain.beta.kubernetes.io/zone) you can use the following command (you’ll also see that AWS automatically has added labels to your workers nodes):

$ kubectl get nodes --show-labels
NAME STATUS ROLES AGE VERSION LABELS
ip-192-168-19-67.ec2.internal Ready <none> 46m v1.21.5-eks-bc4871b alpha.eksctl.io/cluster-name=dev,alpha.eksctl.io/nodegroup-name=standard-workers,beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=t3.micro,beta.kubernetes.io/os=linux,eks.amazonaws.com/capacityType=ON_DEMAND,eks.amazonaws.com/nodegroup-image=ami-00836a7940260f6dd,eks.amazonaws.com/nodegroup=standard-workers,eks.amazonaws.com/sourceLaunchTemplateId=lt-080ecc2d75fd5e564,eks.amazonaws.com/sourceLaunchTemplateVersion=1,failure-domain.beta.kubernetes.io/region=us-east-1,failure-domain.beta.kubernetes.io/zone=us-east-1d,kubernetes.io/arch=amd64,kubernetes.io/hostname=ip-192-168-19-67.ec2.internal,kubernetes.io/os=linux,node.kubernetes.io/instance-type=t3.micro,topology.kubernetes.io/region=us-east-1,topology.kubernetes.io/zone=us-east-1d
ip-192-168-21-238.ec2.internal Ready <none> 46m v1.21.5-eks-bc4871b alpha.eksctl.io/cluster-name=dev,alpha.eksctl.io/nodegroup-name=standard-workers,beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=t3.micro,beta.kubernetes.io/os=linux,eks.amazonaws.com/capacityType=ON_DEMAND,eks.amazonaws.com/nodegroup-image=ami-00836a7940260f6dd,eks.amazonaws.com/nodegroup=standard-workers,eks.amazonaws.com/sourceLaunchTemplateId=lt-080ecc2d75fd5e564,eks.amazonaws.com/sourceLaunchTemplateVersion=1,failure-domain.beta.kubernetes.io/region=us-east-1,failure-domain.beta.kubernetes.io/zone=us-east-1d,kubernetes.io/arch=amd64,kubernetes.io/hostname=ip-192-168-21-238.ec2.internal,kubernetes.io/os=linux,node.kubernetes.io/instance-type=t3.micro,topology.kubernetes.io/region=us-east-1,topology.kubernetes.io/zone=us-east-1d
ip-192-168-48-227.ec2.internal Ready <none> 46m v1.21.5-eks-bc4871b alpha.eksctl.io/cluster-name=dev,alpha.eksctl.io/nodegroup-name=standard-workers,beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=t3.micro,beta.kubernetes.io/os=linux,eks.amazonaws.com/capacityType=ON_DEMAND,eks.amazonaws.com/nodegroup-image=ami-00836a7940260f6dd,eks.amazonaws.com/nodegroup=standard-workers,eks.amazonaws.com/sourceLaunchTemplateId=lt-080ecc2d75fd5e564,eks.amazonaws.com/sourceLaunchTemplateVersion=1,failure-domain.beta.kubernetes.io/region=us-east-1,failure-domain.beta.kubernetes.io/zone=us-east-1f,kubernetes.io/arch=amd64,kubernetes.io/hostname=ip-192-168-48-227.ec2.internal,kubernetes.io/os=linux,node.kubernetes.io/instance-type=t3.micro,topology.kubernetes.io/region=us-east-1,topology.kubernetes.io/zone=us-east-1f

You can also check labels for the persistent volumes. These labels were added via the AWS EBS container storage interface (CSI) drivers:

$ kubectl get pv --show-labels
NAME CAPACITY   ACCESS MODES RECLAIM POLICY STATUS CLAIM STORAGECLASS REASON AGE LABELS
pvc-2e6516a8-731f-4274-bba6-beddc86c95a1   1Gi RWO Delete Bound default/www-nginx-sts-1 standard 26m topology.kubernetes.io/region=us-east-1,topology.kubernetes.io/zone=us-east-1f
pvc-47db9d9b-53fd-475a-bf6f-fb25180c5773   1Gi       RWO          Delete           Bound   default/www-nginx-sts-0   standard              26m   topology.kubernetes.io/region=us-east-1,topology.kubernetes.io/zone=us-east-1d
pvc-7fddf3dc-b713-44c6-abf3-c72af53daded   10Gi       RWO          Delete           Bound   default/pv-claim         standard                32m   topology.kubernetes.io/region=us-east-1,topology.kubernetes.io/zone=us-east-1d

Conclusion

In this blog, you have learned how topology plays a crucial role in getting an optimal performance for your Kubernetes cluster. Then we looked at topology-aware volume provisioning, which is critical to ensure no mismatch between pod placement and the volume available in the same zone. This feature is supported in all the major cloud providers (AWS, GCP, and Azure).

In the scenario above, we created two volumes in two availability zones. In a more generic scenario, if you need to create multiple copies of volumes in multiple availability zones, this is a good case for turning to NetApp Cloud Volumes ONTAP.

To get even more from your Kubernetes cluster’s persistent storage layer, including synchronous cross-AZ deployment, consider NetApp Cloud Volumes ONTAP. Cloud Volumes ONTAP gives Kubernetes users on AWS, Azure, and GCP an automatic and dynamic way to respond to persistent volume claims via the NetApp Astra Trident provisioner, space-reducing storage efficiencies, instant volume cloning, and more.

View full post