When you run an application, the physical layout in your architecture (CPU, memory, and storage), forms what is known as your topology, and it directly impacts your application's performance. The same is true for applications that are container-based and running on Kubernetes storage.
In this blog, we will see how topology affects applications running in Kubernetes clusters and take a close look at Kubernetes’ topology-aware volume provisioning feature.
Jump down to the how-to steps using the links below:
First, let's understand what a topology is in the simplest terms. Topology refers to the arrangement of physical layouts in a computer. This includes devices such as the CPU, memory, disk, etc.
If you look at performance-critical workloads, such as in high performance computing (HPC), financial applications, or Internet of Things (IoT), topology information is required in order to schedule all the work that will take place for those applications in co-located CPUs and other devices. This will ensure optimal performance.
Now the question is: why does your topology matter to Kubernetes?
As we discussed above, high-performance applications need topology information to co-locate CPU or devices for optimal performance. One of the major challenges is that the default Kubernetes scheduler is not topology aware. This is mainly due to how the Kubernetes scheduler works.
The Kubernetes scheduler is responsible for scheduling pods to the nodes. Let's elaborate more on this with the help of an example. Let's say you are running a pod that requires 8 CPUs. The first step is:
Note: Kubernetes scheduler only decides which pod goes to which node. It doesn't place the pod on the nodes; that's the work of Kubelet which creates the pod.
Now that you have a brief idea of how the scheduler assigns pods to nodes, let's see why it presents some challenges with topology.
Prior to the introduction of topology-aware volumes, running pods with the zonal persistent disks was a significant challenge due to Kubernetes handling dynamic provisioning and pod scheduling independently.
As soon as a persistent volume claim (PVC) is created, the volume gets provisioned. The provisioner has no knowledge of what pods were in use when that happened, which means the volume could have been provisioned in one zone, while the pod was scheduled in another availability zone. This would ultimately result in a failed pod.
The workaround for this issue has been to overprovision nodes in all the nodes or to manually create volumes in the correct zones, but both of these solutions defeat the whole purpose of dynamic provisioning. This issue has been addressed with the introduction of Kubernetes topology-aware dynamic provisioning.
Starting with Kubernetes 1.12, support has been introduced for topology-aware dynamic provisioning for persistent volumes. Using the topology manager, Kubernetes will now draw on inputs from the Kubernetes scheduler to make informed decisions on where the most ideal location to provision a persistent volume for a pod is. This is especially helpful in the multi-availability zone environments, so that volumes will be provisioned in the same availability zone where the pod is running.
What the topology manager does is ensure that a Kubernetes pod is provided with resources that are correctly aligned at runtime. The topology manager is an integral part of the Kubelet, but it doesn't apply any specific constraint; it's the responsibility of the topology manager to enforce resource alignment.
Let's try to understand the use case for this technology. In cases where an application is horizontally scalable, you would want multiple replicas of that application. But at the same time, you want these replicas to be in different availability zones. This way, if one availability zone goes down, your application should not be impacted. Topology-aware provisioning makes sure that more optimal placement, with both the pod and the persistent volume in the same AZ, takes place.
Before we start to set up a Kubernetes topology-aware volume, let us first understand a few key terms:
In the following scenario we will first define a new storage class and a volume claim template. The volume claim template will create two persistent volumes in different availability zones that will later be binded to pods running in these availability zones. But first we will set the volume to be topology-aware.
To set up a topology-aware volume, you can create a storage class with volumeBindingMode set to WaitForFirstConsumer, or we can use a storage class that already comes with this mode enabled.
As discussed above, the default mode is Immediate, but the problem with that mode is that the storage backend is topology-constrained and not accessible to all nodes in the cluster, i.e., nodes in different availability zones. This will result in the creation of persistent volume without the knowledge of pod scheduling requirements. Setting volumeBindingMode to WaitForFirstConsumer solves this problem by delaying the provisioning or binding of a persistent volume until the pod that uses the persistent volume claim is created.
For example, EBS volumes are availability zone-specific. That is, a volume present in us-west-2a can't be mounted to an instance in us-west-2b or 2c. This is a topology constraint of AWS. This is intended to reduce latency between instances and the backend storage. The main idea behind WaitForFirstConsumer is to delay the provisioning so that pod will only try to bind the volume once the pod is created.
NOTE: All the below examples use AWS EKS. You can read more about AWS EKS here.
$ kubectl get sc
NAME PROVISIONER RECLAIMPOLICY VOLUMEBINDINGMODE ALLOWVOLUMEEXPANSION AGE
gp2 (default) kubernetes.io/aws-ebs Delete WaitForFirstConsumer false 16m
As you can see, the default storage class(gp2), which comes with AWS, has volumeBindingMode set to WaitForFirstConsumer.
$ cat sc.yaml
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: standard
provisioner: kubernetes.io/aws-ebs
parameters:
type: gp2
volumeBindingMode: WaitForFirstConsumer
$ kubectl create -f sc.yaml
storageclass.storage.k8s.io/standard created
$ kubectl get sc
NAME PROVISIONER RECLAIMPOLICY VOLUMEBINDINGMODE ALLOWVOLUMEEXPANSION AGE
gp2 (default) kubernetes.io/aws-ebs Delete WaitForFirst Consumer false 15m
Note: In the below example, we are first creating the headless service. Headless service is helpful in cases when you don't need an IP or load-balancing. This is useful in a StatefulSet where you need to provide the network identity to your pod.
apiVersion: v1
kind: Service
metadata:
name: nginx-headless
labels:
run: my-nginx
spec:
ports:
- port: 80
name: web
clusterIP: None
selector:
run: my-nginx
---
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: nginx-sts
spec:
serviceName: "nginx-headless"
replicas: 2
selector:
matchLabels:
run: my-nginx
template:
metadata:
labels:
run: my-nginx
containers:
- name: nginx
image: nginx
volumeMounts:
- name: www
mountPath: /var/www/
volumeClaimTemplates:
- metadata:
name: www
spec:
storageClassName: standard
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 100Mi
$ kubectl describe pv |grep VolumeID
VolumeID: aws://us-east-1f/vol-0fca6b7845155b415
VolumeID: aws://us-east-1a/vol-0f9dacfb7d7180968
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- topologyKey: failure-domain.beta.kubernetes.io/zone
operator: In
values:
- us-east-1d
- us-east-1f
For example, in this case we are adding the constraint that the pod can only be created in availability zones us-east-1d and us-east-1f.
$ kubectl get nodes --show-labels
NAME STATUS ROLES AGE VERSION LABELS
ip-192-168-19-67.ec2.internal Ready <none> 46m v1.21.5-eks-bc4871b alpha.eksctl.io/cluster-name=dev,alpha.eksctl.io/nodegroup-name=standard-workers,beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=t3.micro,beta.kubernetes.io/os=linux,eks.amazonaws.com/capacityType=ON_DEMAND,eks.amazonaws.com/nodegroup-image=ami-00836a7940260f6dd,eks.amazonaws.com/nodegroup=standard-workers,eks.amazonaws.com/sourceLaunchTemplateId=lt-080ecc2d75fd5e564,eks.amazonaws.com/sourceLaunchTemplateVersion=1,failure-domain.beta.kubernetes.io/region=us-east-1,failure-domain.beta.kubernetes.io/zone=us-east-1d,kubernetes.io/arch=amd64,kubernetes.io/hostname=ip-192-168-19-67.ec2.internal,kubernetes.io/os=linux,node.kubernetes.io/instance-type=t3.micro,topology.kubernetes.io/region=us-east-1,topology.kubernetes.io/zone=us-east-1d
ip-192-168-21-238.ec2.internal Ready <none> 46m v1.21.5-eks-bc4871b alpha.eksctl.io/cluster-name=dev,alpha.eksctl.io/nodegroup-name=standard-workers,beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=t3.micro,beta.kubernetes.io/os=linux,eks.amazonaws.com/capacityType=ON_DEMAND,eks.amazonaws.com/nodegroup-image=ami-00836a7940260f6dd,eks.amazonaws.com/nodegroup=standard-workers,eks.amazonaws.com/sourceLaunchTemplateId=lt-080ecc2d75fd5e564,eks.amazonaws.com/sourceLaunchTemplateVersion=1,failure-domain.beta.kubernetes.io/region=us-east-1,failure-domain.beta.kubernetes.io/zone=us-east-1d,kubernetes.io/arch=amd64,kubernetes.io/hostname=ip-192-168-21-238.ec2.internal,kubernetes.io/os=linux,node.kubernetes.io/instance-type=t3.micro,topology.kubernetes.io/region=us-east-1,topology.kubernetes.io/zone=us-east-1d
ip-192-168-48-227.ec2.internal Ready <none> 46m v1.21.5-eks-bc4871b alpha.eksctl.io/cluster-name=dev,alpha.eksctl.io/nodegroup-name=standard-workers,beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=t3.micro,beta.kubernetes.io/os=linux,eks.amazonaws.com/capacityType=ON_DEMAND,eks.amazonaws.com/nodegroup-image=ami-00836a7940260f6dd,eks.amazonaws.com/nodegroup=standard-workers,eks.amazonaws.com/sourceLaunchTemplateId=lt-080ecc2d75fd5e564,eks.amazonaws.com/sourceLaunchTemplateVersion=1,failure-domain.beta.kubernetes.io/region=us-east-1,failure-domain.beta.kubernetes.io/zone=us-east-1f,kubernetes.io/arch=amd64,kubernetes.io/hostname=ip-192-168-48-227.ec2.internal,kubernetes.io/os=linux,node.kubernetes.io/instance-type=t3.micro,topology.kubernetes.io/region=us-east-1,topology.kubernetes.io/zone=us-east-1f
$ kubectl get pv --show-labels
NAME CAPACITY ACCESS MODES RECLAIM POLICY STATUS CLAIM STORAGECLASS REASON AGE LABELS
pvc-2e6516a8-731f-4274-bba6-beddc86c95a1 1Gi RWO Delete Bound default/www-nginx-sts-1 standard 26m topology.kubernetes.io/region=us-east-1,topology.kubernetes.io/zone=us-east-1f
pvc-47db9d9b-53fd-475a-bf6f-fb25180c5773 1Gi RWO Delete Bound default/www-nginx-sts-0 standard 26m topology.kubernetes.io/region=us-east-1,topology.kubernetes.io/zone=us-east-1d
pvc-7fddf3dc-b713-44c6-abf3-c72af53daded 10Gi RWO Delete Bound default/pv-claim standard 32m topology.kubernetes.io/region=us-east-1,topology.kubernetes.io/zone=us-east-1d
In this blog, you have learned how topology plays a crucial role in getting an optimal performance for your Kubernetes cluster. Then we looked at topology-aware volume provisioning, which is critical to ensure no mismatch between pod placement and the volume available in the same zone. This feature is supported in all the major cloud providers (AWS, GCP, and Azure).
In the scenario above, we created two volumes in two availability zones. In a more generic scenario, if you need to create multiple copies of volumes in multiple availability zones, this is a good case for turning to NetApp Cloud Volumes ONTAP.
To get even more from your Kubernetes cluster’s persistent storage layer, including synchronous cross-AZ deployment, consider NetApp Cloud Volumes ONTAP. Cloud Volumes ONTAP gives Kubernetes users on AWS, Azure, and GCP an automatic and dynamic way to respond to persistent volume claims via the NetApp Astra Trident provisioner, space-reducing storage efficiencies, instant volume cloning, and more.