More about Kubernetes Storage
- Fundamentals of Securing Kubernetes Clusters in the Cloud
- Kubernetes Storage Master Class: A Free Webinar Series by NetApp
- Kubernetes StorageClass: Concepts and Common Operations
- Kubernetes Data Mobility with Cloud Volumes ONTAP
- Scaling Kubernetes Persistent Volumes with Cloud Volumes ONTAP
- What's New in K8S 1.23?
- Kubernetes Topology-Aware Volumes and How to Set Them Up
- How to Use NetApp Cloud Manager for Provisioning Persistent Volumes in Kubernetes
- Kubernetes vs. Nomad: Understanding the Tradeoffs
- How to Set Up MySQL Kubernetes Deployments with Cloud Volumes ONTAP
- Kubernetes Volume Cloning with Cloud Volumes ONTAP
- Container Storage Interface: The Foundation of K8s Storage
- Kubernetes Deployment vs StatefulSet: Which is Right for You?
- Kubernetes for Developers: Overview, Insights, and Tips
- Kubernetes StatefulSet: A Practical Guide
- Kubernetes CSI: Basics of CSI Volumes and How to Build a CSI Driver
- Kubernetes Management and Orchestration Services: An Interview with Michael Shaul
- Kubernetes Database: How to Deploy and Manage Databases on Kubernetes
- Kubernetes and Persistent Apps: An Interview with Michael Shaul
- Kubernetes: Dynamic Provisioning with Cloud Volumes ONTAP and Astra Trident
- Kubernetes Cloud Storage Efficiency with Cloud Volumes ONTAP
- Data Protection for Persistent Data Storage in Kubernetes Workloads
- Managing Stateful Applications in Kubernetes
- Kubernetes: Provisioning Persistent Volumes
- An Introduction to Kubernetes
- Google Kubernetes Engine: Ultimate Quick Start Guide
- Azure Kubernetes Service Tutorial: How to Integrate AKS with Azure Container Instances
- Kubernetes Workloads with Cloud Volumes ONTAP: Success Stories
- Container Management in the Cloud Age: New Insights from 451 Research
- Kubernetes Storage: An In-Depth Look
- Monolith vs. Microservices: How Are You Running Your Applications?
- Kubernetes Shared Storage: The Basics and a Quick Tutorial
- Kubernetes NFS Provisioning with Cloud Volumes ONTAP and Trident
- Azure Kubernetes Service How-To: Configure Persistent Volumes for Containers in AKS
- Kubernetes NFS: Quick Tutorials
- NetApp Trident and Docker Volume Tutorial
When you run an application, the physical layout in your architecture (CPU, memory, and storage), forms what is known as your topology, and it directly impacts your application's performance. The same is true for applications that are container-based and running on Kubernetes storage.
In this blog, we will see how topology affects applications running in Kubernetes clusters and take a close look at Kubernetes’ topology-aware volume provisioning feature.
Jump down to the how-to steps using the links below:
- What Is Topology and Why Does It Matter?
- The Topology Challenge with Kubernetes
- Kubernetes Topology-Aware Volumes
- How to Set Up a Kubernetes Topology-Aware Volume
What Is Topology and Why Does It Matter?
First, let's understand what a topology is in the simplest terms. Topology refers to the arrangement of physical layouts in a computer. This includes devices such as the CPU, memory, disk, etc.
If you look at performance-critical workloads, such as in high performance computing (HPC), financial applications, or Internet of Things (IoT), topology information is required in order to schedule all the work that will take place for those applications in co-located CPUs and other devices. This will ensure optimal performance.
Now the question is: why does your topology matter to Kubernetes?
The Topology Challenge with Kubernetes
As we discussed above, high-performance applications need topology information to co-locate CPU or devices for optimal performance. One of the major challenges is that the default Kubernetes scheduler is not topology aware. This is mainly due to how the Kubernetes scheduler works.
How the Kubernetes Scheduler Works
The Kubernetes scheduler is responsible for scheduling pods to the nodes. Let's elaborate more on this with the help of an example. Let's say you are running a pod that requires 8 CPUs. The first step is:
- Filtering: The scheduler will filter out the nodes which don't meet this requirement. For example, let’s say you have 3 worker nodes: one with 6 CPUs, the other with 9 CPUs, and the last one with 12 CPUs. The scheduler will filter out the node with 6 CPUs as it doesn't meet the requirement of 8 CPUs that was requested by the pod.
- Scoring: Once it filters out the nodes that don’t meet the requirement, the next step is scoring the ones that do. The scheduler will rank the nodes based on how much the resource is left after assigning the pod. In the above case, after placing the pod of a node with 9 CPUs, only one CPU is left free, while the second node with 12 CPUs will have 3 free CPUs after placing the pod. Since the system with 12 CPUs will get better scoring due to the higher number of free CPUs, the scheduler will decide that the system is the ideal location for the pod to be placed.
Note: Kubernetes scheduler only decides which pod goes to which node. It doesn't place the pod on the nodes; that's the work of Kubelet which creates the pod.
Kubernetes Scheduler and Topology Challenges
Now that you have a brief idea of how the scheduler assigns pods to nodes, let's see why it presents some challenges with topology.
Prior to the introduction of topology-aware volumes, running pods with the zonal persistent disks was a significant challenge due to Kubernetes handling dynamic provisioning and pod scheduling independently.
As soon as a persistent volume claim (PVC) is created, the volume gets provisioned. The provisioner has no knowledge of what pods were in use when that happened, which means the volume could have been provisioned in one zone, while the pod was scheduled in another availability zone. This would ultimately result in a failed pod.
The workaround for this issue has been to overprovision nodes in all the nodes or to manually create volumes in the correct zones, but both of these solutions defeat the whole purpose of dynamic provisioning. This issue has been addressed with the introduction of Kubernetes topology-aware dynamic provisioning.
Kubernetes Topology-Aware Volumes
Starting with Kubernetes 1.12, support has been introduced for topology-aware dynamic provisioning for persistent volumes. Using the topology manager, Kubernetes will now draw on inputs from the Kubernetes scheduler to make informed decisions on where the most ideal location to provision a persistent volume for a pod is. This is especially helpful in the multi-availability zone environments, so that volumes will be provisioned in the same availability zone where the pod is running.
What the topology manager does is ensure that a Kubernetes pod is provided with resources that are correctly aligned at runtime. The topology manager is an integral part of the Kubelet, but it doesn't apply any specific constraint; it's the responsibility of the topology manager to enforce resource alignment.
Let's try to understand the use case for this technology. In cases where an application is horizontally scalable, you would want multiple replicas of that application. But at the same time, you want these replicas to be in different availability zones. This way, if one availability zone goes down, your application should not be impacted. Topology-aware provisioning makes sure that more optimal placement, with both the pod and the persistent volume in the same AZ, takes place.
How to Set Up a Kubernetes Topology-Aware Volume
Before we start to set up a Kubernetes topology-aware volume, let us first understand a few key terms:
- Storage Class: The storage class allows dynamic provisioning, where volumes are created only when the application requires them.
- Volume Binding Mode: This field has control over exactly when dynamic provisioning and volume binding will take place. The default mode is Immediate, which indicates that volume binding and dynamic provisioning occur once the persistent volume claim (PVC) is created.
- StatefulSet: StatefulSet is similar to deployments but with advantages such as persistent storage, unique network identifier, and graceful deployment and scaling. Read more about StatefulSet here.
- TopologyKey: A topologyKey is used to label nodes. If the two nodes have identical labels, the scheduler treats both nodes in the same topology. Kubernetes scheduler always balances the pods in each topology domain.
Walkthrough on Creating a Topology-Aware Persistent Volume
In the following scenario we will first define a new storage class and a volume claim template. The volume claim template will create two persistent volumes in different availability zones that will later be binded to pods running in these availability zones. But first we will set the volume to be topology-aware.
To set up a topology-aware volume, you can create a storage class with volumeBindingMode set to WaitForFirstConsumer, or we can use a storage class that already comes with this mode enabled.
As discussed above, the default mode is Immediate, but the problem with that mode is that the storage backend is topology-constrained and not accessible to all nodes in the cluster, i.e., nodes in different availability zones. This will result in the creation of persistent volume without the knowledge of pod scheduling requirements. Setting volumeBindingMode to WaitForFirstConsumer solves this problem by delaying the provisioning or binding of a persistent volume until the pod that uses the persistent volume claim is created.
For example, EBS volumes are availability zone-specific. That is, a volume present in us-west-2a can't be mounted to an instance in us-west-2b or 2c. This is a topology constraint of AWS. This is intended to reduce latency between instances and the backend storage. The main idea behind WaitForFirstConsumer is to delay the provisioning so that pod will only try to bind the volume once the pod is created.
NOTE: All the below examples use AWS EKS. You can read more about AWS EKS here.
- To verify an existing storage class in AWS, use the command below:
$ kubectl get sc
NAME PROVISIONER RECLAIMPOLICY VOLUMEBINDINGMODE ALLOWVOLUMEEXPANSION AGE
gp2 (default) kubernetes.io/aws-ebs Delete WaitForFirstConsumer false 16m
As you can see, the default storage class(gp2), which comes with AWS, has volumeBindingMode set to WaitForFirstConsumer.
- You can create your storage class by using the below yaml definition and set volumeBindingMode: WaitForFirstConsumer.
$ cat sc.yaml apiVersion: storage.k8s.io/v1 kind: StorageClass metadata: name: standard provisioner: kubernetes.io/aws-ebs parameters: type: gp2 volumeBindingMode: WaitForFirstConsumer
- Create the storage class using the command below:
$ kubectl create -f sc.yaml
- To verify the newly created storage class, use the following command:
$ kubectl get sc
NAME PROVISIONER RECLAIMPOLICY VOLUMEBINDINGMODE ALLOWVOLUMEEXPANSION AGE
gp2 (default) kubernetes.io/aws-ebs Delete WaitForFirst Consumer false 15m
- To use the StorageClass we just created, we will create a StatefulSet using that storage class. For this we will need to define volumeClaimTemplates and the storageClassName.
Note: In the below example, we are first creating the headless service. Headless service is helpful in cases when you don't need an IP or load-balancing. This is useful in a StatefulSet where you need to provide the network identity to your pod.
apiVersion: v1 kind: Service metadata: name: nginx-headless labels: run: my-nginx spec: ports: - port: 80 name: web clusterIP: None selector: run: my-nginx --- apiVersion: apps/v1 kind: StatefulSet metadata: name: nginx-sts spec: serviceName: "nginx-headless" replicas: 2 selector: matchLabels: run: my-nginx template: metadata: labels: run: my-nginx containers: - name: nginx image: nginx volumeMounts: - name: www mountPath: /var/www/ volumeClaimTemplates: - metadata: name: www spec: storageClassName: standard accessModes: - ReadWriteOnce resources: requests: storage: 100Mi
- Now, if you try to get the VolumeIDs, you will see the volumes are provisioned in different availability zones.
$ kubectl describe pv |grep VolumeID
- You can also add other pod scheduling constraints that aren’t directly related to topology, such as pod affinity and anti-affinity, taints and tolerations, resource requirements, and node selector:
spec: affinity: nodeAffinity: requiredDuringSchedulingIgnoredDuringExecution: nodeSelectorTerms: - matchExpressions: - topologyKey: failure-domain.beta.kubernetes.io/zone operator: In values: - us-east-1d - us-east-1f
For example, in this case we are adding the constraint that the pod can only be created in availability zones us-east-1d and us-east-1f.
- To get the topologyKey (failure-domain.beta.kubernetes.io/zone) you can use the following command (you’ll also see that AWS automatically has added labels to your workers nodes):
$ kubectl get nodes --show-labels
NAME STATUS ROLES AGE VERSION LABELS
ip-192-168-19-67.ec2.internal Ready <none> 46m v1.21.5-eks-bc4871b alpha.eksctl.io/cluster-name=dev,alpha.eksctl.io/nodegroup-name=standard-workers,beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=t3.micro,beta.kubernetes.io/os=linux,eks.amazonaws.com/capacityType=ON_DEMAND,eks.amazonaws.com/nodegroup-image=ami-00836a7940260f6dd,eks.amazonaws.com/nodegroup=standard-workers,eks.amazonaws.com/sourceLaunchTemplateId=lt-080ecc2d75fd5e564,eks.amazonaws.com/sourceLaunchTemplateVersion=1,failure-domain.beta.kubernetes.io/region=us-east-1,failure-domain.beta.kubernetes.io/zone=us-east-1d,kubernetes.io/arch=amd64,kubernetes.io/hostname=ip-192-168-19-67.ec2.internal,kubernetes.io/os=linux,node.kubernetes.io/instance-type=t3.micro,topology.kubernetes.io/region=us-east-1,topology.kubernetes.io/zone=us-east-1d
ip-192-168-21-238.ec2.internal Ready <none> 46m v1.21.5-eks-bc4871b alpha.eksctl.io/cluster-name=dev,alpha.eksctl.io/nodegroup-name=standard-workers,beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=t3.micro,beta.kubernetes.io/os=linux,eks.amazonaws.com/capacityType=ON_DEMAND,eks.amazonaws.com/nodegroup-image=ami-00836a7940260f6dd,eks.amazonaws.com/nodegroup=standard-workers,eks.amazonaws.com/sourceLaunchTemplateId=lt-080ecc2d75fd5e564,eks.amazonaws.com/sourceLaunchTemplateVersion=1,failure-domain.beta.kubernetes.io/region=us-east-1,failure-domain.beta.kubernetes.io/zone=us-east-1d,kubernetes.io/arch=amd64,kubernetes.io/hostname=ip-192-168-21-238.ec2.internal,kubernetes.io/os=linux,node.kubernetes.io/instance-type=t3.micro,topology.kubernetes.io/region=us-east-1,topology.kubernetes.io/zone=us-east-1d
ip-192-168-48-227.ec2.internal Ready <none> 46m v1.21.5-eks-bc4871b alpha.eksctl.io/cluster-name=dev,alpha.eksctl.io/nodegroup-name=standard-workers,beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=t3.micro,beta.kubernetes.io/os=linux,eks.amazonaws.com/capacityType=ON_DEMAND,eks.amazonaws.com/nodegroup-image=ami-00836a7940260f6dd,eks.amazonaws.com/nodegroup=standard-workers,eks.amazonaws.com/sourceLaunchTemplateId=lt-080ecc2d75fd5e564,eks.amazonaws.com/sourceLaunchTemplateVersion=1,failure-domain.beta.kubernetes.io/region=us-east-1,failure-domain.beta.kubernetes.io/zone=us-east-1f,kubernetes.io/arch=amd64,kubernetes.io/hostname=ip-192-168-48-227.ec2.internal,kubernetes.io/os=linux,node.kubernetes.io/instance-type=t3.micro,topology.kubernetes.io/region=us-east-1,topology.kubernetes.io/zone=us-east-1f
- You can also check labels for the persistent volumes. These labels were added via the AWS EBS container storage interface (CSI) drivers:
$ kubectl get pv --show-labels
NAME CAPACITY ACCESS MODES RECLAIM POLICY STATUS CLAIM STORAGECLASS REASON AGE LABELS
pvc-2e6516a8-731f-4274-bba6-beddc86c95a1 1Gi RWO Delete Bound default/www-nginx-sts-1 standard 26m topology.kubernetes.io/region=us-east-1,topology.kubernetes.io/zone=us-east-1f
pvc-47db9d9b-53fd-475a-bf6f-fb25180c5773 1Gi RWO Delete Bound default/www-nginx-sts-0 standard 26m topology.kubernetes.io/region=us-east-1,topology.kubernetes.io/zone=us-east-1d
pvc-7fddf3dc-b713-44c6-abf3-c72af53daded 10Gi RWO Delete Bound default/pv-claim standard 32m topology.kubernetes.io/region=us-east-1,topology.kubernetes.io/zone=us-east-1d
In this blog, you have learned how topology plays a crucial role in getting an optimal performance for your Kubernetes cluster. Then we looked at topology-aware volume provisioning, which is critical to ensure no mismatch between pod placement and the volume available in the same zone. This feature is supported in all the major cloud providers (AWS, GCP, and Azure).
In the scenario above, we created two volumes in two availability zones. In a more generic scenario, if you need to create multiple copies of volumes in multiple availability zones, this is a good case for turning to NetApp Cloud Volumes ONTAP.
To get even more from your Kubernetes cluster’s persistent storage layer, including synchronous cross-AZ deployment, consider NetApp Cloud Volumes ONTAP. Cloud Volumes ONTAP gives Kubernetes users on AWS, Azure, and GCP an automatic and dynamic way to respond to persistent volume claims via the NetApp Astra Trident provisioner, space-reducing storage efficiencies, instant volume cloning, and more.