Azure Data Lake: 4 Building Blocks and Best Practices

Written by Yifat Perry, Technical Content Manager | Apr 18, 2021 1:34:16 PM

What is Azure Data Lake?

Azure Data Lake is a big data solution based on multiple cloud services in the Microsoft Azure ecosystem. It allows organizations to ingest multiple data sets, including structured, unstructured, and semi-structured data, into an infinitely scalable data lake enabling storage, processing, and analytics.

Azure provides a range of analytics services, allowing you to process, query and analyze data using Spark, MapReduce, SQL querying, NoSQL data models, and more. We’ll focus on four key components—core data lake infrastructure, Azure Data Lake Storage (ADLS), Azure Data Lake Analytics (ADLA), and HDInsights.

This is part of our series of articles on Azure big data.

In this article, you will learn:

4 Building Blocks of Data Lakes on Azure
Building Your Azure Data Lake: Complementary Services
Azure Data Lake Best Practices
Azure Data Lake with NetApp Cloud Volumes ONTAP

4 Building Blocks of Data Lakes on Azure

A data lake solution in Azure typically consists of four building blocks. All data lakes are based on Azure’s core infrastructure, including blob storage, Azure Data Factory, and Hadoop YARN.

Beyond this, organizations can optionally use Azure Data Lake Storage, a specialized storage service for large-scale datasets, and Azure Data Lake Analytics, a compute service that processes large scale data sets using T-SQL. Another optional component is Azure HDInsight, which lets you run distributed big data jobs using tools like Hadoop and Spark.

Core Infrastructure

Azure Data Lake is based on Azure Blob Storage, an elastic object storage solution that provides low-cost tiered storage, high availability, and robust disaster recovery capabilities.

The solution integrates Blob Storage with Azure Data Factory, a tool for creating and running extract, transform, load (ETL) and extract, load and transform (ELT) processes. It also uses Apache Hadoop YARN as a cluster management platform, which can manage scalability of SQL Server instances, Azure SQL Database instances, and Azure SQL Data Warehouse servers.

Related content: read our guide to Azure Big Data solutions

Azure Data Lake Storage (ADLS)

Azure Data Lake Storage is a repository that can store massive datasets. It lets you store data in two ways:

WebHDFS storage—compatible with the Hadoop File System (HDFS), which is a hierarchical data store with strong security capabilities
Data lake blob storage—you can store data as blobs, with the full functionality of Azure Blob Storage, including encryption, data tiering, integration with Azure AD, and data lifecycle automation.

Azure Data Lake Analytics (ADLA)

Azure Data Lake Analytics is a compute service that lets you connect and process data from ADLS. It provides a platform for .NET developers to effectively process up to petabytes of data. Azure Data Lake Analytics allows users to run analytics jobs of any size, leveraging U-SQL to perform analytics tasks that combine C# and SQL.

Azure HDInsight

Azure HDInsight is a managed service for running distributed big data jobs on Azure infrastructure. It allows users to run popular open source frameworks such as Apache Hadoop, Spark, and Kafka. It lets you leverage these open source projects, with fully managed infrastructure and cluster management, and no need for installation and customization.

Building Your Azure Data Lake: Complementary Services

The following table shows the main Azure services you can use to build your data lake architecture.

Service	Description	Function in a Data Lake
Azure Blob Storage	Managed object storage	Storing unstructured data
Azure Databricks	Serverless analytics based on Azure Spark	Batch processing of large datasets
Cosmos DB	Managed serverless NoSQL data store, supporting Cassandra and MongoDB	Storing key-value pairs with no fixed schema
Azure SQL Database	Cloud-based managed SQL Server	Storing relational datasets with SQL querying
Azure SQL Datawarehouse	Cloud-based enterprise data warehouse (EDW)	Storing large volumes of structured data, enabling massively parallel processing (MPP)
Azure Analysis Service	Analytics engine based on SQL Server Analysis Server	Building ad-hoc semantic models for tabular data
Azure Data Factory	Cloud-based ETL service	Integrating the data lake with over 50 storage systems and databases, transforming data

Related content: read our guide to Azure Analytics Services

Azure Data Lake Best Practices

Here are some best practices that will help you make the most of your data lake deployment on Azure.

Security

Azure Data Lake Storage Gen2 provides Portable Operating System Interface (POSIX) access control for users, groups, and service principals defined in Azure Active Directory (Azure AD). These access controls can be set on existing files and directories. Use access control to create default permissions that can be automatically applied to new files or directories.

Resiliency

When designing a system with Data Lake Storage or cloud services, you need to consider availability requirements and how to deal with potential service outages. It is important to plan both for outages affecting a specific compute instance, a zone or an entire region.

Consider the workload's target recovery time objective (RTO) and recovery point objective (RPO). Leverage Azure’s range of storage redundancy options, ranging from Local Redundant Storage (LRS) to Read-Access Geo-Redundant Storage (RA-GRS).

Related content: read our guide to Azure High Availability

Directory Layout

When ingesting data into a data lake, you should plan data structure to facilitate security, efficient processing and partitioning. Plan the directory structure to account for elements like organizational unit, data source, timeframe, and processing requirements.

In most cases, you should have the region in the beginning of your directory structure, and the date at the end. This lets you use POSIX permissions to lock down specific regions or data time frames to certain users. Putting the date at the end means that you can restrict specific date ranges without having to process many subdirectories unnecessarily.

Learn more about building a cloud data lake here: Cloud Data Lake in 5 Steps

Azure Data Lake with NetApp Cloud Volumes ONTAP

NetApp Cloud Volumes ONTAP, the leading enterprise-grade storage management solution, delivers secure, proven storage management services on AWS, Azure and Google Cloud. Cloud Volumes ONTAP supports up to a capacity of 368TB, and supports various use cases such as file services, databases, DevOps or any other enterprise workload, with a strong set of features including high availability, data protection, storage efficiencies, Kubernetes integration, and more.

Cloud Volumes ONTAP supports advanced features for managing SAN storage in the cloud, catering for NoSQL database systems, as well as NFS shares that can be accessed directly from cloud big data analytics clusters.

In addition, Cloud Volumes ONTAP provides storage efficiency features, including thin provisioning, data compression, and deduplication, reducing the storage footprint and costs by up to 70%.

View full post