Azure Data Lake is a big data solution based on multiple cloud services in the Microsoft Azure ecosystem. It allows organizations to ingest multiple data sets, including structured, unstructured, and semi-structured data, into an infinitely scalable data lake enabling storage, processing, and analytics.
Azure provides a range of analytics services, allowing you to process, query and analyze data using Spark, MapReduce, SQL querying, NoSQL data models, and more. We’ll focus on four key components—core data lake infrastructure, Azure Data Lake Storage (ADLS), Azure Data Lake Analytics (ADLA), and HDInsights.
This is part of our series of articles on Azure big data.
In this article, you will learn:
A data lake solution in Azure typically consists of four building blocks. All data lakes are based on Azure’s core infrastructure, including blob storage, Azure Data Factory, and Hadoop YARN.
Beyond this, organizations can optionally use Azure Data Lake Storage, a specialized storage service for large-scale datasets, and Azure Data Lake Analytics, a compute service that processes large scale data sets using T-SQL. Another optional component is Azure HDInsight, which lets you run distributed big data jobs using tools like Hadoop and Spark.
Azure Data Lake is based on Azure Blob Storage, an elastic object storage solution that provides low-cost tiered storage, high availability, and robust disaster recovery capabilities.
The solution integrates Blob Storage with Azure Data Factory, a tool for creating and running extract, transform, load (ETL) and extract, load and transform (ELT) processes. It also uses Apache Hadoop YARN as a cluster management platform, which can manage scalability of SQL Server instances, Azure SQL Database instances, and Azure SQL Data Warehouse servers.
Related content: read our guide to Azure Big Data solutions
Azure Data Lake Storage is a repository that can store massive datasets. It lets you store data in two ways:
Azure Data Lake Analytics is a compute service that lets you connect and process data from ADLS. It provides a platform for .NET developers to effectively process up to petabytes of data. Azure Data Lake Analytics allows users to run analytics jobs of any size, leveraging U-SQL to perform analytics tasks that combine C# and SQL.
Azure HDInsight is a managed service for running distributed big data jobs on Azure infrastructure. It allows users to run popular open source frameworks such as Apache Hadoop, Spark, and Kafka. It lets you leverage these open source projects, with fully managed infrastructure and cluster management, and no need for installation and customization.
The following table shows the main Azure services you can use to build your data lake architecture.
Service |
Description |
Function in a Data Lake |
Azure Blob Storage |
Managed object storage |
Storing unstructured data |
Azure Databricks |
Serverless analytics based on Azure Spark |
Batch processing of large datasets |
Cosmos DB |
Managed serverless NoSQL data store, supporting Cassandra and MongoDB |
Storing key-value pairs with no fixed schema |
Azure SQL Database |
Cloud-based managed SQL Server |
Storing relational datasets with SQL querying |
Azure SQL Datawarehouse |
Cloud-based enterprise data warehouse (EDW) |
Storing large volumes of structured data, enabling massively parallel processing (MPP) |
Azure Analysis Service |
Analytics engine based on SQL Server Analysis Server |
Building ad-hoc semantic models for tabular data |
Azure Data Factory |
Cloud-based ETL service |
Integrating the data lake with over 50 storage systems and databases, transforming data |
Related content: read our guide to Azure Analytics Services
Here are some best practices that will help you make the most of your data lake deployment on Azure.
Azure Data Lake Storage Gen2 provides Portable Operating System Interface (POSIX) access control for users, groups, and service principals defined in Azure Active Directory (Azure AD). These access controls can be set on existing files and directories. Use access control to create default permissions that can be automatically applied to new files or directories.
When designing a system with Data Lake Storage or cloud services, you need to consider availability requirements and how to deal with potential service outages. It is important to plan both for outages affecting a specific compute instance, a zone or an entire region.
Consider the workload's target recovery time objective (RTO) and recovery point objective (RPO). Leverage Azure’s range of storage redundancy options, ranging from Local Redundant Storage (LRS) to Read-Access Geo-Redundant Storage (RA-GRS).
Related content: read our guide to Azure High Availability
When ingesting data into a data lake, you should plan data structure to facilitate security, efficient processing and partitioning. Plan the directory structure to account for elements like organizational unit, data source, timeframe, and processing requirements.
In most cases, you should have the region in the beginning of your directory structure, and the date at the end. This lets you use POSIX permissions to lock down specific regions or data time frames to certain users. Putting the date at the end means that you can restrict specific date ranges without having to process many subdirectories unnecessarily.
Learn more about building a cloud data lake here: Cloud Data Lake in 5 Steps
NetApp Cloud Volumes ONTAP, the leading enterprise-grade storage management solution, delivers secure, proven storage management services on AWS, Azure and Google Cloud. Cloud Volumes ONTAP supports up to a capacity of 368TB, and supports various use cases such as file services, databases, DevOps or any other enterprise workload, with a strong set of features including high availability, data protection, storage efficiencies, Kubernetes integration, and more.
Cloud Volumes ONTAP supports advanced features for managing SAN storage in the cloud, catering for NoSQL database systems, as well as NFS shares that can be accessed directly from cloud big data analytics clusters.
In addition, Cloud Volumes ONTAP provides storage efficiency features, including thin provisioning, data compression, and deduplication, reducing the storage footprint and costs by up to 70%.