More about Azure Big Data
- Azure Data Box: Solution Overview and Best Practices
- Azure Data Box Gateway: Benefits, Use Cases, and 6 Best Practices
- Azure Data Lake Pricing Explained
- Best Practices for Using Azure HDInsight for Big Data and Analytics
- Azure Data Lake: 4 Building Blocks and Best Practices
- Azure Analytics Services: An In-Depth Look
- Azure NoSQL: Types, Services, and a Quick Tutorial
- Azure Big Data: 3 Steps to Building Your Solution
Subscribe to our blog
Thanks for subscribing to the blog.
April 18, 2021
Topics: Cloud Volumes ONTAP AzureDatabaseElementary5 minute readAnalytics
What is Azure Data Lake?
Azure Data Lake is a big data solution based on multiple cloud services in the Microsoft Azure ecosystem. It allows organizations to ingest multiple data sets, including structured, unstructured, and semi-structured data, into an infinitely scalable data lake enabling storage, processing, and analytics.
Azure provides a range of analytics services, allowing you to process, query and analyze data using Spark, MapReduce, SQL querying, NoSQL data models, and more. We’ll focus on four key components—core data lake infrastructure, Azure Data Lake Storage (ADLS), Azure Data Lake Analytics (ADLA), and HDInsights.
This is part of our series of articles on Azure big data.
In this article, you will learn:
- 4 Building Blocks of Data Lakes on Azure
- Building Your Azure Data Lake: Complementary Services
- Azure Data Lake Best Practices
- Azure Data Lake with NetApp Cloud Volumes ONTAP
4 Building Blocks of Data Lakes on Azure
A data lake solution in Azure typically consists of four building blocks. All data lakes are based on Azure’s core infrastructure, including blob storage, Azure Data Factory, and Hadoop YARN.
Beyond this, organizations can optionally use Azure Data Lake Storage, a specialized storage service for large-scale datasets, and Azure Data Lake Analytics, a compute service that processes large scale data sets using T-SQL. Another optional component is Azure HDInsight, which lets you run distributed big data jobs using tools like Hadoop and Spark.
Core Infrastructure
Azure Data Lake is based on Azure Blob Storage, an elastic object storage solution that provides low-cost tiered storage, high availability, and robust disaster recovery capabilities.
The solution integrates Blob Storage with Azure Data Factory, a tool for creating and running extract, transform, load (ETL) and extract, load and transform (ELT) processes. It also uses Apache Hadoop YARN as a cluster management platform, which can manage scalability of SQL Server instances, Azure SQL Database instances, and Azure SQL Data Warehouse servers.
Related content: read our guide to Azure Big Data solutions
Azure Data Lake Storage (ADLS)
Azure Data Lake Storage is a repository that can store massive datasets. It lets you store data in two ways:
- WebHDFS storage—compatible with the Hadoop File System (HDFS), which is a hierarchical data store with strong security capabilities
- Data lake blob storage—you can store data as blobs, with the full functionality of Azure Blob Storage, including encryption, data tiering, integration with Azure AD, and data lifecycle automation.
Azure Data Lake Analytics (ADLA)
Azure Data Lake Analytics is a compute service that lets you connect and process data from ADLS. It provides a platform for .NET developers to effectively process up to petabytes of data. Azure Data Lake Analytics allows users to run analytics jobs of any size, leveraging U-SQL to perform analytics tasks that combine C# and SQL.
Azure HDInsight
Azure HDInsight is a managed service for running distributed big data jobs on Azure infrastructure. It allows users to run popular open source frameworks such as Apache Hadoop, Spark, and Kafka. It lets you leverage these open source projects, with fully managed infrastructure and cluster management, and no need for installation and customization.
Building Your Azure Data Lake: Complementary Services
The following table shows the main Azure services you can use to build your data lake architecture.
Service |
Description |
Function in a Data Lake |
Azure Blob Storage |
Managed object storage |
Storing unstructured data |
Azure Databricks |
Serverless analytics based on Azure Spark |
Batch processing of large datasets |
Cosmos DB |
Managed serverless NoSQL data store, supporting Cassandra and MongoDB |
Storing key-value pairs with no fixed schema |
Azure SQL Database |
Cloud-based managed SQL Server |
Storing relational datasets with SQL querying |
Azure SQL Datawarehouse |
Cloud-based enterprise data warehouse (EDW) |
Storing large volumes of structured data, enabling massively parallel processing (MPP) |
Azure Analysis Service |
Analytics engine based on SQL Server Analysis Server |
Building ad-hoc semantic models for tabular data |
Azure Data Factory |
Cloud-based ETL service |
Integrating the data lake with over 50 storage systems and databases, transforming data |
Related content: read our guide to Azure Analytics Services
Azure Data Lake Best Practices
Here are some best practices that will help you make the most of your data lake deployment on Azure.
Security
Azure Data Lake Storage Gen2 provides Portable Operating System Interface (POSIX) access control for users, groups, and service principals defined in Azure Active Directory (Azure AD). These access controls can be set on existing files and directories. Use access control to create default permissions that can be automatically applied to new files or directories.
Resiliency
When designing a system with Data Lake Storage or cloud services, you need to consider availability requirements and how to deal with potential service outages. It is important to plan both for outages affecting a specific compute instance, a zone or an entire region.
Consider the workload's target recovery time objective (RTO) and recovery point objective (RPO). Leverage Azure’s range of storage redundancy options, ranging from Local Redundant Storage (LRS) to Read-Access Geo-Redundant Storage (RA-GRS).
Related content: read our guide to Azure High Availability
Directory Layout
When ingesting data into a data lake, you should plan data structure to facilitate security, efficient processing and partitioning. Plan the directory structure to account for elements like organizational unit, data source, timeframe, and processing requirements.
In most cases, you should have the region in the beginning of your directory structure, and the date at the end. This lets you use POSIX permissions to lock down specific regions or data time frames to certain users. Putting the date at the end means that you can restrict specific date ranges without having to process many subdirectories unnecessarily.
Learn more about building a cloud data lake here: Cloud Data Lake in 5 Steps
Azure Data Lake with NetApp Cloud Volumes ONTAP
NetApp Cloud Volumes ONTAP, the leading enterprise-grade storage management solution, delivers secure, proven storage management services on AWS, Azure and Google Cloud. Cloud Volumes ONTAP supports up to a capacity of 368TB, and supports various use cases such as file services, databases, DevOps or any other enterprise workload, with a strong set of features including high availability, data protection, storage efficiencies, Kubernetes integration, and more.
Cloud Volumes ONTAP supports advanced features for managing SAN storage in the cloud, catering for NoSQL database systems, as well as NFS shares that can be accessed directly from cloud big data analytics clusters.
In addition, Cloud Volumes ONTAP provides storage efficiency features, including thin provisioning, data compression, and deduplication, reducing the storage footprint and costs by up to 70%.