A data lake is a flexible, cost effective data store that can hold very large quantities of structured and unstructured data. It allows organizations to store data in its original form, and perform search and analytics, transforming the data as needed on an ad hoc basis.
Amazon Web Services (AWS) data lake architectures are typically based on the Simple Storage Service (S3). S3 provides a storage layer for the data lake. There are several AWS big data solutions that can help manage and make use of the data:
In this article, you will learn:
While there are many possible data lake architectures, Amazon provides a reference architecture with the following characteristics:
The architecture is built of three key components:
The following diagram illustrates the architecture:
How is the reference architecture deployed?
Amazon provides ready-made templates you can use to deploy this architecture in your Amazon account.
The following best practices will help you make the most of your AWS data lake deployment.
Amazon recommends ingesting data in its original form and retaining it. Any transformation of the data should be saved to another S3 bucket—this makes it possible to revisit the original data and process it in different ways.
While this is a good practice, it means there will be a lot of old data stored in S3. You should use object lifecycle policies to define when this data should move to an archival storage tier, such as Amazon Glacier. This conserves costs and still gives you access to the data if and when needed.
Take organization into account right at the beginning of a data lake project:
Treatment and processing should be handled differently for different types of data:
Learn more in Cloud Data Lake in 5 Steps.
In the previous section we discussed a simple data lake architecture, which you can set up automatically using a CloudFormation template, and best practices at different stages of the data pipeline. To let you customize your deployment and enable continuous data management, Amazon provides AWS Lake Formation.
Lake Formation is a fully managed service that makes it easy to build, protect, and manage your data lake. It simplifies the complex manual steps typically required to create a data lake, which include:
Lake Formation crawls data sources and automatically moves data into Amazon Simple Storage Service (Amazon S3), to create a data lake.
Lake Formation performs the following tasks, either directly or through other AWS services, including AWS Glue, S3 and AWS database services:
Once data is stored in the data lake, end users can select the analytics service of their choice, for example Amazon Athena, Redshift, or EMR, to access and work with the data.
Related content: read our guide to AWS data analytics
Lake Formation integrates with Identity and Access Management (IAM), automatically mapping users and roles to data protection policies in the Data Catalog. You can use federated templates to integrate the data lake with Active Directory or LDAP using SAML.
Lake Formation organizes data using blueprints. These allow you to ingest data, create Glue workflows that crawl and transform source tables, and load the result to S3. In S3, it organizes data by creating partitions and data formats, and maintains a data catalog with a user interface that lets you search data by type, classification, or free text.
NetApp Cloud Volumes ONTAP, the leading enterprise-grade storage management solution, delivers secure, proven storage management services on AWS, Azure and Google Cloud. Cloud Volumes ONTAP supports up to a capacity of 368TB, and supports various use cases such as file services, databases, DevOps or any other enterprise workload, with a strong set of features including high availability, data protection, storage efficiencies, Kubernetes integration, and more.
Cloud Volumes ONTAP supports advanced features for managing SAN storage in the cloud, catering for NoSQL database systems, as well as NFS shares that can be accessed directly from cloud big data analytics clusters.
In addition, Cloud Volumes ONTAP storage efficiency features, including thin provisioning, data compression and deduplication, and data tiering, will help to reduce your data lake storage footprint and costs by up to 70%.