Analytics and machine learning are transforming the way companies do business, and those insights are built on data lakes.
A data lake can be thought of as a centralized pool of structured and unstructured data at any scale. It is a place where you can dump data from different data sources without any kind of data pre-processing. And because of their tremendous size, these data lakes are generally built in the cloud.
In this post we’ll describe the five-step process to building a cloud data lake:
A data lake is any repository that stores unprocessed data from both structured and unstructured sources, which will be used for analytics and training machine learning model.
Cloud data lakes:
A cloud data lake is the same as any other data lake, except it is stored in the cloud. Since data lakes generally store extremely large amounts of data, this is where services such as Amazon S3, Azure Blob storage, Google Cloud Storage, and other lower-cost data storage platforms come into picture.
There are many advantages to choosing the cloud over on-prem storage for a data lake. One example is that because the cloud’s services are elastic in nature, you don’t have to plan the storage size in advance. Also, you only have to pay for what you use.
Data lakes and data warehouses sound and act similar, but they are different. A data warehouse is where you store structured data coming from transactional systems. Data warehouses are highly optimized for SQL queries as they drive operational decision making and reporting. Another difference is that a data warehouse has a clearly defined schema.
The data in a data lake is more raw and unstructured. That means there can be mistakes made when you are conceptualizing a data lake. Because the data is unstructured, you need to know what data you need before you can add any structure to it. You remove the unwanted data and retain only what you need.
Not every cloud data lake is built exactly alike, but there are some key steps that almost all of them take along the way. Let’s take a look at each of them and their related challenges.
The nature of the data stored in a data lake is unstructured. But that doesn’t mean you dump any and all data you have into it. It’s important to only store the data that you need and discard the rest. Otherwise, you are not only confusing your analytics and machine learning models with unnecessary features, but you’re also paying for excess storage.
In the beginning you will not want to lose any data that is being produced by your data sources. But you’ll soon realize that with all that excess data, you won’t be able to determine which data points will have the greatest impact on your business. You need to decide on the features you absolutely need when you are designing your data lake architecture.
For example, if you are building a solution for chip manufacturing in an EDA company, there is data on a number of relevant topics you might want to have on hand. The chip design process can produce large amounts of extremely complex information, and each design project goes through rigorous testing over numerous builds. Past chip designs, test results, and modelling information are all points to include in your data lake.
As this example shows, you want to make sure you have the required data to answer any such questions. At the same time, you want to make sure that the data for those answers isn’t being polluted with unnecessary noise from excess data. The adage “garbage in, garbage out” is apt when it comes to data.
Now that you know what data you want to store, it’s time to look at how you’ll store that data. In a modern application, data flows in from many directions: mobile apps constantly pushing data to servers, IoT devices constantly uploading sensor or machine data, and users creating data by way of interacting with your applications. In data science terms, we have the volume, the variety, and the velocity of data to tackle. So, you need to have a robust system, capable of ingesting all this data into storage.
All the major cloud storage providers offer multiple solutions for high-speed data ingestion, such as Amazon S3, Amazon Kinesis, Amazon Glue, Amazon Athena, Google Cloud Storage, Google Dataflow, Google BigQuery, Azure Data Factory, Azure Databricks, and Azure Functions.
In addition to data ingestion, you also have to figure out how you are going to store the data reliably. When selecting a cloud storage provider, you have to check the data availability, data protection and security levels, time to recovery, elasticity, integrations with third party tools, etc.
Cost is another factor to consider. Most providers today charge only for the storage you use—including both the bandwidth it takes for data to flow in and out and for read/write operations. It is also important to note how fast the storage’s read and write operations are, which can be particularly important for data preparation.
Before data lakes came into use, data was mostly relational in nature and was stored in data warehouses. The raw data was extracted from the source, transformed to the relational scheme, and then loaded into a data warehouse. This process is known as Extract, Transform, and Load (ETL). This is still being used today in many modern applications, where analytics need relational data. Something similar is happening with big data.
In the case of big data, the velocity of data is so high, there will not be enough time to transform the data before it is loaded, so the data is first dumped into a data lake. Designated processes then read it from the data lake, transform it, and reload it to the data lake. The order of things is different here: in this case it’s Extract, Load, and then Transform (ELT).
Irrespective of whether you’re running ETL or ELT, you need a high-performance storage layer. This is because data is extracted, transformed, and loaded (or reloaded) in high volumes. And these jobs are running non-stop. The quality and speed of these operations are proportional to the storage performance. When the data read and write rates are high, there are chances of data loss or corruption if the storage is not able to keep up.
High performance storage solutions are pricey. And since you pay for what you use, the more data you have, the higher the bill. But it’s not always that linear. There are additional costs associated with reading and writing as well as movement of data. So make sure you check the pricing scenarios thoroughly.
Once you have the data ready, you need analytics to convert it to information. There are two types of analytics:
Dashboards for sales, marketing, reporting, and other teams are all driven by analytics. Business intelligence (BI) tools also query data lakes regularly. When all of these teams are using the same data, there could be hundreds of queries running in parallel. The storage solution should be able to serve all these queries as quickly as possible. This is especially important for streaming analytics.
Every business is moving towards implementing some kind of machine learning today. Machine learning can be utilized in a number of different ways. For example, you can train a model to look at the number of orders over a period of time along with the various business decisions taken during that time and correlate the two. At the same time, machine learning can also be used to provide new features in your product.
Training a machine learning model takes hours, if not days. Data is continuously read from storage during training. If data for this training is stored in the cloud, you’ll end up paying a lot because of the sheer number of read operations and the amount of data being read.
One solution to this a lot of developers prefer is to have a copy of the data locally either on their development machines or in a LAN or on-prem server. This reduces the latency greatly for the training, and will not cost as much. This implies that data will be downloaded regularly from the cloud to local servers, which will add some cost. But this cost will be lower compared to training models in the cloud.
That being said there will be situations where you have to train models in the cloud when you are short on time. For such situations, you need high throughput and low latency storage for your models to train quickly. Such storage solutions are expensive. You need to find a balance between having a local copy of the data and training models in the cloud.
Retaining such copies can be a cost concern which should be considered.
Collecting data to provide powerful insights for your business is only part of the picture. You also need to determine how the storage solution you select will affect those insights. Data ingestion, preparation, and processing need to be done with low latencies so that the insights are provided in time. No matter which data lake storage solution you choose, there will always be tradeoffs, so you need to do your research to figure them out and consider how they affect your business.
In all of the steps above there are challenges that Cloud Volumes ONTAP can help with. As a data management platform for storage on AWS, Azure, and Google Cloud, Cloud Volumes ONTAP extends rich data governance with high availability, data protection, and cost-efficiency features for enterprise-level storage requirements.
Two of the main ways that Cloud Volumes ONTAP can assist in building your data lakes is in reducing your data storage costs and increasing your flexibility.
As mentioned above, data lake data that isn’t in use will be costly to keep on performant block storage, such as Amazon EBS or Google Persistent Disks, but the cloud providers don’t have a solution for tiering to lower-cost object storage. That is a challenge that Cloud Volumes ONTAP’s data tiering feature can help you solve. By automatically identifying infrequently used data in the data lake and tiering it seamlessly to object storage on Amazon S3, Azure Blob storage, or Google Cloud Storage, data tiering is an automatic way to optimize where data is stored and significantly reduce costs.
Cloud Volumes ONTAP can also help using FlexClone® data cloning technology to create instant, zero-capacity, writable data volume clones that are extremely cost efficient, with storage consumed only for the delta data.