Similar to the machines that powered the first Industrial Revolution during the 18th century, digitization of information is now powering the digital revolution that many organizations are experiencing today. The underpinning catalyst for this digital revolution is the vast sums of data generated by organizations that is used to make intelligent decisions.
As such, it is a well-established consensus that today, data is the most valuable asset to any organization, across any sector imaginable. However, one of the biggest issues affecting the industry at the moment is understanding how to manage these vast sums of data, from its creation and throughout its existence. It is no mean feat: new technologies such as Big Data and Internet of Things are adding to this burden due to vast sums of data they generate every passing second with requirements to be stored, analyzed, and maintained. This is precisely where data lifecycle management becomes an important focus for many organizations today.
One of the major keys to data lifecycle management is understanding your data access frequency. In this post we’ll take a closer look at how data is accessed over time and how that affects its lifecycle management, and what NetApp’s Cloud Tiering service can do to help.
Data Lifecycle Management refers to the process of understanding the various stages that data goes through during its existence. Key phases of a typical data lifecycle include:
Stage 1: Data generation Creation of data through acquisition of existing data, manual entry of new data, and capture of data generated by various systems.
Stage 2: Data processing Processing of data created to reduce the noise and discard irrelevant data. During this stage, data is typically accessed frequently and requires to be stored locally in places like the edge or at the core, such as an enterprise data center.
Stage 3: Data storage and consumption Active storage of processed data for an organization’s objectives and operations. Similar to the previous stage, data within this stage is typically stored at the core in highly performant storage (data center or cloud-based repository).
Stage 4: Data archival The active use of data has completed and the data is typically stored for long term retention and storage efficiency reasons. Archived data is typically stored in low-cost storage tiers at the core, such as tapes, or in the cloud within cloud-based object storage tiers, just in case it’s required in the future.
Stage 5: Data purging / retirement In this phase, data which is no longer needed to be maintained is permanently deleted.
Data is typically characterized by its relevance and value. New and fresh data are often more sought after and accessed more often than older, more historical data due to perceived relevance and value. This frequency of access typically determines the state of the data to be classified as either hot or warm or cold.
New data is often accessed frequently for active processing and consumption by organizations to generate “new information” for monetization purposes. This data is typically perceived to be of higher value and relevance, and therefore can be considered “hot” or “Tier 1” data. Various structured data, such as those stored in an OLTP database or an email server database, as well as various unstructured data, such as a freshly generated spreadsheet or a presentation document, can be considered examples of hot data. This data will likely be accessed actively and frequently by various users or applications for a period of time, until the relevant information has been extracted out of these for day-to-day business requirements.
Data that has already been immediately processed and consumed and is no longer accessed regularly is considered to be “warm.” This data is not as frequently used as the hot data, but may still require periodic access by various users and applications. Warm data is also commonly referred to as Tier 2 data. Examples of warm data include data mining platforms such as big data applications and most unstructured data (such as user home directories, file shares).
Data that that isn’t being accessed anymore, but is still required to be stored long term for various purposes (such as compliance) is considered to be in a “cold” state. Examples of cold data include backup data, archived data such as mail or file archives, or replicas of production data sets stored off site or in the cloud for disaster recovery purposes.
In the data lifecycle, hot data typically belongs to Stages 1, 2, and 3 (i.e., data generation, data processing, and data storage/consumption) as described in the previous section. In these stages, the data is typically stored and maintained in a Tier 1 enterprise storage platform such as a NetApp All Flash FAS (AFF) or a Tier 1 enterprise cloud storage such as Amazon EBS or Azure disks. Warm data often belongs to Stage 3, data storage/consumption, (in Tier 2), while cold data that belongs to Stage 4, the archival stage in the data lifecycle, can be securely stored away for long term archival purposes on inexpensive object or cloud storage (Tier 3).
While intelligently categorizing data into hot, warm, and cold data tiers would benefit organizations by reducing the Total Cost of Ownership (TCO), it is intrinsically difficult for many organizations to accurately differentiate and meaningfully separate their data according to these three categories.
Why is that? Consider the simple example of a retail organization with various stores and outlets. Can the age of the sales data be used as the identifier to differentiate between what's hot versus what's warm data? Would you categorize the real-time sales data coming through from the point of sales terminals or the e-commerce platform such as the online store as Tier 1 hot data while categorizing data from the recent past (i.e., 3-6 months) as warm and the rest as cold? What happens when the end of the business year approaches and you need to run various business intelligence (BI) reports to analyze year-over-year sales performance where such historical data now needs to be re-promoted back to hot data for processing?
In addition to the difficulty of accurately identifying hot, warm, and cold data, another challenge many organizations face is understanding how to store and manage each type of dataset in the most efficient manner. Would you keep both hot and warm data in the same Tier 1 storage platform? Do you move all the cold data off to a cheaper cloud object storage platform? How is that data migration challenge addressed in a way that is seamless to the front end applications that consume all the data? Going back to the previous example, if all the sales data are stored in a single database that is accessed by a number of applications for reporting and analytics purposes, what would happen when historical data is archived off to a different storage platform? Do the applications need to now point to a different source to get historical sales stats? Or if you keep all the data in the same database, how would you cope with the ever increasing size of the production database and in turn, the size of the underlying storage solution? Is it cost efficient to keep buying additional storage shelves for your on-premises storage array?
Due to these challenges, many organizations inevitably end up keeping most if not all of their hot and warm data on the same expensive storage solution in their enterprise data center. This results in an unnecessarily high TCO for those customers, including underlying storage platform costs as well as administrative costs related to the management of that data. This high TCO is then also amplified due to the regulatory and compliance needs such as disaster recovery (DR) where the same dataset is now required to be duplicated elsewhere consuming additional resources.
A well-architected enterprise data storage platform should factor these concerns in at the foundation of their infrastructure and the application design. Having an effective, automated data fabric that allows an organization worry less about the management of their data can add real benefit to their bottom line through reduced TCO and increase their ability to respond to changing customer demands.
NetApp Cloud Tiering service is a key part of the NetApp Data Fabric solutions toolset, that allows customers benefit from the high performant on-premises storage platforms such as NetApp AFF, while also automatically leveraging the low costs and high durability of public cloud object storage.
NetApp Cloud Tiering allows customers to store and run all of their hot, warm, and cold data through a single storage platform providing a single fixed access point for all applications. Behind the scenes, Cloud Tiering service identifies infrequently used warm and cold data and automatically, and most importantly, seamlessly moves that data into cheaper object storage platforms in the cloud, such as Azure Blob, Amazon S3, and Google Cloud Storage. When the infrequently used data need to be accessed again, it will seamlessly be promoted back to the Tier 1 storage tier so that applications or users consuming the data will continue to benefit from the high performance on offer.
NetApp Cloud Tiering help customers efficiently manage their data lifecycle without complex manual involvement, helping reduce overall TCO by up to 30%. Through the combination of Tier 1 storage with lower cost Tier 2 and Tier 3 object storage from the Cloud, NetApp Cloud Tiering also help customers achieve up to 50x more capacity on their existing AFF solutions, significantly reducing the CAPEX investments when it comes to meeting complex data storage requirements.
Interested in trialing NetApp Cloud Tiering to assess the savings to be had in your environment? Sign up for a free trial to get started.