BlueXP Blog

Azure Data Catalog: Understanding Concepts and Use Cases

Written by Jeff Whitaker, Cloud Data Services | Sep 9, 2020 9:47:27 AM

Azure Data Catalog enables registrants to share data sources. Developers, data scientists, and analysts can use Data Catalog to discover, verify, and use community datasets. Data Catalog provides functionalities for data discovery, data understanding, and data consumption. Users of Data Catalogs are classified as either data producers or consumers.

In this post, we’ll explain key Azure Data Catalog concepts and user cases, and demonstrate how you can use Catalog data sources. We’ll also show how NetApp's solution for Azure File Storage can help you optimize storage resources on Azure.

This is part of a series of articles about Azure storage

In this article, you will learn:

What Is Azure Data Catalog?

Azure Data Catalog is a service that serves as a central repository for big data. It is designed to help users, including developers, data scientists, and analysts discover, verify, and use datasets contributed by the community. Data Catalog is built through crowdsourced data with annotations and metadata and is meant to enable data users and collectors to share their efforts.

Once data sources are registered in Data Catalog, any user with access can add to the metadata to enrich the set. This includes adding tags, descriptions, processes for requesting access, or documentation. Any custom metadata that is added is used to supplement the structural metadata added from the data source.

Azure Data Catalog Use Cases

Azure Data Catalog can serve many users for a variety of purposes but the two most common uses are for centralization of data information and for business intelligence.

Registration of central data sources
As organizations grow, the amount of data that is collected can quickly grow difficult to manage. When data is inventoried it becomes more difficult to organize and less useful since many individuals in the organization may not even know it exists.

By registering data in Data Catalog, organizations can ensure that data is available to all relevant business units. It can also help ensure that organizations can benefit from the shared knowledge and efforts of all of their users and analysts. This includes benefits for data related to online transaction processing (OLTP) systems, analytics databases, data warehouses, and line-of-business.

Data sources for research and business intelligence
Developing business intelligence (BI) requires the combination of many sources of data, including those not created for BI or analysis. When data sources are distributed, organizations are less able to gather, standardize, or apply data to BI purposes.

Aggregating data with Azure Data Catalog can enable analysts to skip some or most of the manual work BI typically requires. Analysts can collaborate with both internal and external teams to identify sources and ensure that data is accurate and relevant. Then, once BI is developed analysts can share their findings uniformly throughout the organization.

This sharing helps ensure that multiple analyses aren’t required and that business units are all working from the same insights. It also enables end users to contribute to and improve upon data which can then be used to refine BI.

Azure Data Catalog Key Concepts

When using Data Catalog, there are a few key concepts to be aware of to ensure effective use. These are data discovery, understanding, consumption, and users.

Data discovery
Data discovery is the functionality that enables you to make data searchable and available to users. It ensures that all data registered in the catalog is discoverable.

Data understanding
Data understanding is the functionality that makes data in the catalog interpretable. This includes metadata, any descriptions of the dataset content or format, and any documents defining procedures for use.

Data consumption
Data consumption is the use of data by users. It can include different modes for data access and ingestion. It can also include the ability to allow or restrict subsets of users to access or modify data.

Data users
Data users are anyone who is accessing, modifying, consuming, or contributing to data. In general, data users fall into two main groups (producers and consumers) although someone can be in both groups.

  • Producers—those responsible for creating, registering, and maintaining data.
  • Consumers—those who use data that is made available for reporting, analysis, or distribution purposes.

How to Use Data Sources in Azure Data Catalog

When using Azure Data Catalog there are four common actions you and your team should be familiar with—registering data, discovering data, annotating data, and documenting data sources. The following sections briefly explain how to perform these tasks. You can find more detailed information on these and other actions in the official documentation.

Register data sources
Registration involves extracting metadata from your sources and transferring it to your catalog. The data itself is not moved, only the metadata used to identify it. This enables you to continue controlling data with your existing policies and tooling.

When you want to register a data source, you need to:

  1. Start your Data Catalog data source registration tool. This is found in the Data Catalog portal.
  2. Using an account with proper Azure Active Directory credentials (the same one you use for the portal) sign in
  3. Choose the data source you want to add to the catalog and follow the registration steps.

Once your data source is registered, the service automatically tracks the data location and indexes metadata. Once registration is complete, data is available for discovery and use.

Discover data sources
Discovering data in Data Catalog is done through filtering and searching. Filtering enables you to limit data results by characteristics including, source type, tags, object type, and expert users. Searching enables you to match data by any included property, such as data annotations.

The most efficient way to discover data is to use a combination of filtering and searching. This enables you to both find specific datasets and to identify data that you may not have known were available.

Annotate data sources
One of the most powerful features of Azure Data Catalog is the ability to annotate data. Annotation enables users across your organization to contribute knowledge and expertise to refining datasets and what those sets can be used for. For example, analysts can help clarify what reports data contributes, IT can add information about how the data can be accessed, and legal can clarify what regulations may apply to data.

Additionally, because visibility in the catalog does not automatically equal access, annotations can be used to vet recommended changes to data. For example, if a user recognizes that data is unreliable or incomplete, they can make a note of it and other users can see their concerns. Then, action can be taken to verify the concern without fear that data has been changed, possibly in error.

Document data sources
With Data Catalog, you can document an inventory of your data assets. This includes any data you may have stored in other content repositories since you can create links to this data in your catalog.

The detail level of your documentation is customizable depending on your needs. You can record just characteristics, value, or purpose of data sources, or you can include fully detailed descriptions of data schemas. In general, when creating documentation, you have three options to choose from:

  • Document containers only—this defines where data is stored and basic information about the data. This often isn’t enough information for users to make informed decisions about whether or not data is useful to them.
  • Document tables only—this defines information that is specific to the data stored but does not define where data is stored or how it can be accessed. This can help users make decisions about data but can also make it difficult to use data.
  • Document containers and tables—this defines both data specifics and data use information. This is the most useful method of documentation but may require more maintenance if data is frequently moved or modified.

Azure NetApp Files for Big Data Environments

Azure NetApp Files is a Microsoft Azure file storage service built on NetApp technology, giving you the file capabilities in Azure even your core business applications require.

Get enterprise-grade data management and storage to Azure so you can manage your workloads and applications with ease, and move all of your file-based applications to the cloud.

Azure NetApp Files solves availability and performance challenges for enterprises that want to move mission-critical applications to the cloud, including workloads like HPC, SAP, Linux, Oracle and SQL Server workloads, Windows Virtual Desktop, and more.

In particular, moving big data into Azure NetApp Files can solve your analytics and high performance compute (HPC) requirements. A sample use case is genomics, where gene data resides in thousands of files, and the faster the performance of the data, the faster the analysis can complete. Azure NetApp Files provides sub-millisecond access response time, which directly translates to improved processing speed.

Want to get started? See Azure NetApp Files for yourself with a free demo