The Microsoft Azure cloud provides a range of managed services that can help your organization ingest, process, and analyze big data using a variety of technologies and approaches, including machine learning, Hadoop and Apache Spark, stream processing, and business intelligence (BI).
Azure analytics services are offered in several deployment models, including Platform as a Service (PaaS) and Infrastructure as a Service (IaaS). Integration is seamless between Microsoft services, as well as third-parties.
Related content: read our guide to Azure big data.
In this article, you will learn:
Big data architectures are complex and may vary per unique needs and designs. However there are certain logical components that should be built into the architecture.
This following diagram demonstrates how logical components should work in big data architectures. Note that not all solutions use all components.
Related content: read our guide to Azure big data.
Azure Synapse combines enterprise data warehousing with big data analytics. This analytics service lets organizations query data on their terms, at scale. It offers flexible options, including serverless on-demand and provisioned resources. Azure Synapse helps combine warehouses with big data analysis, providing a centralized interface for data ingestion, preparation and management.
This is an analytics platform, based on Apache Spark and built for seamless use in Azure’s platform. Databricks provides an interactive workspace, streamlined workflows, and a one-click setup. The latter is especially useful to promote collaboration between data roles, including scientists and engineers, as well as business analysts.
The Hadoop enables performance of complex, distributed analysis jobs on any volume of data. HDInsight simplifies the process of creating big data clusters in Hadoop, letting you quickly create and scale clusters based on individual needs.
HDInsight provides all Hadoop tools, including Apache Kafka, Apache Spark, Hive, Storm, and HBase. Additionally, the service provides enterprise-scale infrastructure for monitoring, compliance, security, and high availability.
This service was designed for Extract Transform Load (ETL) operations handling structured data that require processing on massive scales. The ETL process is applied on data from structured databases. Data is first collected, then cleaned, and then converted into a format suitable for analysis.
Data Factory provides a codeless process for building both ETL and Extract Load Transform (ELT). There is no need for code or configuration. Data Factory comes with built-in connectors for more than 90 data sources.
Azure Machine Learning, commonly referred to as Azure ML, is a library providing pre-packaged and pre-trained machine learning algorithms. In addition to algorithms, Azure ML provides a UI for building machine learning pipelines including training, evaluation, and testing.
Azure ML also provides capabilities for interpretable AI, including visualization and data for a wide range of purposes. These features can help you better understand model behavior, implement fairness metrics, and compare algorithms to discover which variant is best for your purposes.
This service includes real-time analytics and a complex event-processing engine. You can use Azure Stream Analytics to identify patterns and relationships in information extracted from various sources including sensors, devices, clickstreams, applications, and social media feeds. You can then use the patterns to trigger actions like building alerts, storing data for future use, and sending data to reporting tools.
You can use Azure Data Lake Analytics to build data transformation software using a wide range of languages, such as Python, R, NET, and U-SQL. Data Lake Analytics is great for processing data in the petabytes. However, the service does not pool data in a data lake when processing, as occurs in Azure Synapse Analytics. Instead, Data Lake Analytics connects to Azure-based data sources, like Azure Data Lake Storage, and then performs real-time analytics based on specs provided by your code.
This is a fully-managed platform as a service (PaaS) offering for data modeling, used for enterprise-grade cloud-based data models. Azure Analysis Services offers features for advanced modeling and mashup, which enable you to combine data from various sources, set up metrics, and secure all your data in one tabular semantic data model. This lets you perform ad hoc data analysis more easily and quickly with various tools, including Excel Power BI.
This service enables fast and scalable data exploration of log and telemetry. You can use this service to handle the massive amounts of data streams generated by various systems, including features for collecting, storing, and analyzing data. A major advantage of Azure Data Explorer is that it lets you do complex ad-hoc data queries in seconds.
Azure Data Share enables simple and secure data sharing with multiple collaborators, including external users like customers and third-party partners. The service can help you provision new data sharing accounts in a few clicks, as well as add datasets and invite users to use the account. A major advantage of Azure Data Share is that it helps to easily combine data from third party sources.
Azure Time Series Insights Gen2 provides end-to-end Internet of Things (IoT) analytics capabilities that can be scaled according to changing needs and demands. The platform provides a user-friendly interface and APIs for integration with existing tooling.
NetApp Cloud Volumes ONTAP, the leading enterprise-grade storage management solution, delivers secure, proven storage management services on AWS, Azure and Google Cloud. Cloud Volumes ONTAP supports up to a capacity of 368TB, and supports various use cases such as file services, databases, DevOps or any other enterprise workload, with a strong set of features including high availability, data protection, storage efficiencies, Kubernetes integration, and more.
Cloud Volumes ONTAP supports advanced features for managing SAN storage in the cloud, catering for NoSQL database systems, as well as NFS shares that can be accessed directly from cloud big data analytics clusters.
In addition, the built-in storage efficiency features, including thin provisioning, data compression, deduplication, and data tiering, reduce storage footprint and costs by up to 70%.