hamburger icon close icon
Google Cloud Database

Google Cloud Dataflow: The Basics and 4 Critical Best Practices

What Is Google Cloud Dataflow?

Google Cloud Dataflow is a managed service used to execute data processing pipelines. It provides a unified model for defining parallel data processing pipelines that can run batch or streaming data.

In Cloud Dataflow, a pipeline is a sequence of steps that reads, transforms, and writes data. Each pipeline takes large amounts of data, potentially combining it with other data, and creates an enriched target dataset. The resulting dataset can be similar in size to the original, or a smaller summary dataset.

Dataflow is a fully pipeline runner that does not require initial setup of underlying resources. Because it is fully integrated with the Google Cloud Platform (GCP), it can easily combine other Google Cloud big data services, such as Google BigQuery.

This is part of our series of articles about Google Cloud Databases.

Related content: Read our guide to Google Cloud big data

In this article:

Cloud Dataflow Key Concepts

A data processing pipeline includes three core steps—reading data from a source, transforming it, and then writing the data back into a sink.

The first step involves reading data from a source into a PCollection (parallel collection), which can be distributed across multiple machines. Next, the transform step involves performing one or several operations, called transforms, on your PCollection. Since PCollections are immutable, the pipeline creates a new PCollection every time it runs a transform operation. Finally, after the pipeline completes executing all transforms, it writes a final PCollection to an external sink.

Apache Beam SDK
Google Cloud Dataflow is a managed version of Apache Beam. You can use the Apache Beam SDK in Java or Python to create a pipeline. Once you deploy and execute your pipeline, it is called a Dataflow job. You can use the cloud console UI, the API, or gcloud CLI to create Dataflow jobs.

Virtual machines
Understanding virtual machines (VMs) is crucial for using Dataflow—it assigns worker VMs to execute data processing tasks, letting you customize the size and shape of VMs. You can also use auto-scaling if the traffic pattern spikes. Dataflow uses an auto-scaling feature to automatically increase or decrease the number of worker VMs required to run the job.

Dataflow streaming engine
Dataflow uses a streaming engine to separate compute from storage and move parts of your pipeline execution out of your worker VMs and into the Dataflow service backend. This process aims to improve auto scaling and data latency.

Dataflow jobs
Google Cloud Dataflow offers several job options. Dataflow templates are a collection of pre-built templates that let you create ready-made jobs. Alternatively, you can use a template as a baseline to create a custom job. You can also share templates with other collaborators. Inline monitoring allows you to track your job progress.

Dataflow lets you access job metrics directly to troubleshoot pipelines at the step and worker levels. The Dataflow interface lets you use Vertex AI notebooks to build and deploy data pipelines based on the latest data science and machine learning (ML) frameworks.

Dataflow SQL
Dataflow SQL lets you use SQL to create streaming pipelines from the Google Cloud BigQuery web UI. This feature lets you merge streaming data from Pub/Sub with Google Cloud Storage files or files from BigQuery tables. You can write results in BigQuery and build a dashboard for real-time visualization.

Use Cases of Google Cloud Dataflow

Stream Analytics

Google’s stream analytics service can help you organize your data more effectively, ensuring your data is useful and easily accessible immediately when generated. The service runs on top of Cloud Dataflow alongside Pub/Sub and BigQuery to provide a streaming solution. It provisions the computing resources required to ingest, process, and analyze fluctuating data volumes to produce real-time insights.

This abstracted provisioning approach helps simplify the complexities of streaming and allows data engineers and analysts to access stream analytics easily.

Related content: Read our guide to Google Cloud analytics

Real-Time AI

Cloud Dataflow can add streaming events to the Vertex AI and TFX sections in Google Cloud. This helps ensure they can help enable real-time predictive analytics, fraud detection, and personalization. Several use cases are associated with implementing real-time AI capabilities. Google Cloud Dataflow helps you implement pattern recognition, anomaly detection, and prediction workflows.

TFX combines Dataflow with Apache Beam in a distributed engine for data processing, enabling various aspects of the machine learning lifecycle. The Kubeflow pipelines support these aspects when you integrate CI/CD pipelines for machine learning.

Sensor and Log Data Processing

Google Cloud Dataflow allows you to unlock business insights via a global network of IoT devices by leveraging intelligent IoT capabilities. Its scalability and managed integration options help you connect, store, and analyze data in Google Cloud and on edge devices.

Google Cloud Dataflow Best Practices

1. Build Manageable Data Pipelines  

Use composable transforms to communicate business logic and create easy-to-maintain pipelines. Keep the pipeline as simple and clear as possible, to make pipeline troubleshooting easier for production models.

2. Apply Lifecycle Tests

An end-to-end data pipeline should include lifecycle testing to analyze the update, drain, and cancel options. Lifecycle testing helps you understand your pipeline’s interactions with data sinks and unavoidable side effects. It is important to understand the interactions between the lifecycle events of Cloud Dataflow and your sinks and sources.

These tests also help you understand interactions during recovery and failover situations, such as the effect of watermarks on events. For instance, sending elements with historic timestamps into the pipeline may result in the system treating these elements as late data, which can create a problem for recovery.

3. Clone Streaming Environments

Streaming a data source like Cloud Pub/Sub lets you attach subscriptions to topics. You can clone the streaming production environment for major updates by creating a new subscription for the production environment’s topic.

You can do this on regular cadences—for instance, once you’ve reached a threshold of minor updates to the pipeline. It also lets you perform A/B testing, ensuring the lifecycle events go smoothly in the production environment if you use splitable streaming data.

4. Monitor and Configure Alerts

Set up monitoring with custom metrics to reflect your service level objectives (SLOs) and configure alerts to notify you when the metrics approach the specified thresholds. You use the Cloud Dataflow runner to integrate custom metrics with Stackdriver. Stackdriver alerts can help ensure compliance with the specified SLOs.

Google Cloud Data Optimization with Cloud Volumes ONTAP

NetApp Cloud Volumes ONTAP, the leading enterprise-grade storage management solution, delivers secure, proven storage management services on AWS, Azure and Google Cloud. Cloud Volumes ONTAP capacity can scale into the petabytes, and it supports various use cases such as file services, databases, DevOps or any other enterprise workload, with a strong set of features including high availability, data protection, storage efficiencies, Kubernetes integration, and more.

Cloud Volumes ONTAP supports advanced features for managing SAN storage in the cloud, catering for NoSQL database systems, as well as NFS shares that can be accessed directly from cloud big data analytics clusters.

In addition, Cloud Volumes ONTAP provides storage efficiency features, including thin provisioning, data compression, and deduplication, reducing the storage footprint and costs by up to 70%.

New call-to-action
Yifat Perry, Technical Content Manager

Technical Content Manager