Getting Started with Cloud Dataflow | Scribe

    Getting Started with Cloud Dataflow

    • Hafeez Baig |
    • 18 steps |
    • 2 minutes
    1
    Sign in into the **Google Cloud Console**
    2
    Type "**Dataflow**" in the search bar and click on the **Dataflow** option
    information ordinal icon
    **What is Dataflow?**\ \ **Dataflow** is a fully managed service in Google Cloud that enables the real-time or batch processing of large data sets. It is based on the Apache Beam programming model, allowing developers to create pipelines that process, transform, and analyze data efficiently. Dataflow automatically handles resource provisioning, scaling, and optimization, making it ideal for processing big data without needing to manage infrastructure. It supports use cases like ETL (Extract, Transform, Load), real-time analytics, and event-driven processing, helping businesses gain insights from their data streams or stored data.
    3
    **Jobs** wizard will open, click on the **CREATE JOB FROM TEMPLATE** button
    4
    **Dataflow API** wizard will open, click on the **ENABLE** button
    information ordinal icon
    **What is Dataflow API?**\ \ The **Dataflow API** is a set of tools and endpoints provided by Google Cloud that allows developers to interact programmatically with the Dataflow service. Through the API, users can manage Dataflow jobs, such as creating, submitting, monitoring, and canceling pipelines. It enables automation and integration of Dataflow within other applications and services, making it easier to deploy and scale data processing pipelines. The API also provides detailed information about the status of running jobs, job history, and resource usage, helping manage large-scale data processing workflows efficiently.
    5
    **Create job from template** wizard will open, click on the **Dataflow templates** option
    6
    On the right side give the Job name as - "**my-dataflow**"
    7
    Scroll to the **Regional endpoint** dropdown section and select the regional endpoint as per your requirement
    information ordinal icon
    **What is Regional endpoint?**\ \ A **Regional endpoint** in Google Cloud refers to a specific geographical location where a service or API is hosted and managed. It ensures that resources and services, like compute, storage, or APIs, are operated within a particular region to optimize performance, comply with data residency requirements, and reduce latency for users in that region. Using regional endpoints helps in distributing workloads, enhancing fault tolerance, and improving response times by keeping data and services closer to the end-users.
    8
    Scroll to the **Dataflow template** dropdown section and select the option **Pub/Sub Proto to BigQuery** **Note:** You can select the option as per your requirement.
    information ordinal icon
    **What is Dataflow template:Pub/Sub Proto to BigQuery?**\ \ The **Dataflow template: Pub/Sub Proto to BigQuery** is a pre-built pipeline in Google Cloud's Dataflow service designed to read messages from a **Pub/Sub** topic, deserialize them from **Protocol Buffers (Proto)** format, and then write the processed data into a **BigQuery** table. This template simplifies the process of streaming or batch ingestion of structured data from Pub/Sub into BigQuery, enabling real-time data analytics and reporting. It is particularly useful when dealing with Proto-encoded messages, as it automates the conversion and loading process without the need for extensive custom code.
    9
    Scroll to the **Pub/Sub input subscription** dropdown section and here you can select a subscription
    information ordinal icon
    **What is Pub/Sub input subscription?**\ \ A **Pub/Sub input subscription** refers to a **subscription** in Google Cloud Pub/Sub that receives messages from a **topic**. When a Pub/Sub topic publishes messages, the input subscription serves as the point where a subscriber (such as a Dataflow pipeline or other service) listens and retrieves those messages. In a Dataflow pipeline, for example, the **Pub/Sub input subscription** would be the source from which the pipeline consumes incoming messages for further processing, transformation, or storage. It acts as the entry point for data streaming from Pub/Sub into other Google Cloud services like BigQuery or Cloud Storage.
    10
    Click on the **CREATE A SUBSCRIPTION** button if you want to create a new subscription
    information ordinal icon
    **What is Subscription?**\ \ A **Subscription** in Google Cloud Pub/Sub is a link between a **topic** and a subscriber, enabling the delivery of messages published to the topic. When a message is published, all active subscriptions associated with the topic receive a copy. Subscribers can retrieve these messages either by pulling them from the Pub/Sub service or by having them pushed automatically to a designated endpoint. Subscriptions ensure reliable message delivery, making them essential for real-time data processing, event-driven architectures, and other messaging-based workflows.
    11
    Scroll to the **BigQuery output table** section, here you can browse and select the output table for Big Query
    information ordinal icon
    **What is BigQuery output table?**\ \ A **BigQuery output table** is a destination table in Google Cloud's BigQuery service where processed data is stored after being written from a data pipeline, such as one running in Dataflow. When a pipeline processes data—such as reading from Pub/Sub or another source—the results can be loaded into this BigQuery table for querying, analysis, and reporting. The output table acts as the final repository for structured, queryable data, supporting use cases like real-time analytics, batch processing, and data warehousing.
    12
    Scroll to the **Required Parameters** section
    information ordinal icon
    **What are the Required Parameters?**\ \ **Required parameters** are specific inputs or configurations that must be provided for a process, function, or service to execute successfully. In the context of APIs, services, or templates like Google Cloud Dataflow, required parameters typically include essential information such as project IDs, input/output locations, or authentication details. Without these parameters, the system won't have the necessary data to run the operation, resulting in errors or incomplete processes. These parameters ensure that the system has all the key inputs needed to function correctly.