Deploy Apache Airflow in Google Cloud Composer under a VPC

Last Update: November 13, 2024
Deploy Apache Airflow in Google Cloud Composer
Table of Contents
Contributors
Picture of Vivasoft Data Engineering Team
Vivasoft Data Engineering Team
Tech Stack
0 +
Want to accelerate your software development company?

It has become a prerequisite for companies to develop custom software products to stay competitive.

Project Overview

Google Cloud Composer is an Apache Airflow managed service that allows workflow creation, scheduling, and monitoring. Cloud composer deployment in a Virtual Private Cloud (VPC) offers enhanced network isolation and security. The specific procedures for setting up Apache Airflow in Google Cloud Composer under a VPC will be outlined in this article.

Technology Used

Python, Apache Airflow, GCP cloud storage, IAM, VPC network, Cloud NAT

Documentation

Objectives

  • Identify the benefits of deploying Apache Airflow in GCP Composer under a VP:
    • Understand the benefits of using VPC for network security and isolation.
    • Acknowledge the significance of managed services for workflow orchestration, such as Google Cloud Composer.

  • Recognize the significance of managed services such as Google Cloud Composer for workflow orchestration.

  • Learn to Create and Configure a VPC Network in Google Cloud Platform:
    • Understand how to set up a new VPC network.
    • Configure subnets and firewall rules to secure your network.

  • Set Up Private Service Connect for Enhanced Security:
    • Create and configure Private Service Connect endpoints.
    • Ensure secure communication between services within your VPC.

  • Deploy a Google Kubernetes Engine (GKE) Cluster within the VPC
    • Learn the steps to create and configure a GKE cluster.
    • Integrate the GKE cluster with your VPC network for secure containerized workloads.

  • Create and Configure a Cloud SQL Instance for Airflow Metadata
    • Set up a Cloud SQL instance within your VPC.
    • Configure network settings to allow secure access to the SQL instance from your Airflow environment.

  • Deploy a Google Cloud Composer Environment in the Configured VPC:
    • Step-by-step instructions to create a Cloud Composer environment.
    • Understand how to link the Composer environment to your VPC network

  • Adapt to Evolving Data Requirements
    • As business needs and data sources change, ensure the flexibility to adjust expectations for data quality.
    • To stay up with the ever-changing data world, continuously update data validation procedures.

  • Configure Environment Variables and Airflow Connections
    • Learn how to set up necessary environment variables for your Airflow environment.
    • Configure connections to various Google Cloud services and external systems.

  • Deploy and Manage DAGs in Your Airflow Environment:
    • Understand best practices for securing your Airflow deployment.
    • Ensure compliance with organizational and regulatory requirements by leveraging VPC features.

  • Utilize Google Cloud Documentation and Resources
    • Learn to navigate and use Google Cloud’s documentation for troubleshooting and advanced configurations.
    • Access additional resources for ongoing learning and support.

Technology Stack

Apache Airflow

  • Data Orchestrator
  • Visual representation of Data PipeLine

Composer

  • GCP managed Airflow

Step 1: Create a VPC Network

1. Create a New VPC:

  • Click on Create VPC network.
  • Name your VPC (e.g., airflow-vpc).
  • Set your region
  • Choose an automatic or custom subnet creation. For custom, define subnets in each region where your Composer environment will be deployed.
  • subnet section:
    • Name your subnet (e.g., airflow-vpc-subnet)
    • ip stack type: IPv4(single-stack)
    • Ip range: 10.0.0.0/24
    • Private google access: on
    • Flow logs: off
    • Hybrid subnets: off
    • Done
  • Rest of the things are default
  • Create

Step 2: Set Up Firewall Rules (Optional)

1. Create a New VPC:

    • In the VPC network menu, select Firewall rules.

2. Create Firewall Rules:

    • Click on Create firewall rule.
    • Name your firewall rule (e.g., allow-internal-traffic).
    • Set the targets to All instances in the network.
    • Source filters: Choose IP ranges and enter 10.128.0.0/9 (default internal IP range).
    • Protocols and ports: Select Allow all or specify necessary protocols (e.g., TCP: 22, 80, 443).
    • Create.

Step 3: Create Cloud NAT

1. Create a New VPC:

  • Name your NAT name.
  • NAT type: Public
  • Cloud router:
    • Select Cloud router from dropdown
    • Select region
    • Cloud router:
      • Create new router
      • Set custom IP
    • Network service tier: Standard
    • Create

Step 4: IAM Configuration

1. Default Account Settings:

2. Create Service Account:

    • Navigate to IAM → service accounts
    • Create service account
    • Goto IAM
      • Select the created service account
      • Update the role
        1. Cloud Composer v2 API Service Agent Extension
        2. Cloud Composer API Service Agent
        3. Eventarc Event Receiver
        4. Goto IAM → service accounts
    • Select the created the service account
      • Update the role
        1. Service Account Token Creator
        2. Editor
        3. Cloud Composer v2 API Service Agent Extension
    • Configure the composer service account
      • Select the service-*@cloudcomposer-accounts.iam.gserviceaccount.com
      • Update the role
        1. Cloud Composer v2 API Service Agent Extension
        2. Cloud Composer API Service Agent
        3. Service account admin

Step 5: Bind

  • Bind your service account to the composer service account
  • *@*.iam.gserviceaccount.com: it’s your custom service account
  • service-*@cloudcomposer-accounts.iam.gserviceaccount.com: Default composer service account
				
					gcloud iam service-accounts add-iam-policy-binding \
*@*.iam.gserviceaccount.com \
--member serviceAccount:service-*@cloudcomposer-accounts.iam.gserviceaccount.com \
--role roles/composer.ServiceAgentV2Ext
				
			
  • *@*.iam.gserviceaccount.com: it’s your custom service account
  • service-*@cloudcomposer-accounts.iam.gserviceaccount.com: Default composer service account

Step 6: Create the Composer

1. Basic

    • Navigate to the composer from cloud console
    • Click create composer
    • Configuration
      1. Name: composer env name
      2. Region:  set region name
      3. Image version: select Airflow version
      4. Service account: custom service account
      5. Labels: Optional but good to have
      6. Resilience mode: Standard

2. Intermediate

    • Environment: small (we’ll configure it later)
    • Network Configuration: (You can’t update it later)
      1. Select networks in this project
      2. Select network and subnetwork
      3. Secondary IP range for pods: Auto-created
      4. Secondary IP range for services: Auto-created
      5. Network type: public
      6. Web server network access control: Allow access from all IP addresses

 

3. Final

    • Advanced Configuration
      1. Environment variables: we’ll configure it later
      2. Airflow configuration overrides: we’ll configure it later
      3. Environment bucket: Select custom bucket
      4. Data encryption: Google managed encryption key
      5. Maintenance window: Select your timezone
      6. Dataplex data lineage integration: Disable integration with dataplex data lineage
      7. Airflow database zone: Any zone
      8. Recovery Configuration: Set your snapshot schedule

 

4. Done and wait approx. ~25 minutes

Step 7: Configure Composer

  • Select the composer
  • Navigate to Environment Configuration
    • Goto resources section and update the configuration according to pipeline workload
    • Don’t update core infrastructure directly (recommended)

 

  • Navigate to Airflow Configuration overrides

 

  • Core
    • Dag_file_processor_timeout: 800
    • Dagbag_import_timeout: 600
    • Dags_are_paused_at_creation: True (Important)
    • Default_timezone: Asia/Kuala_Lumpur
    • Parallelism: 50
    • Max_active_tasks_per_dag: 25
    • Max_active_runs_per_dag: 25
    • Killed_task_cleanup_time: 604800 (to prevent Airflow error)

 

  • Logging
    • Logging_level: INFO (To get detail Airflow log)

 

  • Scheduler
    • Dag_dir_list_interval: 180

 

  • Secrets
    • Use_cache: True

 

  • Webserver
    • Default_ui_timezone: Asia/Kuala_Lumpur (Important)
    • Default_wrap: True (helpful to debug log)
    • Navbar_color: #40c706 (Optional)
  • Celery
    • Worker_autoscale: 5,1 (max, min)
    • Worker_concurrency: 8 (no of parallel task execution)
  • Navigate to PyPi packages

Add your pip package if needed (e.g., Name: google-cloud-storage, version: ==2.17.0)

Step 8: Deploy the Airflow DAGs

  • Navigate to cloud storage attached with Composer
  • Goto dags folder and upload all .py files
  • Done

Conclusion

By following these steps, you have successfully deployed Apache Airflow in Google Cloud Composer under a VPC. This setup ensures enhanced security and network isolation for your workflows. Happy orchestrating!

Potential Developer
Tech Stack
0 +
Accelerate Your Software Development Potential with Us
With our innovative solutions and dedicated expertise, success is a guaranteed outcome. Let's accelerate together towards your goals and beyond.
Blogs You May Love

Don’t let understaffing hold you back. Maximize your team’s performance and reach your business goals with the best IT Staff Augmentation