Project Overview
Google Cloud Composer is an Apache Airflow managed service that allows workflow creation, scheduling, and monitoring. Cloud composer deployment in a Virtual Private Cloud (VPC) offers enhanced network isolation and security. The specific procedures for setting up Apache Airflow in Google Cloud Composer under a VPC will be outlined in this article.
Technology Used
Python, Apache Airflow, GCP cloud storage, IAM, VPC network, Cloud NAT
Documentation
Objectives
- Identify the benefits of deploying Apache Airflow in GCP Composer under a VP:
- Understand the benefits of using VPC for network security and isolation.
- Acknowledge the significance of managed services for workflow orchestration, such as Google Cloud Composer.
- Recognize the significance of managed services such as Google Cloud Composer for workflow orchestration.
- Learn to Create and Configure a VPC Network in Google Cloud Platform:
- Understand how to set up a new VPC network.
- Configure subnets and firewall rules to secure your network.
- Set Up Private Service Connect for Enhanced Security:
- Create and configure Private Service Connect endpoints.
- Ensure secure communication between services within your VPC.
- Deploy a Google Kubernetes Engine (GKE) Cluster within the VPC:
- Learn the steps to create and configure a GKE cluster.
- Integrate the GKE cluster with your VPC network for secure containerized workloads.
- Create and Configure a Cloud SQL Instance for Airflow Metadata:
- Set up a Cloud SQL instance within your VPC.
- Configure network settings to allow secure access to the SQL instance from your Airflow environment.
- Deploy a Google Cloud Composer Environment in the Configured VPC:
- Step-by-step instructions to create a Cloud Composer environment.
- Understand how to link the Composer environment to your VPC network
- Adapt to Evolving Data Requirements:
- As business needs and data sources change, ensure the flexibility to adjust expectations for data quality.
- To stay up with the ever-changing data world, continuously update data validation procedures.
- Configure Environment Variables and Airflow Connections:
- Learn how to set up necessary environment variables for your Airflow environment.
- Configure connections to various Google Cloud services and external systems.
- Deploy and Manage DAGs in Your Airflow Environment:
- Understand best practices for securing your Airflow deployment.
- Ensure compliance with organizational and regulatory requirements by leveraging VPC features.
- Utilize Google Cloud Documentation and Resources:
- Learn to navigate and use Google Cloud’s documentation for troubleshooting and advanced configurations.
- Access additional resources for ongoing learning and support.
Technology Stack
Apache Airflow
- Data Orchestrator
- Visual representation of Data PipeLine
Composer
- GCP managed Airflow
Step 1: Create a VPC Network
1. Create a New VPC:
- Click on Create VPC network.
- Name your VPC (e.g., airflow-vpc).
- Set your region
- Choose an automatic or custom subnet creation. For custom, define subnets in each region where your Composer environment will be deployed.
- subnet section:
- Name your subnet (e.g., airflow-vpc-subnet)
- ip stack type: IPv4(single-stack)
- Ip range: 10.0.0.0/24
- Private google access: on
- Flow logs: off
- Hybrid subnets: off
- Done
- Rest of the things are default
- Create
Step 2: Set Up Firewall Rules (Optional)
1. Create a New VPC:
- In the VPC network menu, select Firewall rules.
2. Create Firewall Rules:
- Click on Create firewall rule.
- Name your firewall rule (e.g., allow-internal-traffic).
- Set the targets to All instances in the network.
- Source filters: Choose IP ranges and enter 10.128.0.0/9 (default internal IP range).
- Protocols and ports: Select Allow all or specify necessary protocols (e.g., TCP: 22, 80, 443).
- Create.
Step 3: Create Cloud NAT
1. Create a New VPC:
- Name your NAT name.
- NAT type: Public
- Cloud router:
- Select Cloud router from dropdown
- Select region
- Cloud router:
- Create new router
- Set custom IP
- Network service tier: Standard
- Create
Step 4: IAM Configuration
1. Default Account Settings:
- Navigate to IAM from cloud console
- Give the proper role to the default service account(*-compute@developer.gserviceaccount.com)
- Role:
- Editor
- Eventarc Event Receiver
- Cloud SQL Client
2. Create Service Account:
- Navigate to IAM → service accounts
- Create service account
- Goto IAM
- Select the created service account
- Update the role
- Cloud Composer v2 API Service Agent Extension
- Cloud Composer API Service Agent
- Eventarc Event Receiver
- Goto IAM → service accounts
- Select the created the service account
- Update the role
- Service Account Token Creator
- Editor
- Cloud Composer v2 API Service Agent Extension
- Update the role
- Configure the composer service account
- Select the service-*@cloudcomposer-accounts.iam.gserviceaccount.com
- Update the role
- Cloud Composer v2 API Service Agent Extension
- Cloud Composer API Service Agent
- Service account admin
Step 5: Bind
- Bind your service account to the composer service account
- *@*.iam.gserviceaccount.com: it’s your custom service account
- service-*@cloudcomposer-accounts.iam.gserviceaccount.com: Default composer service account
gcloud iam service-accounts add-iam-policy-binding \
*@*.iam.gserviceaccount.com \
--member serviceAccount:service-*@cloudcomposer-accounts.iam.gserviceaccount.com \
--role roles/composer.ServiceAgentV2Ext
- *@*.iam.gserviceaccount.com: it’s your custom service account
- service-*@cloudcomposer-accounts.iam.gserviceaccount.com: Default composer service account
Step 6: Create the Composer
1. Basic
- Navigate to the composer from cloud console
- Click create composer
- Configuration
- Name: composer env name
- Region: set region name
- Image version: select Airflow version
- Service account: custom service account
- Labels: Optional but good to have
- Resilience mode: Standard
2. Intermediate
- Environment: small (we’ll configure it later)
- Network Configuration: (You can’t update it later)
- Select networks in this project
- Select network and subnetwork
- Secondary IP range for pods: Auto-created
- Secondary IP range for services: Auto-created
- Network type: public
- Web server network access control: Allow access from all IP addresses
3. Final
- Advanced Configuration
- Environment variables: we’ll configure it later
- Airflow configuration overrides: we’ll configure it later
- Environment bucket: Select custom bucket
- Data encryption: Google managed encryption key
- Maintenance window: Select your timezone
- Dataplex data lineage integration: Disable integration with dataplex data lineage
- Airflow database zone: Any zone
- Recovery Configuration: Set your snapshot schedule
- Advanced Configuration
4. Done and wait approx. ~25 minutes
Step 7: Configure Composer
- Select the composer
- Navigate to Environment Configuration
- Goto resources section and update the configuration according to pipeline workload
- Don’t update core infrastructure directly (recommended)
- Navigate to Airflow Configuration overrides
- Core
- Dag_file_processor_timeout: 800
- Dagbag_import_timeout: 600
- Dags_are_paused_at_creation: True (Important)
- Default_timezone: Asia/Kuala_Lumpur
- Parallelism: 50
- Max_active_tasks_per_dag: 25
- Max_active_runs_per_dag: 25
- Killed_task_cleanup_time: 604800 (to prevent Airflow error)
- Logging
- Logging_level: INFO (To get detail Airflow log)
- Scheduler
- Dag_dir_list_interval: 180
- Secrets
- Use_cache: True
- Webserver
- Default_ui_timezone: Asia/Kuala_Lumpur (Important)
- Default_wrap: True (helpful to debug log)
- Navbar_color: #40c706 (Optional)
- Celery
- Worker_autoscale: 5,1 (max, min)
- Worker_concurrency: 8 (no of parallel task execution)
- Navigate to PyPi packages
Add your pip package if needed (e.g., Name: google-cloud-storage, version: ==2.17.0)
Step 8: Deploy the Airflow DAGs
- Navigate to cloud storage attached with Composer
- Goto dags folder and upload all .py files
- Done
Conclusion
By following these steps, you have successfully deployed Apache Airflow in Google Cloud Composer under a VPC. This setup ensures enhanced security and network isolation for your workflows. Happy orchestrating!