Kickstart Your DataOps Journey with dbt

Last Update: December 3, 2024
DataOps Journey with dbt
Table of Contents
Contributors
Picture of Vivasoft Team
Vivasoft Team
Tech Stack
0 +
Want to accelerate your software development company?

It has become a prerequisite for companies to develop custom software products to stay competitive.

Project Overview

In this project, we’re utilizing the data build tool (dbt) to improve our DataOps procedures. For contemporary data-driven enterprises, DataOps, an approach designed to increase the speed, quality, and dependability of data analytics, is essential. We aim to accomplish effective data transformations, reliable testing, and smooth teamwork by incorporating dbt into our data pipeline.

Technology Used

Python, dbt, SQL

Documentation

The dbt documentation feature allows you to look into model dependencies and add metadata to datasets. For more details, visit the official dbt docs.

Objectives

  • Automate Data Transformations:  To ensure correctness and consistency, use DBT to automate and manage SQL-based data transformations (table, view, materialized view, incremental, refresh/overwrite, etc.).
  • Enhance Data Quality: To keep data quality high, conduct thorough testing and validation.
  • Improve Collaboration: By using version-controlled data models and thorough documentation, data engineers, analysts, and scientists may work together more effectively.
  • Streamline Data Pipeline Management: To automate and keep an eye on the complete data pipeline, integrate dbt with orchestration technologies such as Apache Airflow.

Key Components

  • Data Sources: A range of unprocessed data sources, such as databases, CSV files, and APIs.
  • Staging Area: Initial loading of raw data into the data warehouse.
  • Transformations: Data cleaning, transformation, and aggregation using DBT models.
  • Testing and Documentation: Making sure that all transformations have clear documentation and ensuring the quality of the data.
  • Orchestration and Scheduling: Automating and tracking the data process with Airflow.

Workflow

  • Data Ingestion: Raw data is ingested from various sources into the staging area of the data warehouse.

 

  • Data Modeling with dbt:
    • Staging Models: Initial cleaning and preparation of raw data.
    • Intermediate Models: Business logic and transformations applied to stage data.
    • Mart Models: Final aggregated and business-ready data.

 

  • Testing and Validation: dbt tests are executed to ensure data integrity and accuracy.
  • Documentation: dbt generates documentation for all models, providing a clear overview of the data pipeline.

  • Orchestration: Airflow schedules and manages dbt runs, ensuring timely updates and monitoring the pipeline status.

  • Monitoring and Alerts: Custom scripts send notifications on pipeline success or failure, enabling proactive management.

Outcomes and Benefits

  • Automated and Reliable Data Pipelines: By using automated testing and data transformations, manual involvement is decreased and reliability is raised.
  • High Data Quality: Accuracy and integrity of data are guaranteed by stringent testing and validation procedures.
  • Enhanced Collaboration: Version-controlled models and thorough documentation improve teamwork.
  • Scalable and Efficient Data Operations: Using DBT and Airflow, data workflows can be managed effectively and data processing can be scaled.

Toplevel Workflow

Fig1: dbt project structure

Project Setup

We’ll go over how to set up your DBT project in this stage, including how to connect to data sources and configure the environment initially. This configuration guarantees that you have a strong basis upon which to construct and efficiently manage your data models.

Setting Up the Environment

a. Install dbt

  • Create Python3 venv:
				
					python3 -m venv dbt_project
				
			
  • source venv_name/bin/activate
				
					source venv_name/bin/activate
				
			
  • Install dbt:
				
					pip install dbt-core dbt-bigquery
				
			

b. Initialize dbt project

				
					dbt init project_name 
				
			

1. Select database adapter from terminal

2. If GCP, select project name and dataset_name (if dataset isn’t specified in model, BQ will use this. Otherwise, BQ will create new dataset like defaultDatasetName_otherName)

  • Select threads to run multiple models simultaneously for saving time
  • Job_execution_timeout_seconds: default is 300 seconds(5 mins)
  • Location: You can change it later

4. Goto home dir:

				
					cd ~
				
			

5. Navigate to dbt dir:

				
					cd .dbt
				
			

6. Open profiles.yml:

				
					nano profiles.yml
				
			

7. Update the GCP location and other stuffs if you want

c. Directory structure

				
					├──dbt_project
 |	└──analyses: save SQL for reporting purpose
 |           	└──checks.yml
 |          	└──dbt_packages: all dbt libraries
 |           └──logs
 |       		└──dbt.log: dbt logs when each dbt runs models
 |            └──macros
 |       		└──date_operation.sql(example): fn to use across dbt project
 |	 └──models: Core logic to build the data warehouse
 |                       └──source.yml: Top level database configuration to use
 |                       └──schema.yml: Top level documentation
 |       	 └──seeds: To export file as table
 |  	 └──snapshots: For SCD(slowly changing dimensions)
 |            └──target: All complied SQL files dbt executes
 |            └──tests: Custom test SQL to ensure data quality
 |            └──dbt_project.yml: Top level configuration (table, view, materialized view, global variables, path setup, scheme define and a lot)

				
			

d. Run dbt Project

  • Generic run:
				
					dbt run
				
			
  • Check connection & configuration:
				
					dbt debug
				
			
  • Run a specific model:
				
					dbt run –models modelName
				
			
  • Test the dbt project:
				
					 dbt test 
				
			
  • Dbt project visualization:
				
					 dbt  docs generate

 dbt  docs serve –port 8001

				
			

Create Sample dbt model

Creating SQL scripts that specify how raw data should be processed and organized in your data warehouse is known as data modeling in DBT. These transformations are contained in SQL files called dbt models. In the data pipeline, models are arranged into various layers, each of which has a distinct function:

Steps

  • Navigate to dbt project and model dir: dbt_project → models
  • Create a folder named src (example)
  • Under src folder create a SQL file name src_dbt_test.sql and paste the SQL code
				
					WITH CTE AS (
        SELECT 
1
                         )
SELECT
       *
FROM CTE

				
			
  • Switch to terminal and execute following command
				
					dbt run
				
			
  • Go to database and get your table

Conclusion

By leveraging dbt for data modeling, you can achieve:

  • Automated and Consistent Transformations: dbt enables you to automate data transformations and ensure they are applied consistently across your data pipeline.
  • Improved Data Quality: With built-in testing and validation, dbt helps maintain high data quality.
  • Clear Documentation: Comprehensive documentation generated by dbt makes it easier for team members to understand and collaborate on the data models.
  • Efficient Collaboration: Using version control (e.g., Git), dbt facilitates better collaboration among data engineers, analysts, and scientists.

By following these steps, you’ve built a robust, efficient, and scalable data pipeline using dbt, laying the foundation for a successful DataOps implementation.

Potential Developer
Tech Stack
0 +
Accelerate Your Software Development Potential with Us
With our innovative solutions and dedicated expertise, success is a guaranteed outcome. Let's accelerate together towards your goals and beyond.
Blogs You May Love

Don’t let understaffing hold you back. Maximize your team’s performance and reach your business goals with the best IT Staff Augmentation