Modular Data Science Pipeline with DVC & Dagger

Software Development and Software Engineering

Skills Gained

Pipeline Design & Modular Architecture
CI/CD for Machine Learning
Data Versioning (DVC)
Workflow Orchestration (Dagger, Go)
Reproducible & Scalable ML Systems

2024
This project focused on transforming a monolithic Python workflow into a modular, reproducible data science pipeline. I redesigned the architecture into structured components for data preprocessing, model training, and model selection, ensuring clear separation of responsibilities and improved scalability. I implemented data versioning with DVC to track datasets and artifacts across experiments, enabling consistent and traceable workflows. Using Dagger (Go), I orchestrated automated pipeline execution, managing each stage from preprocessing to model validation and artifact generation within a containerized environment. The pipeline was integrated with GitHub Actions for CI/CD, allowing every run to be reproducible, documented, and validated through automated tests. This setup ensured reliable experiment tracking, streamlined collaboration, and efficient deployment of production-ready models.

Return