Large-Scale Data Analysis and Machine Learning Lifecycle

Large Scale Data Analysis

Skills Gained

Large-Scale Data Analysis and Machine Learning Lifecycle

Scalable Data Processing (Apache Spark)
Machine Learning & Feature Engineering
Natural Language Processing
Model Evaluation & Experiment Tracking
Reproducible ML Workflows

2024
This project explored scalable data analysis and end-to-end machine learning workflows using large real-world datasets. I developed distributed data processing pipelines with Apache Spark to analyze customer reviews, identify influencer patterns, and examine sentiment and language trends across regions and cuisines. Using feature engineering and natural language preprocessing techniques, I built machine learning models for star rating prediction and wind power forecasting. Models were evaluated using Accuracy, F1 score, and regression metrics, providing insight into both classification performance and predictive reliability. The project also covered the full ML lifecycle, including handling missing values, resampling time-series data, polynomial feature transformations for non-linear relationships, hyperparameter tuning with MLflow, and ensuring reproducibility through version control and documented workflows.

Return