Large Scale Data Analysis
Skills Gained
Large-Scale Data Analysis and Machine Learning Lifecycle
Scalable Data Processing (Apache Spark)
Machine Learning & Feature Engineering
Natural Language Processing
Model Evaluation & Experiment Tracking
Reproducible ML Workflows
2024
This project explored scalable data analysis and end-to-end machine learning workflows using large real-world datasets. I developed distributed data processing pipelines with Apache Spark to analyze customer reviews, identify influencer patterns, and examine sentiment and language trends across regions and cuisines. Using feature engineering and natural language preprocessing techniques, I built machine learning models for star rating prediction and wind power forecasting. Models were evaluated using Accuracy, F1 score, and regression metrics, providing insight into both classification performance and predictive reliability. The project also covered the full ML lifecycle, including handling missing values, resampling time-series data, polynomial feature transformations for non-linear relationships, hyperparameter tuning with MLflow, and ensuring reproducibility through version control and documented workflows.