Explore how machine learning is operationalized in big data workflows. Learn how to build end-to-end ML pipelines from data ingestion to model deployment using distributed systems and scalable frameworks. This 3-hour session, Machine Learning Pipelines in Big Data Environments, provides a hands-on introduction to building, managing, and deploying scalable machine learning workflows using Apache Spark and MLlib. As organizations work with increasingly large datasets, traditional machine learning approaches become insufficient. This course addresses that gap by teaching participants how to design modular, reusable, and efficient ML pipelines tailored for distributed environments.
Participants will explore the core components of Spark MLlib pipelines—including Transformers, Estimators, and Pipeline models—and understand how to chain these components to automate and streamline tasks such as feature engineering, model training, and evaluation.
The session also delves into applying regression and classification algorithms on large datasets, implementing hyperparameter tuning through cross-validation, and managing the end-to-end lifecycle of ML models. With a focus on real-world scalability, learners will be introduced to MLflow for model tracking and versioning, and will explore how to deploy models in both batch and streaming big data environments.
By the end of the course, participants will be equipped with the practical skills to build and operationalize machine learning pipelines in production-scale systems.