Applied Machine Learning
“All models are wrong, But some are useful.” ― George Box
Every time we interact with an e-commerce site and see a recommendation to buy a product or we interact with our messenger app and see a chat bot in action, we are seeing machine learning in action. Strong mathematical theories underpin these machine learning application. And the Machine Learning library eco-system has matured to an extent that it is straight forward to write a few lines of code and have the ML back-end ready for one’s application.
However, the challenge for many beginners is how to structure a business problem as a ML problem, and then go on to build, select and evaluate the right model. This workshop is designed to help learn how to apply machine learning to business problems. Real-life case studies are used to teach the various algorithms and techniques. The focus will be on applications, rather than on exposition of the various algorithms.
Key Concepts
- Theory: ML Formulation, Generalisation, Bias-Variance, Overfitting, Interpretation
- Paradigms: Supervised (Regression & Classification), Unsupervised (Dimensionality Reduction & Clustering)
- Data: Tabular & Relational, Time-Series
- Models: Linear, Tree-based, Matrix Factorisation, Distance-based
- Methods: Feature Engineering, Hyper-Parameter Tuning, Regularisation, Validation, Aggregation, Boosting
- Process: Feature Creation & Selection, Model Building, Validation & Selection, Model Deployment & Monitoring
- Tools: python-data-stack, scikit-learn
Approach
This would be a three-day instructor-led hands-on workshop to learn and implement an end-to-end machine learning models. This is predominantly a hands-on course and will be 70% programming/coding and 30% theory. There will be twelve sessions of two hours each over three days.
Session 1: Introduction & Concepts
- What is Machine Learning: Learning from Data
- ML Paradigms: Supervised, Unsupervised, Reinforcement
- Approach for building ML products: the process
- Intuition for Classification (Paper & Pen)
Session 2: Model Building: Tree-based
- Building a Decision Tree
- Encoding data, Training & Test Split
- Choosing Error Metrics: Standard vs. Custom
- Model Evaluation Approach
Session 3: Model Validation & Selection
- Overfitting & Underfitting, Generalisation, Learning Curves
- Regularisation in Trees
- Cross-Validation: Hold-out, K-fold
- Hyper-Parameter Tuning: Grid Search
Session 4: Ensemble Models
- Ensemble Models for Generalisation
- Resampling & Bootstrap Data
- Bagging Approach: Random Forest
- Boosting Approach: Gradient Boosting
Session 5: Building Model: Linear
- Linear Regression
- Normalisation & Standardistion
- L1 / L2 Regularisation in Linear Models
- Logistic Regression (for Classification)
Session 6: Feature Engineering
- Feature Creation Approaches
- Scale Transformations
- Feature Importance & Selection
- Domain Knowledge & Art of Feature Engineering
Session 7: Build & Deploy ML Service
- Concept of ML Service for Prediction
- Pipelines and Model Serialisation
- Rest API and design
- Deploy your ML Service - localhost API
Session 8: Interpret ML Models
- Concept & Why ML Interpretation
- Types: Feature, Instance & Model level
- Local Surrogate Models (LIME), Shapely Values
- Model Visualisation
Session 9: Dimensionality Reduction
- The curse of dimensionality
- Matrix Factorisation approaches
- Usage for Unsupervised Learning: Similarity
- Feature creation for Supervised Learning
Session 10: Clustering
- Concept & Challenges of Clustering
- Distance-based approaches e.g. K-Means
- Measuring clustering performance
- Alternate approaches - Neighbour, Manifold (theory)
Session 11: ML Challenges & Automation
- Handling Time-dependent Data
- Unbalanced Class, Anomaly Detection
- Hyper Parameter Tuning Approaches
- Automated Feature Engg & Model Selection: AutoML
Session 12: Practice Session & Wrap-up
- Best practices in building ML service
- Monitoring Model drift & Tracking Performance
- Challenges in managing ML in production
- Where to go from here: Learning Path
Target Audience
- Anyone familiar with doing data analysis (using a scripting language like Python, R, SAS or programming languages like Java, Scala, C++) and wants to pick up the skills for machine learning.
- A programmer looking to transition in to building data driven products or a data scientist role.
- A beginner in data science with some experience in doing machine learning, but wants to get a deeper and a more applied perspective on using Machine Learning.
Pre-requisites
- Programming knowledge is mandatory. Attendee should be able to write conditional statements, use loops, be comfortable writing functions and be able to understand code snippets and come up with programming logic.
- Participants should have a basic familiarity of Python. Specifically, we expect participants to know the first three sections from this: http://anandology.com/python-practice-book/
- Participants should have experience with using
Pandas
andJupyter Notebook
. At the bare minimum, you should be able to understand and run the code in this The Art of Data Science repo. Refer to the Onion Notebook’s and especially the Acquire, Refine, Transform and Explore sections.
Software Requirements
We will be using Python data stack for the workshop. Please install Ananconda for Python 3.5 for the workshop. That has everything we need for the workshop. For attendees more curious, we will be using Jupyter Notebook as our IDE. We will be using primarily scikit-learn libraries for most of the machine learning algorithms.
Facilitators’ Profile
Amit Kapoor teaches the craft of telling visual stories with data. He conducts workshops and trainings on Data Science in Python and R, as well as on Data Visualisation topics. His background is in strategy consulting having worked with AT Kearney in India, then with Booz & Company in Europe and more recently for startups in Bangalore. He did his B.Tech in Mechanical Engineering from IIT, Delhi and PGDM (MBA) from IIM, Ahmedabad. You can find more about him at http://amitkaps.com/ and tweet him at @amitkaps.
Bargava Subramanian is a practicing Data Scientist. He has 14 years of experience delivering business analytics solutions to Investment Banks, Entertainment Studios and High-Tech companies. He has given talks and conducted workshops on Data Science, Machine Learning, Deep Learning and Optimization in Python and R. He has a Masters in Statistics from University of Maryland, College Park, USA. He is an ardent NBA fan. You can tweet to him at @bargava.