Applied Machine Learning

“All models are wrong, But some are useful.” ― George Box

Every time we interact with an e-commerce site and see a recommendation to buy a product or we interact with our messenger app and see a chat bot in action, we are seeing machine learning in action. Strong mathematical theories underpin these machine learning application. And the Machine Learning library eco-system has matured to an extent that it is straight forward to write a few lines of code and have the ML back-end ready for one’s application.

However, the challenge for many beginners is how to structure a business problem as a ML problem, and then go on to build, select and evaluate the right model. This workshop is designed to help learn how to apply machine learning to business problems. Real-life case studies are used to teach the various algorithms and techniques. The focus will be on applications, rather than on exposition of the various algorithms.

Key Concepts

Theory: ML Formulation, Generalisation, Bias-Variance, Overfitting, Interpretation
Paradigms: Supervised (Regression & Classification), Unsupervised (Dimensionality Reduction & Clustering)
Data: Tabular & Relational, Time-Series
Models: Linear, Tree-based, Matrix Factorisation, Distance-based
Methods: Feature Engineering, Hyper-Parameter Tuning, Regularisation, Validation, Aggregation, Boosting
Process: Feature Creation & Selection, Model Building, Validation & Selection, Model Deployment & Monitoring
Tools: python-data-stack, scikit-learn

Approach

This would be a three-day instructor-led hands-on workshop to learn and implement an end-to-end machine learning models. This is predominantly a hands-on course and will be 70% programming/coding and 30% theory. There will be twelve sessions of two hours each over three days.

Session 1: Introduction & Concepts

What is Machine Learning: Learning from Data
ML Paradigms: Supervised, Unsupervised, Reinforcement
Approach for building ML products: the process
Intuition for Classification (Paper & Pen)

Session 2: Model Building: Tree-based

Building a Decision Tree
Encoding data, Training & Test Split
Choosing Error Metrics: Standard vs. Custom
Model Evaluation Approach

Session 3: Model Validation & Selection

Overfitting & Underfitting, Generalisation, Learning Curves
Regularisation in Trees
Cross-Validation: Hold-out, K-fold
Hyper-Parameter Tuning: Grid Search

Session 4: Ensemble Models

Ensemble Models for Generalisation
Resampling & Bootstrap Data
Bagging Approach: Random Forest
Boosting Approach: Gradient Boosting

Session 5: Building Model: Linear

Linear Regression
Normalisation & Standardistion
L1 / L2 Regularisation in Linear Models
Logistic Regression (for Classification)

Session 6: Feature Engineering

Feature Creation Approaches
Scale Transformations
Feature Importance & Selection
Domain Knowledge & Art of Feature Engineering

Session 7: Build & Deploy ML Service

Concept of ML Service for Prediction
Pipelines and Model Serialisation
Rest API and design
Deploy your ML Service - localhost API

Session 8: Interpret ML Models

Concept & Why ML Interpretation
Types: Feature, Instance & Model level
Local Surrogate Models (LIME), Shapely Values
Model Visualisation

Session 9: Dimensionality Reduction

The curse of dimensionality
Matrix Factorisation approaches
Usage for Unsupervised Learning: Similarity
Feature creation for Supervised Learning

Session 10: Clustering

Concept & Challenges of Clustering
Distance-based approaches e.g. K-Means
Measuring clustering performance
Alternate approaches - Neighbour, Manifold (theory)

Session 11: ML Challenges & Automation

Handling Time-dependent Data
Unbalanced Class, Anomaly Detection
Hyper Parameter Tuning Approaches
Automated Feature Engg & Model Selection: AutoML

Session 12: Practice Session & Wrap-up

Best practices in building ML service
Monitoring Model drift & Tracking Performance
Challenges in managing ML in production
Where to go from here: Learning Path

Target Audience

Anyone familiar with doing data analysis (using a scripting language like Python, R, SAS or programming languages like Java, Scala, C++) and wants to pick up the skills for machine learning.
A programmer looking to transition in to building data driven products or a data scientist role.
A beginner in data science with some experience in doing machine learning, but wants to get a deeper and a more applied perspective on using Machine Learning.

Pre-requisites

Programming knowledge is mandatory. Attendee should be able to write conditional statements, use loops, be comfortable writing functions and be able to understand code snippets and come up with programming logic.
Participants should have a basic familiarity of Python. Specifically, we expect participants to know the first three sections from this: http://anandology.com/python-practice-book/
Participants should have experience with using Pandas and Jupyter Notebook. At the bare minimum, you should be able to understand and run the code in this The Art of Data Science repo. Refer to the Onion Notebook’s and especially the Acquire, Refine, Transform and Explore sections.

Software Requirements

We will be using Python data stack for the workshop. Please install Ananconda for Python 3.5 for the workshop. That has everything we need for the workshop. For attendees more curious, we will be using Jupyter Notebook as our IDE. We will be using primarily scikit-learn libraries for most of the machine learning algorithms.

Facilitators’ Profile

Amit Kapoor teaches the craft of telling visual stories with data. He conducts workshops and trainings on Data Science in Python and R, as well as on Data Visualisation topics. His background is in strategy consulting having worked with AT Kearney in India, then with Booz & Company in Europe and more recently for startups in Bangalore. He did his B.Tech in Mechanical Engineering from IIT, Delhi and PGDM (MBA) from IIM, Ahmedabad. You can find more about him at http://amitkaps.com/ and tweet him at @amitkaps.

Bargava Subramanian is a practicing Data Scientist. He has 14 years of experience delivering business analytics solutions to Investment Banks, Entertainment Studios and High-Tech companies. He has given talks and conducted workshops on Data Science, Machine Learning, Deep Learning and Optimization in Python and R. He has a Masters in Statistics from University of Maryland, College Park, USA. He is an ardent NBA fan. You can tweet to him at @bargava.