- Imbibe the underlying principles of data analytics and learn how to use the data science pipeline
- Develop proficiency in using R or python data stack and libraries like ggplot, dplyr, stats (for R) and pandas, scikit-learn (for python)
- Learn how to employ statistical and machine learning algorithms to solve real life problems

- Professionals interested in learning data science
- Programmers interested in building data driven products
- Journalist, scientist, researchers interested in telling data stories
- Business Intelligence analysts and consultants

- Taught by real life practicioners
- Tested and practical curriculum with real data sets
- Interactive and live coding sessions

- What is data science?
- What type of questions can be answered?
- Frame/Acquire/Refine/Explore/Model/Insight framework

- How to frame a data science problem?
- Learn the hypothesis-driven approach?
- How do you start - question driven, dataset driven or both?

- Sources of Data
- Download from an internal system
- Obtained from client or other 3rd party
- Extracted from a web-based API
- Scraped from a website / pdfs
- Gathered manually and recorded

- Acquire data from a csv file or a database
- Acquire data from a 3rd part client (e.g. twitter)

- Why do visual exploration?
- Understand Data Structure & Types
- Grammar of Graphics and Basics of visualisation
- Explore single variable graphs - (Quantitative, Categorical)
- Explore dual variable graphs - (Q & Q, Q & C, C & C)
- Explore multi-dimensional variable graphs

- Concept of Tidy Data - Why is it important?
- Missing e.g. Check for missing or incomplete data
- Quality e.g. Check for duplicates, accuracy, unusual data
- Parse e.g. extract year from date
- Merge e.g. first and surname for full name
- Convert e.g. free text to coded value
- Derive e.g. gender from title
- Calculate e.g. percentages, proportion
- Remove e.g. remove redundant data
- Aggregate e.g. rollup by year, cluster by area
- Filter e.g. exclude based on location
- Sample e.g. extract a representative data
- Summary e.g. show summary stats like mean

- Basic statistics: variance, standard deviation, co-variance, correlation

- Introduction to Machine Learning
- The power and limits of models
- Tradeoff between Prediction Accuracy and Model Interpretability
- Assessing Model Accuracy
- For Regression problems: RMSE
- For classification problems: Precision, Recall, AUC/ROC, F-Score, Mis-classification rate

- Bias-Variance tradeoff & Overfitting
- Linear Regression
- Logistic Regression
- L1, L2 Linear & Logistic Regression
- Regularization
- Classification model
- Decision Trees
- Visualizing decision trees

- Why do we need to communicate insight?
- Types of communication - Exploration vs. Explanation
- Explanation: Telling a story with data
- Exploration: Building an interface for people to find stories

**Participant Profile** — The workshop is ideal for anyone who is using open source software - **R** or **Python** stack for statistical analysis and visualization. If you are not using R or Python for statistical analysis, then existing familiarity with any other statistical programming tool like SPSS, SAS, MATLAB would be needed. There is no pre-requisite requirement to be familiar with the R or Python libraries mentioned above.

**Tools Used** - For doing the exercise during the workshop, we would be using R and R IDE - R Studio or Anaconda Distribution for Python. Please install the same in your machine prior to the workshop session. A detailed list of R or Python libraries to install would be shared ahead of the workshop session.

**Number of Participants** — The maximum number of participants for the workshop would be capped at 30. The small class size would enable a more participative environment with group interaction and presentations possible as well as opportunities to have one-to-one learning interactions.

**Duration** — The workshop would be conducted over 2 days from 0900 to 1700. There will be short breaks during the morning and afternoon session and a longer lunch break of around 45 minutes in the middle.

**Venue Logistics** — A training venue for the workshop, with availability of a projector, sound system and whiteboard would be needed for conducting the session.

The workshop would be charged at Rs. 150,000 per day (for Indian locations) or USD 5,000 per day (for International locations). Service tax and other government charges as applicable will be additional. Also, for sessions conducted outside of Bangalore, the facilitator’s travel and accommodation cost would be charged on actuals.

Amit Kapoor is interested in learning and teaching the craft of telling visual stories with data. He is the founder partner at narrativeVIZ Consulting, where he teaches data-science, data-visualisation and data-stories as tools for improving communication, persuasion, and leadership and conducts workshops on these topics for businesses, nonprofits, and academic institutes. He also teaches visualisation as a guest faculty in design context at NID, Bangalore and in management context at IIM Bangalore & IIM Ahmedabad

His background is in strategy consulting in using data-driven stories to drive change across organizations and businesses. He has more than 15 years of management consulting experience, first with AT Kearney in India, then with Booz & Company in Europe and more recently for startups in Bangalore. He did his B.Tech in Mechanical Engineering from IIT, Delhi and PGDM (MBA) from IIM, Ahmedabad. You can find more about him at amitkaps.com and tweet him at @amitkaps.