SQL Server Data Science Training

IEPDS: Immersion Event on Practical Data Science with Cortana Analytics: Azure Machine Learning, SQL Data Mining and R

(Retired – no longer offered as a public class)

Overview

This 5-day class is the first of the two data science courses taught by Rafal Lukawiecki. Some of the topics will be introduced at level 200, without requiring you to have data science prerequisites, on the first day. However, as the class progresses, the level of the training will quickly increase to 300 and 400.

You will learn machine learning, data mining, some statistics, data preparation, and how to interpret the results. You will see how to formulate business questions in terms of data science hypotheses and experiments, and how to prepare inputs to answer those questions. We will cover common issues and mistakes, how to resolve them, like overtraining, and how to cope with rare events, such as fraud. At the end of this course you will be able to plan and run data science projects.

As a practicing data miner, Rafal will also share his decade of hands-on experience while teaching you about Azure Machine Learning (Azure ML) which is the foundation of Cortana Analytics Suite, and its highly-visual, on-premise companion, the SQL Server Analysis Services Data Mining engine, supplemented with the free open source and Cortana’s Revolution Analytics R software. We will use some Excel, however, most of our time will be spent in ML Studio, some in R, RStudio, SSDT, SSMS, and the Azure Portal.

Prerequisites: No formal prerequisites. Basic knowledge of SQL Server Data Tools, Excel and any analytical experience helps. Best of all: prepare questions that you would like to answer using predictive analytics and machine learning.

Format: 60% lectures, 20% demos, plus 20% time allocated to help you follow the demos and tasks on your own equipment, if you bring a laptop. You will be challenged to find answers to 4 problems during the course, and you will have a chance to build your own models in SSAS, Azure ML and R. Doing that will help you learn, however, it is not a requirement: you are welcome to observe the demos and ask questions, too. If you bring your own data, you are welcome to analyse it, too. You will get a list of free or evaluation-edition software to preinstall before attending. You will need your own Azure account: free one is OK, but the paid one is better—and it can be inexpensive, or even free during a trial. You can copy course experiments and data into your ML workspace for learning and future reference.

Target audience: Analysts, power users, predictive and BI developers, database and other professionals who wish to embrace machine learning, budding data scientists, consultants.

Instructor: Rafal Lukawiecki (You can watch a 75-minute presentation on this material by Rafal at the 2015 Ignite conference here.)

Need Help Justifying Training? Here’s a letter to your boss explaining why SQLskills training is worthwhile and a list of community blog posts about our classes.

Quotes From Past Attendees

Listed below are some verbatim quotes from recent attendees of this class:

“Rarely do we meet domain expertise, in any discipline, that is as complete and deep as Rafal’s in Data Science/Data Mining/R, SSAS, AML. Add to the observed level of patience and ability to articulate and concern for the class’ absorption of the material and ‘lucky/fortunate’ is the best way to describe anyone who attends one of Rafal’s training events.”
“Would come back again – just loved it!”
“Rafal was just great!”
“‘Boots on ground’ – real-world applicable.”
“Ad hoc deep questions answered with real-life examples or academic explanations softened with real-world perspective.”

Curriculum

Please note: we reserve the right to amend the order and the day allocation of the modules to best suit the dynamic character of the class and to answer questions as they arise. Some subjects (marked with an asterisk *) are optional, and will only be covered if time allows.

Module 1: Overview of Practical Data Science for Business

We begin the course with a quick, high-level introduction of all of the key concepts, terminology, components, and tools. Topics covered include:

Introduction to data science and its components
Machine learning vs data mining
Statistics
Big data
Data wrangling
Team, process, and tools

Module 2: Tools and Getting Started

Configuring Cortana Analytics key component, cloud-based Azure ML is effortless. You need to pay a little bit more attention to on-premise R and SQL server environment, to make sure that you can easily access your modeling data. Topics covered include:

Getting started with and using Cortana Analytics: Azure ML, SSAS DM, and R
Structures, models, data flows
Configuration concerns and pricing
Azure requirements and dependencies
Other components of Cortana Analytics Suite
Using Rattle with R and RStudio
Getting a feel for the data: interpreting notched boxplots in R

Module 3: Data

Data science requires you to prepare your data into a rather unique, flat, and completely denormalised format. While inputs are always necessary, and you may need to engineer hundreds of them, we do not need predictive outputs in all cases. Topics covered include:

Inputs and outputs, features and labels
Data formats, discretization vs. continuous
Cases, observations, signatures
Feature engineering
Azure ML data preparation and manipulation modules
Preparing unstructured text for text analysis
Feature hashing
Moving data around and its storage

Module 4: Process

The analytical process consists of problem formulation, data preparation, modelling, validation, and deployment—all in an iterative fashion. You will learn about the CRISP-DM industry-standard approach, as well as the application of the scientific method of reasoning to experimentation, when solving real-world business problems. Topics covered include:

Stating business question in data science term
CRISP-DM
Scientific method of reasoning
Hypothesis testing and experiments
Iterative hypothesis refinement

Module 5: Algorithm Overview

There are hundreds of machine learning algorithms, yet they belong to just a dozen of groups, of which 4-5 are in very common use. We will introduce those algorithm classes, and we will discuss some of the most often used examples in each class, while explaining which technology tools (Azure ML, SQL, or R) provide their most convenient implementation. Topics covered include:

What does data mining do?
Algorithm classes in Azure ML, R, and SSAS
Supervised vs Unsupervised learning
Classifiers
Clustering
Regression
Similarity Matching
Recommenders

Module 6: Segmentation

Segmentation is the main application of unsupervised learning using clustering algorithms. While the action of the algorithm is usually quick and easy to configure, interpreting the results can take a lot of time and intuition. We will spend plenty of time practicing segmentation, interpreting the results and subsequently parameterising the algorithm to provide us with additional insight, and to help you apply it back to your own data. You will even learn how to apply this technique for anomaly (outlier) detection and text analytics! Topics covered include:

Introduction to segmentation
Clustering algorithms (k-means, EM, and others)
Interpreting clusters
Cluster characteristics
Discrimination
Tornado charts
Using clustering for text analysis
Anomaly detection with clustering, PCA and SVMs

Module 7: Classification

Without doubt, classifiers are the most important, and the most often used category of machine learning algorithms, and the foundation of algorithmic data science. We will focus on several variants of the most important classifier algorithm—decision tree—while progressively interpreting the results, and improving its performance. After introducing neural networks and logistic regression we will also compare the performance of all of these classifiers on our test dataset. Topics covered include:

Introduction to classifiers
Two-class (binary) vs multi-class
Decision trees, forests, and boosting
Decision jungles *
Neural networks and logistic regression
Overfitting (overtraining) concerns
Using classifiers for text analysis
Associative decision trees *

Module 8: Basic Statistics

Basic concepts of statistics, notably: means, medians, modes, and variance or standard deviation, are essential to validating data and model quality. Probability, and the concept of p-values help you decide which of your inputs (features) are more important than others. R makes all of these powerful ideas accessible and visual, while Azure ML enables you to deploy them easily into production. Topics covered include:

Basic concepts of statistics: population vs sample, measure types, means and deviations, distributions, confidence intervals, p-values
Correlation
Descriptive statistics with R
Basic concepts of probability
Finding important features using p-values, linear regression, and ANOVA *

Module 9: Model Validation

The most important aspect of any data science project is the iterative validation and improvement of the models. Without validation, your models cannot be used. There are several tests of model validity, and we will focus on accuracy and reliability, showing you different ways to measure it. Topics covered include:

Testing accuracy
Lift charts
Testing reliability
Testing usefulness

Module 10: Classifier Precision

Validation of classifiers is likely to be your main occupation as a data scientist, because classifiers are used so often, and because their precision is not always easy to balance with business requirements, such as restricted resources or required business performance. We will introduce the fundamentals of finding the balance between the acceptable number of false positives and false negatives by using classification (confusion) matrices, and plotting the options using ROC (Receiver Operating Characteristic) charts. Topics covered include:

Testing classifiers
False positives vs. false negatives
Classification (confusion) matrix
Precision
Recall
Balancing precision with recall vs. business goals and constraints
Charting precision-recall (sensitivity-specificity)
ROC curves
Other measures of accuracy
Cross-validation
Optimizing binary classifier thresholds for a known business goal of a prediction quality
Refining models to improve accuracy and reliability
Using parameter sweeps to fine-tune algorithm performance
Class imbalance problem (fraud analytics and rare event prediction) *

Module 11: Regressions

Considered by some as the numerical equivalent of classifiers, regression is a large subject of its own. We will introduce its simple but a very popular form, linear regression, and the more precise, but also prone-to-overfitting, decision tree variant. Topics covered include:

Introduction to simple regressions
Linear regression (classic)
Regression decision trees and other ensemble regression algorithms
Relationship to ANOVA *
Measuring linear regression quality (R-squared, predictor p-values, RMSE, MAE, RAE, RSE, and additional testing using R) *

Module 12: Similarity Matching and Recommenders

From basic concepts of similarity matching, through model-based associative analysis, collaborative filtering, to hybrid systems, like the Matchbox algorithm, there are several techniques for building recommenders. You will get a good overview of this subject, as well as an understanding of how to use these techniques for advanced data exploration, such as Market Basket Analysis. Topics covered include:

Introduction to recommender concepts
Model-based, similarity-based, and hybrid recommenders
Association rules
Understanding itemsets and rules
Rule importance vs. rule probability
Data structures for association rules
Market Basket Analysis
Collaborative filtering
Matchbox recommenders
Validating recommenders

Module 13: Other Algorithms (Brief Overview)

As the course is coming to its end, we will briefly overview some of the remaining and interesting algorithms and techniques, such as text analytics, without going into too much detail, but letting you have an understanding of the existing approaches. Topics covered include:

Text analysis
Sequence clustering and Markov chains
SVM (Support Vector Machines)
Time series
Image analytics

Module 14: Production and Model Maintenance

If you plan on using your models for prediction, rather than just for the exploration of data, you need to deploy your models to production and maintain them on an on-going basis. You will learn about the easiest way to do so using Azure ML web services and its REST synchronous and asynchronous APIs, as well as how to deploy and invoke SSAS models by using DMX queries. Topics covered include:

Deploying models to production
SSAS models and DMX queries
Azure ML web services: preparation and publishing
Cortana Solutions Gallery
REST APIs: request/response vs. batch
On-going maintenance and model updates

Questions?

If you have any questions not answered by our F.A.Q., please contact us.