uCertify

Data Mining and Predictive Analysis

Learn the data mining and predictive analysis essentials with hands-on techniques that turn raw data into actionable insights.

(DM-PA.AE1) / ISBN : 978-1-64459-374-5

Lessons

Lab

TestPrep

AI Tutor (Add-on)

301 Reviews

Get A Free Trial

This course includes:

Free pre-assessment and first 2 lessons

34+ Interactive Lessons | 58+ Exercises

Accessible on mobile and tablet too

Certificate of completion

Are you an instructor?

Access detailed information about the course content, learning objectives, activities, and assessments before adding it to your curriculum.

About This Course

This Data Mining and Predictive Analytics course cuts through the noise to teach you the practical skills you need to analyze data and make accurate predictions. You’ll learn how to apply real-world data mining techniques, work with machine learning (ML) models, and extract insights that drive smarter decisions. We break down complex concepts into straightforward lessons to uncover the most profitable nuggets of knowledge from the data while avoiding the potential pitfalls that may cost your company millions of dollars.

Skills You’ll Get

Understand how to gather, clean, and organize raw data for analysis
Capitalize on core methods like classification, clustering, and association rule mining
Build predictive models using ML algorithms
Represent data insights using visual elements for better interpretation and decision-making
Apply statistical methods to analyze and interpret data trends
Understand and implement ML algorithms for predictive tasks
Learn how to select and create features to improve model accuracy
Evaluate model performance and fine-tune for better accuracy

Interactive Lessons

34+ Interactive Lessons 58+ Exercises | 120+ Quizzes | 164+ Flashcards | 164+ Glossary of terms

Gamified TestPrep

Hands-On Labs

63+ LiveLab | 63+ Video tutorials | 02:02+ Hours

Download Course Outline

Preface

What is Data Mining? What is Predictive Analytics?
Why is this Course Needed?
Who Will Benefit from this Course?
Danger! Data Mining is Easy to do Badly
“White-Box” Approach
Algorithm Walk-Throughs
Exciting New Topics
The R Zone
Appendix: Data Summarization and Visualization
The Case Study: Bringing it all Together
How the Course is Structured

An Introduction to Data Mining and Predictive Analytics

What is Data Mining? What Is Predictive Analytics?
Wanted: Data Miners
The Need For Human Direction of Data Mining
The Cross-Industry Standard Process for Data Mining: CRISP-DM
Fallacies of Data Mining
What Tasks can Data Mining Accomplish
The R Zone
R References
Exercises

Data Preprocessing

Why do We Need to Preprocess the Data?
Data Cleaning
Handling Missing Data
Identifying Misclassifications
Graphical Methods for Identifying Outliers
Measures of Center and Spread
Data Transformation
Min–Max Normalization
Z-Score Standardization
Decimal Scaling
Transformations to Achieve Normality
Numerical Methods for Identifying Outliers
Flag Variables
Transforming Categorical Variables into Numerical Variables
Binning Numerical Variables
Reclassifying Categorical Variables
Adding an Index Field
Removing Variables that are not Useful
Variables that Should Probably not be Removed
Removal of Duplicate Records
A Word About ID Fields
The R Zone
R Reference
Exercises

Exploratory Data Analysis

Hypothesis Testing Versus Exploratory Data Analysis
Getting to Know The Data Set
Exploring Categorical Variables
Exploring Numeric Variables
Exploring Multivariate Relationships
Selecting Interesting Subsets of the Data for Further Investigation
Using EDA to Uncover Anomalous Fields
Binning Based on Predictive Value
Deriving New Variables: Flag Variables
Deriving New Variables: Numerical Variables
Using EDA to Investigate Correlated Predictor Variables
Summary of Our EDA
The R Zone
R References
Exercises

Dimension-Reduction Methods

Need for Dimension-Reduction in Data Mining
Principal Components Analysis
Applying PCA to the Houses Data Set
How Many Components Should We Extract?
Profiling the Principal Components
Communalities
Validation of the Principal Components
Factor Analysis
Applying Factor Analysis to the Adult Data Set
Factor Rotation
User-Defined Composites
An Example of a User-Defined Composite
The R Zone
R References
Exercises

Univariate Statistical Analysis

Data Mining Tasks in Discovering Knowledge in Data
Statistical Approaches to Estimation and Prediction
Statistical Inference
How Confident are We in Our Estimates?
Confidence Interval Estimation of the Mean
How to Reduce the Margin of Error
Confidence Interval Estimation of the Proportion
Hypothesis Testing for the Mean
Assessing The Strength of Evidence Against The Null Hypothesis
Using Confidence Intervals to Perform Hypothesis Tests
Hypothesis Testing for The Proportion
Reference
The R Zone
R Reference
Exercises

Multivariate Statistics

Two-Sample t-Test for Difference in Means
Two-Sample Z-Test for Difference in Proportions
Test for the Homogeneity of Proportions
Chi-Square Test for Goodness of Fit of Multinomial Data
Analysis of Variance
Reference
The R Zone
R Reference
Exercises

Preparing to Model the Data

Supervised Versus Unsupervised Methods
Statistical Methodology and Data Mining Methodology
Cross-Validation
Overfitting
Bias–Variance Trade-Off
Balancing The Training Data Set
Establishing Baseline Performance
The R Zone
R Reference
Exercises

Simple Linear Regression

An Example of Simple Linear Regression
Dangers of Extrapolation
How Useful is the Regression? The Coefficient of Determination, r2
Standard Error of the Estimate, s
Correlation Coefficient r
Anova Table for Simple Linear Regression
Outliers, High Leverage Points, and Influential Observations
Population Regression Equation
Verifying The Regression Assumptions
Inference in Regression
t-Test for the Relationship Between x and y
Confidence Interval for the Slope of the Regression Line
Confidence Interval for the Correlation Coefficient ρ
Confidence Interval for the Mean Value of y Given x
Prediction Interval for a Randomly Chosen Value of y Given x
Transformations to Achieve Linearity
Box–Cox Transformations
The R Zone
R References
Exercises

Multiple Regression and Model Building

An Example of Multiple Regression
The Population Multiple Regression Equation
Inference in Multiple Regression
Regression With Categorical Predictors, Using Indicator Variables
Adjusting R2: Penalizing Models For Including Predictors That Are Not Useful
Sequential Sums of Squares
Multicollinearity
Variable Selection Methods
Gas Mileage Data Set
An Application of Variable Selection Methods
Using the Principal Components as Predictors in Multiple Regression
The R Zone
R References
Exercises

k-Nearest Neighbor Algorithm

Classification Task
k-Nearest Neighbor Algorithm
Distance Function
Combination Function
Quantifying Attribute Relevance: Stretching the Axes
Database Considerations
k-Nearest Neighbor Algorithm for Estimation and Prediction
Choosing k
Application of k-Nearest Neighbor Algorithm Using IBM/SPSS Modeler
The R Zone
R References
Exercises

Decision Trees

What is a Decision Tree?
Requirements for Using Decision Trees
Classification and Regression Trees
C4.5 Algorithm
Decision Rules
Comparison of the C5.0 and CART Algorithms Applied to Real Data
The R Zone
R References
Exercises

Neural Networks

Input and Output Encoding
Neural Networks for Estimation and Prediction
Simple Example of a Neural Network
Sigmoid Activation Function
Back-Propagation
Gradient-Descent Method
Back-Propagation Rules
Example of Back-Propagation
Termination Criteria
Learning Rate
Momentum Term
Sensitivity Analysis
Application of Neural Network Modeling
The R Zone
R References
Exercises

Logistic Regression

Simple Example of Logistic Regression
Maximum Likelihood Estimation
Interpreting Logistic Regression Output
Inference: Are the Predictors Significant?
Odds Ratio and Relative Risk
Interpreting Logistic Regression for a Dichotomous Predictor
Interpreting Logistic Regression for a Polychotomous Predictor
Interpreting Logistic Regression for a Continuous Predictor
Assumption of Linearity
Zero-Cell Problem
Multiple Logistic Regression
Introducing Higher Order Terms to Handle Nonlinearity
Validating the Logistic Regression Model
WEKA: Hands-On Analysis Using Logistic Regression
The R Zone
R References
Exercises

NaïVe Bayes and Bayesian Networks

Bayesian Approach
Maximum A Posteriori (MAP) Classification
Posterior Odds Ratio
Balancing The Data
Naïve Bayes Classification
Interpreting The Log Posterior Odds Ratio
Zero-Cell Problem
Numeric Predictors for Naïve Bayes Classification
WEKA: Hands-on Analysis Using Naïve Bayes
Bayesian Belief Networks
Clothing Purchase Example
Using The Bayesian Network to Find Probabilities
The R Zone
R References
Exercises

Model Evaluation Techniques

Model Evaluation Techniques for the Description Task
Model Evaluation Techniques for the Estimation and Prediction Tasks
Model Evaluation Measures for the Classification Task
Accuracy and Overall Error Rate
Sensitivity and Specificity
False-Positive Rate and False-Negative Rate
Proportions of True Positives, True Negatives, False Positives, and False Negatives
Misclassification Cost Adjustment to Reflect Real-World Concerns
Decision Cost/Benefit Analysis
Lift Charts and Gains Charts
Interweaving Model Evaluation with Model Building
Confluence of Results: Applying a Suite of Models
The R Zone
R References
Exercises
Hands-On Analysis

Cost-Benefit Analysis Using Data-Driven Costs

Decision Invariance Under Row Adjustment
Positive Classification Criterion
Demonstration Of The Positive Classification Criterion
Constructing The Cost Matrix
Decision Invariance Under Scaling
Direct Costs and Opportunity Costs
Case Study: Cost-Benefit Analysis Using Data-Driven Misclassification Costs
Rebalancing as a Surrogate for Misclassification Costs
The R Zone
R References
Exercises

Cost-Benefit Analysis for Trinary and -Nary Classification Models

Classification Evaluation Measures for a Generic Trinary Target
Application of Evaluation Measures for Trinary Classification to the Loan Approval Problem
Data-Driven Cost-Benefit Analysis for Trinary Loan Classification Problem
Comparing Cart Models With and Without Data-Driven Misclassification Costs
Classification Evaluation Measures for a Generic k-Nary Target
Example of Evaluation Measures and Data-Driven Misclassification Costs for k-Nary Classification
The R Zone
R References
Exercises

Graphical Evaluation of Classification Models

Review of Lift Charts and Gains Charts
Lift Charts and Gains Charts Using Misclassification Costs
Response Charts
Profits Charts
Return on Investment (ROI) Charts
The R Zone
R References
Exercises
Hands-On Exercises

Hierarchical and k-Means Clustering

The Clustering Task
Hierarchical Clustering Methods
Single-Linkage Clustering
Complete-Linkage Clustering
k-Means Clustering
Example of k-Means Clustering at Work
Behavior of MSB, MSE, and Pseudo-F as the k-Means Algorithm Proceeds
Application of k-Means Clustering Using SAS Enterprise Miner
Using Cluster Membership to Predict Churn
The R Zone
R References
Exercises
Hands-On Analysis

Kohonen Networks

Self-Organizing Maps
Kohonen Networks
Example of a Kohonen Network Study
Cluster Validity
Application of Clustering Using Kohonen Networks
Interpreting The Clusters
Using Cluster Membership as Input to Downstream Data Mining Models
The R Zone
R References
Exercises

BIRCH Clustering

Rationale for BIRCH Clustering
Cluster Features
Cluster Feature TREE
Phase 1: Building The CF Tree
Phase 2: Clustering The Sub-Clusters
Example of Birch Clustering, Phase 1: Building The CF Tree
Example of BIRCH Clustering, Phase 2: Clustering The Sub-Clusters
Evaluating The Candidate Cluster Solutions
Case Study: Applying BIRCH Clustering to The Bank Loans Data Set
The R Zone
R References
Exercises

Measuring Cluster Goodness

Rationale for Measuring Cluster Goodness
The Silhouette Method
Silhouette Example
Silhouette Analysis of the IRIS Data Set
The Pseudo-F Statistic
Example of the Pseudo-F Statistic
Pseudo-F Statistic Applied to the IRIS Data Set
Cluster Validation
Cluster Validation Applied to the Loans Data Set
The R Zone
R References
Exercises

Association Rules

Affinity Analysis and Market Basket Analysis
Support, Confidence, Frequent Itemsets, and the A Priori Property
How Does The A Priori Algorithm Work (Part 1)? Generating Frequent Itemsets
How Does The A Priori Algorithm Work (Part 2)? Generating Association Rules
Extension From Flag Data to General Categorical Data
Information-Theoretic Approach: Generalized Rule Induction Method
Association Rules are Easy to do Badly
How Can We Measure the Usefulness of Association Rules?
Do Association Rules Represent Supervised or Unsupervised Learning?
Local Patterns Versus Global Models
The R Zone
R References
Exercises

Segmentation Models

The Segmentation Modeling Process
Segmentation Modeling Using EDA to Identify the Segments
Segmentation Modeling using Clustering to Identify the Segments
The R Zone
R References
Exercises

Ensemble Methods: Bagging and Boosting

Rationale for Using an Ensemble of Classification Models
Bias, Variance, and Noise
When to Apply, and not to apply, Bagging
Bagging
Boosting
Application of Bagging and Boosting Using IBM/SPSS Modeler
References
The R Zone
R Reference
Exercises

Model Voting and Propensity Averaging

Simple Model Voting
Alternative Voting Methods
Model Voting Process
An Application of Model Voting
What is Propensity Averaging?
Propensity Averaging Process
An Application of Propensity Averaging
The R Zone
R References
Exercises
Hands-On Analysis

Genetic Algorithms

Introduction To Genetic Algorithms
Basic Framework of a Genetic Algorithm
Simple Example of a Genetic Algorithm at Work
Modifications and Enhancements: Selection
Modifications and Enhancements: Crossover
Genetic Algorithms for Real-Valued Variables
Using Genetic Algorithms to Train a Neural Network
WEKA: Hands-On Analysis Using Genetic Algorithms
The R Zone
R References
Exercises

Imputation of Missing Data

Need for Imputation of Missing Data
Imputation of Missing Data: Continuous Variables
Standard Error of the Imputation
Imputation of Missing Data: Categorical Variables
Handling Patterns in Missingness
Reference
The R Zone
R References

Case Study, Part 1: Business Understanding, Data Preparation, and EDA

Cross-Industry Standard Practice for Data Mining
Business Understanding Phase
Data Understanding Phase, Part 1: Getting a Feel for the Data Set
Data Preparation Phase
Data Understanding Phase, Part 2: Exploratory Data Analysis

Case Study, Part 2: Clustering and Principal Components Analysis

Partitioning the Data
Developing the Principal Components
Validating the Principal Components
Profiling the Principal Components
Choosing the Optimal Number of Clusters Using Birch Clustering
Choosing the Optimal Number of Clusters Using k-Means Clustering
Application of k-Means Clustering
Validating the Clusters
Profiling the Clusters

Case Study, Part 3: Modeling And Evaluation For Performance And Interpretability

Do You Prefer The Best Model Performance, Or A Combination Of Performance And Interpretability?
Modeling And Evaluation Overview
Cost-Benefit Analysis Using Data-Driven Costs
Variables to be Input To The Models
Establishing The Baseline Model Performance
Models That Use Misclassification Costs
Models That Need Rebalancing as a Surrogate for Misclassification Costs
Combining Models Using Voting and Propensity Averaging
Interpreting The Most Profitable Model

Case Study, Part 4: Modeling and Evaluation for High Performance Only

Variables to be Input to the Models
Models that use Misclassification Costs
Models that Need Rebalancing as a Surrogate for Misclassification Costs
Combining Models using Voting and Propensity Averaging
Lessons Learned
Conclusions

Appendix A

Data Summarization and Visualization
Part 1: Summarization 1: Building Blocks Of Data Analysis
Part 2: Visualization: Graphs and Tables For Summarizing And Organizing Data
Part 3: Summarization 2: Measures Of Center, Variability, and Position
Part 4: Summarization And Visualization Of Bivariate Relationships

An Introduction to Data Mining and Predictive Analytics

Analyzing a Dataset

Data Preprocessing

Handling Missing Data
Creating a Histogram
Creating a Scatterplot
Creating a Normal Q-Q Plot
Creating Indicator Variables

Exploratory Data Analysis

Analyzing the churn Dataset
Exploring Categorical Variables
Exploring Numeric Variables
Exploring Multivariate Relationships
Investigating Correlation Values and p-values in Matrix Form

Dimension-Reduction Methods

Creating a Scree Plot
Profiling the Principal Components
Calculating Communalities
Validating the Principal Components
Applying Factor Analysis to a Dataset

Univariate Statistical Analysis

Estimating the Confidence Interval for the Mean
Estimating the Confidence Interval of the Population Proportion

Multivariate Statistics

Performing a t-test for Finding the Difference in Means
Performing a z-test for Finding the Difference in Proportions
Performing a Chi-Square Test for Homogeneity of Proportions
Performing a Chi-Square Test for Goodness of Fit of Multinomial Data
Analyzing a Variance

Preparing to Model the Data

Balancing the Training and Testing Datasets

Simple Linear Regression

Plotting Data with a Regression Line
Measuring the Goodness of Fit of the Regression
Performing Regression with Other Hikers
Verifying the Regression Assumptions
Determining Prediction and Confidence Intervals
Assessing Normality in Scrabble
Applying Box-Cox Transformations

Multiple Regression and Model Building

Approximating the Relationship between the Variables in a Scatterplot
Identifying Confidence Intervals
Creating a Dot Plot
Determining the Sequential Sums of Squares
Analyzing Multicollinearity
Applying the Best Subsets Procedure in a Regression Model
Applying the Stepwise Selection Procedure in a Regression Model
Applying the Backward Elimination Procedure
Applying Forward Selection Procedure
Using the Principal Components as Predictors in Multiple Regression

k-Nearest Neighbor Algorithm

Running KNN
Calculating the Euclidean Distance

Decision Trees

Plotting a Classification Tree

Neural Networks

Running a Neural Network

Logistic Regression

Creating a Plot for Logistic Regression
Interpreting Logistic Regression and Odds Ratio for a Dichotomous Predictor

NaïVe Bayes and Bayesian Networks

Calculating Posterior Odds Ratio
Calculating the Log Posterior Odds Ratio
Calculating the Numeric Predictors for Naive Bayes Classification

Model Evaluation Techniques

Estimating Costs for Benefit Analysis

Cost-Benefit Analysis Using Data-Driven Costs

Analyzing Cost-benefit Using Data-driven Misclassification Costs

Cost-Benefit Analysis for Trinary and -Nary Classification Models

Analyzing the Cost-Benefit for the Trinary Loan Classification Problem

Hierarchical and k-Means Clustering

Using Single-linkage Clustering
Using Complete-linkage Clustering
Finding Clusters in Data

Kohonen Networks

Using a 3x2 Kohonen Network
Interpreting Clusters

Measuring Cluster Goodness

Plotting Silhouette Values of a Dataset
Applying Cluster Validation to a Dataset

Association Rules

Viewing the Output Sorted by Support

Segmentation Models

Predicting Income Using Caps and No Caps Groups

Genetic Algorithms

Using Genetic Algorithms to Train a Neural Network

Any questions?
Check out the FAQs

Check out our FAQs for more information on data mining and predictive analytics courses.

Data mining is the process of discovering patterns and relationships in large datasets using statistical and ML techniques. It helps in extracting useful information from raw data.

Predictive analytics uses historical data, statistical algorithms, and ML to predict future outcomes. It builds models to forecast trends and behaviors, helping in decision-making.

Data mining helps discover patterns, trends, and insights from large data sets, enabling businesses to make informed decisions and drive efficiency.

Yes, data analysts are generally well-paid. Entry-level data analysts can earn around $40,000 to $66,000 annually, while mid-level analysts can make approximately $74,000. Senior data analysts often earn six-figure salaries, especially with specialized skills.

This data mining and predictive analysis training course is designed for data analysts, business professionals, and anyone interested in leveraging data for predictive decision-making.

As this is an intermediate to advanced level course, a basic understanding of data analysis, statistics, or programming is helpful.

This data mining and analysis training course focuses on commonly used tools like Python, R, Excel, and popular ML libraries such as scikit-learn and TensorFlow.

After completing this course, you’ll have a skillset for roles like data analyst, data scientist, business analyst, and ML engineer, among others.

Related Courses

All Courses

Lab

CCNA 200-301 Pearson uCertify Network Simulator

ISBN: 9781616918378

200-301-SIMULATOR.AB1

Lessons AI Tutor

Accounting Course 101

ISBN: 9781644597002

ACCOUNT-WRKBK.AE1

Lessons Lab

Accounting All-in-One

ISBN: 9781644594490

ACCOUNTS.AE1

Lessons TestPrep

ACCUPLACER For Beginners

ISBN: 9781644595732

ACCUPLACER.AE1

Lessons TestPrep

ACT Prep 2024

ISBN: 9781644594889

ACT-PREP.AE1

Lessons Lab TestPrep AI Tutor

Mastering Active Directory

ISBN: 9781644595909

ACTV-DIRECT.AJ1

Lessons Lab AI Tutor

Adversarial Machine Learning

ISBN: 9798900590165

ADV-ML.AU1

This course includes:

Free pre-assessment and first 2 lessons

34+ Interactive Lessons | 58+ Exercises

Accessible on mobile and tablet too

Certificate of completion

Are you an instructor?

Access detailed information about the course content, learning objectives, activities, and assessments before adding it to your curriculum.

Data Mining and Predictive Analysis

Are you an instructor?