We’ll begin with an introduction to predictive modelling. We’ll then discuss important statistical models, followed by a general approach to building predictive models and finally, we’ll cover the key steps in building predictive models. Please note that prerequisites for starting out in predictive modeling are an understanding of exploratory data analysis and statistical inference.
You can download the data files for this tutorial here.
What is Predictive Analytics?
Predictive analytics involves developing statistical models that predict an outcome or probability of an outcome. For example, models can be developed to predict the income of customers or the probability of someone buying a particular product. Models are developed using historical data or purposely collected data. Predictive analytics is used in many business and economic areas with a varying degree of maturity depending on the sector. For example, predictive analytics has been used in the financial services industry for many years, whereas areas such as sports and retail are still evolving its use. However, almost all sectors have recognized the importance of predictive analytics in day to day decision making in both business and research.
Standard Predictive Models
There are many standard predictive models that a data scientist should know. For example, linear regression models, logistic regression, poisson regression and time series analysis. As a data scientist, you should know which model to use in a given situation. The graphic below summarises the range of predictive models used in data science and analytics.
Predictive Modelling – A General Approach
Let’s now discuss the general approach for building a predictive model. We start by setting a business goal. Then it’s necessary to understand our data and carry out lot of data pre-processing that is consistent with our business goal. Once the data is ready, we need to to carry out exploratory data analysis before starting with predictive modeling. After exploratory data analysis, we can start building statistical models. Models are subsequently evaluated and validated, before we finally implement them in business and research decision making. This is a general approach and can it vary depending on the situation or the problem under study, but broadly we follow these steps in predictive modeling.
Data Understanding and Pre-Processing
First we’ll explore the first step in predictive modelling – ie, data understanding and pre-processing. Data understanding is nothing but understating data dimension, variable types, variable relationships and so on. You may have to convert raw data to useable data by data cleaning, handling missing values, removing inconsistencies or transforming variables, for example. Feature engineering is also important. By using specific domain knowledge we can create new variables based on existing variables. Data pre-processing can also involve grouping or factoring our variables as segmentation and data reduction can also be useful in starting to build models. Data understanding pre-processing is a critical step and it may require lot of time before you actually start to build a predictive model.
Exploratory Data Analysis
The next step is exploratory data analysis. Performing exploratory data analysis involves obtaining descriptive statistics, data visualization and correlation analysis. This can be extremely useful in helping us to understand our variables and it may even help us to exclude some variables before we start building a model. A lot of insights into data are generated in this step, which further enhances our understanding of predictive modeling.
Model Identification, Selection and Validation
Step three is model identification, selection andvalidation. Model identification is based on the objective of our study type of dependent variable. The dependent variable can be continuous or binary, or a count variable and depending on the type of variable, we can decide the standard predictive modelling method.
Model selection is based on a range statistics such as R squared, P value or AIC. Alternatively, there can also be some automatic search procedures. Using these, we can select our final model.
After model selection, the model is validated using cross validation methods. Here, we basically split the data into two sections, training data and test data. We develop a model using training data and use test data is on separate data. We use test data to check the predictive ability of our model, which should give us confidence in implementing the model in a real world situation.
Predictive Model Implementation
Our final step is model implementation. Once we validate our model, we’re ready to implement it in real life situations. We can build equations using coefficient of significance variables or we can map the model with an existing system in an organization. We can deploy a model in different ways, such as including it in a spreadsheet, creating a separate web application with a user interface for business users, or integrating it into a firm’s current IT system, for example.
Sample Size and Data Dimension
Predictive models are developed using historical data and sample size and data dimension are important for a building good ones. We cannot develop a model with very small sample size because this model may not give insights about relationships among the variables. If our data is too large with a lot of variables but few observations, then we are trying to learn too much from small samples. Results from such models can be very erratic. A rule of thumb is that we should use a sample where the number of observations is at least ten times the number of variables. For example, if we have eight variables, then we should have at least 80 observations.
Let’s quickly recap. In this tutorial, we introduced the concept of predictive analytics with some examples of where it is applied and as well as the standard predictive models used in data science. We also learned that a model is developed by using a four step general approach where step one is data understanding and pre-processing, step two is exploratory data analysis, step three is model selection and validation and the fourth is model implementation.
This tutorial lesson is taken from the Postgraduate Diploma in Data Science.