In this session, we’ll learn the concept of Statistical Inference. Statistical inference is a vast area which includes many statistical methods from analyzing data to drawing inferences or conclusions in research or business problems. It plays a vital role in the application of data science across industries.
You can download the data files for this tutorial here.
We’ll begin with some basic terminology before going into more depth on the topic. Then we’ll discuss two broad areas of statistical inference – estimation, including point and interval estimation, and then hypothesis testing.
Statistical Inference Terminology
Let’s look at some basic terms used in statistical inference.
A variable is what we measure and want to study in our project. It could be employee salaries or the transaction value of customers, for example
A population is a set of all units we want to draw conclusions about.
For example: All the employees in an organization.
And a sample is a subset of employees (in other words a specific group of employees) in the organization.
A statistical distribution gives us an idea about how these values are distributed in a population. The most common distribution is a “normal” distribution.
A factor defines sub groups in a study such as the gender or location of employees.
Descriptive Statistics typically include the mean, median, and standard deviation of a variable under study.
What is Statistical Inference?
So what is statistical inference? Statistical inference is the process of drawing conclusions about unknown population properties, using a sample drawn from the population. Unknown population properties can be, for example, mean, proportion or variance. These are also called parameters
Statistical inference is broadly divided into 2 parts: Estimation and Hypothesis Testing. Estimation is further divided into point estimation and interval estimation.
In point estimation, we estimate an unknown parameter using a single number that is calculated from the sample data. For example, the average salary of junior data scientists based on a sample is 55,000 euros
In Interval estimation, we find a range of values within which we believe the true population parameter lies with high probability. Here, the average salary of junior data scientists is between 52,0000 and 58,000, with a 95% confidence level.
In hypothesis testing we need to decide whether a statement regarding a population parameter is true or false, based on sample data. For example, a claim that the average salary of junior data scientists is greater than 50,0000 euros annually can be tested using sample data.
Parameters, Estimators and Estimates
Let’s now look at the difference between parameters, estimators and estimates. A parameter is an unknown quantity such as population mean. It is estimated using a function of sample values, such as sample mean. So, a sample mean is an estimator.
The value of the sample mean using sample values is called an estimate.
We’ll now discuss this further. In this example, the parameter is the population mean of salaries earned by junior data scientists. The sample mean is the estimator and the estimate based on the sample is 55,000 euros
Here the parameter is the proportion of data scientists using R. The sample proportion is the estimator and the estimate based on the sample is 380.
In both of the previous examples we estimated parameters using a single value, hence the name point estimation.
Interval estimation can give an inference such as a 95% confidence interval for the average salary of junior data scientists is between 52,000 and 58,000. Generally, 95% or 90% Confidence Intervals are used.
A 95% confidence interval is a range estimate within which the true value of the parameter lies with a probability of 0.95.
The estimator, like the sample mean, is a random variable as its value varies based on the sample drawn.
The distribution of estimators, like the sample mean or sample proportion is called a sampling distribution.
The standard deviation of this distribution is called a standard error. So, if we draw 50 samples each of size 250, then the sample mean will vary. The different values give rise to sampling distribution.
Now let’s look at hypothesis testing. A hypothesis is an assertion about the distribution of one or more random variables.
A null hypothesis, often referred to as H0, is an assertion which is generally believed to be true until a researcher rejects it with evidence. An alternative hypothesis, H1, is where the researcher’s claim contradicts the null hypothesis. In other words, hypothesis testing decides whether a statement regarding a population parameter is true or false, based on a set of sample data.
A test statistic is a random variable that is calculated from sample data and used in a hypothesis test. We can use a test statistic to determine whether to reject a null hypothesis. It compares our data with what is expected under a null hypothesis.
A critical region, also known as the rejection region, is a set of values for which the null hypothesis is rejected.
Hypothesis Testing Case Study
Let’s look at an example of a claim and how it is tested.
A paint manufacturer claims that the average drying time of their new paint is less than 20 minutes.
To test the claim, a sample of 36 boards were painted from 36 different cans of paint and the drying time was observed.
The sample mean is calculated using 36 values.
Note that the claim is shown as an alternate hypothesis, H1.
The claim is considered valid if the null hypothesis is rejected.
There are two possible types of error in hypothesis testing. Rejecting a null hypothesis when it is actually true is a Type I error or, alternatively, not rejecting the null hypothesis when it is actually false, which is a Type II error.
An Interesting analogy, as we can see in the table, can be drawn from the decision making process in the legal system
For example, in the legal system, H0, the person is not guilty, H1, the person is guilty
The probability of a type I error is generally called level of significance. The concept of a P value is widely used to make a decision about a claim.
If the P value is less than a pre-defined level of significance of 5% (0.05), then the null hypothesis is rejected. The P value is computed using the sample data and can be considered as a risk of rejecting the null hypothesis when it is actually true. In general, a claim made by a researcher is considered as an alternate hypothesis. In the case of not having any evidence from data, the null hypothesis is still considered to be true.
For example, the claim that “The vaccine is effective” will go into H1 and not H0. H1 can be one sided (tailed), or two sided, as shown here.
In this tutorial we covered the concept of statistical inference, the process of drawing conclusions about unknown population properties using samples drawn from a population
We discussed how to use point estimation to summarize a sample by a single value as an estimate of the population parameter, and interval estimation to summarize within a range of values within which we believe the true population parameter lies with high probability. And finally Hypothesis Testing to decide whether a statement regarding a population parameter is true or false. And finally we looked at Type 1 and Type 11 errors.
This tutorial is based on lessons from the Statistical Inference unit of the Postgraduate Diploma in Data Science.