T Distribution , Kolmogrov Smirnov, Shapiro Wilk Tests

In a previous tutorial we looked at key concepts in statistical inference. We’ll now look at T Distribution , Kolmogrov Smirnov, Shapiro Wilk, and standard parametric tests. Parametric tests are tests that make assumptions about the parameters of the population distribution from which a sample is drawn. We’ll begin with normality assessment using the Quantile-Quantile Plot (also called the Q-Q plot), the Shapiro-Wilk test and the Kolmogrov Smirnov test. Then, we’ll cover T distribution briefly. Finally, the one sample t-test, which is a standard parametric test will be looked in detail.

You can download the data files for this tutorial here.

Test for Normality of Data

An assessment of the normality of data is a pre-requisite for many statistical tests, as normal distribution is an underlying assumption in parametric testing.

Normality can be assessed using two approaches, graphically and numerically. The graphical approach includes the box-whisker plot and the quantile-quantile (Q-Q) plot.

The box-whisker plot is used to assess symmetry more than normality, so for now we’ll focus on the Q-Q plot. The numerical or statistical approach to test normality includes the Shapiro-Wilk and the Kolmogrov-Smirnov tests. Generally, the Shapiro-Wilk test is used for small samples and the Kolmogrov-Smirnov test for large ones.

Assessing Normality of Data using R

Let’s consider an example to assess normality in R. The data set here has two variables recorded for guests staying in a large hotel. The variables are the Customer satisfaction index, abbreviated as ‘csi’ and the total bill amount in thousands of Euros, abbreviated as ‘billamt’. The objective is to check whether the two variables follow a normal distribution based on a sample size of 80.

Parametric test example

Here’s a snapshot of the data. The first column represents ‘id’. The second column represents the ‘csi’, which the Customer Satisfaction Index and the third column is ‘billamt’ which the Total Bill Amount.

Parametric test example data set

We’ll examine the Q-Q Plot in detail and use it in our example dataset. The Q-Q plot is a very powerful graphical technique for assessing normality. Quantiles are calculated using sample data and plotted against the expected quantiles under normal distribution. If the Normality assumption is valid, then a high correlation is expected between sample quantiles, which are the theoretical quantiles under normal distribution, and expected quantiles. The Y axis plots the actual quantile values based on the sample while the X axis plots the theoretical values. If the data is samples from a Normal distribution, then the Q-Q plot will be linear.

To assess the normality of our example dataset in R, as discussed previously, we’ll import the data using the ‘read.csv’ function. The function for getting the Q-Q plot is ‘qqnorm’, and we have to specify the variable name inside the qqnorm function.

The colour of the plot is optional. In our example, ‘dat$csi’ is the variable for which normality needs to be checked.

# Import data

 data<-read.csv("Normality Testing Data.csv", header=TRUE) 

# Q-Q plot for the variable csi

 qqnorm(data$csi,col="blue") 

data$csi is the variable for which normality is to be checked.

The Q-Q plot shows that the distribution of csi can be considered normal, since there is a linear pattern observed in the plot.

# Output:

qq plot normal

Interpretation : ØQ-Q plot is Linear. Distribution of ‘csi’ can be assumed to be normal.

Similarly, we can check the normality of the variable ‘billamt’ from our example data by using the same qqnorm function.

# Q-Q plot for the variable billamt

  qqnorm(data$billamt,col="blue")  

data$billamt is the variable for which normality is to be checked.

Here, we can see that the distribution of billamt appears to be non-normal, as there is lot of deviation from linearity.

To conclude, the Q-Q plot shows that the variable csi follows normal distribution,  whereas the variable billamt appears to be non-normal.

# Output:

qq plot

Interpretation : ØQ-Q plot is deviated from linearity. Distribution of ‘billamt’ appears to be non-normal.

After assessing normality using the graphical method, we’ll assess normality using numerical and statistical methods. The Shapiro-Wilk test is a measure for assessing normality and is  a widely used statistical test. Here, the null hypothesis is that the sample is drawn from a normal population whereas the alternate hypothesis is drawn from a non-normal population. A complicated test statistic is used here and therefore we’ll avoid the actual calculation in this example. Intuitively, we can think of the test statistic as a correlation between the sample ordered values and the expected Normal scores. We’ll reject the null hypothesis if the P value is less than the pre-defined significance level of 5%. If the null hypothesis is rejected, then normality is not true.

test normality of data
null hypothesis, alternate hypothesis
test statistic, decision criteria

To use the Shapiro-Wilk test in R, there is a simple function ‘shapiro.test’, which we apply to the variable of interest whose normality needs to be assessed. In our example dataset, the variable of interest is ‘csi’. If we look at the output, W is the value of the test statistic and the P value, which is 0.9038, is greater than 0.05, which indicates that the normal distribution assumption is true.

# Shapiro Wilk test for the variable csi

  shapiro.test(data$csi)  

data$csi is the variable for which normality is to be checked.

# Output:

shapiro wilk normality test

Similarly, we can apply the Shapiro.test function to our second variable ‘billamt’. Here, when we look at the P value, it is less than 0.05, which indicates the distribution of billamt is non-normal.

The inference obtained from the Shapiro-Wilk test is consistent with that of the Q-Q plot.

  shapiro.test(data$billamt) 

data$billamt is the variable for which normality is to be checked.

# Output:

shapiro wilk test interpretation

In addition to the Shapiro-Wilk test, there is another test used by researchers and academics, known as the Kolmogrov-Smirnov test. The null and alternate hypotheses are similar to the Shapiro-Wilk test, where H0 is the sample drawn from a normal population and H1 is a sample from a non-normal population. The test statistic is however different from the Shapiro-Wilk test. The Kolmogrov-Smirnov test compares the cumulative distribution function with the standard normal distribution function. The test statistic is the maximum difference between two cumulative distribution functions. Again, H0 is rejected if the P value is less than the pre-defined significance level, which is 0.05.

kolmogrov smirnov test
Kolmogrov smirnov test objective
test statistic kolmogrov smirnov

To use the Kolmogrov-Smirnov test in R, we need to install an additional package called ‘nortest’. After installing and calling the package in R using the library function, we use  the ‘lillie.test’ function to assess the normality of the variable of interest. In our case, we apply lillie.test to the ‘csi’ variable.

# Install and use package ‘nortest’

 install.packages("nortest")
 
 library(nortest) 

Package nortest contains the Kolmogorov smirnov test.

# Kolmogorov Smirnov test

  lillie.test(data$csi) 

data$csi is the variable for which normality is to be checked.

If we look at the output, D is the value of the test statistic and the P value is 0.9764, which is greater than 0.05. This means that we fail to reject H0 and therefore the distribution of Customer Satfisfaction Index can be assumed to be normal.

# Output:

customer satisfaction kolmogrov smirnov

We apply the same test for the second variable ‘billamt’. The P value as seen from the output is less than 0.05, which indicates that the distribution of billamt is non-normal.

# Kolmogorov Smirnov test for the variable billamt

 lillie.test(data$billamt) 

data$billamt is the variable for which normality is to be checked.

# Output:

kolmogrov smirnov 3

The T distribution is important in parametric testing, as the test statistic follows a t distribution in many cases. A  t distribution is also symmetric and has a bell shaped curve, but its shape is slightly different from a normal distribution, being a bit lower and wider. Here, we have plotted the t-distribution for different sample sizes along with the normal distribution curve. As can be seen, all the t-distribution curves are lower and wider than the blue normal distribution curve.

t distribution

Degrees of Freedom

Let’s now look at degrees of freedom. In probability and statistics, degrees of freedom, abbreviated as ‘df’, is defined as the number of independent terms. In other words, degrees of freedom is the number of values in the final calculation of a statistic that are free to vary. The ‘sum of the squared deviations from the mean of n values’ has n-1 degrees of freedom. Knowing the n-1 values, we can find the last value, since the sum of deviations from the mean is always zero. Sampling distributions like the t-distribution, F-distribution and the chi-square distribution are based on degrees of freedom.

Consider a crude example. Give 5 numbers such that their sum is 20. You can use 4 numbers freely but the fifth number should be such that the sum of all the numbers is 20. Therefore, here the degrees of freedome df is 4.

Let’s now begin with the first standard parametric test known as the ‘One Sample T Test’. This test is used to test hypotheses about a single population mean. We assume that the data comes from a normal population and we can apply this test in various scenarios.

Case Study

In this case study where a large company is concerned about the time taken by their employees to complete their weekly MIS report. The objective is to check if the average time taken by the employees to complete the MIS report is more than 90 minutes. The sample size is 12 and the variable of interest is ‘time’.

case study information

Here’s a snapshot of the data. There is only one variable in our dataset, the time taken to complete the MIS report.

sample t test

There are three important assumptions for one sample t test.

The first is random sampling from a defined population. Secondly, the population is normally distributed and thirdly, the variable under study should be continuous.

A normality test can be performed by any of the tests explained earlier. The validity of the test is not seriously affected by moderate deviations from the ‘normality’ assumption.

Let’s consider our example of the company whose objective is to test whether the average time taken by employees to complete their weekly report exceeds 90 minutes. Here, we test whether the mean is equal to a test value, and therefore our null hypothesis is that the mean equals 90, and our alternate hypothesis is that the mean is greater than 90. The test statistic is defined as the difference between the sample mean and actual mean divided by the sample standard deviation. We reject the null hypothesis if  the P value is less than the pre-defined value of the significance level of 0.05.

case study objective
one sample t test
test statistic, decision criteria, null hypothesis

This table helps us to understand how the test statistic is calculated. Our calculated value of the test statistic comes out to be 1.9176.

test statistic table

Let’s now calculate the one sample t test in R. We’ll first import our data using the read.csv function. To perform the one sample t test, we use the ‘t.test’ function in R, which requires the variable of interest, the alternate hypothesis and the actual mean whose value is to be tested. In our example, ‘data$time’ is the variable of interest. Since, the company wants to know whether employees take greater than 90 minutes, we input ‘alternative=”greater”’ in our command and the value to be tested is 90, which is 90 minutes.

# Import data

data<-read.csv("ONE SAMPLE t TEST.csv",header=TRUE) 

# t-test for one sample

t.test(data$time, alternative="greater", mu=90) 

data$time is the variable under study. qalternative=“greater” ,Since under H1, value is tested for greater than 90. qmu=90 is the value to be tested.

If we look at the output, t is the test statistic whose value (t= 1.9176) is the same as that obtained from the manual calculation. The P value is 0.04, which is less than our pre-defined value of the significance level.  Therefore, we reject our null hypothesis and conclude that the average time taken to compete the weekly MIS report is more than 90 minutes.

# Output:

sample t test

Here’s a quick recap. In this tutorial, we learned different techniques to check the normality of a variable. We discussed the t distribution and degrees of freedom. Finally, we learned about the one sample t test and applied it to a real world example. This tutorial is taken from lessons in the Statistical Inference unit of Digital Schools Postgraduate Diploma in Data Science

Statistical Inference summary

This tutorial lesson is taken from Digita Schools Postgraduate Diploma in Data Science .

You can try our courses for free to learn more