In an earlier tutorial we focused on measures of central tendency and variation. Now we’ll look at some other measures that are equally important for preliminary data analysis.
You can download the data files for this tutorial here.
We briefly mentioned that median is a better measure of central tendency than mean if the distribution of variables is skewed and here we’ll look at the concept of skewness in detail. In addition, we’ll cover the concept of kurtosis.
Skewness refers to asymmetry in a symmetrical bell curve in a data set. If the curve is shifted to the left or to the right, it is said to be skewed.
The three probability distributions depicted in our example are positively-skewed (or right-skewed), symmetric and negatively-skewed (or left-skewed).
The data on the right side of the curve may taper differently from the data on the left.
These taperings are known as “tails.” A negative skew refers to a longer tail on the left side of the distribution, while a positive skew refers to a longer tail on the right.
The mean of positively skewed data will be greater than the median, but in the case of negatively skewed data, the mean will be less than the median.
The mode of positively skewed data will be less than the median and mean, whereas for negatively skewed data, mode will be greater than median and mean.
For a symmetric distribution, mean, median and mode are all the same.
There are various formulas to estimate skewness from sample data. One is based on mean and mode, whereas the other is based on mean and median. Then there is a formula based on 3 quartiles. If the data is symmetric, then mean=median=mode and median is at equal distance from the upper and lower quartiles. Therefore, all of these formulas will give a zero value in the case of an asymmetric distribution.
Skewness based on the third moment
The most widely used measure of skewness is based on the third moment.
Any threshold or rule of thumb is arbitrary, but here’s one – If skewness is greater than 1.0 (or less than -1.0), the skewness is substantial and the distribution is far from symmetrical. A zero value indicates a symmetric distribution.
Like skewness, kurtosis is a statistical measure that is used to describe distributions. Kurtosis is defined as a measure of ‘peakedness’ and is generally measured relative to normal distributions.
There are three categories of kurtosis that can be displayed by a set of data.
Firstly, mesokurtic – This distribution has kurtosis statistics similar to that of a normal distribution. Secondly, leptokurtic distribution. Any distribution that is leptokurtic displays greater kurtosis than a mesokurtic distribution. Thirdly, platykurtic distribution. A platykurtic distribution displays less kurtosis than mesokurtic distributions.
Here we show the formula to calculate the measure of Kurtosis.
If the resultant value is zero, it indicates that the distribution is mesokurtic. A value less than 0 indicates platykurtic distribution and a value greater than 0 indicates leptokurtic distribution. Note that this gives an excess of kurtosis relative to a normal distribution.
Moments are a set of statistical parameters used to measure a distribution. Moments are the constants that help us to know the characteristics of a population and the graphic shape of a set of data. The moments about zero are called raw moments and the moments about the mean are called central moments.
To summarise, we can say that If skewness is less than −1 or greater than +1, the distribution is highly skewed.
If skewness is between −1 and −½ or between +½ and +1, the distribution is moderately skewed, and If skewness it is between −½ and +½, the distribution is approximately symmetric.
Similarly, a distribution with kurtosis approximately equal to 3 or excess of kurtosis=0 is called mesokurtic. A value of kurtosis less than 3 indicates a platykurtic distribution and a value greater than 3 indicates a leptokurtic distribution.
A normal distribution is a mesokurtic distribution.
Let’s now look at a case study in order to learn strengthen our understanding.
Our objective is to describe the variables present in the data of 100 retailers in the platinum segment of FMCG companies.
The sample size is 100 and the dataset has the following variables – Retailer, Zone, Retailer_Age, Perindex, Growth and NPS_Category
Here’s a snapshot of the data.
Each row of data is information about one retailer with a unique Retailer ID.
Zone, Retailer Age and NPS category are categorical variables whereas performance index and growth are numeric continuous variables. NPS stands for net promoter score and indicates loyalty to the company.
After importing the data using the read.csv function, we install and load the e1071 package to find skewness and kurtosis.
We use the skewness function in R with the argument type=2 to obtain skewness based on the moments formula and the kurtosis function with the argument type=2 to obtain kurtosis based on the moments formula.
Here we can see that the skewness for the Growth variable is 1.59, indicating a positively skewed distribution. A very high value of kurtosis indicates that the peakedness of the distribution curve is far from a normal distribution.
Interestingly, we can see that the skewness value for the NORTH zone is very close to zero, indicating symmetry.
When are skewness and kurtosis applied?
We use skewness and kurtosis to assess how far a distribution is from a normal distribution. Skewness is more commonly used and reported as selection of statistical measures is often based on the degree of skewness. Kurtosis is used less often, but is very helpful in assessing the distribution of variables under study.
So to recap. In this tutorial, we studied what skewness and kurtosis are and the different formulas to obtain measures of skewness and kurtosis.
Further we used the e1071 package to obtain measures of skewness and kurtosis in R.