Before diving into Descriptive Statistics in R we will first look at the different sources and types of data and focus on data measurement scales. To summarise data we will study various measures of central tendency and measures of variation.
You can download the data files for this tutorial here.
There are two main sources of data collection namely Primary and Secondary source.
Primary data sources include information collected and processed directly by the researcher, such as data collected through surveys and interviews. Secondary data sources include information retrieved through preexisting sources such as Census data being used to study the impact of education on income.
Types of Data
In general there are two types of data, structured data and unstructured data.
Structured data is stored in a standardized format for providing information. It is usually stored in well-defined schemas such as databases. It is generally tabular with columns and rows that clearly define its attributes.
Unstructured data is not organised in a pre-defined manner. For example Emails, tweets, blogs and so on.
Measurement scales are used to measure variables in statistics. There are four types of measurement scales – nominal, ordinal, interval and ratio scale.
Nominal scale: It is applied to qualitative data where the objects or items are classified into various distinct groups or categories depending on the type of the characteristic under study. Even if they are coded numerically, the order of values has no meaning. Examples are Location,Gender.
Ordinal Scale: It is applied to kind of data which are rank ordered. The different types characteristic have a logical or ordered relationship. These ranks only indicate as to which category is better. For example Ranking the features of a product on a scale of 1 to 5. Here order of values is meaniniful.
An interval scale provides more a powerful measure than an ordinal scale. It allows us not only to rank order items that are measured, but also to measure and find the difference between them. For example, temperature measured in degrees Celsius.
In addition to all the properties of an interval scale , a ratio scale features an identifiable true zero point. Examples of ratio scales are physical dimensions such as weight, height, distance and so on.
Let’s look at an example to make the concept clearer.
We can observe that gender and region are measured on a nominal scale, age is measured on a ratio scale as age has a true zero point, and satisfaction level is measured on an ordinal scale.
Measures of Central Tendency
Central tendency is a descriptive summary of a dataset through a single value that reflects the center of the data distribution. The three most widely used measures of central tendency are mean, median and mode.
The mean is defined as the sum of all values of the variable divided by the total number of values. The median is the middle value. If N is odd and if N is even, it is the average of the two middle values. The mode is the most frequently occurring observation in a data set.
Calculating Mean, Median and Mode
The formulae for calculating mean, median and mode are simple, but let’s quickly revise them.
In our example, the mean of the marks of 12 students is obtained by adding all the marks and dividing it by 12. Here, the mean is 14.83
To find the median, we first arrange the data in ascending order.
Since the number of observations equals 12and therefore an even number, the median is average of the two middle values, 16 and 17. Therefore the median equals 16.5
17 is the most frequently occurring value. Therefore the mode equals 17
The trimmed mean
A trimmed mean is a method of averaging that removes a small specified percentage of the largest and smallest values before calculating the mean. The use of a trimmed mean helps eliminate the influence of outliers. . Typically, 5% of data points at each end are excluded. Note that the trimmed mean will give an accurate estimate if the underlying distribution is symmetric.
The best measure of central tendency
When data is measured on a nominal scale one can only calculate the mode. For data measured on an ordinal scale, the best measure of central tendency is the median. The mean is appropriate when the distribution is symmetric and the measurement scale is an interval or ratio.
For a skewed (that is, not symmetric) distribution, the mean is generally not at the center and the median is a better measure of central tendency.
Measures of Variation
While measures of central tendency are used to estimate the central value of a dataset, measures of dispersion are important for describing the spread of data.
Two data sets can have an equal mean (that is, measure of central tendency) but vastly different variability. Take our example of two cricketers, where both batsmen have the same average score, but the spread around the mean is different.
The most commonly used measures of variation are range , interquartile range and standard deviation
Range, Interquartile Range and Standard Deviation
The range is defined as the difference between the highest and lowest values in a dataset. The disadvantage of defining range as a measure of dispersion is that it does not take into account all values for calculation.
The interquartile range is defined as the difference between the third quartile denoted by 𝑸_𝟑 and the lower quartile denoted by 𝑸_𝟏 . 75% of observations lie below the third quartile and 25% of observations lie below the first quartile.
Variance is defined as the sum of squares of deviations from the mean, divided by the total number of observations. The standard deviation is the positive square root of the variance. The standard deviation is preferred instead of variance as it has the same units as the original values.
Calculating range, interquartile range and Standard Deviation
In our example, the data of 12 student marks in an examination, the range is the difference between the highest observation and lowest observation. The highest observation is 20 and lowest is 8. Therefore range is 12
To obtain the interquartile range, we first arrange data in ascending order. The third quartile is 3(n/4)th value, as shown, that is the 9th Value= 18. The first quartile is n/4th value. That is the3rd Value= 11. Therefore the interquartile range = 7.
Variance is obtained by adding the squared deviations from the mean and dividing them by n. For our example, n is 12.
Therefore the variance equals 15.47 and the standard deviation is the positive root of 15.97, which equals 3.93
The Coefficient of Variation
If we want to compare the variation in two sets of data, then the coefficient of variation should be used and not variance or standard deviation.
The coefficient of variation is a relative measure of variation, whereas standard deviation is an absolute measure of variation.
The coefficient of variation is computed as standard deviation divided by the mean and then expressed as a percentage.
A higher value of coefficient of variation implies more variation in our data.
Case Study 1
Let’s consider a simple example in which runs scored by two batsmen are recorded. Our objective is to compare the performance of two batsmen using the measures of central tendency and the measure of variation.
We can observe that the mean for both batsmen is 70 but the coefficient of variation for batsman A is 13.97% and for batsman B it is 57.32%.
We can see that variability in performance of Batsman B is more than that of Batsman A. Hence, we can infer that Batsman A is a more consistent performer than Batsman B.
Case study 2
Our next objective is to describe the variables present in the data of 100 retailers in the platinum segment of an FMCG company.
The sample size is 100 and the dataset has the following variables: Retailer, Zone, Retailer_Age, Perindex, Growth and NPS_Category
This is a snapshot of the dataset. Each row of data is information about one retailer with a unique Retailer ID.
Zone, Retailer Age and NPS category are categorical variables, whereas performance index and growth are numeric continuous variables. The NPS stands for net promoter score and indicates loyalty to the company.
We use the read.csv function to import the dataset.
To obtain a summary of the variables we use the summary function.For a continuous or numeric variable, the summary function gives a summary in the form of : minimum, 1st quartile, median, mean, 3rd quartile, maximum and count of Not Applicables (if any).
retail_data <-read.csv("Retail_Data.csv" header=TRUE)
#Checking the variable features using summary function
Understanding Data Through Visualization
A boxplot shows the distribution of data based on a five number summary (“minimum”, first quartile (Q1), median, third quartile (Q3), and “maximum”). It can tell you about your outliers and what their values are. It can also tell you if your data is symmetrical, how tightly your data is grouped, and if and how your data is skewed.
The boxplot function is used in R to display data.
Here we can see that the Perindex variable is distributed symmetrically whereas the Growth variable is Positively Skewed. We’ll look at skewness in detail in the next session
boxplot(retail_data$Perindex, data= retail_data, main = "BoxPlot (Perindex)",ylab = "Perindex",col = "darkorange") boxplot(retail_data$Growth, data= retail_data, main = "BoxPlot (Growth)",ylab = "Growth",col = "darkorange")
Measures of central tendency in R
The mean function is used to obtain the mean of a variable and the median function is used to obtain the median of a variable.
If the data contains missing values, the argument na.rm = TRUE should be defined in the mean function.
Here we will prefer mean for the perindex variable and median for the growth variable as growth distribution is skewed.
So as we’ve seen, the Perindex Variable is symmetric, hence its mean value is considered, whereas for our Growth Variable, which is Positively Skewed, median would be a better measure.
# Mean for Perindex & Growth Variables
mean(retail_data$Perindex)  NA
mean() in R, gives mean of the variable.
mean(retail_data$Perindex,na.rm = T)  70.49697
Using na.rm=T excludes the missing values from the mean
mean(retail_data$Growth,na.rm = T)  5.1528
# Median for Perindex & Growth Variables
median(retail_data$Perindex,na.rm = T)  71.15
median() in R, gives median of the variable.
median(retail_data$Growth,na.rm = T)  4.495
The trimmed_mean function is used to obtain the trimmed mean. We should specify the percentage of observations to be excluded on each side within the trimmed_mean function. For example, to exclude 10% of observation, 0.1 should be specified.
# Trimmed Mean
trimmed_mean_PI <- mean(retail_data$Perindex,0.10,na.rm=T) trimmed_mean_PI  70.5842
Using 0.10 in the mean(), excludes 10% observations from each side of the data from the mean
trimmed_mean_G <- mean(retail_data$Growth,0.10,na.rm = T) trimmed_mean_G  4.825
There is no standard function in R to obtain the mode. You can use the table function to create a frequency table and then extract the observation that has the highest frequency
# Measure of Central Tendency for Categorical Variable
# Mode using Frequency Table
freq <- table(retail_data$Zone) freq East North South West 15 25 32 28
table() in R, gives the frequency of counts of the variable mentioned.
Measures of Dispersion in R
The range function gives the highest and lowest values of a dataset. To obtain the difference between the highest and lowest values the diff function is used.
The IQR function is used to obtain the inter quartile range in R
# Range, Difference & Inter Quartile Range
r_PI <- range(retail_data$Perindex,na.rm = T) r_PI  46.53 92.49
range() in R, gives minimum and maximum values of that variable
r_G <- range(retail_data$Growth,na.rm = T) r_G  1.47 17.50
diff(r_PI)  45.96
diff() calculates difference between all values of that vector
diff(r_G)  16.03
IQR(retail_data$Perindex,na.rm = T)  12.095
IQR() in R gives the Inter-Quartile range of the variable
IQR(retail_data$Growth,na.rm = T)  3.2825
To obtain standard deviation the sd function is used. The var function is used to obtain variance.
There’s no standard function in R to obtain a coefficient of variation. We obtain the standard deviation and the mean using the sd and mean functions respectively, and then the coefficient of variation is obtained in R by dividing standard deviation by mean.
# Standard Deviation/ Variance
sd(retail_data$Perindex,na.rm = T)  9.569232
sd() in R, gives standard deviation of the variable
sd(retail_data$Growth)  2.620525
var(retail_data$Perindex,na.rm = T)  91.5702
var() in R, gives variance of the variable
var(retail_data$Growth)  6.867152
# Coefficient of Variation
cv_PI <- sd(retail_data$Perindex,na.rm = T)/ mean(retail_data$Perindex,na.rm = T) cv_PI  0.1357396
There is no standard function for CV in R. Hence we calculate it by definition.
cv_G <- sd(retail_data$Growth)/mean(retail_data$Growth) cv_G  0.5085633
Let’s now recap, In this tutorial we studied various measures of central tendency and dispersion and how to describe our data using these measures in R. This tutorial is based on lessons from the Data Analytics in R unit of the Postgraduate Diploma in Data Science.