K-Nearest Neighbors – a simple example using R and Python
The k-nearest neighbors (KNN) algorithm is a simple, easy-to-implement supervised machine learning algorithm that can be used to solve both classification and regression problems.
KNN stores all available cases and classifies (or gives expected values of) new cases based on a similarity measure. Here we look at a simple example using both R and Python.
Data Description: A bank possesses demographic and transactional data of its loan customers. If the bank has a robust model to predict defaulters, it can undertake better resource allocation.
Objective: To predict whether the customer applying for a loan will be a defaulter
KNN in R :
Importing data and removing unwanted variables
bankloan<-read.csv("BANK LOAN KNN.csv",header=T)
bankloan2<-subset(bankloan,select=c(-AGE,-SN,-DEFAULTER))
head(bankloan2)
## EMPLOY ADDRESS DEBTINC CREDDEBT OTHDEBT
## 1 17 12 9.3 11.36 5.01
## 2 2 0 17.3 1.79 3.06
## 3 12 11 3.6 0.13 1.24
## 4 3 4 24.4 1.36 3.28
## 5 24 14 10.0 3.93 2.47
## 6 6 9 16.3 1.72 3.01
Scaling variables
bankloan3<-scale(bankloan2)
head(bankloan3)
## EMPLOY ADDRESS DEBTINC CREDDEBT OTHDEBT
## 1 1.5656796 0.6216799 -0.2881684 3.8774339687 0.51519694
## 2 -0.8239988 -1.1852951 0.7889154 0.0289356115 -0.02571385
## 3 0.7691201 0.4710987 -1.0555906 -0.6386200074 -0.53056393
## 4 -0.6646869 -0.5829701 1.7448273 -0.1439854223 0.03531198
## 5 2.6808628 0.9228424 -0.1939235 0.8895193612 -0.18937404
## 6 -0.1867512 0.1699362 0.6542799 0.0007856758 -0.03958336
Creating training and testing data sets
library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
index<-createDataPartition(bankloan$SN,p=0.7,list=FALSE)
head(index)
## Resample1
## [1,] 3
## [2,] 4
## [3,] 5
## [4,] 7
## [5,] 8
## [6,] 10
## [6,] 10
traindata<-bankloan3[index,]
testdata<-bankloan3[-index,]
dim(traindata)
## [1] 273 5
dim(testdata)
## [1] 116 5
Creating class vectors
Ytrain<-bankloan$DEFAULTER[index]
Ytest<-bankloan$DEFAULTER[-index]
KNN classification (Contunuous predictors)
knn() in package “class” undertakes k-nearest neighbour classification testing set using training data. Distance is calculated by Euclidean measure, and the classification is decided by majority vote, with ties broken at random.
library(class)
model<-knn(traindata,testdata,k=20,cl=Ytrain)
KNN in Python :
Here the same BANK LOAN DATA is used.
import numpy as np
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import confusion_matrix, f1_score, precision_score, recall_score, accuracy_score,roc_curve, roc_auc_score
Importing data and removing unwanted variables
bankloan = pd.read_csv("BANK LOAN KNN.csv")
bankloan1 = bankloan.drop(['SN','AGE'], axis = 1)
bankloan1.head()
## EMPLOY ADDRESS DEBTINC CREDDEBT OTHDEBT DEFAULTER
## 0 17 12 9.3 11.36 5.01 1
## 1 2 0 17.3 1.79 3.06 1
## 2 12 11 3.6 0.13 1.24 0
## 3 3 4 24.4 1.36 3.28 1
## 4 24 14 10.0 3.93 2.47 0
Creating training and testing data sets
X = bankloan1.loc[:,bankloan1.columns != 'DEFAULTER']
y = bankloan1.loc[:, 'DEFAULTER']
X_train, X_test, y_train, y_test = train_test_split(X, y,
test_size=0.30,
random_state = 999)
Preparing/Scaling variables
scaler = StandardScaler()
scaler.fit(X_train)
## StandardScaler(copy=True, with_mean=True, with_std=True)
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)
Building the KNN Classifier (Continuous Predictors)
KNeighborsClassifier() from sklearn.neighbors undertakes k-nearest neighbour classification testing set using training data
KNNclassifier = KNeighborsClassifier(n_neighbors =
int(np.sqrt(len(X)).round()))
KNNclassifier.fit(X_train, y_train)
## KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
## metric_params=None, n_jobs=None, n_neighbors=20, p=2,
## weights='uniform')
This tutorial lesson is taken from the Postgraduate Diploma in Data Science .