K-nearest neighbor classifier is one of the simplest to use, and hence, is widely used for classifying dynamic datasets. Click on the link to see how easy it is to classify credit-worthy vs credit-risk customers:
gc
## Default checkingstatus1 duration history purpose amount savings employ
## 1 0 A11 6 A34 A43 1169 A65 A75
## 2 1 A12 48 A32 A43 5951 A61 A73
## 3 0 A14 12 A34 A46 2096 A61 A74
## 4 0 A11 42 A32 A42 7882 A61 A74
## 5 1 A11 24 A33 A40 4870 A61 A73
## 6 0 A14 36 A32 A46 9055 A65 A73
## installment status others residence property age otherplans housing
## 1 4 A93 A101 4 A121 67 A143 A152
## 2 2 A92 A101 2 A121 22 A143 A152
## 3 2 A93 A101 3 A121 49 A143 A152
## 4 2 A93 A103 4 A122 45 A143 A153
## 5 3 A93 A101 4 A124 53 A143 A153
## 6 2 A93 A101 4 A124 35 A143 A153
## cards job liable tele foreign
## 1 2 A173 1 A192 A201
## 2 1 A173 1 A191 A201
## 3 1 A172 2 A191 A201
## 4 1 A173 2 A191 A201
## 5 2 A173 2 A191 A201
## 6 1 A172 2 A192 A201
## Taking back-up of the input file, in case the original data is required later
gc.bkup
## duration.V1 amount.V1 installment.V1
## Min. :-1.401713 Min. :-1.070329 Min. :-1.7636311
## 1st Qu.:-0.738298 1st Qu.:-0.675145 1st Qu.:-0.8697481
## Median :-0.240737 Median :-0.337176 Median : 0.0241348
## Mean : 0.000000 Mean : 0.000000 Mean : 0.0000000
## 3rd Qu.: 0.256825 3rd Qu.: 0.248338 3rd Qu.: 0.9180178
## Max. : 4.237315 Max. : 5.368103 Max. : 0.9180178
## Let's predict on a test set of 100 observations. Rest to be used as train set.
set.seed(123)
test
## [1] 68
100 * sum(test.def == knn.5)/100 # For knn = 5
## [1] 74
100 * sum(test.def == knn.20)/100 # For knn = 20
## [1] 81
## If we look at the above proportions, it's quite evident that K = 1 correctly classifies 68% of the outcomes, K = 5 correctly classifies 74% and K = 20 does it for 81% of the outcomes.
## We should also look at the success rate against the value of increasing K.
table(knn.1 ,test.def)
## test.def
## knn.1 0 1
## 0 54 11
## 1 21 14
## For K = 1, among 65 customers, 54 or 83%, is success rate. Let's look at k = 5 now
table(knn.5 ,test.def)
## test.def
## knn.5 0 1
## 0 62 13
## 1 13 12
## For K = 5, among 76 customers, 63 or 82%, is success rate.Let's look at K = 20 now
table(knn.20 ,test.def)
## test.def
## knn.20 0 1
## 0 69 13
## 1 6 12
##For K = 20, among 88 customers, 71 or 80%, is success rate.
## It seems increasing K increases the classification but reduces success rate. It is worse to class a customer as good when it is bad, than it is to class a customer as bad when it is good.
## By looking at above success rates, K = 1 or K = 5 can be taken as optimum K.
## We can make a plot of the data with the training set in hollow shapes and the new ones filled in.
## Plot for K = 1 can be created as follows -
plot(train.gc[,c("amount","duration")],
col=c(4,3,6,2)[gc.bkup[-test, "installment"]],
pch=c(1,2)[as.numeric(train.def)],
main="Predicted Default, by 1 Nearest Neighbors",cex.main=.95)
points(test.gc[,c("amount","duration")],
bg=c(4,3,6,2)[gc.bkup[-test,"installment"]],
pch=c(21,24)[as.numeric(knn.1)],cex=1.2,col=grey(.7))
legend("bottomright",pch=c(1,16,2,17),bg=c(1,1,1,1),
legend=c("data 0","pred 0","data 1","pred 1"),
title="default",bty="n",cex=.8)
legend("topleft",fill=c(4,3,6,2),legend=c(1,2,3,4),
title="installment %", horiz=TRUE,bty="n",col=grey(.7),cex=.8)