With current inflation data showing that inflation is outpacing wage growth drastically, employees are feeling an impact to their financial bottom line at home. No longer can employees accept idle wage growth as such is accepting wage reductions when inflation is taken into account.
As a business, you must reward your employees financially with incentives that both satisfy their financial desires and address others they may have.
The question is, do we know what desires cause an employee to stay in a role? Is it financial alone? Is there some other factor which can cause an otherwise happy employee to leave? Does this change over time, such as when inflation is growing at the rate it is now, in 2021 and 2022?
I’ve set out to provide you with insights using the data set you have provided, of 870 employees.
I was provided with 3 files containing employee data, 2 in Comma Separated Value format and 1 in Excel Workbook format which I then converted to CSV. The first file contained data on 870 employees with a number of data points about each. The second contained similar data points but no attrition data. Finally, the third contained similar data points but no salary data.
Below, I will read in the data and begin to explore what columns and types of data exists in each.
trainData <- read.csv('CaseStudy2-data.csv')
attritionData <- read.csv('CaseStudy2CompSetNoAttrition.csv')
salaryData <- read.csv('CaseStudy2CompSetNoSalary.csv')
head(trainData)
## ID Age Attrition BusinessTravel DailyRate Department
## 1 1 32 No Travel_Rarely 117 Sales
## 2 2 40 No Travel_Rarely 1308 Research & Development
## 3 3 35 No Travel_Frequently 200 Research & Development
## 4 4 32 No Travel_Rarely 801 Sales
## 5 5 24 No Travel_Frequently 567 Research & Development
## 6 6 27 No Travel_Frequently 294 Research & Development
## DistanceFromHome Education EducationField EmployeeCount EmployeeNumber
## 1 13 4 Life Sciences 1 859
## 2 14 3 Medical 1 1128
## 3 18 2 Life Sciences 1 1412
## 4 1 4 Marketing 1 2016
## 5 2 1 Technical Degree 1 1646
## 6 10 2 Life Sciences 1 733
## EnvironmentSatisfaction Gender HourlyRate JobInvolvement JobLevel
## 1 2 Male 73 3 2
## 2 3 Male 44 2 5
## 3 3 Male 60 3 3
## 4 3 Female 48 3 3
## 5 1 Female 32 3 1
## 6 4 Male 32 3 3
## JobRole JobSatisfaction MaritalStatus MonthlyIncome
## 1 Sales Executive 4 Divorced 4403
## 2 Research Director 3 Single 19626
## 3 Manufacturing Director 4 Single 9362
## 4 Sales Executive 4 Married 10422
## 5 Research Scientist 4 Single 3760
## 6 Manufacturing Director 1 Divorced 8793
## MonthlyRate NumCompaniesWorked Over18 OverTime PercentSalaryHike
## 1 9250 2 Y No 11
## 2 17544 1 Y No 14
## 3 19944 2 Y No 11
## 4 24032 1 Y No 19
## 5 17218 1 Y Yes 13
## 6 4809 1 Y No 21
## PerformanceRating RelationshipSatisfaction StandardHours StockOptionLevel
## 1 3 3 80 1
## 2 3 1 80 0
## 3 3 3 80 0
## 4 3 3 80 2
## 5 3 3 80 0
## 6 4 3 80 2
## TotalWorkingYears TrainingTimesLastYear WorkLifeBalance YearsAtCompany
## 1 8 3 2 5
## 2 21 2 4 20
## 3 10 2 3 2
## 4 14 3 3 14
## 5 6 2 3 6
## 6 9 4 2 9
## YearsInCurrentRole YearsSinceLastPromotion YearsWithCurrManager
## 1 2 0 3
## 2 7 4 9
## 3 2 2 2
## 4 10 5 7
## 5 3 1 3
## 6 7 1 7
I need to begin with a check to see if any data is missing and decide how to address that if so.
which(is.na(trainData))
## integer(0)
which(is.na(attritionData))
## integer(0)
which(is.na(salaryData))
## integer(0)
It looks like nothing is missing from the dataset, so I’ll proceed to determine what I’ve been given and how I will need to clean the data to make it more suitable for predictive insights.
#Display the columns in each data set in a table format
as.data.frame(lapply(trainData, class)) %>% t() %>% kable(bootstrap_options = "striped", full_width = F, position = "left") %>% kable_styling()
| ID | integer |
| Age | integer |
| Attrition | character |
| BusinessTravel | character |
| DailyRate | integer |
| Department | character |
| DistanceFromHome | integer |
| Education | integer |
| EducationField | character |
| EmployeeCount | integer |
| EmployeeNumber | integer |
| EnvironmentSatisfaction | integer |
| Gender | character |
| HourlyRate | integer |
| JobInvolvement | integer |
| JobLevel | integer |
| JobRole | character |
| JobSatisfaction | integer |
| MaritalStatus | character |
| MonthlyIncome | integer |
| MonthlyRate | integer |
| NumCompaniesWorked | integer |
| Over18 | character |
| OverTime | character |
| PercentSalaryHike | integer |
| PerformanceRating | integer |
| RelationshipSatisfaction | integer |
| StandardHours | integer |
| StockOptionLevel | integer |
| TotalWorkingYears | integer |
| TrainingTimesLastYear | integer |
| WorkLifeBalance | integer |
| YearsAtCompany | integer |
| YearsInCurrentRole | integer |
| YearsSinceLastPromotion | integer |
| YearsWithCurrManager | integer |
#Display the columns in each data set in a table format
as.data.frame(lapply(attritionData, class)) %>% t() %>% kable() %>% kable_styling(bootstrap_options = "striped", full_width = F, position = "left")
| ID | integer |
| Age | integer |
| BusinessTravel | character |
| DailyRate | integer |
| Department | character |
| DistanceFromHome | integer |
| Education | integer |
| EducationField | character |
| EmployeeCount | integer |
| EmployeeNumber | integer |
| EnvironmentSatisfaction | integer |
| Gender | character |
| HourlyRate | integer |
| JobInvolvement | integer |
| JobLevel | integer |
| JobRole | character |
| JobSatisfaction | integer |
| MaritalStatus | character |
| MonthlyIncome | integer |
| MonthlyRate | integer |
| NumCompaniesWorked | integer |
| Over18 | character |
| OverTime | character |
| PercentSalaryHike | integer |
| PerformanceRating | integer |
| RelationshipSatisfaction | integer |
| StandardHours | integer |
| StockOptionLevel | integer |
| TotalWorkingYears | integer |
| TrainingTimesLastYear | integer |
| WorkLifeBalance | integer |
| YearsAtCompany | integer |
| YearsInCurrentRole | integer |
| YearsSinceLastPromotion | integer |
| YearsWithCurrManager | integer |
#Display the columns in each data set in a table format
as.data.frame(lapply(salaryData, class)) %>% t() %>% kable() %>% kable_styling(bootstrap_options = "striped", full_width = F, position = "left")
| ID | integer |
| Age | integer |
| Attrition | character |
| BusinessTravel | character |
| DailyRate | integer |
| Department | character |
| DistanceFromHome | integer |
| Education | integer |
| EducationField | character |
| EmployeeCount | integer |
| EmployeeNumber | integer |
| EnvironmentSatisfaction | integer |
| Gender | character |
| HourlyRate | integer |
| JobInvolvement | integer |
| JobLevel | integer |
| JobRole | character |
| JobSatisfaction | integer |
| MaritalStatus | character |
| MonthlyRate | integer |
| NumCompaniesWorked | integer |
| Over18 | character |
| OverTime | character |
| PercentSalaryHike | integer |
| PerformanceRating | integer |
| RelationshipSatisfaction | integer |
| StandardHours | integer |
| StockOptionLevel | integer |
| TotalWorkingYears | integer |
| TrainingTimesLastYear | integer |
| WorkLifeBalance | integer |
| YearsAtCompany | integer |
| YearsInCurrentRole | integer |
| YearsSinceLastPromotion | integer |
| YearsWithCurrManager | integer |
I’d like to better understand the levels and distribution of a few features in the data.
table(trainData$StandardHours) %>% kable() %>% kable_styling(bootstrap_options = "striped", full_width = F, position = "left")
| Var1 | Freq |
|---|---|
| 80 | 870 |
table(trainData$EmployeeCount) %>% kable() %>% kable_styling(bootstrap_options = "striped", full_width = F, position = "left")
| Var1 | Freq |
|---|---|
| 1 | 870 |
table(trainData$Over18) %>% kable() %>% kable_styling(bootstrap_options = "striped", full_width = F, position = "left")
| Var1 | Freq |
|---|---|
| Y | 870 |
There exists some columns which do not appear to contain data that will be useful going forward. This could be due to the column not containing data that can be grouped into identifiable levels, such as ID, StandardHours, Employee Count, Over18, and EmployeeNumber. There are also a number of columns that appear to be pay rates, but do not appear to make sense or show consistency in relation to other columns that show, presumedly, like data. These are HourlyRate, DailyRate, and MonthlyRate. These rate data columns do not currently exist in the Salary Data set provided.
That work to remove them is done here.
#remove unneeded columns
removeVariables <- c("ID", "StandardHours", "EmployeeCount", "Over18", "EmployeeNumber", "HourlyRate", "DailyRate", "MonthlyRate")
trainData <- trainData[,!(names(trainData) %in% removeVariables)]
#Don't remove ID from the two provided data sets
removeVariables <- c("StandardHours", "EmployeeCount", "Over18", "EmployeeNumber", "HourlyRate", "DailyRate", "MonthlyRate")
attritionData <- attritionData[,!(names(attritionData) %in% removeVariables)]
salaryData <- salaryData[,!(names(salaryData) %in% removeVariables)]
In addition to the columns that we won’t need, there exist columns which need some transformation. Notably Age and DistanceFromHome. These are currently continuous variables but will do better as ordinal variables using groups. Additionally, I’ll partition MonthlyIncome into categorical groups defined as AnnualIncome with breaks from < 40k (3333.33), 40K (3333.33) >= 70k (5833.33), 70k (5833.33) >= 100K (8333.33), and > 100k (8333.33)
We define those groups below.
After that, I will convert categorical variables to factors for use in models later.
#Age
trainData$AgeGroup <- with(trainData, ifelse(Age < 21, "18-20", ifelse(Age < 31, "21-30", ifelse(Age < 41, "31-40", ifelse(Age < 51, "41-50", ifelse(Age<61, "51-60", "> 60"))))))
attritionData$AgeGroup <- with(attritionData, ifelse(Age < 21, "18-20", ifelse(Age < 31, "21-30", ifelse(Age < 41, "31-40", ifelse(Age < 51, "41-50", ifelse(Age<61, "51-60", "> 60"))))))
salaryData$AgeGroup <- with(salaryData, ifelse(Age < 21, "18-20", ifelse(Age < 31, "21-30", ifelse(Age < 41, "31-40", ifelse(Age < 51, "41-50", ifelse(Age<61, "51-60", "> 60"))))))
#DistanceFromHome
trainData$WorkDistance <- with(trainData, ifelse(DistanceFromHome < 5, "<5 Miles", ifelse(DistanceFromHome < 11, "5-10", ifelse(DistanceFromHome < 16, "11-15", ifelse(DistanceFromHome < 21, "16-20", ifelse(DistanceFromHome<26, "21-25", "> 25"))))))
attritionData$WorkDistance <- with(attritionData, ifelse(DistanceFromHome < 5, "<5 Miles", ifelse(DistanceFromHome < 11, "5-10", ifelse(DistanceFromHome < 16, "11-15", ifelse(DistanceFromHome < 21, "16-20", ifelse(DistanceFromHome<26, "21-25", "> 25"))))))
salaryData$WorkDistance <- with(salaryData, ifelse(DistanceFromHome < 5, "<5 Miles", ifelse(DistanceFromHome < 11, "5-10", ifelse(DistanceFromHome < 16, "11-15", ifelse(DistanceFromHome < 21, "16-20", ifelse(DistanceFromHome<26, "21-25", "> 25"))))))
#trainData$AnnualIncome <- with(trainData, ifelse(MonthlyIncome < 3333.33, "<40k", ifelse(MonthlyIncome < 5833.33, "40K-70K", ifelse(MonthlyIncome < 8333.33, "70k-100k", ">100k"))))
attritionData$AnnualIncome <- with(attritionData, ifelse(MonthlyIncome < 3333.33, "<40k", ifelse(MonthlyIncome < 5833.33, "40K-70K", ifelse(MonthlyIncome < 8333.33, "70k-100k", ">100k"))))
head(trainData)
## Age Attrition BusinessTravel Department DistanceFromHome
## 1 32 No Travel_Rarely Sales 13
## 2 40 No Travel_Rarely Research & Development 14
## 3 35 No Travel_Frequently Research & Development 18
## 4 32 No Travel_Rarely Sales 1
## 5 24 No Travel_Frequently Research & Development 2
## 6 27 No Travel_Frequently Research & Development 10
## Education EducationField EnvironmentSatisfaction Gender JobInvolvement
## 1 4 Life Sciences 2 Male 3
## 2 3 Medical 3 Male 2
## 3 2 Life Sciences 3 Male 3
## 4 4 Marketing 3 Female 3
## 5 1 Technical Degree 1 Female 3
## 6 2 Life Sciences 4 Male 3
## JobLevel JobRole JobSatisfaction MaritalStatus MonthlyIncome
## 1 2 Sales Executive 4 Divorced 4403
## 2 5 Research Director 3 Single 19626
## 3 3 Manufacturing Director 4 Single 9362
## 4 3 Sales Executive 4 Married 10422
## 5 1 Research Scientist 4 Single 3760
## 6 3 Manufacturing Director 1 Divorced 8793
## NumCompaniesWorked OverTime PercentSalaryHike PerformanceRating
## 1 2 No 11 3
## 2 1 No 14 3
## 3 2 No 11 3
## 4 1 No 19 3
## 5 1 Yes 13 3
## 6 1 No 21 4
## RelationshipSatisfaction StockOptionLevel TotalWorkingYears
## 1 3 1 8
## 2 1 0 21
## 3 3 0 10
## 4 3 2 14
## 5 3 0 6
## 6 3 2 9
## TrainingTimesLastYear WorkLifeBalance YearsAtCompany YearsInCurrentRole
## 1 3 2 5 2
## 2 2 4 20 7
## 3 2 3 2 2
## 4 3 3 14 10
## 5 2 3 6 3
## 6 4 2 9 7
## YearsSinceLastPromotion YearsWithCurrManager AgeGroup WorkDistance
## 1 0 3 31-40 11-15
## 2 4 9 31-40 11-15
## 3 2 2 31-40 16-20
## 4 5 7 31-40 <5 Miles
## 5 1 3 21-30 <5 Miles
## 6 1 7 21-30 5-10
#trainData[, c(2,3,4,7,9,12,14,17,26,27,28)] <- lapply(trainData[, c(2,3,4,7,9,12,14,17,26,27,28)], factor)
trainData <- trainData %>% mutate(across(where(is.character), as.factor))
attritionData <- attritionData %>% mutate(across(where(is.character), as.factor))
salaryData <- salaryData %>% mutate(across(where(is.character), as.factor))
head(trainData)
## Age Attrition BusinessTravel Department DistanceFromHome
## 1 32 No Travel_Rarely Sales 13
## 2 40 No Travel_Rarely Research & Development 14
## 3 35 No Travel_Frequently Research & Development 18
## 4 32 No Travel_Rarely Sales 1
## 5 24 No Travel_Frequently Research & Development 2
## 6 27 No Travel_Frequently Research & Development 10
## Education EducationField EnvironmentSatisfaction Gender JobInvolvement
## 1 4 Life Sciences 2 Male 3
## 2 3 Medical 3 Male 2
## 3 2 Life Sciences 3 Male 3
## 4 4 Marketing 3 Female 3
## 5 1 Technical Degree 1 Female 3
## 6 2 Life Sciences 4 Male 3
## JobLevel JobRole JobSatisfaction MaritalStatus MonthlyIncome
## 1 2 Sales Executive 4 Divorced 4403
## 2 5 Research Director 3 Single 19626
## 3 3 Manufacturing Director 4 Single 9362
## 4 3 Sales Executive 4 Married 10422
## 5 1 Research Scientist 4 Single 3760
## 6 3 Manufacturing Director 1 Divorced 8793
## NumCompaniesWorked OverTime PercentSalaryHike PerformanceRating
## 1 2 No 11 3
## 2 1 No 14 3
## 3 2 No 11 3
## 4 1 No 19 3
## 5 1 Yes 13 3
## 6 1 No 21 4
## RelationshipSatisfaction StockOptionLevel TotalWorkingYears
## 1 3 1 8
## 2 1 0 21
## 3 3 0 10
## 4 3 2 14
## 5 3 0 6
## 6 3 2 9
## TrainingTimesLastYear WorkLifeBalance YearsAtCompany YearsInCurrentRole
## 1 3 2 5 2
## 2 2 4 20 7
## 3 2 3 2 2
## 4 3 3 14 10
## 5 2 3 6 3
## 6 4 2 9 7
## YearsSinceLastPromotion YearsWithCurrManager AgeGroup WorkDistance
## 1 0 3 31-40 11-15
## 2 4 9 31-40 11-15
## 3 2 2 31-40 16-20
## 4 5 7 31-40 <5 Miles
## 5 1 3 21-30 <5 Miles
## 6 1 7 21-30 5-10
I’ll look now at a correlation matrix to determine where correlation exists between variables in the data set.
#build correlation data using all numeric variables
corrData <- cor(trainData[,sapply(trainData, is.numeric)])
corrplot(corrData, type="upper", order="hclust",
col=brewer.pal(n=8, name="RdYlBu"))
I’ve seen the correlation data but what can I tell by looking at the mean values by whether or not an employee left Frito Lay?
aggregate(trainData,by = list(trainData$Attrition),FUN = mean, na.rm=TRUE) %>% kable() %>% kable_styling(bootstrap_options = "striped", full_width = F, position = "left") %>% scroll_box(height = 100, width = 800)
| Group.1 | Age | Attrition | BusinessTravel | Department | DistanceFromHome | Education | EducationField | EnvironmentSatisfaction | Gender | JobInvolvement | JobLevel | JobRole | JobSatisfaction | MaritalStatus | MonthlyIncome | NumCompaniesWorked | OverTime | PercentSalaryHike | PerformanceRating | RelationshipSatisfaction | StockOptionLevel | TotalWorkingYears | TrainingTimesLastYear | WorkLifeBalance | YearsAtCompany | YearsInCurrentRole | YearsSinceLastPromotion | YearsWithCurrManager | AgeGroup | WorkDistance |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| No | 37.41233 | NA | NA | NA | 9.028767 | 2.923288 | NA | 2.738356 | NA | 2.780822 | 2.116438 | NA | 2.761644 | NA | 6702.000 | 2.660274 | NA | 15.17534 | 3.149315 | 2.726027 | 0.8397260 | 11.602740 | 2.867123 | 2.809589 | 7.301370 | 4.453425 | 2.175343 | 4.369863 | NA | NA |
| Yes | 33.78571 | NA | NA | NA | 10.957143 | 2.785714 | NA | 2.507143 | NA | 2.421429 | 1.635714 | NA | 2.435714 | NA | 4764.786 | 3.078571 | NA | 15.32857 | 3.164286 | 2.607143 | 0.4928571 | 8.185714 | 2.650000 | 2.635714 | 5.192857 | 2.907143 | 2.135714 | 2.942857 | NA | NA |
Is there anything remarkable about the income of those who left vs those who stayed?
ggplot(data=trainData, aes(y=MonthlyIncome, x=Attrition, fill=Attrition)) + geom_boxplot() + labs(title="Plot of Monthly Income Vs. Attrition",x="Attrition", y = "Monthly Income") + scale_fill_manual(values=c("#CC1525", "#FBBD16"))
I see that the average age of an employee that left is about 4 years higher than one who did not leave during this period. A leaving employee typically was higher paid, lived closer, had a higher education, and had longer experience in their role, with their current manager, and in the company. That is all interesting but I’ll need to test a few things to get a better idea of what is causing an employee to leave Frito Lay.
I’d like to know how many employees exist in each age group among the training data set. To do, I’ll create a table showing the frequency of each age group.
table(trainData$AgeGroup) %>% kable() %>% kable_styling(bootstrap_options = "striped", full_width = F, position = "left")
| Var1 | Freq |
|---|---|
| 18-20 | 17 |
| 21-30 | 205 |
| 31-40 | 373 |
| 41-50 | 200 |
| 51-60 | 75 |
The ages appear to be normally distributed across the typical working ages.
To properly train and test based on the trainData, I need to split that data set into a training set and a testing set.
set.seed(1234)
splitPercentage <- createDataPartition(y= trainData$Attrition, p=.75, list = FALSE)
trainSet <- trainData[splitPercentage, ]
testSet <- trainData[-splitPercentage, ]
I start out trying to predict the attrition data with a Naive Bayes Model.
#using example from https://stackoverflow.com/questions/23891140/r-how-to-visualize-confusion-matrix-using-the-caret-package/42940553
draw_confusion_matrix <- function(cm) {
total <- sum(cm$table)
res <- as.numeric(cm$table)
# Generate color gradients. Palettes come from RColorBrewer.
yellowPalette <- c("#FBBD16","#FFB955","#FFBA85","#FFC1B2")
redPalette <- c("#CC1525","#FFBDE9","#FC85B2","#C14F7D")
getColor <- function (yellowOrRed = "yellow", amount = 0) {
if (amount == 0)
return("#FFFFFF")
palette <- yellowPalette
if (yellowOrRed == "red")
palette <- redPalette
colorRampPalette(palette)(100)[10 + ceiling(90 * amount / total)]
}
# set the basic layout
layout(matrix(c(1,1,2)))
par(mar=c(2,2,2,2))
plot(c(100, 345), c(300, 450), type = "n", xlab="", ylab="", xaxt='n', yaxt='n')
title('CONFUSION MATRIX', cex.main=2)
# create the matrix
classes = colnames(cm$table)
rect(150, 430, 240, 370, col=getColor("yellow", res[1]))
text(195, 435, classes[1], cex=1.2)
rect(250, 430, 340, 370, col=getColor("red", res[3]))
text(295, 435, classes[2], cex=1.2)
text(125, 370, 'Predicted', cex=1.3, srt=90, font=2)
text(245, 450, 'Actual', cex=1.3, font=2)
rect(150, 305, 240, 365, col=getColor("red", res[2]))
rect(250, 305, 340, 365, col=getColor("yellow", res[4]))
text(140, 400, classes[1], cex=1.2, srt=90)
text(140, 335, classes[2], cex=1.2, srt=90)
# add in the cm results
text(195, 400, res[1], cex=1.6, font=2, col='white')
text(195, 335, res[2], cex=1.6, font=2, col='white')
text(295, 400, res[3], cex=1.6, font=2, col='white')
text(295, 335, res[4], cex=1.6, font=2, col='white')
# add in the specifics
plot(c(100, 0), c(100, 0), type = "n", xlab="", ylab="", main = "DETAILS", xaxt='n', yaxt='n')
text(10, 85, names(cm$byClass[1]), cex=1.2, font=2)
text(10, 70, round(as.numeric(cm$byClass[1]), 3), cex=1.2)
text(30, 85, names(cm$byClass[2]), cex=1.2, font=2)
text(30, 70, round(as.numeric(cm$byClass[2]), 3), cex=1.2)
text(50, 85, names(cm$byClass[5]), cex=1.2, font=2)
text(50, 70, round(as.numeric(cm$byClass[5]), 3), cex=1.2)
text(70, 85, names(cm$byClass[6]), cex=1.2, font=2)
text(70, 70, round(as.numeric(cm$byClass[6]), 3), cex=1.2)
text(90, 85, names(cm$byClass[7]), cex=1.2, font=2)
text(90, 70, round(as.numeric(cm$byClass[7]), 3), cex=1.2)
# add in the accuracy information
text(30, 35, names(cm$overall[1]), cex=1.5, font=2)
text(30, 20, round(as.numeric(cm$overall[1]), 3), cex=1.4)
text(70, 35, names(cm$overall[2]), cex=1.5, font=2)
text(70, 20, round(as.numeric(cm$overall[2]), 3), cex=1.4)
}
#set.seed(1234)
nbModel <- naiveBayes(Attrition ~ ., data=trainSet)
#predict with test
testSet$attritionPrediction <- predict(nbModel, testSet)
nbMatrix <- table(data = testSet$attritionPrediction, reference = testSet$Attrition)
#nbMatrix
cm <- confusionMatrix(nbMatrix)
#cm
draw_confusion_matrix(cm)
You have asked for a 60/60 Sensitivity/Specificity model. The Naive Bayes model returned a sensitivity of 83 % and a specificity of 45.7 %, which does not meet the requested levels. This tells us that the Naive Bayes model is doing an ok job predicting when a person will stay with the employer but not that great at predicting when a person will leave employment.
I will proceed to see if I can do better and meet the goals of a 60/60 model as requested.
I’ll try using a random forest model to improve on the ability to predict attrition of employees.
set.seed(1234)
rfModel <- train(Attrition~., data=trainSet, method="rf", metric="accuracy", trControl=trainControl(method="cv", number=5), tuneGrid=expand.grid(.mtry = c(1: 10)),nodesize=14, ntree=300, importance=TRUE)
## Warning in train.default(x, y, weights = w, ...): The metric "accuracy" was not
## in the result set. Accuracy will be used instead.
#predict with test
testSet$rfPrediction <- predict(rfModel, testSet)
rfMatrix <- table(reference = testSet$Attrition, data = testSet$rfPrediction)
#head(rfPrediction)
#rfMatrix
cm <- confusionMatrix(rfMatrix)
#cm
draw_confusion_matrix(cm)
rfImportance <- varImp(rfModel)
head(rfImportance$importance, n=10)
## No Yes
## Age 54.873032 54.873032
## BusinessTravelTravel_Frequently 7.348607 7.348607
## BusinessTravelTravel_Rarely 8.567325 8.567325
## DepartmentResearch & Development 14.289385 14.289385
## DepartmentSales 17.914888 17.914888
## DistanceFromHome 13.896830 13.896830
## Education 4.171803 4.171803
## EducationFieldLife Sciences 20.100092 20.100092
## EducationFieldMarketing 4.891344 4.891344
## EducationFieldMedical 17.843111 17.843111
Using the random forest model, I am able to achieve a sensitivity of 86.3% and a specificity of 100% on the test data set, above the 60%/60% marker you specified. This improved upon the results of the naive bayes model from earlier and will be the model I propose you adopt for predicting attrition among your employees.
In analyzing which features contributed the most to an employee’s decision to stay or leave, the top 10 were Age, Business Travel for those who traveled both frequently and rarely, Department for those in R&D and Sales, Distance from home that an employee lived, and the field of education which the employee had, including Life Sciences, Marketing, and Medical.
To begin with, I’ll use a linear regression model and evaluate the features variance inflation factor to determine if any features should be eliminated from the dataset.
lmModel <- lm(MonthlyIncome~., trainSet)
summary(lmModel)
##
## Call:
## lm(formula = MonthlyIncome ~ ., data = trainSet)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3787.8 -649.1 -3.3 596.5 3890.2
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 267.354 994.697 0.269 0.78819
## Age -10.339 16.944 -0.610 0.54199
## AttritionYes 28.540 138.394 0.206 0.83669
## BusinessTravelTravel_Frequently 279.977 166.932 1.677 0.09402 .
## BusinessTravelTravel_Rarely 379.189 139.783 2.713 0.00686 **
## DepartmentResearch & Development -294.999 561.730 -0.525 0.59966
## DepartmentSales -846.391 585.304 -1.446 0.14867
## DistanceFromHome 10.609 31.640 0.335 0.73752
## Education -74.157 44.041 -1.684 0.09274 .
## EducationFieldLife Sciences 348.448 449.910 0.774 0.43895
## EducationFieldMarketing 370.789 474.668 0.781 0.43502
## EducationFieldMedical 218.565 448.198 0.488 0.62597
## EducationFieldOther 430.279 483.205 0.890 0.37357
## EducationFieldTechnical Degree 333.261 477.054 0.699 0.48508
## EnvironmentSatisfaction -36.769 39.727 -0.926 0.35506
## GenderMale 122.055 87.018 1.403 0.16124
## JobInvolvement 27.762 64.104 0.433 0.66511
## JobLevel 2942.932 91.064 32.317 < 2e-16 ***
## JobRoleHuman Resources -454.257 610.835 -0.744 0.45737
## JobRoleLaboratory Technician -577.718 204.597 -2.824 0.00490 **
## JobRoleManager 4293.533 331.734 12.943 < 2e-16 ***
## JobRoleManufacturing Director 233.512 202.249 1.155 0.24872
## JobRoleResearch Director 3943.718 263.774 14.951 < 2e-16 ***
## JobRoleResearch Scientist -257.422 199.909 -1.288 0.19834
## JobRoleSales Executive 332.829 432.003 0.770 0.44134
## JobRoleSales Representative 93.353 473.137 0.197 0.84365
## JobSatisfaction -26.180 38.747 -0.676 0.49951
## MaritalStatusMarried 21.430 118.671 0.181 0.85676
## MaritalStatusSingle 7.949 161.268 0.049 0.96071
## NumCompaniesWorked 34.487 18.820 1.832 0.06737 .
## OverTimeYes -2.741 98.893 -0.028 0.97790
## PerformanceRating -161.270 116.980 -1.379 0.16852
## RelationshipSatisfaction 1.046 38.233 0.027 0.97818
## StockOptionLevel -1.994 66.553 -0.030 0.97610
## TrainingTimesLastYear 38.837 34.771 1.117 0.26447
## WorkLifeBalance -43.981 60.946 -0.722 0.47080
## YearsAtCompany 29.629 14.822 1.999 0.04606 *
## YearsInCurrentRole 10.353 20.521 0.505 0.61407
## YearsWithCurrManager -19.114 20.998 -0.910 0.36304
## AgeGroup21-30 104.508 384.893 0.272 0.78608
## AgeGroup31-40 511.097 439.304 1.163 0.24512
## AgeGroup41-50 641.180 544.965 1.177 0.23984
## AgeGroup51-60 713.306 687.722 1.037 0.30006
## WorkDistance> 25 -631.993 838.387 -0.754 0.45125
## WorkDistance11-15 -522.643 381.272 -1.371 0.17095
## WorkDistance16-20 -144.670 522.982 -0.277 0.78216
## WorkDistance21-25 -448.062 682.121 -0.657 0.51152
## WorkDistance5-10 -113.691 208.547 -0.545 0.58585
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1062 on 605 degrees of freedom
## Multiple R-squared: 0.9485, Adjusted R-squared: 0.9445
## F-statistic: 237 on 47 and 605 DF, p-value: < 2.2e-16
VIF(lmModel)
## GVIF Df GVIF^(1/(2*Df))
## Age 12.726242 1 3.567386
## Attrition 1.496990 1 1.223516
## BusinessTravel 1.175158 2 1.041176
## Department 117.046168 2 3.289193
## DistanceFromHome 37.471381 1 6.121387
## Education 1.169272 1 1.081329
## EducationField 3.562265 5 1.135462
## EnvironmentSatisfaction 1.087986 1 1.043066
## Gender 1.058837 1 1.028998
## JobInvolvement 1.157110 1 1.075691
## JobLevel 5.597519 1 2.365908
## JobRole 634.239697 8 1.496721
## JobSatisfaction 1.100190 1 1.048899
## MaritalStatus 2.135711 2 1.208887
## NumCompaniesWorked 1.320266 1 1.149028
## OverTime 1.179392 1 1.085998
## PerformanceRating 1.061309 1 1.030199
## RelationshipSatisfaction 1.057036 1 1.028123
## StockOptionLevel 1.940808 1 1.393129
## TrainingTimesLastYear 1.113767 1 1.055352
## WorkLifeBalance 1.080366 1 1.039406
## YearsAtCompany 4.559403 1 2.135276
## YearsInCurrentRole 3.085311 1 1.756505
## YearsWithCurrManager 3.148772 1 1.774478
## AgeGroup 16.071199 4 1.414999
## WorkDistance 50.816794 5 1.481156
A number of features show to be problematic, with a VIF over 5. some are extraordinarily over a value of 5, such as JobRole at 634.23. This would indicate that the linear model is having trouble estimating the coefficient for that variable.
Others that may be problematic would be Department, Distance From Home, Age Group, and Work Distance. The last two were categorical variables created for the attrition prediction and will be removed.
In addition, it seems to fit a logic test that distance from home would not influence the salary of an employee. Department could influence the salary but maybe not heavily across different departments, but more so within a department based on another feature. I believe it will be best to remove it from consideration here.
Age does show a higher than 10 VIF, however, I do believe that Age will be a valuable feature in predicting salary, so I will leave it for now.
There are other factors which I am not comfortable leaving in the dataset are Attrition, as whether an employee left could imply their satisfaction with their salary but doesn’t indicate they’re salary properly.
removeVariables <- c("JobRole", "Department", "DistanceFromHome", "AgeGroup", "WorkDistance", "Attrition")
trainSetLM2 <- trainSet[,!(names(trainSet) %in% removeVariables)]
testSetLM2 <- testSet[,!(names(testSet) %in% removeVariables)]
lmModel <- lm(MonthlyIncome~., trainSetLM2)
summary(lmModel)
##
## Call:
## lm(formula = MonthlyIncome ~ ., data = trainSetLM2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5408.4 -766.2 37.0 824.2 3817.1
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1821.526 819.708 -2.222 0.0266 *
## Age 11.614 7.425 1.564 0.1183
## BusinessTravelTravel_Frequently 200.292 210.823 0.950 0.3425
## BusinessTravelTravel_Rarely 429.177 177.546 2.417 0.0159 *
## Education 17.576 55.502 0.317 0.7516
## EducationFieldLife Sciences -452.297 430.157 -1.051 0.2934
## EducationFieldMarketing -997.708 450.092 -2.217 0.0270 *
## EducationFieldMedical -517.887 431.253 -1.201 0.2302
## EducationFieldOther -473.337 475.320 -0.996 0.3197
## EducationFieldTechnical Degree -552.959 467.468 -1.183 0.2373
## EnvironmentSatisfaction -76.407 50.056 -1.526 0.1274
## GenderMale 77.233 111.663 0.692 0.4894
## JobInvolvement 102.366 79.776 1.283 0.1999
## JobLevel 3886.462 66.005 58.882 <2e-16 ***
## JobSatisfaction -29.575 48.650 -0.608 0.5435
## MaritalStatusMarried 10.358 151.018 0.069 0.9453
## MaritalStatusSingle -52.699 202.941 -0.260 0.7952
## NumCompaniesWorked 24.980 23.475 1.064 0.2877
## OverTimeYes 26.363 120.214 0.219 0.8265
## PerformanceRating -116.321 148.902 -0.781 0.4350
## RelationshipSatisfaction 4.202 49.175 0.085 0.9319
## StockOptionLevel -66.382 84.320 -0.787 0.4314
## TrainingTimesLastYear 45.410 43.881 1.035 0.3011
## WorkLifeBalance 50.740 77.584 0.654 0.5134
## YearsAtCompany 41.343 18.570 2.226 0.0263 *
## YearsInCurrentRole -15.843 25.713 -0.616 0.5380
## YearsWithCurrManager -47.972 26.502 -1.810 0.0708 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1379 on 626 degrees of freedom
## Multiple R-squared: 0.9102, Adjusted R-squared: 0.9064
## F-statistic: 243.9 on 26 and 626 DF, p-value: < 2.2e-16
VIF(lmModel)
## GVIF Df GVIF^(1/(2*Df))
## Age 1.449471 1 1.203940
## BusinessTravel 1.072243 2 1.017591
## Education 1.101541 1 1.049543
## EducationField 1.183717 5 1.017009
## EnvironmentSatisfaction 1.024550 1 1.012201
## Gender 1.034217 1 1.016965
## JobInvolvement 1.063000 1 1.031019
## JobLevel 1.744368 1 1.320745
## JobSatisfaction 1.028813 1 1.014304
## MaritalStatus 1.925913 2 1.178038
## NumCompaniesWorked 1.218428 1 1.103824
## OverTime 1.033769 1 1.016744
## PerformanceRating 1.020008 1 1.009954
## RelationshipSatisfaction 1.037217 1 1.018439
## StockOptionLevel 1.847930 1 1.359386
## TrainingTimesLastYear 1.052214 1 1.025775
## WorkLifeBalance 1.038512 1 1.019074
## YearsAtCompany 4.244904 1 2.060317
## YearsInCurrentRole 2.873484 1 1.695136
## YearsWithCurrManager 2.975074 1 1.724840
The Linear Regression resulted in an RMSE of $1379 and an adjusted R Squared of 0.9064.
I will begin using the full dataset containing all of the predictors that were eliminated in the Linear Regression modeling.
set.seed(1234)
cubistModel <- cubist(x= trainSet[, !(names(trainSet) %in% c("MonthlyIncome"))], y=trainSet$MonthlyIncome, committees = 5)
cubistModel
##
## Call:
## cubist.default(x = trainSet[, !(names(trainSet) %in% c("MonthlyIncome"))], y
## = trainSet$MonthlyIncome, committees = 5)
##
## Number of samples: 653
## Number of predictors: 26
##
## Number of committees: 5
## Number of rules per committee: 6, 3, 6, 3, 6
summary(cubistModel)
##
## Call:
## cubist.default(x = trainSet[, !(names(trainSet) %in% c("MonthlyIncome"))], y
## = trainSet$MonthlyIncome, committees = 5)
##
##
## Cubist [Release 2.07 GPL Edition] Sat Feb 12 18:13:46 2022
## ---------------------------------
##
## Target attribute `outcome'
##
## Read 653 cases (27 attributes) from undefined.data
##
## Model 1:
##
## Rule 1/1: [259 cases, mean 2753.8, range 1081 to 4936, est err 534.0]
##
## if
## JobLevel <= 1
## then
## outcome = 2411 + 52 YearsAtCompany
##
## Rule 1/2: [27 cases, mean 4488.9, range 2042 to 7403, est err 1005.9]
##
## if
## JobLevel > 1
## JobRole in {Laboratory Technician, Sales Representative}
## then
## outcome = 3691 + 170 YearsAtCompany - 213 YearsWithCurrManager
## + 261 JobLevel
##
## Rule 1/3: [41 cases, mean 5654.9, range 3660 to 9980, est err 749.2]
##
## if
## Age <= 29
## JobLevel > 1
## JobRole in {Healthcare Representative, Manufacturing Director,
## Research Scientist, Sales Executive}
## then
## outcome = -1497 + 3202 JobLevel
##
## Rule 1/4: [201 cases, mean 7197.7, range 4028 to 13966, est err 920.2]
##
## if
## Age > 29
## DistanceFromHome <= 16
## JobLevel > 1
## JobRole in {Healthcare Representative, Manufacturing Director,
## Research Scientist, Sales Executive}
## then
## outcome = -445 + 3605 JobLevel - 64 DistanceFromHome
## - 151 EnvironmentSatisfaction + 116 TrainingTimesLastYear
## - 142 Education
##
## Rule 1/5: [54 cases, mean 7350.8, range 4014 to 13973, est err 996.6]
##
## if
## Age > 29
## DistanceFromHome > 16
## JobLevel > 1
## JobRole in {Healthcare Representative, Manufacturing Director,
## Research Scientist, Sales Executive}
## then
## outcome = -1147 + 4122 JobLevel - 503 TrainingTimesLastYear
## - 65 YearsAtCompany
##
## Rule 1/6: [71 cases, mean 16359.6, range 6077 to 19999, est err 779.2]
##
## if
## JobLevel > 1
## JobRole in {Human Resources, Manager, Research Director}
## then
## outcome = 2370 + 3491 JobLevel - 199 JobSatisfaction + 25 YearsAtCompany
##
## Model 2:
##
## Rule 2/1: [259 cases, mean 2753.8, range 1081 to 4936, est err 544.4]
##
## if
## JobLevel <= 1
## then
## outcome = 2653
##
## Rule 2/2: [324 cases, mean 6798.8, range 2042 to 13973, est err 1023.6]
##
## if
## JobLevel > 1
## JobRole in {Healthcare Representative, Human Resources,
## Laboratory Technician, Manufacturing Director,
## Research Scientist, Sales Executive, Sales Representative}
## then
## outcome = -1713 + 3704 JobLevel - 23 DistanceFromHome
##
## Rule 2/3: [70 cases, mean 16506.5, range 11031 to 19999, est err 731.9]
##
## if
## JobRole in {Manager, Research Director}
## then
## outcome = 3561 + 3215 JobLevel
##
## Model 3:
##
## Rule 3/1: [259 cases, mean 2753.8, range 1081 to 4936, est err 541.1]
##
## if
## JobLevel <= 1
## then
## outcome = 875 + 1119 JobLevel + 73 YearsAtCompany + 9 Age
##
## Rule 3/2: [27 cases, mean 4488.9, range 2042 to 7403, est err 1589.8]
##
## if
## JobLevel > 1
## JobRole in {Laboratory Technician, Sales Representative}
## then
## outcome = 3201
##
## Rule 3/3: [41 cases, mean 5654.9, range 3660 to 9980, est err 1001.1]
##
## if
## Age <= 29
## JobLevel > 1
## JobRole in {Healthcare Representative, Manufacturing Director,
## Research Scientist, Sales Executive}
## then
## outcome = -1383 + 2769 JobLevel + 7 Age
##
## Rule 3/4: [296 cases, mean 7011.9, range 3660 to 13973, est err 1166.1]
##
## if
## JobLevel > 1
## JobRole in {Healthcare Representative, Manufacturing Director,
## Research Scientist, Sales Executive}
## then
## outcome = 959 + 3478 JobLevel - 108 DistanceFromHome
## - 315 EnvironmentSatisfaction + 242 TrainingTimesLastYear
## - 295 Education
##
## Rule 3/5: [64 cases, mean 7165.0, range 4014 to 13973, est err 1279.5]
##
## if
## DistanceFromHome > 16
## JobLevel > 1
## JobRole in {Healthcare Representative, Manufacturing Director,
## Research Scientist, Sales Executive}
## then
## outcome = -130 + 4552 JobLevel - 997 TrainingTimesLastYear
## - 131 YearsAtCompany
##
## Rule 3/6: [71 cases, mean 16359.6, range 6077 to 19999, est err 833.0]
##
## if
## JobLevel > 1
## JobRole in {Human Resources, Manager, Research Director}
## then
## outcome = 409 + 3595 JobLevel - 314 JobSatisfaction + 456 JobInvolvement
## + 53 YearsAtCompany
##
## Model 4:
##
## Rule 4/1: [259 cases, mean 2753.8, range 1081 to 4936, est err 544.4]
##
## if
## JobLevel <= 1
## then
## outcome = 2653
##
## Rule 4/2: [324 cases, mean 6798.8, range 2042 to 13973, est err 1027.2]
##
## if
## JobLevel > 1
## JobRole in {Healthcare Representative, Human Resources,
## Laboratory Technician, Manufacturing Director,
## Research Scientist, Sales Executive, Sales Representative}
## then
## outcome = -2267 + 3836 JobLevel
##
## Rule 4/3: [70 cases, mean 16506.5, range 11031 to 19999, est err 822.0]
##
## if
## JobRole in {Manager, Research Director}
## then
## outcome = 4151 + 3136 JobLevel
##
## Model 5:
##
## Rule 5/1: [259 cases, mean 2753.8, range 1081 to 4936, est err 554.8]
##
## if
## JobLevel <= 1
## then
## outcome = 2169 + 104 YearsAtCompany
##
## Rule 5/2: [27 cases, mean 4488.9, range 2042 to 7403, est err 1618.6]
##
## if
## JobLevel > 1
## JobRole in {Laboratory Technician, Sales Representative}
## then
## outcome = 3163
##
## Rule 5/3: [41 cases, mean 5654.9, range 3660 to 9980, est err 986.1]
##
## if
## Age <= 29
## JobLevel > 1
## JobRole in {Healthcare Representative, Manufacturing Director,
## Research Scientist, Sales Executive}
## then
## outcome = -821 + 2568 JobLevel - 7 DistanceFromHome + 6 Age
##
## Rule 5/4: [64 cases, mean 7165.0, range 4014 to 13973, est err 1244.7]
##
## if
## DistanceFromHome > 16
## JobLevel > 1
## JobRole in {Healthcare Representative, Manufacturing Director,
## Research Scientist, Sales Executive}
## then
## outcome = -41 + 4417 JobLevel - 1008 TrainingTimesLastYear
## - 130 YearsAtCompany
##
## Rule 5/5: [201 cases, mean 7197.7, range 4028 to 13966, est err 1044.1]
##
## if
## Age > 29
## DistanceFromHome <= 16
## JobLevel > 1
## JobRole in {Healthcare Representative, Manufacturing Director,
## Research Scientist, Sales Executive}
## then
## outcome = 1513 + 3346 JobLevel - 131 DistanceFromHome
## - 315 EnvironmentSatisfaction + 242 TrainingTimesLastYear
## - 295 Education
##
## Rule 5/6: [71 cases, mean 16359.6, range 6077 to 19999, est err 891.0]
##
## if
## JobLevel > 1
## JobRole in {Human Resources, Manager, Research Director}
## then
## outcome = -123 + 3689 JobLevel - 289 JobSatisfaction
## + 423 JobInvolvement + 49 YearsAtCompany
##
##
## Evaluation on training data (653 cases):
##
## Average |error| 996.8
## Relative |error| 0.29
## Correlation coefficient 0.96
##
##
## Attribute usage:
## Conds Model
##
## 96% 68% JobLevel
## 62% JobRole
## 17% 31% DistanceFromHome
## 17% 10% Age
## 35% YearsAtCompany
## 26% TrainingTimesLastYear
## 21% Education
## 21% EnvironmentSatisfaction
## 6% JobSatisfaction
## 4% JobInvolvement
##
##
## Time: 0.1 secs
modelPrediction <- predict(cubistModel, testSet[, !(names(testSet) %in% c("MonthlyIncome"))])
#RMSE
cubistRMSE <- sqrt(mean((modelPrediction - testSet$MonthlyIncome)^2))
cubistRMSE
## [1] 1223.452
cubistR2 <- cor(modelPrediction, testSet$MonthlyIncome)^2
cubistR2
## [1] 0.9372058
With an RMSE of $1296 and an R^2 of 0.9295, there is improvement in the model over the linear regression. Now I’ll attempt to tune for better predictors to see if I can improve it more.
varImp(cubistModel)
## Overall
## JobLevel 82.0
## JobRole 31.0
## DistanceFromHome 24.0
## Age 13.5
## YearsAtCompany 17.5
## TrainingTimesLastYear 13.0
## Education 10.5
## EnvironmentSatisfaction 10.5
## JobSatisfaction 3.0
## JobInvolvement 2.0
## Attrition 0.0
## BusinessTravel 0.0
## Department 0.0
## EducationField 0.0
## Gender 0.0
## MaritalStatus 0.0
## NumCompaniesWorked 0.0
## OverTime 0.0
## PerformanceRating 0.0
## RelationshipSatisfaction 0.0
## StockOptionLevel 0.0
## WorkLifeBalance 0.0
## YearsInCurrentRole 0.0
## YearsWithCurrManager 0.0
## AgeGroup 0.0
## WorkDistance 0.0
Now I will try to apply model tuning using the caret package to use a 10 fold cross validation over the parameters Committees and Neighbors.
grid <- expand.grid(committees = c(10, 15, 20, 25, 30, 35, 40, 45, 50), neighbors = c(0, 1, 3, 5))
set.seed(1234)
caretGrid <- train(x= trainSet[, !(names(trainSet) %in% c("MonthlyIncome"))], y=trainSet$MonthlyIncome, method="cubist", tuneGrid=grid, trControl = trainControl(method="cv"))
caretGrid
## Cubist
##
## 653 samples
## 26 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 588, 588, 586, 587, 588, 586, ...
## Resampling results across tuning parameters:
##
## committees neighbors RMSE Rsquared MAE
## 10 0 1026.057 0.9473133 784.1183
## 10 1 1374.918 0.9038651 1046.0441
## 10 3 1141.334 0.9326640 888.9389
## 10 5 1083.234 0.9392715 833.9379
## 15 0 1024.255 0.9473345 785.2219
## 15 1 1375.591 0.9036741 1044.1604
## 15 3 1142.218 0.9324960 888.5238
## 15 5 1083.887 0.9392351 836.1047
## 20 0 1028.556 0.9471783 789.0399
## 20 1 1378.276 0.9031460 1047.9079
## 20 3 1144.758 0.9321824 889.7532
## 20 5 1086.898 0.9389324 837.7105
## 25 0 1030.729 0.9467781 790.7755
## 25 1 1379.968 0.9029354 1048.6742
## 25 3 1147.972 0.9318716 892.4039
## 25 5 1090.906 0.9385348 841.8759
## 30 0 1033.564 0.9467263 792.8499
## 30 1 1380.826 0.9027004 1050.4937
## 30 3 1148.837 0.9317342 892.9141
## 30 5 1091.859 0.9384397 842.9693
## 35 0 1037.004 0.9463975 795.9165
## 35 1 1382.923 0.9023901 1050.9750
## 35 3 1152.011 0.9314138 895.9661
## 35 5 1095.365 0.9381181 846.0507
## 40 0 1036.451 0.9465123 794.9063
## 40 1 1381.607 0.9026400 1051.4267
## 40 3 1151.006 0.9315342 894.8546
## 40 5 1094.760 0.9381866 845.4335
## 45 0 1037.521 0.9464162 795.8715
## 45 1 1382.756 0.9024795 1051.0357
## 45 3 1152.767 0.9313642 896.1033
## 45 5 1096.322 0.9380672 846.4105
## 50 0 1039.022 0.9463086 796.3599
## 50 1 1383.689 0.9023915 1051.4496
## 50 3 1153.527 0.9312673 896.8106
## 50 5 1097.045 0.9379773 846.2243
##
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were committees = 15 and neighbors = 0.
ggplot(caretGrid)
Using the CARET package to tune the model with a set of committee and neighbor selections, I was able to find a best fit with 15 committees and no neighbors. In that best fit instance, I was able to achieve an RMSE of 1024.255 and an R Squared of 0.9473.
I’ll now run this best fit model against the test data to confirm like results.
modelPrediction <- predict(caretGrid, testSet[, !(names(testSet) %in% c("MonthlyIncome"))])
#RMSE
cubistRMSE <- sqrt(mean((modelPrediction - testSet$MonthlyIncome)^2))
cubistRMSE
## [1] 1197.722
cubistR2 <- cor(modelPrediction, testSet$MonthlyIncome)^2
cubistR2
## [1] 0.9396154
I was able to achieve results of an RMSE of 1197.722 and an R^2 of 0.9396, which shows that the model is slightly over fit to the training data set but does still return better results than simply a linear regression model.
I’ll now generate the predicted values from the data sets with the missing data so to allow you to validate my models and methodologies, using my highest performing models.
attritionData$PredictedValue <- predict(rfModel, attritionData)
salaryData$PredictedValue <- predict(caretGrid, salaryData)
attritionExport <- attritionData %>% select(c("ID", "PredictedValue"))
write.csv(attritionExport, "Case2PredictionsMcDonald Attrition.csv")
salaryExport <- salaryData %>% select(c("ID", "PredictedValue"))
write.csv(salaryExport, "Case2PredictionsMcDonald Salary.csv")