Jason McDonald’s Frito Lay Attrition Case Study

Executive Summary

With current inflation data showing that inflation is outpacing wage growth drastically, employees are feeling an impact to their financial bottom line at home. No longer can employees accept idle wage growth as such is accepting wage reductions when inflation is taken into account.

As a business, you must reward your employees financially with incentives that both satisfy their financial desires and address others they may have.

The question is, do we know what desires cause an employee to stay in a role? Is it financial alone? Is there some other factor which can cause an otherwise happy employee to leave? Does this change over time, such as when inflation is growing at the rate it is now, in 2021 and 2022?

I’ve set out to provide you with insights using the data set you have provided, of 870 employees.

Import the data

I was provided with 3 files containing employee data, 2 in Comma Separated Value format and 1 in Excel Workbook format which I then converted to CSV. The first file contained data on 870 employees with a number of data points about each. The second contained similar data points but no attrition data. Finally, the third contained similar data points but no salary data.

Below, I will read in the data and begin to explore what columns and types of data exists in each.

trainData <- read.csv('CaseStudy2-data.csv')
attritionData <- read.csv('CaseStudy2CompSetNoAttrition.csv')
salaryData <- read.csv('CaseStudy2CompSetNoSalary.csv')
head(trainData)

##   ID Age Attrition    BusinessTravel DailyRate             Department
## 1  1  32        No     Travel_Rarely       117                  Sales
## 2  2  40        No     Travel_Rarely      1308 Research & Development
## 3  3  35        No Travel_Frequently       200 Research & Development
## 4  4  32        No     Travel_Rarely       801                  Sales
## 5  5  24        No Travel_Frequently       567 Research & Development
## 6  6  27        No Travel_Frequently       294 Research & Development
##   DistanceFromHome Education   EducationField EmployeeCount EmployeeNumber
## 1               13         4    Life Sciences             1            859
## 2               14         3          Medical             1           1128
## 3               18         2    Life Sciences             1           1412
## 4                1         4        Marketing             1           2016
## 5                2         1 Technical Degree             1           1646
## 6               10         2    Life Sciences             1            733
##   EnvironmentSatisfaction Gender HourlyRate JobInvolvement JobLevel
## 1                       2   Male         73              3        2
## 2                       3   Male         44              2        5
## 3                       3   Male         60              3        3
## 4                       3 Female         48              3        3
## 5                       1 Female         32              3        1
## 6                       4   Male         32              3        3
##                  JobRole JobSatisfaction MaritalStatus MonthlyIncome
## 1        Sales Executive               4      Divorced          4403
## 2      Research Director               3        Single         19626
## 3 Manufacturing Director               4        Single          9362
## 4        Sales Executive               4       Married         10422
## 5     Research Scientist               4        Single          3760
## 6 Manufacturing Director               1      Divorced          8793
##   MonthlyRate NumCompaniesWorked Over18 OverTime PercentSalaryHike
## 1        9250                  2      Y       No                11
## 2       17544                  1      Y       No                14
## 3       19944                  2      Y       No                11
## 4       24032                  1      Y       No                19
## 5       17218                  1      Y      Yes                13
## 6        4809                  1      Y       No                21
##   PerformanceRating RelationshipSatisfaction StandardHours StockOptionLevel
## 1                 3                        3            80                1
## 2                 3                        1            80                0
## 3                 3                        3            80                0
## 4                 3                        3            80                2
## 5                 3                        3            80                0
## 6                 4                        3            80                2
##   TotalWorkingYears TrainingTimesLastYear WorkLifeBalance YearsAtCompany
## 1                 8                     3               2              5
## 2                21                     2               4             20
## 3                10                     2               3              2
## 4                14                     3               3             14
## 5                 6                     2               3              6
## 6                 9                     4               2              9
##   YearsInCurrentRole YearsSinceLastPromotion YearsWithCurrManager
## 1                  2                       0                    3
## 2                  7                       4                    9
## 3                  2                       2                    2
## 4                 10                       5                    7
## 5                  3                       1                    3
## 6                  7                       1                    7

Look for missing data

I need to begin with a check to see if any data is missing and decide how to address that if so.

which(is.na(trainData))

## integer(0)

which(is.na(attritionData))

## integer(0)

which(is.na(salaryData))

## integer(0)

It looks like nothing is missing from the dataset, so I’ll proceed to determine what I’ve been given and how I will need to clean the data to make it more suitable for predictive insights.

Training Data Set

#Display the columns in each data set in a table format
as.data.frame(lapply(trainData, class)) %>% t() %>% kable(bootstrap_options = "striped", full_width = F, position = "left") %>% kable_styling()

ID	integer
Age	integer
Attrition	character
BusinessTravel	character
DailyRate	integer
Department	character
DistanceFromHome	integer
Education	integer
EducationField	character
EmployeeCount	integer
EmployeeNumber	integer
EnvironmentSatisfaction	integer
Gender	character
HourlyRate	integer
JobInvolvement	integer
JobLevel	integer
JobRole	character
JobSatisfaction	integer
MaritalStatus	character
MonthlyIncome	integer
MonthlyRate	integer
NumCompaniesWorked	integer
Over18	character
OverTime	character
PercentSalaryHike	integer
PerformanceRating	integer
RelationshipSatisfaction	integer
StandardHours	integer
StockOptionLevel	integer
TotalWorkingYears	integer
TrainingTimesLastYear	integer
WorkLifeBalance	integer
YearsAtCompany	integer
YearsInCurrentRole	integer
YearsSinceLastPromotion	integer
YearsWithCurrManager	integer

No Attrition Data Set

#Display the columns in each data set in a table format
as.data.frame(lapply(attritionData, class)) %>% t() %>% kable() %>% kable_styling(bootstrap_options = "striped", full_width = F, position = "left")

ID	integer
Age	integer
BusinessTravel	character
DailyRate	integer
Department	character
DistanceFromHome	integer
Education	integer
EducationField	character
EmployeeCount	integer
EmployeeNumber	integer
EnvironmentSatisfaction	integer
Gender	character
HourlyRate	integer
JobInvolvement	integer
JobLevel	integer
JobRole	character
JobSatisfaction	integer
MaritalStatus	character
MonthlyIncome	integer
MonthlyRate	integer
NumCompaniesWorked	integer
Over18	character
OverTime	character
PercentSalaryHike	integer
PerformanceRating	integer
RelationshipSatisfaction	integer
StandardHours	integer
StockOptionLevel	integer
TotalWorkingYears	integer
TrainingTimesLastYear	integer
WorkLifeBalance	integer
YearsAtCompany	integer
YearsInCurrentRole	integer
YearsSinceLastPromotion	integer
YearsWithCurrManager	integer

No Salary Data Set

#Display the columns in each data set in a table format
as.data.frame(lapply(salaryData, class)) %>% t() %>% kable() %>% kable_styling(bootstrap_options = "striped", full_width = F, position = "left")

ID	integer
Age	integer
Attrition	character
BusinessTravel	character
DailyRate	integer
Department	character
DistanceFromHome	integer
Education	integer
EducationField	character
EmployeeCount	integer
EmployeeNumber	integer
EnvironmentSatisfaction	integer
Gender	character
HourlyRate	integer
JobInvolvement	integer
JobLevel	integer
JobRole	character
JobSatisfaction	integer
MaritalStatus	character
MonthlyRate	integer
NumCompaniesWorked	integer
Over18	character
OverTime	character
PercentSalaryHike	integer
PerformanceRating	integer
RelationshipSatisfaction	integer
StandardHours	integer
StockOptionLevel	integer
TotalWorkingYears	integer
TrainingTimesLastYear	integer
WorkLifeBalance	integer
YearsAtCompany	integer
YearsInCurrentRole	integer
YearsSinceLastPromotion	integer
YearsWithCurrManager	integer

Determining levels in some features

I’d like to better understand the levels and distribution of a few features in the data.

table(trainData$StandardHours) %>% kable() %>% kable_styling(bootstrap_options = "striped", full_width = F, position = "left")

Var1	Freq
80	870

table(trainData$EmployeeCount) %>% kable() %>% kable_styling(bootstrap_options = "striped", full_width = F, position = "left")

Var1	Freq
1	870

table(trainData$Over18) %>% kable() %>% kable_styling(bootstrap_options = "striped", full_width = F, position = "left")

Var1	Freq
Y	870

Cleaning the data sets

There exists some columns which do not appear to contain data that will be useful going forward. This could be due to the column not containing data that can be grouped into identifiable levels, such as ID, StandardHours, Employee Count, Over18, and EmployeeNumber. There are also a number of columns that appear to be pay rates, but do not appear to make sense or show consistency in relation to other columns that show, presumedly, like data. These are HourlyRate, DailyRate, and MonthlyRate. These rate data columns do not currently exist in the Salary Data set provided.

That work to remove them is done here.

#remove unneeded columns
removeVariables <- c("ID", "StandardHours", "EmployeeCount", "Over18", "EmployeeNumber", "HourlyRate", "DailyRate", "MonthlyRate")
trainData <- trainData[,!(names(trainData) %in% removeVariables)] 
#Don't remove ID from the two provided data sets
removeVariables <- c("StandardHours", "EmployeeCount", "Over18", "EmployeeNumber", "HourlyRate", "DailyRate", "MonthlyRate")
attritionData <- attritionData[,!(names(attritionData) %in% removeVariables)]
salaryData <- salaryData[,!(names(salaryData) %in% removeVariables)]

Cleaning the data sets - Parts 2

In addition to the columns that we won’t need, there exist columns which need some transformation. Notably Age and DistanceFromHome. These are currently continuous variables but will do better as ordinal variables using groups. Additionally, I’ll partition MonthlyIncome into categorical groups defined as AnnualIncome with breaks from < 40k (3333.33), 40K (3333.33) >= 70k (5833.33), 70k (5833.33) >= 100K (8333.33), and > 100k (8333.33)

We define those groups below.

After that, I will convert categorical variables to factors for use in models later.

#Age
trainData$AgeGroup <- with(trainData, ifelse(Age < 21, "18-20", ifelse(Age < 31, "21-30", ifelse(Age < 41, "31-40", ifelse(Age < 51, "41-50", ifelse(Age<61, "51-60", "> 60"))))))
attritionData$AgeGroup <- with(attritionData, ifelse(Age < 21, "18-20", ifelse(Age < 31, "21-30", ifelse(Age < 41, "31-40", ifelse(Age < 51, "41-50", ifelse(Age<61, "51-60", "> 60"))))))
salaryData$AgeGroup <- with(salaryData, ifelse(Age < 21, "18-20", ifelse(Age < 31, "21-30", ifelse(Age < 41, "31-40", ifelse(Age < 51, "41-50", ifelse(Age<61, "51-60", "> 60"))))))
#DistanceFromHome
trainData$WorkDistance <- with(trainData, ifelse(DistanceFromHome < 5, "<5 Miles", ifelse(DistanceFromHome < 11, "5-10", ifelse(DistanceFromHome < 16, "11-15", ifelse(DistanceFromHome < 21, "16-20", ifelse(DistanceFromHome<26, "21-25", "> 25"))))))
attritionData$WorkDistance <- with(attritionData, ifelse(DistanceFromHome < 5, "<5 Miles", ifelse(DistanceFromHome < 11, "5-10", ifelse(DistanceFromHome < 16, "11-15", ifelse(DistanceFromHome < 21, "16-20", ifelse(DistanceFromHome<26, "21-25", "> 25"))))))
salaryData$WorkDistance <- with(salaryData, ifelse(DistanceFromHome < 5, "<5 Miles", ifelse(DistanceFromHome < 11, "5-10", ifelse(DistanceFromHome < 16, "11-15", ifelse(DistanceFromHome < 21, "16-20", ifelse(DistanceFromHome<26, "21-25", "> 25"))))))
#trainData$AnnualIncome <- with(trainData, ifelse(MonthlyIncome < 3333.33, "<40k", ifelse(MonthlyIncome < 5833.33, "40K-70K", ifelse(MonthlyIncome < 8333.33, "70k-100k", ">100k"))))
attritionData$AnnualIncome <- with(attritionData, ifelse(MonthlyIncome < 3333.33, "<40k", ifelse(MonthlyIncome < 5833.33, "40K-70K", ifelse(MonthlyIncome < 8333.33, "70k-100k", ">100k"))))
head(trainData)

##   Age Attrition    BusinessTravel             Department DistanceFromHome
## 1  32        No     Travel_Rarely                  Sales               13
## 2  40        No     Travel_Rarely Research & Development               14
## 3  35        No Travel_Frequently Research & Development               18
## 4  32        No     Travel_Rarely                  Sales                1
## 5  24        No Travel_Frequently Research & Development                2
## 6  27        No Travel_Frequently Research & Development               10
##   Education   EducationField EnvironmentSatisfaction Gender JobInvolvement
## 1         4    Life Sciences                       2   Male              3
## 2         3          Medical                       3   Male              2
## 3         2    Life Sciences                       3   Male              3
## 4         4        Marketing                       3 Female              3
## 5         1 Technical Degree                       1 Female              3
## 6         2    Life Sciences                       4   Male              3
##   JobLevel                JobRole JobSatisfaction MaritalStatus MonthlyIncome
## 1        2        Sales Executive               4      Divorced          4403
## 2        5      Research Director               3        Single         19626
## 3        3 Manufacturing Director               4        Single          9362
## 4        3        Sales Executive               4       Married         10422
## 5        1     Research Scientist               4        Single          3760
## 6        3 Manufacturing Director               1      Divorced          8793
##   NumCompaniesWorked OverTime PercentSalaryHike PerformanceRating
## 1                  2       No                11                 3
## 2                  1       No                14                 3
## 3                  2       No                11                 3
## 4                  1       No                19                 3
## 5                  1      Yes                13                 3
## 6                  1       No                21                 4
##   RelationshipSatisfaction StockOptionLevel TotalWorkingYears
## 1                        3                1                 8
## 2                        1                0                21
## 3                        3                0                10
## 4                        3                2                14
## 5                        3                0                 6
## 6                        3                2                 9
##   TrainingTimesLastYear WorkLifeBalance YearsAtCompany YearsInCurrentRole
## 1                     3               2              5                  2
## 2                     2               4             20                  7
## 3                     2               3              2                  2
## 4                     3               3             14                 10
## 5                     2               3              6                  3
## 6                     4               2              9                  7
##   YearsSinceLastPromotion YearsWithCurrManager AgeGroup WorkDistance
## 1                       0                    3    31-40        11-15
## 2                       4                    9    31-40        11-15
## 3                       2                    2    31-40        16-20
## 4                       5                    7    31-40     <5 Miles
## 5                       1                    3    21-30     <5 Miles
## 6                       1                    7    21-30         5-10

#trainData[, c(2,3,4,7,9,12,14,17,26,27,28)] <- lapply(trainData[, c(2,3,4,7,9,12,14,17,26,27,28)], factor)
trainData <- trainData %>% mutate(across(where(is.character), as.factor))
attritionData <- attritionData %>% mutate(across(where(is.character), as.factor))
salaryData <- salaryData %>% mutate(across(where(is.character), as.factor))

Review what we’ve done so far

head(trainData)

##   Age Attrition    BusinessTravel             Department DistanceFromHome
## 1  32        No     Travel_Rarely                  Sales               13
## 2  40        No     Travel_Rarely Research & Development               14
## 3  35        No Travel_Frequently Research & Development               18
## 4  32        No     Travel_Rarely                  Sales                1
## 5  24        No Travel_Frequently Research & Development                2
## 6  27        No Travel_Frequently Research & Development               10
##   Education   EducationField EnvironmentSatisfaction Gender JobInvolvement
## 1         4    Life Sciences                       2   Male              3
## 2         3          Medical                       3   Male              2
## 3         2    Life Sciences                       3   Male              3
## 4         4        Marketing                       3 Female              3
## 5         1 Technical Degree                       1 Female              3
## 6         2    Life Sciences                       4   Male              3
##   JobLevel                JobRole JobSatisfaction MaritalStatus MonthlyIncome
## 1        2        Sales Executive               4      Divorced          4403
## 2        5      Research Director               3        Single         19626
## 3        3 Manufacturing Director               4        Single          9362
## 4        3        Sales Executive               4       Married         10422
## 5        1     Research Scientist               4        Single          3760
## 6        3 Manufacturing Director               1      Divorced          8793
##   NumCompaniesWorked OverTime PercentSalaryHike PerformanceRating
## 1                  2       No                11                 3
## 2                  1       No                14                 3
## 3                  2       No                11                 3
## 4                  1       No                19                 3
## 5                  1      Yes                13                 3
## 6                  1       No                21                 4
##   RelationshipSatisfaction StockOptionLevel TotalWorkingYears
## 1                        3                1                 8
## 2                        1                0                21
## 3                        3                0                10
## 4                        3                2                14
## 5                        3                0                 6
## 6                        3                2                 9
##   TrainingTimesLastYear WorkLifeBalance YearsAtCompany YearsInCurrentRole
## 1                     3               2              5                  2
## 2                     2               4             20                  7
## 3                     2               3              2                  2
## 4                     3               3             14                 10
## 5                     2               3              6                  3
## 6                     4               2              9                  7
##   YearsSinceLastPromotion YearsWithCurrManager AgeGroup WorkDistance
## 1                       0                    3    31-40        11-15
## 2                       4                    9    31-40        11-15
## 3                       2                    2    31-40        16-20
## 4                       5                    7    31-40     <5 Miles
## 5                       1                    3    21-30     <5 Miles
## 6                       1                    7    21-30         5-10

Analysis of correlation seen among variables in the data

I’ll look now at a correlation matrix to determine where correlation exists between variables in the data set.

#build correlation data using all numeric variables
corrData <- cor(trainData[,sapply(trainData, is.numeric)])
corrplot(corrData, type="upper", order="hclust",
         col=brewer.pal(n=8, name="RdYlBu"))

Exploring those who left and those who didn’t

I’ve seen the correlation data but what can I tell by looking at the mean values by whether or not an employee left Frito Lay?

aggregate(trainData,by = list(trainData$Attrition),FUN = mean, na.rm=TRUE)  %>% kable() %>% kable_styling(bootstrap_options = "striped", full_width = F, position = "left") %>% scroll_box(height = 100, width = 800)

Group.1	Age	Attrition	BusinessTravel	Department	DistanceFromHome	Education	EducationField	EnvironmentSatisfaction	Gender	JobInvolvement	JobLevel	JobRole	JobSatisfaction	MaritalStatus	MonthlyIncome	NumCompaniesWorked	OverTime	PercentSalaryHike	PerformanceRating	RelationshipSatisfaction	StockOptionLevel	TotalWorkingYears	TrainingTimesLastYear	WorkLifeBalance	YearsAtCompany	YearsInCurrentRole	YearsSinceLastPromotion	YearsWithCurrManager	AgeGroup	WorkDistance
No	37.41233	NA	NA	NA	9.028767	2.923288	NA	2.738356	NA	2.780822	2.116438	NA	2.761644	NA	6702.000	2.660274	NA	15.17534	3.149315	2.726027	0.8397260	11.602740	2.867123	2.809589	7.301370	4.453425	2.175343	4.369863	NA	NA
Yes	33.78571	NA	NA	NA	10.957143	2.785714	NA	2.507143	NA	2.421429	1.635714	NA	2.435714	NA	4764.786	3.078571	NA	15.32857	3.164286	2.607143	0.4928571	8.185714	2.650000	2.635714	5.192857	2.907143	2.135714	2.942857	NA	NA

Looking at the monthly income of those who left vs those who stayed

Is there anything remarkable about the income of those who left vs those who stayed?

ggplot(data=trainData, aes(y=MonthlyIncome, x=Attrition, fill=Attrition)) + geom_boxplot() + labs(title="Plot of Monthly Income Vs. Attrition",x="Attrition", y = "Monthly Income") + scale_fill_manual(values=c("#CC1525", "#FBBD16"))

I see that the average age of an employee that left is about 4 years higher than one who did not leave during this period. A leaving employee typically was higher paid, lived closer, had a higher education, and had longer experience in their role, with their current manager, and in the company. That is all interesting but I’ll need to test a few things to get a better idea of what is causing an employee to leave Frito Lay.

Determining splits by age group

I’d like to know how many employees exist in each age group among the training data set. To do, I’ll create a table showing the frequency of each age group.

table(trainData$AgeGroup) %>% kable() %>% kable_styling(bootstrap_options = "striped", full_width = F, position = "left")

Var1	Freq
18-20	17
21-30	205
31-40	373
41-50	200
51-60	75

The ages appear to be normally distributed across the typical working ages.

Prepare to make predictions by removing highly correlated data points

Looking back to the correlation data, there are some points which feature high correlation and likely contribute to the data in the same way. Some of these aren’t surprising. Take Performance Rating and Percent Salary Hike. It seems logical to expect those with higher performance rating to earn the highest salary increases. Also Total Working Years and Job Level. You expect someone who has worked for a number of years to have a higher job role.

To address these, I’m going to drop some variables from the data set. I may change these values later but for now, will drop Years Since Last Promotion, as it is highly correlated to a number of other variables, Total Working Years, as Age Group can account for this data point adequately, and Percent Salary Hike as Performance Rating would appear to account for this in the set.

dropColumns <- c("YearsSinceLastPromotion", "TotalWorkingYears", "PercentSalaryHike")
trainData <- trainData[, !(names(trainData) %in% dropColumns)]
attritionData <- attritionData[, !(names(attritionData) %in% dropColumns)]
salaryData <- salaryData[, !(names(salaryData) %in% dropColumns)]

Create training and test data sets from trainData

To properly train and test based on the trainData, I need to split that data set into a training set and a testing set.

set.seed(1234)
splitPercentage <- createDataPartition(y= trainData$Attrition, p=.75, list = FALSE)
trainSet <- trainData[splitPercentage, ]
testSet <- trainData[-splitPercentage, ]

Classification Prediction - Attrition

Begin with a Naive Bayes Model to predict Attrition

I start out trying to predict the attrition data with a Naive Bayes Model.

Helper Function to draw a confusion matrix

#using example from https://stackoverflow.com/questions/23891140/r-how-to-visualize-confusion-matrix-using-the-caret-package/42940553

draw_confusion_matrix <- function(cm) {

  total <- sum(cm$table)
  res <- as.numeric(cm$table)

  # Generate color gradients. Palettes come from RColorBrewer.
  yellowPalette <- c("#FBBD16","#FFB955","#FFBA85","#FFC1B2")
  redPalette <- c("#CC1525","#FFBDE9","#FC85B2","#C14F7D")
  getColor <- function (yellowOrRed = "yellow", amount = 0) {
    if (amount == 0)
      return("#FFFFFF")
    palette <- yellowPalette
    if (yellowOrRed == "red")
      palette <- redPalette
    colorRampPalette(palette)(100)[10 + ceiling(90 * amount / total)]
  }

  # set the basic layout
  layout(matrix(c(1,1,2)))
  par(mar=c(2,2,2,2))
  plot(c(100, 345), c(300, 450), type = "n", xlab="", ylab="", xaxt='n', yaxt='n')
  title('CONFUSION MATRIX', cex.main=2)

  # create the matrix 
  classes = colnames(cm$table)
  rect(150, 430, 240, 370, col=getColor("yellow", res[1]))
  text(195, 435, classes[1], cex=1.2)
  rect(250, 430, 340, 370, col=getColor("red", res[3]))
  text(295, 435, classes[2], cex=1.2)
  text(125, 370, 'Predicted', cex=1.3, srt=90, font=2)
  text(245, 450, 'Actual', cex=1.3, font=2)
  rect(150, 305, 240, 365, col=getColor("red", res[2]))
  rect(250, 305, 340, 365, col=getColor("yellow", res[4]))
  text(140, 400, classes[1], cex=1.2, srt=90)
  text(140, 335, classes[2], cex=1.2, srt=90)

  # add in the cm results
  text(195, 400, res[1], cex=1.6, font=2, col='white')
  text(195, 335, res[2], cex=1.6, font=2, col='white')
  text(295, 400, res[3], cex=1.6, font=2, col='white')
  text(295, 335, res[4], cex=1.6, font=2, col='white')

  # add in the specifics 
  plot(c(100, 0), c(100, 0), type = "n", xlab="", ylab="", main = "DETAILS", xaxt='n', yaxt='n')
  text(10, 85, names(cm$byClass[1]), cex=1.2, font=2)
  text(10, 70, round(as.numeric(cm$byClass[1]), 3), cex=1.2)
  text(30, 85, names(cm$byClass[2]), cex=1.2, font=2)
  text(30, 70, round(as.numeric(cm$byClass[2]), 3), cex=1.2)
  text(50, 85, names(cm$byClass[5]), cex=1.2, font=2)
  text(50, 70, round(as.numeric(cm$byClass[5]), 3), cex=1.2)
  text(70, 85, names(cm$byClass[6]), cex=1.2, font=2)
  text(70, 70, round(as.numeric(cm$byClass[6]), 3), cex=1.2)
  text(90, 85, names(cm$byClass[7]), cex=1.2, font=2)
  text(90, 70, round(as.numeric(cm$byClass[7]), 3), cex=1.2)

  # add in the accuracy information 
  text(30, 35, names(cm$overall[1]), cex=1.5, font=2)
  text(30, 20, round(as.numeric(cm$overall[1]), 3), cex=1.4)
  text(70, 35, names(cm$overall[2]), cex=1.5, font=2)
  text(70, 20, round(as.numeric(cm$overall[2]), 3), cex=1.4)
}

#set.seed(1234)
nbModel <- naiveBayes(Attrition ~ ., data=trainSet)

#predict with test
testSet$attritionPrediction <- predict(nbModel, testSet)
nbMatrix <- table(data = testSet$attritionPrediction, reference = testSet$Attrition)
#nbMatrix
cm <- confusionMatrix(nbMatrix)
#cm


draw_confusion_matrix(cm)

Analyze results of Naive Bayes Model

You have asked for a 60/60 Sensitivity/Specificity model. The Naive Bayes model returned a sensitivity of 83 % and a specificity of 45.7 %, which does not meet the requested levels. This tells us that the Naive Bayes model is doing an ok job predicting when a person will stay with the employer but not that great at predicting when a person will leave employment.

I will proceed to see if I can do better and meet the goals of a 60/60 model as requested.

Random Forest Model

I’ll try using a random forest model to improve on the ability to predict attrition of employees.

set.seed(1234)
rfModel <- train(Attrition~., data=trainSet, method="rf", metric="accuracy", trControl=trainControl(method="cv", number=5), tuneGrid=expand.grid(.mtry = c(1: 10)),nodesize=14, ntree=300, importance=TRUE)

## Warning in train.default(x, y, weights = w, ...): The metric "accuracy" was not
## in the result set. Accuracy will be used instead.

#predict with test
testSet$rfPrediction <- predict(rfModel, testSet)
rfMatrix <- table(reference = testSet$Attrition, data = testSet$rfPrediction)
#head(rfPrediction)
#rfMatrix
cm <- confusionMatrix(rfMatrix)
#cm
draw_confusion_matrix(cm)

rfImportance <- varImp(rfModel)
head(rfImportance$importance, n=10)

##                                         No       Yes
## Age                              54.873032 54.873032
## BusinessTravelTravel_Frequently   7.348607  7.348607
## BusinessTravelTravel_Rarely       8.567325  8.567325
## DepartmentResearch & Development 14.289385 14.289385
## DepartmentSales                  17.914888 17.914888
## DistanceFromHome                 13.896830 13.896830
## Education                         4.171803  4.171803
## EducationFieldLife Sciences      20.100092 20.100092
## EducationFieldMarketing           4.891344  4.891344
## EducationFieldMedical            17.843111 17.843111

Analyze results of Random Forest Model

Using the random forest model, I am able to achieve a sensitivity of 86.3% and a specificity of 100% on the test data set, above the 60%/60% marker you specified. This improved upon the results of the naive bayes model from earlier and will be the model I propose you adopt for predicting attrition among your employees.

In analyzing which features contributed the most to an employee’s decision to stay or leave, the top 10 were Age, Business Travel for those who traveled both frequently and rarely, Department for those in R&D and Sales, Distance from home that an employee lived, and the field of education which the employee had, including Life Sciences, Marketing, and Medical.

Regression - Predicting salary of an employee

Linear Regression Model

To begin with, I’ll use a linear regression model and evaluate the features variance inflation factor to determine if any features should be eliminated from the dataset.

lmModel <- lm(MonthlyIncome~., trainSet)
summary(lmModel)

## 
## Call:
## lm(formula = MonthlyIncome ~ ., data = trainSet)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3787.8  -649.1    -3.3   596.5  3890.2 
## 
## Coefficients:
##                                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                       267.354    994.697   0.269  0.78819    
## Age                               -10.339     16.944  -0.610  0.54199    
## AttritionYes                       28.540    138.394   0.206  0.83669    
## BusinessTravelTravel_Frequently   279.977    166.932   1.677  0.09402 .  
## BusinessTravelTravel_Rarely       379.189    139.783   2.713  0.00686 ** 
## DepartmentResearch & Development -294.999    561.730  -0.525  0.59966    
## DepartmentSales                  -846.391    585.304  -1.446  0.14867    
## DistanceFromHome                   10.609     31.640   0.335  0.73752    
## Education                         -74.157     44.041  -1.684  0.09274 .  
## EducationFieldLife Sciences       348.448    449.910   0.774  0.43895    
## EducationFieldMarketing           370.789    474.668   0.781  0.43502    
## EducationFieldMedical             218.565    448.198   0.488  0.62597    
## EducationFieldOther               430.279    483.205   0.890  0.37357    
## EducationFieldTechnical Degree    333.261    477.054   0.699  0.48508    
## EnvironmentSatisfaction           -36.769     39.727  -0.926  0.35506    
## GenderMale                        122.055     87.018   1.403  0.16124    
## JobInvolvement                     27.762     64.104   0.433  0.66511    
## JobLevel                         2942.932     91.064  32.317  < 2e-16 ***
## JobRoleHuman Resources           -454.257    610.835  -0.744  0.45737    
## JobRoleLaboratory Technician     -577.718    204.597  -2.824  0.00490 ** 
## JobRoleManager                   4293.533    331.734  12.943  < 2e-16 ***
## JobRoleManufacturing Director     233.512    202.249   1.155  0.24872    
## JobRoleResearch Director         3943.718    263.774  14.951  < 2e-16 ***
## JobRoleResearch Scientist        -257.422    199.909  -1.288  0.19834    
## JobRoleSales Executive            332.829    432.003   0.770  0.44134    
## JobRoleSales Representative        93.353    473.137   0.197  0.84365    
## JobSatisfaction                   -26.180     38.747  -0.676  0.49951    
## MaritalStatusMarried               21.430    118.671   0.181  0.85676    
## MaritalStatusSingle                 7.949    161.268   0.049  0.96071    
## NumCompaniesWorked                 34.487     18.820   1.832  0.06737 .  
## OverTimeYes                        -2.741     98.893  -0.028  0.97790    
## PerformanceRating                -161.270    116.980  -1.379  0.16852    
## RelationshipSatisfaction            1.046     38.233   0.027  0.97818    
## StockOptionLevel                   -1.994     66.553  -0.030  0.97610    
## TrainingTimesLastYear              38.837     34.771   1.117  0.26447    
## WorkLifeBalance                   -43.981     60.946  -0.722  0.47080    
## YearsAtCompany                     29.629     14.822   1.999  0.04606 *  
## YearsInCurrentRole                 10.353     20.521   0.505  0.61407    
## YearsWithCurrManager              -19.114     20.998  -0.910  0.36304    
## AgeGroup21-30                     104.508    384.893   0.272  0.78608    
## AgeGroup31-40                     511.097    439.304   1.163  0.24512    
## AgeGroup41-50                     641.180    544.965   1.177  0.23984    
## AgeGroup51-60                     713.306    687.722   1.037  0.30006    
## WorkDistance> 25                 -631.993    838.387  -0.754  0.45125    
## WorkDistance11-15                -522.643    381.272  -1.371  0.17095    
## WorkDistance16-20                -144.670    522.982  -0.277  0.78216    
## WorkDistance21-25                -448.062    682.121  -0.657  0.51152    
## WorkDistance5-10                 -113.691    208.547  -0.545  0.58585    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1062 on 605 degrees of freedom
## Multiple R-squared:  0.9485, Adjusted R-squared:  0.9445 
## F-statistic:   237 on 47 and 605 DF,  p-value: < 2.2e-16

VIF(lmModel)

##                                GVIF Df GVIF^(1/(2*Df))
## Age                       12.726242  1        3.567386
## Attrition                  1.496990  1        1.223516
## BusinessTravel             1.175158  2        1.041176
## Department               117.046168  2        3.289193
## DistanceFromHome          37.471381  1        6.121387
## Education                  1.169272  1        1.081329
## EducationField             3.562265  5        1.135462
## EnvironmentSatisfaction    1.087986  1        1.043066
## Gender                     1.058837  1        1.028998
## JobInvolvement             1.157110  1        1.075691
## JobLevel                   5.597519  1        2.365908
## JobRole                  634.239697  8        1.496721
## JobSatisfaction            1.100190  1        1.048899
## MaritalStatus              2.135711  2        1.208887
## NumCompaniesWorked         1.320266  1        1.149028
## OverTime                   1.179392  1        1.085998
## PerformanceRating          1.061309  1        1.030199
## RelationshipSatisfaction   1.057036  1        1.028123
## StockOptionLevel           1.940808  1        1.393129
## TrainingTimesLastYear      1.113767  1        1.055352
## WorkLifeBalance            1.080366  1        1.039406
## YearsAtCompany             4.559403  1        2.135276
## YearsInCurrentRole         3.085311  1        1.756505
## YearsWithCurrManager       3.148772  1        1.774478
## AgeGroup                  16.071199  4        1.414999
## WorkDistance              50.816794  5        1.481156

A number of features show to be problematic, with a VIF over 5. some are extraordinarily over a value of 5, such as JobRole at 634.23. This would indicate that the linear model is having trouble estimating the coefficient for that variable.

Others that may be problematic would be Department, Distance From Home, Age Group, and Work Distance. The last two were categorical variables created for the attrition prediction and will be removed.

In addition, it seems to fit a logic test that distance from home would not influence the salary of an employee. Department could influence the salary but maybe not heavily across different departments, but more so within a department based on another feature. I believe it will be best to remove it from consideration here.

Age does show a higher than 10 VIF, however, I do believe that Age will be a valuable feature in predicting salary, so I will leave it for now.

There are other factors which I am not comfortable leaving in the dataset are Attrition, as whether an employee left could imply their satisfaction with their salary but doesn’t indicate they’re salary properly.

removeVariables <- c("JobRole", "Department", "DistanceFromHome", "AgeGroup", "WorkDistance", "Attrition")
trainSetLM2 <- trainSet[,!(names(trainSet) %in% removeVariables)] 
testSetLM2 <- testSet[,!(names(testSet) %in% removeVariables)]

Rerun the regression analysis of the linear model

lmModel <- lm(MonthlyIncome~., trainSetLM2)
summary(lmModel)

## 
## Call:
## lm(formula = MonthlyIncome ~ ., data = trainSetLM2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5408.4  -766.2    37.0   824.2  3817.1 
## 
## Coefficients:
##                                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                     -1821.526    819.708  -2.222   0.0266 *  
## Age                                11.614      7.425   1.564   0.1183    
## BusinessTravelTravel_Frequently   200.292    210.823   0.950   0.3425    
## BusinessTravelTravel_Rarely       429.177    177.546   2.417   0.0159 *  
## Education                          17.576     55.502   0.317   0.7516    
## EducationFieldLife Sciences      -452.297    430.157  -1.051   0.2934    
## EducationFieldMarketing          -997.708    450.092  -2.217   0.0270 *  
## EducationFieldMedical            -517.887    431.253  -1.201   0.2302    
## EducationFieldOther              -473.337    475.320  -0.996   0.3197    
## EducationFieldTechnical Degree   -552.959    467.468  -1.183   0.2373    
## EnvironmentSatisfaction           -76.407     50.056  -1.526   0.1274    
## GenderMale                         77.233    111.663   0.692   0.4894    
## JobInvolvement                    102.366     79.776   1.283   0.1999    
## JobLevel                         3886.462     66.005  58.882   <2e-16 ***
## JobSatisfaction                   -29.575     48.650  -0.608   0.5435    
## MaritalStatusMarried               10.358    151.018   0.069   0.9453    
## MaritalStatusSingle               -52.699    202.941  -0.260   0.7952    
## NumCompaniesWorked                 24.980     23.475   1.064   0.2877    
## OverTimeYes                        26.363    120.214   0.219   0.8265    
## PerformanceRating                -116.321    148.902  -0.781   0.4350    
## RelationshipSatisfaction            4.202     49.175   0.085   0.9319    
## StockOptionLevel                  -66.382     84.320  -0.787   0.4314    
## TrainingTimesLastYear              45.410     43.881   1.035   0.3011    
## WorkLifeBalance                    50.740     77.584   0.654   0.5134    
## YearsAtCompany                     41.343     18.570   2.226   0.0263 *  
## YearsInCurrentRole                -15.843     25.713  -0.616   0.5380    
## YearsWithCurrManager              -47.972     26.502  -1.810   0.0708 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1379 on 626 degrees of freedom
## Multiple R-squared:  0.9102, Adjusted R-squared:  0.9064 
## F-statistic: 243.9 on 26 and 626 DF,  p-value: < 2.2e-16

VIF(lmModel)

##                              GVIF Df GVIF^(1/(2*Df))
## Age                      1.449471  1        1.203940
## BusinessTravel           1.072243  2        1.017591
## Education                1.101541  1        1.049543
## EducationField           1.183717  5        1.017009
## EnvironmentSatisfaction  1.024550  1        1.012201
## Gender                   1.034217  1        1.016965
## JobInvolvement           1.063000  1        1.031019
## JobLevel                 1.744368  1        1.320745
## JobSatisfaction          1.028813  1        1.014304
## MaritalStatus            1.925913  2        1.178038
## NumCompaniesWorked       1.218428  1        1.103824
## OverTime                 1.033769  1        1.016744
## PerformanceRating        1.020008  1        1.009954
## RelationshipSatisfaction 1.037217  1        1.018439
## StockOptionLevel         1.847930  1        1.359386
## TrainingTimesLastYear    1.052214  1        1.025775
## WorkLifeBalance          1.038512  1        1.019074
## YearsAtCompany           4.244904  1        2.060317
## YearsInCurrentRole       2.873484  1        1.695136
## YearsWithCurrManager     2.975074  1        1.724840

The Linear Regression resulted in an RMSE of $1379 and an adjusted R Squared of 0.9064.

Can another model improve on the results seen with Linear Regression

I will begin using the full dataset containing all of the predictors that were eliminated in the Linear Regression modeling.

set.seed(1234)
cubistModel <- cubist(x= trainSet[, !(names(trainSet) %in% c("MonthlyIncome"))], y=trainSet$MonthlyIncome, committees = 5)
cubistModel

## 
## Call:
## cubist.default(x = trainSet[, !(names(trainSet) %in% c("MonthlyIncome"))], y
##  = trainSet$MonthlyIncome, committees = 5)
## 
## Number of samples: 653 
## Number of predictors: 26 
## 
## Number of committees: 5 
## Number of rules per committee: 6, 3, 6, 3, 6

summary(cubistModel)

## 
## Call:
## cubist.default(x = trainSet[, !(names(trainSet) %in% c("MonthlyIncome"))], y
##  = trainSet$MonthlyIncome, committees = 5)
## 
## 
## Cubist [Release 2.07 GPL Edition]  Sat Feb 12 18:13:46 2022
## ---------------------------------
## 
##     Target attribute `outcome'
## 
## Read 653 cases (27 attributes) from undefined.data
## 
## Model 1:
## 
##   Rule 1/1: [259 cases, mean 2753.8, range 1081 to 4936, est err 534.0]
## 
##     if
##  JobLevel <= 1
##     then
##  outcome = 2411 + 52 YearsAtCompany
## 
##   Rule 1/2: [27 cases, mean 4488.9, range 2042 to 7403, est err 1005.9]
## 
##     if
##  JobLevel > 1
##  JobRole in {Laboratory Technician, Sales Representative}
##     then
##  outcome = 3691 + 170 YearsAtCompany - 213 YearsWithCurrManager
##            + 261 JobLevel
## 
##   Rule 1/3: [41 cases, mean 5654.9, range 3660 to 9980, est err 749.2]
## 
##     if
##  Age <= 29
##  JobLevel > 1
##  JobRole in {Healthcare Representative, Manufacturing Director,
##                     Research Scientist, Sales Executive}
##     then
##  outcome = -1497 + 3202 JobLevel
## 
##   Rule 1/4: [201 cases, mean 7197.7, range 4028 to 13966, est err 920.2]
## 
##     if
##  Age > 29
##  DistanceFromHome <= 16
##  JobLevel > 1
##  JobRole in {Healthcare Representative, Manufacturing Director,
##                     Research Scientist, Sales Executive}
##     then
##  outcome = -445 + 3605 JobLevel - 64 DistanceFromHome
##            - 151 EnvironmentSatisfaction + 116 TrainingTimesLastYear
##            - 142 Education
## 
##   Rule 1/5: [54 cases, mean 7350.8, range 4014 to 13973, est err 996.6]
## 
##     if
##  Age > 29
##  DistanceFromHome > 16
##  JobLevel > 1
##  JobRole in {Healthcare Representative, Manufacturing Director,
##                     Research Scientist, Sales Executive}
##     then
##  outcome = -1147 + 4122 JobLevel - 503 TrainingTimesLastYear
##            - 65 YearsAtCompany
## 
##   Rule 1/6: [71 cases, mean 16359.6, range 6077 to 19999, est err 779.2]
## 
##     if
##  JobLevel > 1
##  JobRole in {Human Resources, Manager, Research Director}
##     then
##  outcome = 2370 + 3491 JobLevel - 199 JobSatisfaction + 25 YearsAtCompany
## 
## Model 2:
## 
##   Rule 2/1: [259 cases, mean 2753.8, range 1081 to 4936, est err 544.4]
## 
##     if
##  JobLevel <= 1
##     then
##  outcome = 2653
## 
##   Rule 2/2: [324 cases, mean 6798.8, range 2042 to 13973, est err 1023.6]
## 
##     if
##  JobLevel > 1
##  JobRole in {Healthcare Representative, Human Resources,
##                     Laboratory Technician, Manufacturing Director,
##                     Research Scientist, Sales Executive, Sales Representative}
##     then
##  outcome = -1713 + 3704 JobLevel - 23 DistanceFromHome
## 
##   Rule 2/3: [70 cases, mean 16506.5, range 11031 to 19999, est err 731.9]
## 
##     if
##  JobRole in {Manager, Research Director}
##     then
##  outcome = 3561 + 3215 JobLevel
## 
## Model 3:
## 
##   Rule 3/1: [259 cases, mean 2753.8, range 1081 to 4936, est err 541.1]
## 
##     if
##  JobLevel <= 1
##     then
##  outcome = 875 + 1119 JobLevel + 73 YearsAtCompany + 9 Age
## 
##   Rule 3/2: [27 cases, mean 4488.9, range 2042 to 7403, est err 1589.8]
## 
##     if
##  JobLevel > 1
##  JobRole in {Laboratory Technician, Sales Representative}
##     then
##  outcome = 3201
## 
##   Rule 3/3: [41 cases, mean 5654.9, range 3660 to 9980, est err 1001.1]
## 
##     if
##  Age <= 29
##  JobLevel > 1
##  JobRole in {Healthcare Representative, Manufacturing Director,
##                     Research Scientist, Sales Executive}
##     then
##  outcome = -1383 + 2769 JobLevel + 7 Age
## 
##   Rule 3/4: [296 cases, mean 7011.9, range 3660 to 13973, est err 1166.1]
## 
##     if
##  JobLevel > 1
##  JobRole in {Healthcare Representative, Manufacturing Director,
##                     Research Scientist, Sales Executive}
##     then
##  outcome = 959 + 3478 JobLevel - 108 DistanceFromHome
##            - 315 EnvironmentSatisfaction + 242 TrainingTimesLastYear
##            - 295 Education
## 
##   Rule 3/5: [64 cases, mean 7165.0, range 4014 to 13973, est err 1279.5]
## 
##     if
##  DistanceFromHome > 16
##  JobLevel > 1
##  JobRole in {Healthcare Representative, Manufacturing Director,
##                     Research Scientist, Sales Executive}
##     then
##  outcome = -130 + 4552 JobLevel - 997 TrainingTimesLastYear
##            - 131 YearsAtCompany
## 
##   Rule 3/6: [71 cases, mean 16359.6, range 6077 to 19999, est err 833.0]
## 
##     if
##  JobLevel > 1
##  JobRole in {Human Resources, Manager, Research Director}
##     then
##  outcome = 409 + 3595 JobLevel - 314 JobSatisfaction + 456 JobInvolvement
##            + 53 YearsAtCompany
## 
## Model 4:
## 
##   Rule 4/1: [259 cases, mean 2753.8, range 1081 to 4936, est err 544.4]
## 
##     if
##  JobLevel <= 1
##     then
##  outcome = 2653
## 
##   Rule 4/2: [324 cases, mean 6798.8, range 2042 to 13973, est err 1027.2]
## 
##     if
##  JobLevel > 1
##  JobRole in {Healthcare Representative, Human Resources,
##                     Laboratory Technician, Manufacturing Director,
##                     Research Scientist, Sales Executive, Sales Representative}
##     then
##  outcome = -2267 + 3836 JobLevel
## 
##   Rule 4/3: [70 cases, mean 16506.5, range 11031 to 19999, est err 822.0]
## 
##     if
##  JobRole in {Manager, Research Director}
##     then
##  outcome = 4151 + 3136 JobLevel
## 
## Model 5:
## 
##   Rule 5/1: [259 cases, mean 2753.8, range 1081 to 4936, est err 554.8]
## 
##     if
##  JobLevel <= 1
##     then
##  outcome = 2169 + 104 YearsAtCompany
## 
##   Rule 5/2: [27 cases, mean 4488.9, range 2042 to 7403, est err 1618.6]
## 
##     if
##  JobLevel > 1
##  JobRole in {Laboratory Technician, Sales Representative}
##     then
##  outcome = 3163
## 
##   Rule 5/3: [41 cases, mean 5654.9, range 3660 to 9980, est err 986.1]
## 
##     if
##  Age <= 29
##  JobLevel > 1
##  JobRole in {Healthcare Representative, Manufacturing Director,
##                     Research Scientist, Sales Executive}
##     then
##  outcome = -821 + 2568 JobLevel - 7 DistanceFromHome + 6 Age
## 
##   Rule 5/4: [64 cases, mean 7165.0, range 4014 to 13973, est err 1244.7]
## 
##     if
##  DistanceFromHome > 16
##  JobLevel > 1
##  JobRole in {Healthcare Representative, Manufacturing Director,
##                     Research Scientist, Sales Executive}
##     then
##  outcome = -41 + 4417 JobLevel - 1008 TrainingTimesLastYear
##            - 130 YearsAtCompany
## 
##   Rule 5/5: [201 cases, mean 7197.7, range 4028 to 13966, est err 1044.1]
## 
##     if
##  Age > 29
##  DistanceFromHome <= 16
##  JobLevel > 1
##  JobRole in {Healthcare Representative, Manufacturing Director,
##                     Research Scientist, Sales Executive}
##     then
##  outcome = 1513 + 3346 JobLevel - 131 DistanceFromHome
##            - 315 EnvironmentSatisfaction + 242 TrainingTimesLastYear
##            - 295 Education
## 
##   Rule 5/6: [71 cases, mean 16359.6, range 6077 to 19999, est err 891.0]
## 
##     if
##  JobLevel > 1
##  JobRole in {Human Resources, Manager, Research Director}
##     then
##  outcome = -123 + 3689 JobLevel - 289 JobSatisfaction
##            + 423 JobInvolvement + 49 YearsAtCompany
## 
## 
## Evaluation on training data (653 cases):
## 
##     Average  |error|              996.8
##     Relative |error|               0.29
##     Correlation coefficient        0.96
## 
## 
##  Attribute usage:
##    Conds  Model
## 
##     96%    68%    JobLevel
##     62%           JobRole
##     17%    31%    DistanceFromHome
##     17%    10%    Age
##            35%    YearsAtCompany
##            26%    TrainingTimesLastYear
##            21%    Education
##            21%    EnvironmentSatisfaction
##             6%    JobSatisfaction
##             4%    JobInvolvement
## 
## 
## Time: 0.1 secs

modelPrediction <- predict(cubistModel, testSet[, !(names(testSet) %in% c("MonthlyIncome"))])
#RMSE
cubistRMSE <- sqrt(mean((modelPrediction - testSet$MonthlyIncome)^2))
cubistRMSE

## [1] 1223.452

cubistR2 <- cor(modelPrediction, testSet$MonthlyIncome)^2
cubistR2

## [1] 0.9372058

Evaluating Cubist Model

With an RMSE of $1296 and an R^2 of 0.9295, there is improvement in the model over the linear regression. Now I’ll attempt to tune for better predictors to see if I can improve it more.

varImp(cubistModel)

##                          Overall
## JobLevel                    82.0
## JobRole                     31.0
## DistanceFromHome            24.0
## Age                         13.5
## YearsAtCompany              17.5
## TrainingTimesLastYear       13.0
## Education                   10.5
## EnvironmentSatisfaction     10.5
## JobSatisfaction              3.0
## JobInvolvement               2.0
## Attrition                    0.0
## BusinessTravel               0.0
## Department                   0.0
## EducationField               0.0
## Gender                       0.0
## MaritalStatus                0.0
## NumCompaniesWorked           0.0
## OverTime                     0.0
## PerformanceRating            0.0
## RelationshipSatisfaction     0.0
## StockOptionLevel             0.0
## WorkLifeBalance              0.0
## YearsInCurrentRole           0.0
## YearsWithCurrManager         0.0
## AgeGroup                     0.0
## WorkDistance                 0.0

Applying the results from the evaluation of the model

Now I will try to apply model tuning using the caret package to use a 10 fold cross validation over the parameters Committees and Neighbors.

grid <- expand.grid(committees = c(10, 15, 20, 25, 30, 35, 40, 45, 50), neighbors = c(0, 1, 3, 5))
set.seed(1234)
caretGrid <- train(x= trainSet[, !(names(trainSet) %in% c("MonthlyIncome"))], y=trainSet$MonthlyIncome, method="cubist", tuneGrid=grid, trControl = trainControl(method="cv"))
caretGrid

## Cubist 
## 
## 653 samples
##  26 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 588, 588, 586, 587, 588, 586, ... 
## Resampling results across tuning parameters:
## 
##   committees  neighbors  RMSE      Rsquared   MAE      
##   10          0          1026.057  0.9473133   784.1183
##   10          1          1374.918  0.9038651  1046.0441
##   10          3          1141.334  0.9326640   888.9389
##   10          5          1083.234  0.9392715   833.9379
##   15          0          1024.255  0.9473345   785.2219
##   15          1          1375.591  0.9036741  1044.1604
##   15          3          1142.218  0.9324960   888.5238
##   15          5          1083.887  0.9392351   836.1047
##   20          0          1028.556  0.9471783   789.0399
##   20          1          1378.276  0.9031460  1047.9079
##   20          3          1144.758  0.9321824   889.7532
##   20          5          1086.898  0.9389324   837.7105
##   25          0          1030.729  0.9467781   790.7755
##   25          1          1379.968  0.9029354  1048.6742
##   25          3          1147.972  0.9318716   892.4039
##   25          5          1090.906  0.9385348   841.8759
##   30          0          1033.564  0.9467263   792.8499
##   30          1          1380.826  0.9027004  1050.4937
##   30          3          1148.837  0.9317342   892.9141
##   30          5          1091.859  0.9384397   842.9693
##   35          0          1037.004  0.9463975   795.9165
##   35          1          1382.923  0.9023901  1050.9750
##   35          3          1152.011  0.9314138   895.9661
##   35          5          1095.365  0.9381181   846.0507
##   40          0          1036.451  0.9465123   794.9063
##   40          1          1381.607  0.9026400  1051.4267
##   40          3          1151.006  0.9315342   894.8546
##   40          5          1094.760  0.9381866   845.4335
##   45          0          1037.521  0.9464162   795.8715
##   45          1          1382.756  0.9024795  1051.0357
##   45          3          1152.767  0.9313642   896.1033
##   45          5          1096.322  0.9380672   846.4105
##   50          0          1039.022  0.9463086   796.3599
##   50          1          1383.689  0.9023915  1051.4496
##   50          3          1153.527  0.9312673   896.8106
##   50          5          1097.045  0.9379773   846.2243
## 
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were committees = 15 and neighbors = 0.

ggplot(caretGrid)

Best model from Cubist tuning

Using the CARET package to tune the model with a set of committee and neighbor selections, I was able to find a best fit with 15 committees and no neighbors. In that best fit instance, I was able to achieve an RMSE of 1024.255 and an R Squared of 0.9473.

I’ll now run this best fit model against the test data to confirm like results.

modelPrediction <- predict(caretGrid, testSet[, !(names(testSet) %in% c("MonthlyIncome"))])
#RMSE
cubistRMSE <- sqrt(mean((modelPrediction - testSet$MonthlyIncome)^2))
cubistRMSE

## [1] 1197.722

cubistR2 <- cor(modelPrediction, testSet$MonthlyIncome)^2
cubistR2

## [1] 0.9396154

Analyzing the results of running against the test sample

I was able to achieve results of an RMSE of 1197.722 and an R^2 of 0.9396, which shows that the model is slightly over fit to the training data set but does still return better results than simply a linear regression model.

Generating result sets on the Attrition and Salary missing data sets

I’ll now generate the predicted values from the data sets with the missing data so to allow you to validate my models and methodologies, using my highest performing models.

attritionData$PredictedValue <- predict(rfModel, attritionData)
salaryData$PredictedValue <- predict(caretGrid, salaryData)

attritionExport <- attritionData %>% select(c("ID", "PredictedValue"))
write.csv(attritionExport, "Case2PredictionsMcDonald Attrition.csv")

salaryExport <- salaryData %>% select(c("ID", "PredictedValue"))
write.csv(salaryExport, "Case2PredictionsMcDonald Salary.csv")

CaseStudy2

Jason Mcdonald

2/10/2022