In this, Machine Learning project we are going to learn why employees are leaving . What are the factors that promote employees leaving the company? Which department in the company is highly affected and the reasons behind it Things I have done in this project
1. Data Analyzing
2. Which Department is most affected
3. Reasons Behind Attrition
4. Removing unnecessary data from our dataset
5. Convert our dataset for machine learning training model
The machine learning model used here
1.Logistic Regression(accuracy and confusion matrix)
2.Random Forest Classifier(accuracy and confusion matrix)
First thing we need to do is understand what kind of data is given. Then we will remove all the unnecessary , fill the null value if there is any . Visualise it and then we will do our machine learning
The key to success in any organization is attracting and retaining top talent. I’m an HR analyst at my company, and one of my tasks is to determine which factors keep employees at my company and which prompt others to leave. I need to know what factors I can change to prevent the loss of good people.
Content I have data about past and current employees in a spreadsheet on my desktop. It has various data points on our employees, but I’m most interested in whether they’re still with my company or whether they’ve gone to work somewhere else. And I want to understand how this relates to workforce attrition.
d.info() #will get all the information regarding data size ,data types present,size ,shape,null values etc
d.describe() #tell the mean , standard deviation and other statiscal data for the employees
# Categorical columns
d.select_dtypes(include='object').columns # note object type can't use directly into machine learning we need to convert it so here we are basically looking our orbject data name and how many they we will do there conversion later on
Step 3 : Removing All the unnecessary Data
By visualizing in excel we see that there is some parameter that are useless and there is no need to involve in machine for the model accuracy. Example EmployeeId, Employeecount(this value is same for all the data so passing in our machine learning model is useless) , similary in this way i have identify other useless data so we are going to drop it
The reason we are handling object data because these can't be send directly to machine learning . Machine learning model that we are using only understand numbers not words. So first we will see how many object data we have then we will convert it into unint data type
dataset.select_dtypes(include ='object').columns #tells all the object data type remember we have this command earlier as well do you see difference in output run on your system or see my github for better understanding
dataset = pd.get_dummies(data=dataset, drop_first=True) #convert our dataset into unint
dataset.info() #run this command and see yourself
dataset.rename(columns={'Attrition_Yes':'Attrition'}, inplace=True) #changing the column name
Step 6 : Data Splitting
x = dataset.drop(columns='Attrition') #include all except Attrition
y = dataset['Attrition'] #include only Attrition data column
from sklearn.model_selection import train_test_split
split the data into training data and testing data
test_size = 0.2 means that 20% of our data is for testing rest of the data is use for training
random_state = 42 means whenever we will run this command we will get the same output
x_train = All the input variable that gives attrition data
x_test = testing data it also contains input variable that gives attrition data but it is use for prediction purpose
y_train = Contains only attrition data
y_test = Contains only attrition data but this data is to verify how good our machine learning model predicts
Step 7 : Apply Machine Learning Model
We are first going to use Logistic Regression model because we only want to know whether the employee is going to leave or not and logistic regression is good for that
from sklearn.linear_model import LogisticRegression
lgr = LogisticRegression(random_state=0)
lgr.fit(x_train,y_train)
y_pred = lgr.predict(x_test)
from sklearn.metrics import accuracy_score, confusion_matrix
acc = accuracy_score(y_test, y_pred)
print(acc*100)
output = 86.39455782312925
confusion_matrix(y_test,y_pred)
array([[253, 2],
[ 38, 1]], dtype=int64)
Random Forest Classifier Model
from sklearn.ensemble import RandomForestClassifier
rfc = RandomForestClassifier(random_state=0)
rfc.fit(x_train,y_train)
y_rfc_pred =rfc.predict(x_test)
from sklearn.metrics import accuracy_score , confusion_matrix
0 Comments
if you are not getting it then ask i am glad to help