Setting Up Google Colab and Installing Packages
Introduction
In this project, you will learn how to set up Google Colab, an online environment that allows you to write and execute Python code in your browser. You will also learn how to install necessary packages that will be used for analyzing HR datasets.
Steps to Set Up Google Colab
1. Access Google Colab
- Open your web browser.
- Navigate to Google Colab.
- If you are not signed in, sign in with your Google account.
2. Create a New Notebook
- Once you are logged in, click on “File”.
- From the dropdown menu, select “New Notebook”. This will create a new Colab notebook.
3. Rename the Notebook
- Click on the title “Untitled” at the top left corner of the page.
- Rename it to something descriptive like “HR_Dataset_Analysis”.
Installing Packages
To analyze HR datasets, you will need a few essential Python packages such as pandas
for data manipulation and matplotlib
for data visualization.
1. Install pandas
Package
Below is the code to install the pandas
package. This should be run in a code cell within your Colab notebook.
!pip install pandas
2. Install matplotlib
Package
Similarly, you can install the matplotlib
package using the code below.
!pip install matplotlib
3. Import the Packages
After installing the packages, you need to import them to use in your project. Add the following lines to your Colab notebook:
import pandas as pd
import matplotlib.pyplot as plt
Complete Setup Code Block
Here’s a complete code block to set up your Google Colab environment and install required packages:
# Install necessary packages
!pip install pandas
!pip install matplotlib
# Import installed packages
import pandas as pd
import matplotlib.pyplot as plt
# Test import by printing versions
print("Pandas version:", pd.__version__)
print("Matplotlib version:", plt.__version__)
Conclusion
You have now set up your Google Colab environment and installed the necessary packages to begin analyzing HR datasets. You are ready to import data and perform advanced analytics in the subsequent units of this project.
Make sure to save your notebook regularly by clicking on “File” then select “Save” or simply press Ctrl+S
.
Data Importation and Initial Overview
1. Data Importation
Import Required Libraries
import pandas as pd
Load Dataset
Load a CSV file from Google Drive.
from google.colab import drive
drive.mount('/content/drive')
# Adjust the file path accordingly
file_path = '/content/drive/My Drive/dataset/hr_data.csv'
df = pd.read_csv(file_path)
2. Initial Overview
Display the First Few Rows
print("First 5 Rows of the Dataset:")
print(df.head())
Check the Shape of the Dataset
print("Shape of the Dataset (rows, columns):")
print(df.shape)
Display Column Names
print("Column Names:")
print(df.columns.tolist())
Data Types of Each Column
print("Data Types of Columns:")
print(df.dtypes)
Summary Statistics
print("Summary Statistics:")
print(df.describe(include='all'))
Check for Missing Values
print("Missing Values in Each Column:")
print(df.isnull().sum())
Check for Duplicates
print("Number of Duplicate Rows:")
print(df.duplicated().sum())
Summary
The code above handles:
- Data importation from Google Drive
- Displaying initial exploratory data analysis (EDA) including:
- First few rows of the dataset
- Shape of the dataset
- Column names
- Data types of each column
- Summary statistics
- Missing values check
- Duplicate rows check
Execute each code block in sequence to comprehensively understand your HR dataset, paving the way for advanced analytics.
Data Cleaning and Preprocessing for HR Datasets
The goal is to clean and preprocess the HR dataset to make it ready for analysis. Let’s focus on the following steps:
- Handling Missing Values
- Converting Data Types
- Handling Duplicates
- Feature Engineering
- Data Normalization/Standardization
1. Handling Missing Values
Replace missing values appropriately based on the column type and business logic.
# Import pandas and load HR dataset (assuming data_frame is your DataFrame)
import pandas as pd
# Fill missing numerical values with median
data_frame['salary'].fillna(data_frame['salary'].median(), inplace=True)
# Fill missing categorical values with mode
data_frame['department'].fillna(data_frame['department'].mode()[0], inplace=True)
2. Converting Data Types
Ensure all columns have the correct data types for analysis.
# Convert hire_date to datetime
data_frame['hire_date'] = pd.to_datetime(data_frame['hire_date'])
# Convert salary to float
data_frame['salary'] = data_frame['salary'].astype(float)
3. Handling Duplicates
Remove duplicate records if any.
# Remove duplicate rows
data_frame.drop_duplicates(subset=['employee_id'], inplace=True)
4. Feature Engineering
Create new features that may be useful for analysis.
# Create tenure feature (assuming today is the reference date)
data_frame['tenure'] = (pd.Timestamp.now() - data_frame['hire_date']).dt.days // 365
# Create a feature to indicate if salary is above median
data_frame['above_median_salary'] = data_frame['salary'] > data_frame['salary'].median()
5. Data Normalization/Standardization
Normalize or standardize numerical features if necessary for analysis.
from sklearn.preprocessing import StandardScaler
# Scale 'salary' and 'tenure'
scaler = StandardScaler()
data_frame[['salary', 'tenure']] = scaler.fit_transform(data_frame[['salary', 'tenure']])
Final Preprocessed Data Overview
Check the final state of the dataset to ensure it’s ready for advanced analytics.
# Display final data structure and types
print(data_frame.info())
print(data_frame.head())
This implementation provides a hands-on guide for cleaning and preprocessing your HR dataset. You can proceed with advanced analytics in the subsequent parts of your project.
Exploratory Data Analysis (EDA)
Overview
EDA involves summarizing the main characteristics of a dataset often with visual methods. This will help in understanding the structure, patterns, and relationships in the data. We’ll be using Python in Google Colab.
Steps for EDA
Summary Statistics
- Objective: Generate summary statistics for numerical and categorical features.
- Implementation:
# Assuming `df` is your DataFrame
numerical_summary = df.describe()
categorical_summary = df.describe(include=['object', 'category'])
print("Numerical Summary:\n", numerical_summary)
print("\nCategorical Summary:\n", categorical_summary)
Missing Values Analysis
- Objective: Identify and analyze missing values in the dataset.
- Implementation:
missing_values = df.isnull().sum()
missing_ratio = df.isnull().mean()
print("Missing Values:\n", missing_values)
print("\nMissing Ratio:\n", missing_ratio)
Data Distribution
- Objective: Visualize the distribution of numerical features.
- Implementation:
import matplotlib.pyplot as plt
import seaborn as sns
numerical_features = df.select_dtypes(include=['int64', 'float64']).columns
for feature in numerical_features:
plt.figure(figsize=(10, 6))
sns.histplot(df- , kde=True)
plt.title(f'Distribution of {feature}')
plt.show()
Correlation Matrix
- Objective: Understand relationships between numerical variables.
- Implementation:
corr_matrix = df.corr()
plt.figure(figsize=(12, 8))
sns.heatmap(corr_matrix, annot=True, fmt='.2f', cmap='coolwarm')
plt.title('Correlation Matrix')
plt.show()
Categorical Data Analysis
- Objective: Analyze the distribution and relationship of categorical data.
- Implementation:
categorical_features = df.select_dtypes(include=['object', 'category']).columns
for feature in categorical_features:
plt.figure(figsize=(10, 6))
sns.countplot(y=df- , order=df
- .value_counts().index)
plt.title(f'Distribution of {feature}')
plt.show()
Pair Plot
- Objective: Visualize relationships between numerical variables using pair plots.
- Implementation:
sns.pairplot(df[numerical_features])
plt.show()
Conclusion
This EDA will give you a comprehensive understanding of your HR dataset, which is crucial before moving on to advanced analytics.
Visualizing HR Metrics
Assuming that the data has already been imported, cleaned, and preprocessed, and exploratory data analysis has been conducted, let’s move on to visualizing HR metrics.
1. Import Necessary Libraries
Make sure you have the required libraries:
import matplotlib.pyplot as plt
import seaborn as sns
# Optional but recommended for larger datasets
import pandas as pd
2. Example HR Dataset
Consider you have a DataFrame
named hr_data
with columns such as employee_id
, age
, department
, salary
, years_at_company
, satisfaction_level
, performance_score
, etc.
# For demonstration purposes, here's an example of what the DataFrame could look like
# hr_data = pd.read_csv('path_to_file.csv')
3. Visualization Examples
Employee Distribution by Department
plt.figure(figsize=(10, 6))
sns.countplot(data=hr_data, x='department', palette='viridis')
plt.title('Employee Distribution by Department')
plt.xlabel('Department')
plt.ylabel('Number of Employees')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()
Employee Age Distribution
plt.figure(figsize=(10, 6))
sns.histplot(hr_data['age'], bins=20, kde=True, color='blue')
plt.title('Employee Age Distribution')
plt.xlabel('Age')
plt.ylabel('Frequency')
plt.tight_layout()
plt.show()
Satisfaction Level by Department
plt.figure(figsize=(12, 8))
sns.boxplot(data=hr_data, x='department', y='satisfaction_level', palette='coolwarm')
plt.title('Satisfaction Level by Department')
plt.xlabel('Department')
plt.ylabel('Satisfaction Level')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()
Salary Distribution by Department
plt.figure(figsize=(12, 8))
sns.boxplot(data=hr_data, x='department', y='salary', palette='magma')
plt.title('Salary Distribution by Department')
plt.xlabel('Department')
plt.ylabel('Salary')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()
Performance Score vs. Satisfaction Level
plt.figure(figsize=(10, 6))
sns.scatterplot(data=hr_data, x='performance_score', y='satisfaction_level', hue='department', palette='tab10', s=100)
plt.title('Performance Score vs. Satisfaction Level')
plt.xlabel('Performance Score')
plt.ylabel('Satisfaction Level')
plt.legend(loc='best')
plt.tight_layout()
plt.show()
4. Conclusion
These plots are essential for understanding various HR metrics. They help visualize the data, making it easier to identify trends and patterns. You can generate these visualizations and customize them further to address specific questions or insights needed from your HR data. By running these examples in Google Colab, you should be able to derive meaningful insights effortlessly.
Analyzing Employee Demographics
Here we are going to perform the analysis on employee demographic data. This will include calculating essential statistics, analyzing distributions, and providing insights on various demographic metrics.
Prerequisite: Importing Necessary Libraries
Ensure you have the following libraries imported if not already done in the earlier sections.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
# Optional: To display all columns of a DataFrame
pd.set_option('display.max_columns', None)
Step 1: Loading the Data
Assuming the dataset is named employee_data.csv
and resides in your Google Drive.
from google.colab import drive
drive.mount('/content/drive')
# Load the data
file_path = '/content/drive/My Drive/employee_data.csv'
df = pd.read_csv(file_path)
df.head()
Step 2: Analyzing Basic Demographic Information
Let’s start by getting an overview of the demographic variables such as age
, gender
, department
, and education
.
# Basic statistics for age
age_stats = df['age'].describe()
print("Age Statistics:\n", age_stats)
# Gender distribution
gender_distribution = df['gender'].value_counts()
print("\nGender Distribution:\n", gender_distribution)
# Department distribution
department_distribution = df['department'].value_counts()
print("\nDepartment Distribution:\n", department_distribution)
# Education level distribution
education_distribution = df['education'].value_counts()
print("\nEducation Level Distribution:\n", education_distribution)
Step 3: Visualizing Demographic Data
3.1 Age Distribution
plt.figure(figsize=(10, 6))
sns.histplot(df['age'], kde=True, bins=30, color='blue')
plt.title('Age Distribution of Employees')
plt.xlabel('Age')
plt.ylabel('Frequency')
plt.show()
3.2 Gender Distribution
plt.figure(figsize=(8, 6))
sns.countplot(data=df, x='gender', palette='Set2')
plt.title('Gender Distribution of Employees')
plt.xlabel('Gender')
plt.ylabel('Count')
plt.show()
3.3 Department Distribution
plt.figure(figsize=(12, 6))
sns.countplot(data=df, x='department', palette='Set3')
plt.title('Department Distribution')
plt.xlabel('Department')
plt.ylabel('Count')
plt.xticks(rotation=45)
plt.show()
3.4 Education Level Distribution
plt.figure(figsize=(10, 6))
sns.countplot(data=df, x='education', palette='Set1')
plt.title('Education Level Distribution')
plt.xlabel('Education Level')
plt.ylabel('Count')
plt.xticks(rotation=45)
plt.show()
Step 4: Cross-Analyzing Demographic Data
4.1 Gender vs. Department
plt.figure(figsize=(12, 6))
sns.countplot(data=df, x='department', hue='gender', palette='Set2')
plt.title('Gender Distribution Across Departments')
plt.xlabel('Department')
plt.ylabel('Count')
plt.legend(title='Gender')
plt.xticks(rotation=45)
plt.show()
4.2 Education Level vs. Age
plt.figure(figsize=(12, 6))
sns.boxplot(data=df, x='education', y='age', palette='Set1')
plt.title('Age Distribution Across Education Levels')
plt.xlabel('Education Level')
plt.ylabel('Age')
plt.xticks(rotation=45)
plt.show()
Step 5: Generating Summary Report
Collecting the insights into a summary report.
summary_report = {
"age_statistics": age_stats.to_dict(),
"gender_distribution": gender_distribution.to_dict(),
"department_distribution": department_distribution.to_dict(),
"education_distribution": education_distribution.to_dict()
}
# Convert summary report to DataFrame for better visualization
summary_df = pd.DataFrame(summary_report)
summary_df
We have now analyzed the employee demographics by calculating key statistics and visualizing the data to glean insights. This should allow us to understand the demographic makeup of the workforce comprehensively.
Attrition Analysis
In this section, we’ll use Python to perform attrition analysis to understand and predict why employees leave a company. You can leverage this to make data-driven decisions to improve employee retention.
1. Data Preparation
Assume you have already loaded and cleaned your dataset.
# Import necessary libraries
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
# Assume `df` is your preprocessed DataFrame
# Split the data into features and target
X = df.drop("Attrition", axis=1)
y = df["Attrition"]
# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
# Standardize the features
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
2. Model Training
We’ll use a RandomForestClassifier for the prediction.
# Initialize the RandomForestClassifier
model = RandomForestClassifier(n_estimators=100, random_state=42)
# Train the model
model.fit(X_train, y_train)
3. Model Evaluation
Evaluate the model using the test dataset.
# Make predictions
y_pred = model.predict(X_test)
# Evaluate the predictions
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))
print("\nClassification Report:\n", classification_report(y_test, y_pred))
print("\nAccuracy Score:", accuracy_score(y_test, y_pred))
4. Feature Importance
Identify important features that contribute to attrition.
# Get feature importances
importance = model.feature_importances_
features = X.columns
# Create a DataFrame for better visualization
feature_importance_df = pd.DataFrame({'Feature': features, 'Importance': importance})
feature_importance_df = feature_importance_df.sort_values(by='Importance', ascending=False)
print(feature_importance_df)
5. Results Interpretation
Interpret the results to make business decisions.
- The confusion matrix and classification report provide insights into the model's performance.
- Accuracy score gives an overall idea of how well the model is performing.
- The feature importance dataframe allows you to see which features have the most impact on employee attrition.
- Use this information to delve deeper into the most influential features, such as job satisfaction, number of projects, and average working hours, and take corrective actions.
6. Conclusion
By understanding which features affect employee attrition and how accurately you can predict it, your organization can implement more effective retention strategies.
This hands-on tutorial demonstrated how to conduct an attrition analysis using machine learning models, evaluate the model performance, and extract insights to improve HR policies.
Performance Evaluation and Metrics
In the context of analyzing HR datasets, performance evaluation often involves measuring the efficacy of predictive models or assessing the health of employee metrics. Below is a practical implementation focusing on evaluating a predictive model for employee attrition using Python in Google Colab.
1. Importing Required Libraries
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report, roc_auc_score, roc_curve
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np
2. Load the preprocessed dataset
Assuming you have a DataFrame named df
that has already been cleaned and preprocessed.
# Example: Load preprocessed data
df = pd.read_csv('preprocessed_hr_dataset.csv')
3. Splitting Data
from sklearn.model_selection import train_test_split
# Features and target variable
X = df.drop('Attrition', axis=1) # Features
y = df['Attrition'] # Target variable
# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
4. Train a Predictive Model
from sklearn.ensemble import RandomForestClassifier
# Instantiate the model
model = RandomForestClassifier(random_state=42)
# Train the model
model.fit(X_train, y_train)
5. Make Predictions
y_pred = model.predict(X_test)
y_prob = model.predict_proba(X_test)[:, 1]
6. Performance Metrics Evaluation
Accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f'Accuracy: {accuracy:.2f}')Confusion Matrix
cm = confusion_matrix(y_test, y_pred)
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues')
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.title('Confusion Matrix')
plt.show()Classification Report
report = classification_report(y_test, y_pred)
print('Classification Report:')
print(report)ROC AUC Score and ROC Curve
roc_auc = roc_auc_score(y_test, y_prob)
fpr, tpr, thresholds = roc_curve(y_test, y_prob)
plt.plot(fpr, tpr, label=f'ROC Curve (area = {roc_auc:.2f})')
plt.plot([0, 1], [0, 1], 'k--') # Diagonal line
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic (ROC) Curve')
plt.legend(loc='best')
plt.show()
7. Implementing Advanced Metrics
For more in-depth analysis, include metrics like Precision-Recall Curve, F1 Score, etc.
F1 Score
from sklearn.metrics import f1_score
f1 = f1_score(y_test, y_pred)
print(f'F1 Score: {f1:.2f}')Precision-Recall Curve
from sklearn.metrics import precision_recall_curve
precision, recall, _ = precision_recall_curve(y_test, y_prob)
plt.plot(recall, precision)
plt.xlabel('Recall')
plt.ylabel('Precision')
plt.title('Precision-Recall Curve')
plt.show()
By utilizing these metrics, you can comprehensively evaluate the performance of your predictive models in your HR dataset analysis project.
Compensation and Benefits Analysis
In this section, we’ll analyze the compensation and benefits data to gain insights into trends, distributions, and identify any potential disparities. We’ll utilize DataFrames and visualization libraries available in Python within Google Colab.
Step 1: Import Required Libraries
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
Step 2: Load Data
Assume compensation_data.csv
is the dataset containing the relevant information.
# Load dataset
comp_df = pd.read_csv('compensation_data.csv')
# Preview dataframe
comp_df.head()
Step 3: Descriptive Statistics
# Basic descriptive statistics
comp_df.describe()
# Distribution of Compensation Levels
plt.figure(figsize=(10, 6))
sns.histplot(comp_df['salary'], bins=30, kde=True)
plt.title('Salary Distribution')
plt.xlabel('Salary')
plt.ylabel('Frequency')
plt.show()
Step 4: Compensation by Department and Job Title
# Average salary by department
avg_salary_dept = comp_df.groupby('department')['salary'].mean().reset_index()
# Plot average salary by department
plt.figure(figsize=(12, 8))
sns.barplot(data=avg_salary_dept, x='department', y='salary')
plt.title('Average Salary by Department')
plt.xlabel('Department')
plt.ylabel('Average Salary')
plt.xticks(rotation=45)
plt.show()
# Average salary by job title
avg_salary_title = comp_df.groupby('job_title')['salary'].mean().reset_index()
# Plot average salary by job title
plt.figure(figsize=(12, 10))
sns.barplot(data=avg_salary_title, x='salary', y='job_title')
plt.title('Average Salary by Job Title')
plt.xlabel('Average Salary')
plt.ylabel('Job Title')
plt.show()
Step 5: Gender Pay Gap Analysis
# Average salary by gender
avg_salary_gender = comp_df.groupby('gender')['salary'].mean().reset_index()
# Plot average salary by gender
plt.figure(figsize=(8, 6))
sns.barplot(data=avg_salary_gender, x='gender', y='salary')
plt.title('Average Salary by Gender')
plt.xlabel('Gender')
plt.ylabel('Average Salary')
plt.show()
Step 6: Benefits Analysis
Assuming the dataset has columns related to benefits like ‘health_insurance’, ‘retirement_plan’, ‘paid_time_off’, etc.
# Count of benefits offered
benefit_columns = ['health_insurance', 'retirement_plan', 'paid_time_off']
benefit_counts = comp_df[benefit_columns].sum().reset_index()
benefit_counts.columns = ['Benefit', 'Count']
# Plot benefits distribution
plt.figure(figsize=(10, 6))
sns.barplot(data=benefit_counts, x='Benefit', y='Count')
plt.title('Distribution of Benefits Offered')
plt.xlabel('Benefit')
plt.ylabel('Count')
plt.show()
Step 7: Compensation vs. Performance
Assume performance_score
column exists.
# Scatter plot of salary vs performance score
plt.figure(figsize=(10, 6))
sns.scatterplot(data=comp_df, x='performance_score', y='salary')
plt.title('Salary vs Performance Score')
plt.xlabel('Performance Score')
plt.ylabel('Salary')
plt.show()
# Average salary by performance score
avg_salary_perf = comp_df.groupby('performance_score')['salary'].mean().reset_index()
# Plot average salary by performance score
plt.figure(figsize=(8, 6))
sns.barplot(data=avg_salary_perf, x='performance_score', y='salary')
plt.title('Average Salary by Performance Score')
plt.xlabel('Performance Score')
plt.ylabel('Average Salary')
plt.show()
Conclusion
This analysis helps identify trends and insights such as average compensation by department, job title, gender, benefits distribution, and correlation between compensation and performance. This structured approach enables a comprehensive understanding of compensation and benefits within the organization.
Predictive Modeling for Employee Attrition
You are now ready to create a predictive model based on the cleaned and preprocessed HR dataset. The goal is to predict whether an employee will leave the company (attrition).
Step 1: Import Required Libraries
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
Step 2: Prepare the Dataset
Assuming your data is in a DataFrame named df
, and the target variable (attrition) is a column named Attrition
.
# Separate target variable
X = df.drop('Attrition', axis=1)
y = df['Attrition'].apply(lambda x: 1 if x == 'Yes' else 0)
# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Feature Scaling
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
Step 3: Build and Train the Model
Using Random Forest Classifier as an example:
# Initialize the model
rf_model = RandomForestClassifier(n_estimators=100, random_state=42)
# Train the model
rf_model.fit(X_train, y_train)
Step 4: Evaluate the Model
# Predict on test data
y_pred = rf_model.predict(X_test)
# Evaluate performance
accuracy = accuracy_score(y_test, y_pred)
conf_matrix = confusion_matrix(y_test, y_pred)
class_report = classification_report(y_test, y_pred)
print(f"Accuracy: {accuracy}")
print("Confusion Matrix:")
print(conf_matrix)
print("Classification Report:")
print(class_report)
Step 5: Interpret Results
Use the classification_report
and confusion_matrix
to understand the performance of your model. The accuracy_score
gives a quick metric of how well your model is doing.
# Check feature importance
feature_importances = pd.Series(rf_model.feature_importances_, index=X.columns)
print(feature_importances.sort_values(ascending=False))
Conclusion
This implementation covers the complete pipeline for predictive modeling of employee attrition using a Random Forest Classifier in Python. You can adjust the model or hyperparameters as necessary to improve performance.
Advanced Analytics with Machine Learning
Step 1: Preparing the Dataset for Advanced Analytics
Ensure your dataset is loaded and preprocessed correctly. Assuming the dataset from the previous sections is already cleaned and ready:
# Assuming 'hr_data' is your cleaned and preprocessed Pandas DataFrame
import pandas as pd
# Load preprocessed data
# hr_data = pd.read_csv('preprocessed_hr_data.csv')
Step 2: Split Dataset into Features and Target Variable
Here, we’ll consider ‘Attrition’ as the target variable for classification tasks:
# Target variable
X = hr_data.drop(columns=['Attrition'])
y = hr_data['Attrition']
Step 3: Train-Test Split
Split the data into training and testing sets:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
Step 4: Feature Scaling
Standardize the feature variables:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
Step 5: Model Selection and Training
Let’s use three different algorithms: Logistic Regression, Random Forest, and Gradient Boosting for this example.
Logistic Regression
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report
log_reg = LogisticRegression()
log_reg.fit(X_train_scaled, y_train)
log_reg_pred = log_reg.predict(X_test_scaled)
print("Logistic Regression Classification Report:\n", classification_report(y_test, log_reg_pred))
Random Forest
from sklearn.ensemble import RandomForestClassifier
random_forest = RandomForestClassifier()
random_forest.fit(X_train_scaled, y_train)
random_forest_pred = random_forest.predict(X_test_scaled)
print("Random Forest Classification Report:\n", classification_report(y_test, random_forest_pred))
Gradient Boosting
from sklearn.ensemble import GradientBoostingClassifier
grad_boost = GradientBoostingClassifier()
grad_boost.fit(X_train_scaled, y_train)
grad_boost_pred = grad_boost.predict(X_test_scaled)
print("Gradient Boosting Classification Report:\n", classification_report(y_test, grad_boost_pred))
Step 6: Hyperparameter Tuning with GridSearchCV on the Best Model
Choose the best model based on initial results and perform hyperparameter tuning:
from sklearn.model_selection import GridSearchCV
# Example for hyperparameter tuning of Gradient Boosting
param_grid = {
'n_estimators': [100, 200, 300],
'learning_rate': [0.01, 0.1, 0.2],
'max_depth': [3, 4, 5]
}
grid_search = GridSearchCV(estimator=grad_boost, param_grid=param_grid, cv=3, verbose=2, n_jobs=-1)
grid_search.fit(X_train_scaled, y_train)
best_grad_boost = grid_search.best_estimator_
best_grad_boost_pred = best_grad_boost.predict(X_test_scaled)
print("Best Gradient Boosting Classification Report:\n", classification_report(y_test, best_grad_boost_pred))
Step 7: Model Evaluation and Interpretation
Evaluate the best model’s performance using confusion matrix and AUC-ROC:
from sklearn.metrics import confusion_matrix, roc_auc_score, roc_curve
import matplotlib.pyplot as plt
# Confusion Matrix
conf_matrix = confusion_matrix(y_test, best_grad_boost_pred)
print("Confusion Matrix:\n", conf_matrix)
# AUC-ROC
y_pred_proba = best_grad_boost.predict_proba(X_test_scaled)[:,1]
auc_roc = roc_auc_score(y_test, y_pred_proba)
print("AUC-ROC Score:", auc_roc)
# Plotting ROC Curve
fpr, tpr, _ = roc_curve(y_test, y_pred_proba)
plt.plot(fpr, tpr, label=f'Gradient Boosting (area = {auc_roc:.2f})')
plt.plot([0, 1], [0, 1], 'k--')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve')
plt.legend(loc='best')
plt.show()
Conclusion
By following these steps, you should have successfully completed an advanced machine learning analysis on your HR dataset in Google Colab. The implementation provided includes model training, hyperparameter tuning, and evaluation, which helps in deriving meaningful insights and making data-driven decisions.
Reporting and Dashboard Creation
1. Import Necessary Libraries
Ensure you have the following libraries loaded. They are necessary for creating reports and dashboards in Google Colab.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from plotly import express as px
import plotly.graph_objects as go
from dash import Dash, html, dcc
2. Example Data Import
Assuming the HR dataset has been preprocessed, let’s load the cleaned dataset.
df = pd.read_csv('cleaned_hr_dataset.csv')
3. Summary Report Generation
Create summary statistical reports using pandas.
summary = df.describe()
summary.to_csv('summary_report.csv')
4. Example Dashboard Layout
Use Dash
for creating an interactive dashboard.
app = Dash(__name__)
app.layout = html.Div(children=[
html.H1(children='HR Data Dashboard'),
dcc.Graph(
id='example-graph',
figure=px.histogram(df, x='YearsAtCompany', title='Employees by Years At Company')
),
dcc.Graph(
id='attrition-rate',
figure=px.pie(df, names='Attrition', title='Attrition Rate')
),
dcc.Graph(
id='dept-distribution',
figure=px.bar(df, x='Department', y='EmployeeCount', title='Department Distribution')
)
])
if __name__ == '__main__':
app.run_server(debug=True)
5. Combining and Serving the Dashboard
Ensure the DataFrame manipulations and visualizations are cohesive and integrate them into a running dashboard.
# Additional Graphs and Components as Needed
app.layout = html.Div(children=[
html.H1(children='HR Data Dashboard'),
dcc.Tabs([
dcc.Tab(label='Overview', children=[
html.Div([
dcc.Graph(
id='overview-bar',
figure=px.bar(df, x='JobRole', y='MonthlyIncome', title='Monthly Income by Job Role')
),
dcc.Graph(
id='overview-pie',
figure=px.pie(df, names='Gender', title='Gender Distribution')
)
])
]),
dcc.Tab(label='Attrition Analysis', children=[
html.Div([
dcc.Graph(
id='attrition-histogram',
figure=px.histogram(df, x='Age', color='Attrition', barmode='group', title='Attrition by Age')
),
dcc.Graph(
id='attrition-dept',
figure=px.bar(df, x='Department', y='AttritionRate', title='Attrition Rate by Department')
)
])
]),
# Additional tabs can be defined here
])
])
if __name__ == '__main__':
app.run_server(debug=True, port=8050)
Additional Sections
You can expand with more plots and computations as needed and add them to the dashboard layout.
With this implementation, you should be able to run an interactive HR report and visualization dashboard directly in Google Colab or any local environment supporting Plotly and Dash. The provided code snippets cover fundamental aspects of reporting and dashboard creation from loading data to rendering interactive graphs.