Data Cleaning Techniques in Python: Handling Missing Values and Outliers

Table of Contents

Introduction to Data Cleaning and Pre-processing

In this unit, we will cover fundamental concepts of data cleaning and pre-processing. Data cleaning is critical in ensuring the quality and integrity of your data before performing any analysis. This guide provides practical techniques for managing missing values and outliers in Python.

Practical Techniques for Data Cleaning

Setup Instructions

Before we start, you need to have the following Python libraries installed:

NumPy
Pandas
Matplotlib (for visualization)

You can install these using pip:

pip install numpy pandas matplotlib

Managing Missing Values

Import Libraries and Load Data

import pandas as pd
import numpy as np

# Sample DataFrame
data = {
    'A': [1, 2, np.nan, 4, 5],
    'B': [np.nan, 2, 3, 4, np.nan],
    'C': [1, 2, 3, 4, 5]
}

df = pd.DataFrame(data)
print(df)

Identifying Missing Values

# Finding missing values
missing_values_count = df.isnull().sum()
print(missing_values_count)

Handling Missing Values

Removing Missing Values

# Removing rows with any missing values
df_dropped = df.dropna()
print(df_dropped)

# Removing columns with any missing values
df_dropped_cols = df.dropna(axis=1)
print(df_dropped_cols)

Imputing Missing Values

# Filling missing values with a specific value
df_filled = df.fillna(0)
print(df_filled)

# Forward Fill
df_ffill = df.fillna(method='ffill')
print(df_ffill)

# Backward Fill
df_bfill = df.fillna(method='bfill')
print(df_bfill)

# Filling with mean value of the column
df_mean = df.fillna(df.mean())
print(df_mean)

Managing Outliers

Identifying Outliers

Using the Interquartile Range (IQR) method:

# Calculating IQR
Q1 = df.quantile(0.25)
Q3 = df.quantile(0.75)
IQR = Q3 - Q1

# Identifying outliers
outliers = ((df < (Q1 - 1.5 * IQR)) | (df > (Q3 + 1.5 * IQR)))
print(outliers)

Handling Outliers

Removing Outliers

# Removing outliers
df_no_outliers = df[~((df < (Q1 - 1.5 * IQR)) | (df > (Q3 + 1.5 * IQR))).any(axis=1)]
print(df_no_outliers)

Transforming Outliers

# Capping outliers - setting a cap at Q3 + 1.5*IQR for upper outliers
df_capped = df.apply(lambda x: np.where(x > (Q3[x.name] + 1.5 * IQR[x.name]), Q3[x.name] + 1.5 * IQR[x.name], x))
print(df_capped)

# Another approach: Imputing outliers with mean or median values
df_imputed = df.apply(lambda x: np.where(outliers[x.name], x.mean(), x))
print(df_imputed)

This unit provides a robust introduction to managing missing values and outliers using Python. Apply these techniques to clean and preprocess your datasets effectively.

Understanding Missing Data

Overview

Missing data is a common issue in datasets. Effective handling of missing values is crucial for accurate data analysis. In this section, we’ll discuss the types of missing data, identify missing data, and handle missing data using Python.

Types of Missing Data

Missing Completely at Random (MCAR): No systematic difference between the missing and the observed values.
Missing at Random (MAR): The tendency for a data point to be missing is related to some observed data but not the missing data itself.
Missing Not at Random (MNAR): The missing data is related to the missing values themselves.

Identifying Missing Data

Example Dataset

Consider the following pandas DataFrame as our dataset:

import pandas as pd

data = {
    'A': [1, 2, None, 3],
    'B': [None, 2, 3, 4],
    'C': [1, 2, 3, 4],
    'D': [None, None, None, 4]
}

df = pd.DataFrame(data)

Identifying Missing Data

Checking for Missing Values:

missing_counts = df.isnull().sum()
print(missing_counts)

Percentage of Missing Data:

missing_percentage = df.isnull().mean() * 100
print(missing_percentage)

Handling Missing Data

1. Removal of Missing Data

Removing Rows with Missing Values:

df_cleaned = df.dropna()
print(df_cleaned)

Removing Columns with Missing Values:

df_cleaned_columns = df.dropna(axis=1)
print(df_cleaned_columns)

2. Imputation of Missing Data

Imputation with Mean/Median/Mode:

df['A'].fillna(df['A'].mean(), inplace=True)
df['B'].fillna(df['B'].median(), inplace=True)
df['D'].fillna(df['D'].mode()[0], inplace=True)
print(df)

Forward Fill:

df_ffill = df.fillna(method='ffill')
print(df_ffill)

Backward Fill:

df_bfill = df.fillna(method='bfill')
print(df_bfill)

3. Advanced Methods

Interpolation:
- Linear Interpolation:

df_interpolated = df.interpolate()
print(df_interpolated)

Using Machine Learning Algorithms:
- Example: Using KNNImputer from scikit-learn

from sklearn.impute import KNNImputer

imputer = KNNImputer(n_neighbors=2)
df_imputed = pd.DataFrame(imputer.fit_transform(df), columns=df.columns)
print(df_imputed)

Conclusion

Handling missing data involves a strategic approach based on the nature of the data and the context of the analysis. The methods illustrated above should provide a solid foundation for effective management of missing data in Python.

Detecting Missing Values in Dataframes

When tackling missing values in dataframes, particularly using Python, the pandas library provides a robust set of functionalities. Here’s how to practically implement missing value detection:

Import Necessary Libraries

import pandas as pd

Load Your DataFrame

# Sample dataframe for illustration
df = pd.DataFrame({
    'A': [1, 2, None, 4],
    'B': [None, 1, 2, 3],
    'C': [1, None, None, 4]
})

Detect Missing Values

1. Identifying Missing Values

# Check if any value in the DataFrame is missing
df.isnull()

2. Summarizing Missing Values

# Get the count of missing values in each column
missing_values_count = df.isnull().sum()

# Display the count of missing values per column
print(missing_values_count)

3. Visualizing Missing Values

import seaborn as sns
import matplotlib.pyplot as plt

# Visualize missing values using a heatmap
sns.heatmap(df.isnull(), cbar=False, cmap='viridis')
plt.show()

Conclusion

This implementation allows for the detection and visualization of missing values within dataframes. By utilizing isnull() and heatmap(), you can efficiently understand the extent and pattern of missing values in your dataset. Apply these steps directly to your data for practical missing value identification.

Handling Missing Data

Handling missing data is a critical task in data cleaning to ensure the integrity and quality of datasets. There are several techniques for dealing with missing values. Below are practical implementations of these methods.

1. Removing Missing Values

a. Removing Rows with Missing Values

# Assuming df is your DataFrame
df_cleaned = df.dropna()

b. Removing Columns with Missing Values

# Assuming df is your DataFrame
df_cleaned = df.dropna(axis=1)

2. Imputing Missing Values

a. Imputing with Mean/Median/Mode

# Mean Imputation
df['column_name'].fillna(df['column_name'].mean(), inplace=True)

# Median Imputation
df['column_name'].fillna(df['column_name'].median(), inplace=True)

# Mode Imputation
df['column_name'].fillna(df['column_name'].mode()[0], inplace=True)

b. Imputing with Forward/Backward Fill

# Forward Fill
df['column_name'].fillna(method='ffill', inplace=True)

# Backward Fill
df['column_name'].fillna(method='bfill', inplace=True)

3. Using Interpolation

# Linear Interpolation
df['column_name'] = df['column_name'].interpolate(method='linear')

# Polynomial Interpolation
df['column_name'] = df['column_name'].interpolate(method='polynomial', order=2)

4. Imputing with a Predictive Model

This involves using machine learning models to predict and fill missing values.

a. Using Regression Imputation

from sklearn.linear_model import LinearRegression

# Assuming df is your DataFrame and 'target_column' is the column with missing values
known_data = df[df['target_column'].notna()]
unknown_data = df[df['target_column'].isna()]

X_train = known_data.drop('target_column', axis=1)
y_train = known_data['target_column']
X_test = unknown_data.drop('target_column', axis=1)

model = LinearRegression()
model.fit(X_train, y_train)

predictions = model.predict(X_test)

df.loc[df['target_column'].isna(), 'target_column'] = predictions

5. Creating a Missing Indicator

# Assuming df is your DataFrame and 'column_name' is the column with missing values
df['column_name_missing'] = df['column_name'].isnull().astype(int)

6. Dealing with Categorical Missing Values

a. Imputing with the Most Frequent Category

# Assuming df is your DataFrame
df['categorical_column'].fillna(df['categorical_column'].mode()[0], inplace=True)

b. Imputing with a New Category

# Assuming df is your DataFrame
df['categorical_column'].fillna('Unknown', inplace=True)

7. Multivariate Feature Imputation with KNN

from sklearn.impute import KNNImputer

imputer = KNNImputer(n_neighbors=5)
df_imputed = pd.DataFrame(imputer.fit_transform(df), columns=df.columns)

By using these strategies, you will be able to handle missing data effectively in various scenarios. Each method fits different contexts and types of data, helping to maintain data integrity for further analysis.

Introduction to Outliers

Outliers are data points that significantly differ from other observations. They can potentially distort the summary statistics of your dataset and can affect models by introducing noise. Below are steps to identify and handle outliers using Python.

Identifying Outliers

Let’s use a Python script to detect outliers in a dataset using the Z-score and IQR methods:

import pandas as pd
import numpy as np
from scipy import stats

# Load your dataset
data = pd.read_csv('your_dataset.csv')

# Method 1: Z-score
def detect_outliers_zscore(dataframe, threshold=3):
    z_scores = np.abs(stats.zscore(dataframe.select_dtypes(include=[np.number])))
    return np.where(z_scores > threshold)

outliers_z = detect_outliers_zscore(data)
print(f'Outliers detected using Z-score: {outliers_z}')

# Method 2: IQR
def detect_outliers_iqr(dataframe):
    Q1 = dataframe.quantile(0.25)
    Q3 = dataframe.quantile(0.75)
    IQR = Q3 - Q1
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR
    return dataframe[(dataframe < lower_bound) | (dataframe > upper_bound)].dropna()

outliers_iqr = detect_outliers_iqr(data.select_dtypes(include=[np.number]))
print(f'Outliers detected using IQR: {outliers_iqr}')

Handling Outliers

Once identified, we can handle outliers by:

Removing Outliers
Replacing Outliers

Removing Outliers

# Remove outliers based on Z-score
def remove_outliers_zscore(dataframe, threshold=3):
    z_scores = np.abs(stats.zscore(dataframe.select_dtypes(include=[np.number])))
    return dataframe[(z_scores < threshold).all(axis=1)]

data_cleaned_zscore = remove_outliers_zscore(data)
print(f'Dataset after removing outliers (Z-score): {data_cleaned_zscore.shape}')

# Remove outliers based on IQR
def remove_outliers_iqr(dataframe):
    Q1 = dataframe.quantile(0.25)
    Q3 = dataframe.quantile(0.75)
    IQR = Q3 - Q1
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR
    return dataframe[~((dataframe < lower_bound) | (dataframe > upper_bound)).any(axis=1)]

data_cleaned_iqr = remove_outliers_iqr(data.select_dtypes(include=[np.number]))
print(f'Dataset after removing outliers (IQR): {data_cleaned_iqr.shape}')

Replacing Outliers

Here we replace outliers with the mean or median.

# Replace outliers with median based on Z-score
def replace_outliers_zscore(dataframe, threshold=3):
    z_scores = np.abs(stats.zscore(dataframe.select_dtypes(include=[np.number])))
    median = dataframe.median()
    dataframe[(z_scores > threshold)] = median
    return dataframe

data_replaced_zscore = replace_outliers_zscore(data.copy())
print('Dataset after replacing outliers (Z-score)')

# Replace outliers with median based on IQR
def replace_outliers_iqr(dataframe):
    median = dataframe.median()
    Q1 = dataframe.quantile(0.25)
    Q3 = dataframe.quantile(0.75)
    IQR = Q3 - Q1
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR

    dataframe[(dataframe < lower_bound) | (dataframe > upper_bound)] = median
    return dataframe

data_replaced_iqr = replace_outliers_iqr(data.copy().select_dtypes(include=[np.number]))
print('Dataset after replacing outliers (IQR)')

Conclusion

Understanding and managing outliers is a critical step in data cleaning. These methods will help you detect and handle outliers effectively to ensure the quality and reliability of your data analysis.

Methods to Identify Outliers

1. Z-Score Method

The Z-Score method identifies outliers by determining the number of standard deviations an element is from the mean.

Implementation Steps:

Calculate the mean (?) and standard deviation (?) of the dataset.
For each data point, calculate its Z-Score: ( Z = frac{(X – ?)}{?} ).
Identify data points with a Z-Score greater than a threshold (commonly 3 or -3) as outliers.

Pseudocode:

mean = calculate_mean(dataset)
std_dev = calculate_std_dev(dataset)

for each data_point in dataset:
    z_score = (data_point - mean) / std_dev
    if abs(z_score) > threshold:
        mark_as_outlier(data_point)

2. Interquartile Range (IQR) Method

The IQR method uses the quartiles to identify outliers.

Implementation Steps:

Calculate Q1 (first quartile) and Q3 (third quartile) of the dataset.
Compute IQR: ( IQR = Q3 – Q1 ).
Determine the lower bound: ( Q1 – 1.5 * IQR ).
Determine the upper bound: ( Q3 + 1.5 * IQR ).
Identify data points outside these bounds as outliers.

Pseudocode:

Q1 = calculate_first_quartile(dataset)
Q3 = calculate_third_quartile(dataset)
IQR = Q3 - Q1

lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

for each data_point in dataset:
    if data_point < lower_bound or data_point > upper_bound:
        mark_as_outlier(data_point)

3. Modified Z-Score Method

This method involves calculating the median and Median Absolute Deviation (MAD) to identify outliers.

Implementation Steps:

Calculate the median of the dataset.
Compute MAD: ( MAD = median(|data_point – median|) ).
Calculate the Modified Z-Score for each data point: ( M_Z = 0.6745 * frac{(X_i – median)}{MAD} ).
Identify points with Modified Z-Score greater than the recommended threshold (usually 3.5) as outliers.

Pseudocode:

median = calculate_median(dataset)
MAD = calculate_median_absolute_deviation(dataset)

for each data_point in dataset:
    modified_z_score = 0.6745 * (data_point - median) / MAD
    if abs(modified_z_score) > threshold:
        mark_as_outlier(data_point)

4. DBSCAN for Outlier Detection

DBSCAN (Density-Based Spatial Clustering of Applications with Noise) identifies clusters and points that do not fit into any cluster (outliers).

Implementation Steps:

Apply DBSCAN algorithm on the dataset.
Identify points that are classified as noise (outliers).

Pseudocode:

clusters, noise_points = DBSCAN(dataset, epsilon, min_samples)
outliers = noise_points

5. Box Plot Method

Use Box Plots to visually identify outliers.

Implementation Steps:

Plot a box plot of the dataset.
Identify points outside the whiskers as outliers.

Pseudocode:

plot_box_plot(dataset)
for point if it lies outside whiskers:
    mark_as_outlier(point)

Each method has its strengths depending on the data’s distribution and context of the analysis. The choice of method should align with the dataset characteristics and the specific problem being addressed.

Statistical Techniques for Outlier Detection

In this section, we will explore how to detect outliers using statistical techniques such as Z-Score and IQR (Interquartile Range).

1. Z-Score Method

The Z-Score method identifies outliers by calculating the number of standard deviations a data point is from the mean. Data points with a Z-Score greater than a threshold (commonly 3 or -3) are considered outliers.

Steps:

Compute the mean and standard deviation of the dataset.
Calculate the Z-Score for each data point.
Identify outliers based on the Z-Score threshold.

Pseudocode:

mean = calculate_mean(data)
std_dev = calculate_std_dev(data)
threshold = 3

for each point in data:
    z_score = (point - mean) / std_dev
    if abs(z_score) > threshold:
        mark_as_outlier(point)

2. IQR (Interquartile Range) Method

The IQR method identifies outliers by measuring the spread of the middle 50% of the data. Outliers are data points below the first quartile (Q1) or above the third quartile (Q3) by more than 1.5 times the IQR.

Steps:

Calculate Q1 (25th percentile) and Q3 (75th percentile).
Compute the IQR (Q3 – Q1).
Determine the lower and upper bound (cut-off points) for defining outliers.
Identify data points outside these bounds.

Pseudocode:

Q1 = calculate_percentile(data, 25)
Q3 = calculate_percentile(data, 75)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

for each point in data:
    if point < lower_bound or point > upper_bound:
        mark_as_outlier(point)

Example in Python

Here’s how you can implement these techniques in Python:

Z-Score Method:

import numpy as np

# Example dataset
data = [10, 12, 12, 13, 12, 14, 12, 13, 16, 16, 25, 30, 150]

mean = np.mean(data)
std_dev = np.std(data)
threshold = 3

outliers = []
for point in data:
    z_score = (point - mean) / std_dev
    if abs(z_score) > threshold:
        outliers.append(point)

print("Outliers using Z-Score:", outliers)

IQR Method:

import numpy as np

# Example dataset
data = [10, 12, 12, 13, 12, 14, 12, 13, 16, 16, 25, 30, 150]

Q1 = np.percentile(data, 25)
Q3 = np.percentile(data, 75)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

outliers = [point for point in data if point < lower_bound or point > upper_bound]

print("Outliers using IQR:", outliers)

By employing these statistical techniques, you can effectively identify outliers in your dataset, enhancing the integrity and quality of your data analysis.

Handling Outliers in Datasets

Introduction

Handling outliers is a crucial step in data cleaning that ensures the accuracy and reliability of machine learning models. Outliers can skew results and lead to incorrect conclusions. Below, I will outline practical implementations to handle outliers in datasets.

Steps to Handle Outliers

1. Removing Outliers

This method removes data points that lie beyond a certain range, identified through statistical or visualization techniques.

Implementation:

import pandas as pd

# Load your dataset
df = pd.read_csv('your_dataset.csv')

# Define a function to remove outliers
def remove_outliers(df, column, lower_quantile=0.05, upper_quantile=0.95):
    lower_bound = df[column].quantile(lower_quantile)
    upper_bound = df[column].quantile(upper_quantile)
    cleaned_df = df[(df[column] >= lower_bound) & (df[column] <= upper_bound)]
    return cleaned_df

# Apply the function to your dataframe
df_cleaned = remove_outliers(df, 'your_column_name')

2. Replacing Outliers

Two common techniques for replacing outliers are:

Mean/Median Imputation: Replace outliers with the mean or median value of the column.
Capping: Set outliers to the maximum or minimum non-outlying value.

Implementation:

import pandas as pd
import numpy as np

# Load your dataset
df = pd.read_csv('your_dataset.csv')

# Define a function for mean/median imputation
def replace_outliers_with_mean_median(df, column, method='median', lower_quantile=0.05, upper_quantile=0.95):
    lower_bound = df[column].quantile(lower_quantile)
    upper_bound = df[column].quantile(upper_quantile)
    if method == 'mean':
        replacement = df[column].mean()
    else:
        replacement = df[column].median()
    df[column] = np.where((df[column] < lower_bound) | (df[column] > upper_bound), replacement, df[column])
    return df

# Apply the function to your dataframe
df_replaced = replace_outliers_with_mean_median(df, 'your_column_name', method='median')

3. Transformations

Apply transformations to minimize the impact of outliers, such as:

Log Transformation
Square Root Transformation

Implementation:

import pandas as pd
import numpy as np

# Load your dataset
df = pd.read_csv('your_dataset.csv')

# Define a function for transformation
def log_transform(df, column):
    df[column] = np.log1p(df[column])
    return df

# Apply the log transformation
df_transformed = log_transform(df, 'your_column_name')

Conclusion

Handling outliers appropriately is critical for maintaining the integrity of data analysis. The methods illustrated above—removing outliers, replacing them, and applying transformations—are practical techniques that can be readily applied to real-life datasets to manage outliers effectively.

Pandas for Data Cleaning

Managing Missing Values and Outliers with Pandas

In this section, we will walk through practical implementations of data cleaning techniques using Pandas to manage missing values and outliers in a dataset.

1. Import Libraries

import pandas as pd
import numpy as np

2. Load Dataset

Assume we have a dataset named data.csv.

df = pd.read_csv('data.csv')

3. Managing Missing Values

3.1 Identifying Missing Values

missing_values = df.isnull().sum()
print("Missing values in each column:n", missing_values)

3.2 Drop Rows with Missing Values

df_cleaned = df.dropna()

3.3 Fill Missing Values

3.3.1 Using Mean/Median/Mode

df['column_name'].fillna(df['column_name'].mean(), inplace=True)
# Alternatively, use median or mode
# df['column_name'].fillna(df['column_name'].median(), inplace=True)
# df['column_name'].fillna(df['column_name'].mode()[0], inplace=True)

3.3.2 Using Forward Fill and Backward Fill

df['column_name'].fillna(method='ffill', inplace=True)
# Or using backward fill
# df['column_name'].fillna(method='bfill', inplace=True)

3.4 Interpolation

df['column_name'] = df['column_name'].interpolate(method='linear', inplace=True)

4. Managing Outliers

4.1 Identifying Outliers using Z-Score

from scipy import stats
z_scores = np.abs(stats.zscore(df._get_numeric_data()))
outliers = (z_scores > 3).all(axis=1)
df_outliers_removed = df[~outliers]

4.2 Identifying Outliers using IQR

Q1 = df.quantile(0.25)
Q3 = df.quantile(0.75)
IQR = Q3 - Q1

outliers = ((df < (Q1 - 1.5 * IQR)) | (df > (Q3 + 1.5 * IQR))).any(axis=1)
df_outliers_removed = df[~outliers]

4.3 Cap Outliers

cap = 1.5
quantiles = df.quantile([0.25, 0.75])
Q1 = quantiles.loc[0.25]
Q3 = quantiles.loc[0.75]
IQR = Q3 - Q1

lower_cap = Q1 - cap * IQR
upper_cap = Q3 + cap * IQR

df_capped = df.apply(lambda x: np.where(x < lower_cap, lower_cap, np.where(x > upper_cap, upper_cap, x)))

4.4 Transforming Outliers

Log Transformation

df['column_name'] = np.log1p(df['column_name'])

Square Root Transformation

df['column_name'] = np.sqrt(df['column_name'])

These implementations demonstrate practical examples of using Pandas for data cleaning, focusing on managing missing values and outliers.

Advanced Data Cleaning Techniques with Scikit-learn

In this section, we will cover advanced data cleaning techniques using Scikit-learn. Specifically, we will focus on managing missing values and outliers. We assume that you have already covered basic data cleaning methods and are familiar with using Pandas for data manipulation as noted in previous sections of your project.

Imputation Techniques with Scikit-learn

SimpleImputer for Missing Values

The SimpleImputer provided by Scikit-learn is a versatile tool for handling missing values. It allows you to replace missing values with a constant value, the mean, median, or most frequent value of the column.

from sklearn.impute import SimpleImputer
import numpy as np
import pandas as pd

# Example data
data = {
    'age': [25, np.nan, 35, 40, np.nan, 30],
    'income': [50000, 60000, np.nan, 80000, 70000, np.nan]
}

df = pd.DataFrame(data)

# Create an instance of SimpleImputer with mean strategy
imputer = SimpleImputer(strategy='mean')

# Fit the imputer on the DataFrame and transform the data
df_imputed = pd.DataFrame(imputer.fit_transform(df), columns=df.columns)

print(df_imputed)

KNNImputer for Missing Values

The KNNImputer uses k-Nearest Neighbors to impute the missing values. It’s useful in scenarios where the data is expected to exhibit certain patterns or clusters.

from sklearn.impute import KNNImputer

# Example data
data = {
    'age': [25, np.nan, 35, 40, np.nan, 30],
    'income': [50000, 60000, np.nan, 80000, 70000, np.nan]
}

df = pd.DataFrame(data)

# Create an instance of KNNImputer with default neighbor settings
knn_imputer = KNNImputer(n_neighbors=2)

# Fit the KNNImputer on the DataFrame and transform the data
df_knn_imputed = pd.DataFrame(knn_imputer.fit_transform(df), columns=df.columns)

print(df_knn_imputed)

Outlier Detection and Removal using Scikit-learn

IsolationForest for Outlier Detection

The IsolationForest is an effective method for outlier detection. It identifies anomalies by isolating observations in the dataset.

from sklearn.ensemble import IsolationForest

# Example data
data = {
    'feature1': [1, 2, 3, 4, 1000],
    'feature2': [2, 3, 4, 5, 6]
}

df = pd.DataFrame(data)

# Create an instance of IsolationForest
iso_forest = IsolationForest(contamination=0.2)

# Fit the model
iso_forest.fit(df)

# Predict outliers
outliers = iso_forest.predict(df)

# Filter out the outliers (-1 are the outliers)
df_no_outliers = df[outliers != -1]

print(df_no_outliers)

OneClassSVM for Outlier Detection

The OneClassSVM is another robust method for outlier detection. It works well with high-dimensional data.

from sklearn.svm import OneClassSVM

# Example data
data = {
    'feature1': [1, 2, 3, 4, 1000],
    'feature2': [2, 3, 4, 5, 6]
}

df = pd.DataFrame(data)

# Create an instance of OneClassSVM
oc_svm = OneClassSVM(nu=0.1, kernel='rbf', gamma=0.1)

# Fit the model
oc_svm.fit(df)

# Predict outliers
outliers = oc_svm.predict(df)

# Filter out the outliers (-1 are the outliers)
df_no_outliers = df[outliers != -1]

print(df_no_outliers)

These advanced data cleaning techniques using Scikit-learn will help you to handle missing values and outliers more effectively, ensuring a cleaner, more reliable dataset for your analyses and models.

Mastering Data Analytics with Matplotlib in Python

« Older Entries