Transforming Data for Analysis: A Practical Guide for Coders

by | SQL

Setting Up the Environment

1. Install Python

sudo apt-get update
sudo apt-get install python3.9
sudo apt-get install python3-pip

2. Create a Virtual Environment

python3 -m venv myprojectenv
source myprojectenv/bin/activate

3. Install Required Libraries

pip install numpy pandas matplotlib scikit-learn

4. Set Up Project Structure

mkdir -p myproject/{data,src,notebooks,models,reports}
touch myproject/src/__init__.py

5. Configure Version Control

cd myproject
git init
echo "myprojectenv/" > .gitignore

6. Initialize Jupyter Notebook

pip install jupyter
jupyter notebook

7. Create Initial Notebook

  • In Jupyter Notebook, create a new notebook and name it DataExploration.ipynb

8. Verify the Setup with Sample Code

  • Open DataExploration.ipynb and add the following code:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn import datasets

# Load sample data
data = datasets.load_iris()
df = pd.DataFrame(data.data, columns=data.feature_names)

# Display first few rows
print(df.head())

# Plot sample data
df.plot(kind='scatter', x='sepal length (cm)', y='sepal width (cm)')
plt.show()

End of Environment Setup

Cleaning and Preprocessing Data

Step 1: Load Data

data = LOAD_DATA('path_to_dataset')

Step 2: Handle Missing Values

# Drop rows with any missing values
data = DROP_ROWS_WITH_NA(data)

# Alternatively, fill missing values with a specific value (like mean, median, etc.)
data['column_name'] = FILL_NA_WITH(data['column_name'], 'value')

Step 3: Remove Duplicates

data = REMOVE_DUPLICATES(data)

Step 4: Convert Data Types

# Assume we need to convert a column to int
data['column_name'] = CONVERT_TO_INT(data['column_name'])

Step 5: Standardize Column Names

data.columns = STANDARDIZE_NAMES(data.columns)

Step 6: Encoding Categorical Variables

# One-hot encoding
data = ONE_HOT_ENCODING(data, 'categorical_column')

# Label encoding
data['categorical_column'] = LABEL_ENCODING(data['categorical_column'])

Step 7: Normalize/Scale Features

# Min-Max Scaling
data['feature_column'] = MIN_MAX_SCALE(data['feature_column'])

# Standardization (Z-score normalization)
data['feature_column'] = STANDARDIZE(data['feature_column'])

Step 8: Split Data into Training and Testing Sets

train_data, test_data = TRAIN_TEST_SPLIT(data, test_size=0.2, random_state=42)

Usage Example

data = LOAD_DATA('path_to_dataset')
data = DROP_ROWS_WITH_NA(data)
data = REMOVE_DUPLICATES(data)
data['column_name'] = CONVERT_TO_INT(data['column_name'])
data.columns = STANDARDIZE_NAMES(data.columns)
data = ONE_HOT_ENCODING(data, 'categorical_column')
data['feature_column'] = MIN_MAX_SCALE(data['feature_column'])
train_data, test_data = TRAIN_TEST_SPLIT(data, test_size=0.2, random_state=42)

Handling Missing Values

import pandas as pd

# Sample DataFrame with missing values
data = {
    'name': ['Alice', 'Bob', None, 'David'],
    'age': [25, None, 22, 23],
    'city': ['New York', 'Los Angeles', 'Chicago', None]
}
df = pd.DataFrame(data)

# 1. Remove rows with any missing values
df_dropped_any = df.dropna()

# 2. Remove rows with all missing values
df_dropped_all = df.dropna(how='all')

# 3. Fill missing values with a specified value
df_filled = df.fillna({'name': 'Unknown', 'age': 0, 'city': 'Unknown'})

# 4. Forward fill (fill missing values with the previous value in column)
df_ffill = df.ffill()

# 5. Backward fill (fill missing values with the next value in column)
df_bfill = df.bfill()

# 6. Interpolate (fill using interpolation method)
df_interpolated = df.interpolate()

# Output the modified DataFrames
print("Original DataFrame:n", df)
print("nDrop any missing (rows):n", df_dropped_any)
print("nDrop all missing (rows):n", df_dropped_all)
print("nFill missing with specified values:n", df_filled)
print("nForward fill:n", df_ffill)
print("nBackward fill:n", df_bfill)
print("nInterpolate:n", df_interpolated)

This code covers multiple ways to handle missing values in a DataFrame. Select the approach that best fits the needs of your analysis.

Normalizing and Scaling Data

Import required libraries

import pandas as pd
from sklearn.preprocessing import StandardScaler, MinMaxScaler

Load the data

data = pd.read_csv("data.csv")

Initialize Scalers

standard_scaler = StandardScaler()
minmax_scaler = MinMaxScaler()

Select columns to scale

columns_to_scale = ['column1', 'column2', 'column3']

Apply Standard Scaling

data[columns_to_scale] = standard_scaler.fit_transform(data[columns_to_scale])

Apply Min-Max Scaling (Optionally)

# data[columns_to_scale] = minmax_scaler.fit_transform(data[columns_to_scale])

Save the transformed data

data.to_csv("normalized_data.csv", index=False)

Data Transformation Techniques

1. Importing Necessary Libraries

import pandas as pd
from sklearn.preprocessing import OneHotEncoder, LabelEncoder

2. Categorical Encoding (One-Hot Encoding)

# Example DataFrame
data = pd.DataFrame({
    'fruit': ['apple', 'orange', 'apple', 'banana'],
    'count': [10, 20, 15, 10]
})

# One-Hot Encoding
one_hot_encoder = OneHotEncoder(sparse=False)
encoded_features = one_hot_encoder.fit_transform(data[['fruit']])
encoded_df = pd.DataFrame(encoded_features, columns=one_hot_encoder.get_feature_names_out(['fruit']))
data = pd.concat([data, encoded_df], axis=1).drop('fruit', axis=1)
print(data)

3. Categorical Encoding (Label Encoding)

# Example DataFrame
data = pd.DataFrame({
    'fruit': ['apple', 'orange', 'apple', 'banana'],
    'count': [10, 20, 15, 10]
})

# Label Encoding
label_encoder = LabelEncoder()
data['fruit_encoded'] = label_encoder.fit_transform(data['fruit'])
print(data)

4. Logarithmic Transformation

# Example DataFrame
data = pd.DataFrame({
    'count': [10, 20, 15, 10]
})

# Log Transform
data['log_count'] = data['count'].apply(lambda x: np.log(x + 1))
print(data)

5. Binning (Discretization)

# Example DataFrame
data = pd.DataFrame({
    'age': [25, 45, 65, 70, 25, 55]
})

# Binning
data['age_bin'] = pd.cut(data['age'], bins=[0, 30, 50, 100], labels=['Youth', 'Middle-aged', 'Senior'])
print(data)

6. Polynomial Features

from sklearn.preprocessing import PolynomialFeatures

# Example DataFrame
data = pd.DataFrame({
    'feature': [1, 2, 3, 4, 5]
})

# Polynomial Features
poly = PolynomialFeatures(degree=2)
poly_features = poly.fit_transform(data[['feature']])
poly_df = pd.DataFrame(poly_features, columns=poly.get_feature_names_out(['feature']))
print(poly_df)

7. Aggregation

# Example DataFrame
data = pd.DataFrame({
    'category': ['A', 'A', 'B', 'B'],
    'value': [10, 20, 30, 40]
})

# Aggregation
aggregated_data = data.groupby('category').agg({'value': 'sum'}).reset_index()
print(aggregated_data)

8. Date Transformation

# Example DataFrame
data = pd.DataFrame({
    'date': pd.to_datetime(['2021-01-01', '2021-02-15', '2021-03-10'])
})

# Extracting Date Features
data['year'] = data['date'].dt.year
data['month'] = data['date'].dt.month
data['day'] = data['date'].dt.day
data['weekday'] = data['date'].dt.weekday
print(data)

9. Text Transformation (TF-IDF)

from sklearn.feature_extraction.text import TfidfVectorizer

# Example DataFrame
data = pd.DataFrame({
    'text': ['apple orange banana', 'banana apple apple', 'orange orange banana']
})

# TF-IDF Transformation
tfidf = TfidfVectorizer()
tfidf_matrix = tfidf.fit_transform(data['text'])
tfidf_df = pd.DataFrame(tfidf_matrix.toarray(), columns=tfidf.get_feature_names_out())
print(tfidf_df)

10. Feature Interaction

# Example DataFrame
data = pd.DataFrame({
    'feature_1': [1, 2, 3, 4],
    'feature_2': [10, 20, 30, 40]
})

# Feature Interaction
data['interaction'] = data['feature_1'] * data['feature_2']
print(data)

Combining and Aggregating Data

1. Combining Data

# Example datasets: data1 and data2
data1 = [
    {"id": 1, "name": "Alice", "department": "Engineering"},
    {"id": 2, "name": "Bob", "department": "HR"}
]

data2 = [
    {"id": 1, "salary": 100000},
    {"id": 2, "salary": 80000}
]

# Merging data1 and data2 on 'id'
combined_data = merge(data1, data2, key="id")

# Result:
# combined_data = [
#     {"id": 1, "name": "Alice", "department": "Engineering", "salary": 100000},
#     {"id": 2, "name": "Bob", "department": "HR", "salary": 80000}
# ]

2. Aggregating Data

# Sample combined data for aggregation
combined_data = [
    {"id": 1, "department": "Engineering", "salary": 100000},
    {"id": 2, "department": "HR", "salary": 80000},
    {"id": 3, "department": "Engineering", "salary": 120000},
    {"id": 4, "department": "HR", "salary": 95000}
]

# Aggregating to find the average salary per department
aggregated_data = aggregate(combined_data, group_by="department", metric="average", field="salary")

# Result:
# aggregated_data = [
#     {"department": "Engineering", "average_salary": 110000},
#     {"department": "HR", "average_salary": 87500}
# ]

Functions for General Pseudocode

function merge(data1, data2, key):
    # Create a dictionary for quick lookup
    lookup = {entry[key]: entry for entry in data2}
    
    # Combine entries
    result = []
    for entry in data1:
        if entry[key] in lookup:
            combined_entry = entry.copy()
            combined_entry.update(lookup[entry[key]])
            result.append(combined_entry)
    
    return result

function aggregate(data, group_by, metric, field):
    # Initialize storage for aggregated results
    aggregation = {}
    
    # Sum data based on group
    for entry in data:
        group_value = entry[group_by]
        if group_value not in aggregation:
            aggregation[group_value] = []
        aggregation[group_value].append(entry[field])
    
    # Calculate metric
    result = []
    for group, values in aggregation.items():
        if metric == "average":
            avg_value = sum(values) / len(values)
            result.append({group_by: group, f"{metric}_{field}": avg_value})
    
    return result

Usage

# Combining data1 and data2
combined_data = merge(data1, data2, key="id")

# Aggregating combined_data to find average salary by department
aggregated_data = aggregate(combined_data, group_by="department", metric="average", field="salary")

Implementing the provided pseudocode helps in combing and aggregating datasets efficiently. Adjust the sample data accordingly to fit your real-life datasets.

Related Posts