Introduction to Geospatial Data
Overview
Geospatial data is information that describes objects, events, or phenomena associated with a location on the Earth’s surface. This type of data can be represented in various forms including points, lines, and polygons, and is often used for mapping and spatial analysis. In this section, we will introduce basic concepts and setup a practical environment for analyzing geospatial datasets.
Setting Up the Environment in Google Colab
In this first unit, we’ll focus on setting up a Python environment to handle geospatial data. We’ll utilize Google Colab for its simplicity and ease of use. Here’s how to get started:
Step 1: Import Essential Libraries
In this step, we’ll import some commonly used libraries for geospatial data analysis: Pandas, Geopandas, Matplotlib, and Folium.
# Import necessary libraries
import pandas as pd
import geopandas as gpd
import matplotlib.pyplot as plt
import folium
Step 2: Install Geospatial Libraries
Google Colab requires us to install geospatial packages since they are not available by default. We can use pip
to install them.
# Install geopandas, folium and other essential libraries
!pip install geopandas folium
Step 3: Loading Geospatial Data
We’ll use Geopandas to read geospatial data. Geopandas extends Pandas to process geospatial data efficiently.
# Load a sample geospatial dataset
world = gpd.read_file(gpd.datasets.get_path('naturalearth_lowres'))
# Display the first few rows of the dataset
world.head()
Step 4: Visualizing Geospatial Data
We can use Matplotlib and Folium to create static and interactive visualizations, respectively.
Using Matplotlib for a Static Visualization
# Plotting the data using Matplotlib
world.plot()
plt.show()
Using Folium for an Interactive Map
# Creating an interactive Folium map
m = folium.Map(location=[10, 0], zoom_start=2)
# Adding geospatial data to the map
folium.GeoJson(world).add_to(m)
# Display the map
m
Step 5: Basic Geospatial Data Operations
Geopandas allows us to perform typical geospatial operations such as buffering, spatial joins, and basic transformations.
# Simple geospatial operations example
# Selecting countries in Africa
africa = world[world['continent'] == 'Africa']
# Buffering - creating a 1-degree buffer around each geometry
africa['geometry'] = africa['geometry'].buffer(1)
# Plotting the buffered geometries
africa.plot()
plt.show()
Conclusion
This setup introduces essential tools and libraries for geospatial data analysis in Python using Google Colab. With these basics, you are ready to explore geospatial datasets and derive insights that can aid strategic decisions for your consumer goods company.
Setting Up Your Google Collab Environment
To set up your Google Collab environment for analyzing geospatial datasets using Python, follow these practical steps:
1. Mount Google Drive
To store and access datasets from your Google Drive, use the following code snippet:
from google.colab import drive
drive.mount('/content/drive')
2. Install Required Libraries
For geospatial data analysis, you’ll need some specific libraries. Install them using the following commands:
!pip install geopandas
!pip install folium
!pip install rasterio
!pip install shapely
!pip install pyproj
!pip install fiona
3. Import Libraries
After installing the necessary libraries, import them into your notebook:
import geopandas as gpd
import folium
import rasterio
from shapely.geometry import Point, Polygon
import pyproj
import fiona
4. Load Geospatial Data
Load your geospatial data into a GeoDataFrame. Here’s an example of loading a shapefile:
data_path = '/content/drive/My Drive/your-folder/your-shapefile.shp'
gdf = gpd.read_file(data_path)
5. Preview the Data
Display the first few rows of your geospatial dataset:
gdf.head()
6. Plotting Data
Visualize your geospatial data:
gdf.plot()
7. Folium Map
Create a map using Folium:
# Define a base map centered on a specific location
m = folium.Map(location=[latitude, longitude], zoom_start=12)
# Add a GeoDataFrame layer to the map
folium.GeoJson(gdf).add_to(m)
# Display the map
m
By following these steps, you will have successfully prepared your Google Collab environment for analyzing geospatial datasets using Python.
Importing and Exploring Datasets
Importing Necessary Libraries
To start, we need to import essential libraries for data manipulation and geospatial analysis.
import pandas as pd
import geopandas as gpd
import matplotlib.pyplot as plt
Loading the Dataset
Assume we have two datasets:
- Sales Data:
sales_data.csv
- Geospatial Data:
regions.geojson
Importing Sales Data
# Load sales data using pandas
sales_data = pd.read_csv('sales_data.csv')
# Display the first few rows of sales data
print(sales_data.head())
Importing Geospatial Data
# Load geospatial data using geopandas
regions = gpd.read_file('regions.geojson')
# Display the first few rows of geospatial data
print(regions.head())
Understanding Dataset Structures
Sales Data Structure
Examine basic structure and statistics of the sales data:
# Display basic information about sales data
print(sales_data.info())
# Summary statistics of numerical columns
print(sales_data.describe())
# Checking for missing values
print(sales_data.isnull().sum())
Geospatial Data Structure
Examine basic structure and spatial information of the geospatial data:
# Display basic information about geospatial data
print(regions.info())
# Display coordinate reference system (CRS)
print(regions.crs)
# Check for missing geometry entries
print(regions.is_valid.sum(), "valid geometries out of", len(regions))
Exploring Data through Visualization
Visualize Sales Data
Create basic plots to understand the distribution of sales:
# Histogram of sales
sales_data['Sales'].hist(bins=30, edgecolor='black')
plt.title('Distribution of Sales')
plt.xlabel('Sales')
plt.ylabel('Frequency')
plt.show()
# Scatter plot of sales against another variable - e.g., Marketing Spend
plt.scatter(sales_data['Marketing_Spend'], sales_data['Sales'])
plt.title('Sales vs Marketing Spend')
plt.xlabel('Marketing Spend')
plt.ylabel('Sales')
plt.show()
Visualize Geospatial Data
Plot the geospatial data to understand the regions:
# Basic plot of the regions
regions.plot()
plt.title('Geospatial Regions')
plt.show()
Joining and Merging Datasets
Often, geospatial analysis requires joining datasets based on common keys, such as region identifiers.
# Ensure the key column types match in both datasets
sales_data['Region_ID'] = sales_data['Region_ID'].astype(str)
regions['Region_ID'] = regions['Region_ID'].astype(str)
# Merge datasets on 'Region_ID'
merged_data = regions.merge(sales_data, on='Region_ID')
# Display the first few rows of the merged dataset
print(merged_data.head())
Final Checks
Ensure the merged dataset is ready for further analysis:
# Display basic information about the merged dataset
print(merged_data.info())
# Check for any new missing values
print(merged_data.isnull().sum())
In the next steps, you can proceed with further analysis and visualizations to draw insights and inform strategic decisions.
Data Cleaning and Preprocessing
In this section, we’ll clean and preprocess the geospatial dataset to ensure it’s ready for analysis. We’ll address missing values, remove duplicates, and ensure consistent data formatting.
# Import necessary libraries
import pandas as pd
import geopandas as gpd
# Load the dataset (assuming it has been imported in prior sections)
# df = your_dataframe
# Step 1: Handling Missing Values
# Check for missing values
missing_values = df.isnull().sum()
print("Missing values in each column:\n", missing_values)
# Drop rows with missing essential geospatial information
df = df.dropna(subset=['latitude', 'longitude'])
# Optionally, fill missing values in other columns with appropriate values
df['sales'] = df['sales'].fillna(0) # Example: Fill missing sales with 0
# Step 2: Remove Duplicates
# Check for duplicate rows
duplicated_rows = df.duplicated().sum()
print("Number of duplicated rows:", duplicated_rows)
# Remove duplicated rows
df = df.drop_duplicates()
# Step 3: Ensure Consistent Data Formatting
# Convert columns to appropriate data types
df['date'] = pd.to_datetime(df['date'])
df['sales'] = df['sales'].astype(float)
df['category'] = df['category'].astype(str)
# Step 4: Geospatial Data Validation
# Check for valid latitude and longitude values
valid_geo_mask = (df['latitude'].between(-90, 90)) & (df['longitude'].between(-180, 180))
df = df[valid_geo_mask]
# Step 5: Creating Geospatial DataFrame
# Converting DataFrame to GeoDataFrame
gdf = gpd.GeoDataFrame(df, geometry=gpd.points_from_xy(df.longitude, df.latitude))
# Set the coordinate reference system (CRS)
gdf.crs = {'init': 'epsg:4326'}
# Display the cleaned and preprocessed geospatial DataFrame
print(gdf.head())
# (Optional) Save the cleaned dataset for future use
gdf.to_file("cleaned_geospatial_dataset.geojson", driver='GeoJSON')
This code snippet provides a practical implementation of the data cleaning and preprocessing steps in Python:
- Handling Missing Values: Drop rows with missing geospatial data and fill other missing values as needed.
- Remove Duplicates: Identify and remove duplicate rows.
- Ensure Consistent Data Formatting: Convert columns to the correct data types.
- Geospatial Data Validation: Validate latitude and longitude values to ensure they fall within acceptable ranges.
- Creating Geospatial DataFrame: Convert the DataFrame to a GeoDataFrame and set the CRS.
By following these steps, you will have a cleaned and preprocessed geospatial dataset ready for analysis.
Basic Geospatial Data Visualization
In this section, we will implement basic geospatial data visualizations. We’ll use geopandas
and matplotlib
to create visualizations that will help in making strategic decisions for the consumer goods company.
Import Required Libraries
Ensure you have the necessary libraries. You should have already imported pandas
and other essential libraries in previous steps.
import geopandas as gpd
import matplotlib.pyplot as plt
import pandas as pd
Load Geospatial Data
Load the geospatial data that you have already preprocessed in the earlier section.
gdf = gpd.read_file('path/to/your/cleaned_geospatial_data.shp')
Plot Basic Maps
Plotting a Simple Map
First, we’ll plot a simple map to visualize the geospatial data.
gdf.plot()
plt.title('Basic Geospatial Data Plot')
plt.show()
Plotting with Additional Features
You can add more context to your map by plotting additional features like points of interest, regions, etc.
fig, ax = plt.subplots(figsize=(10, 10))
gdf.boundary.plot(ax=ax, linewidth=1)
gdf.plot(ax=ax, color='blue', alpha=0.5)
plt.title('Enhanced Geospatial Data Plot with Boundaries')
plt.show()
Plotting Specific Columns
If your geospatial data contains specific columns, you can plot them explicitly to analyze different attributes.
gdf.plot(column='your_attribute_column', legend=True)
plt.title('Geospatial Data by Specific Attribute')
plt.show()
Overlay Plot with Additional Data
For strategic decision making, overlay your geospatial data with additional datasets like population density, sales regions, etc.
# Assuming additional_geo_data is another GeoDataFrame loaded earlier
additional_geo_data = gpd.read_file('path/to/additional_data.shp')
fig, ax = plt.subplots(figsize=(10, 10))
gdf.plot(ax=ax, color='blue', alpha=0.5, edgecolor='k')
additional_geo_data.plot(ax=ax, color='red', alpha=0.5, edgecolor='k')
plt.title('Overlay Plot with Additional Data')
plt.show()
Save Your Plots
You can save your visualizations for reporting and sharing purposes.
fig.savefig('enhanced_geospatial_plot.png', dpi=300)
By following these steps, you will be able to create visualizations that facilitate strategic decision-making for your consumer goods company.
Part 6: Advanced Geospatial Data Visualization
6.1 Import Necessary Libraries
We need advanced libraries to create compelling and meaningful visualizations.
import geopandas as gpd
import matplotlib.pyplot as plt
import seaborn as sns
from shapely.geometry import Point, Polygon
import contextily as ctx
import folium
6.2 Load and Prepare Data
Loading a sample GeoJSON file for the sake of illustration.
# Load a GeoJSON file
data = gpd.read_file('path_to_geospatial_data.geojson')
# Ensure the GeoDataFrame is in the correct CRS (Coordinate Reference System)
data = data.to_crs(epsg=3857)
6.3 Enhanced Choropleth Map
Create a choropleth visualization to show, for example, population density.
# Define a column to visualize, e.g., 'population_density'
fig, ax = plt.subplots(1, 1, figsize=(15, 10))
data.plot(column='population_density', ax=ax, legend=True,
legend_kwds={'label': "Population Density",
'orientation': "horizontal"},
cmap='OrRd', edgecolor='black')
# Add a basemap
ctx.add_basemap(ax, source=ctx.providers.Stamen.TonerLite)
plt.title("Advanced Choropleth Map of Population Density")
plt.show()
6.4 Interactive Map using Folium
Use Folium to create an interactive map.
# Initialize Folium Map
m = folium.Map(location=[data.geometry.centroid.y.mean(), data.geometry.centroid.x.mean()], zoom_start=10)
# Add GeoJSON layer
folium.Choropleth(
geo_data=data,
name='choropleth',
data=data,
columns=['geo_id', 'population_density'],
key_on='feature.properties.geo_id',
fill_color='YlOrRd',
fill_opacity=0.7,
line_opacity=0.2,
legend_name='Population Density'
).add_to(m)
# Add a tooltip
folium.GeoJsonTooltip(fields=['name', 'population_density']).add_to(m)
# Display map
m
6.5 Heat Map
Visualizing density using a heat map.
import folium.plugins as plugins
# Convert points to latitude and longitude
coords = data[['geometry']].apply(lambda geom: [geom.geometry.centroid.y, geom.geometry.centroid.x], axis=1).tolist()
# Create HeatMap
m = folium.Map(location=[data.geometry.centroid.y.mean(), data.geometry.centroid.x.mean()], zoom_start=10)
heatmap = plugins.HeatMap(coords)
m.add_child(heatmap)
# Display map
m
6.6 Save Output
If you need to save the folium map to an HTML file:
m.save('advanced_geospatial_visualization.html')
This implementation provides advanced visualization techniques, enabling a deeper analysis of geospatial data for strategic decision-making.
Geospatial Data Analysis Techniques
Data Loading and Initialization
Import necessary libraries and geospatial dataset. Ensure dataset includes latitude, longitude, and pertinent variables for analysis.
import pandas as pd
import geopandas as gpd
from shapely.geometry import Point
# Load CSV dataset with geospatial information
data = pd.read_csv('path_to_your_dataset.csv')
# Convert DataFrame to GeoDataFrame
geometry = [Point(xy) for xy in zip(data['longitude'], data['latitude'])]
geo_data = gpd.GeoDataFrame(data, geometry=geometry)
# Set proper coordinate reference system (CRS)
geo_data = geo_data.set_crs("EPSG:4326")
Spatial Joins
Use spatial joins to combine data based on geographic relationships.
# Load additional geospatial data, e.g., regions or administrative boundaries
regions = gpd.read_file('path_to_regions_shapefile.shp')
# Perform spatial join
geo_data_with_regions = gpd.sjoin(geo_data, regions, how="left", op='intersects')
Buffer Analysis
Create buffer zones around certain points and analyze data falling within these zones.
# Define buffer distance in meters
buffer_distance = 500
# Create buffers around points
geo_data['buffer'] = geo_data.geometry.buffer(buffer_distance)
# Spatial join to identify entries within buffers
buffer_analysis = gpd.sjoin(geo_data, geo_data[['buffer']], how='inner', op='intersects')
Distance Calculation
Calculate distances between points and another feature.
from shapely.geometry import Point
# Define a central point (latitude, longitude)
central_point = Point(-73.935242, 40.730610)
# Calculate distance from the central point
geo_data['distance_to_center'] = geo_data.geometry.apply(lambda x: x.distance(central_point))
Cluster Analysis
Identify clusters in data spatially using tools like KMeans.
from sklearn.cluster import KMeans
# Extract coordinates
coords = geo_data[['latitude', 'longitude']]
# Fit KMeans with desired number of clusters, e.g., 5
kmeans = KMeans(n_clusters=5, random_state=42).fit(coords)
geo_data['cluster'] = kmeans.labels_
Area and Perimeter Calculation
Calculate area and perimeter if the geometries represent polygons.
# Ensure geometries are polygons
polygons = geo_data[geo_data.geometry.type == 'Polygon']
# Calculate area and perimeter
polygons['area'] = polygons.geometry.area
polygons['perimeter'] = polygons.geometry.length
Heatmaps
Generate heatmaps to visualize density of points.
import folium
from folium.plugins import HeatMap
# Create a folium map centered around initial point
m = folium.Map(location=[40.730610, -73.935242], zoom_start=12)
# Add heatmap
heat_data = [[point.xy[1][0], point.xy[0][0]] for point in geo_data.geometry]
HeatMap(heat_data).add_to(m)
# Display map
m
Summary Statistics
Calculate summary statistics to understand distribution patterns.
summary_stats = geo_data.describe()
print(summary_stats)
Now you can execute these sections in your Google Colab notebook to implement thorough geospatial data analysis for your project. Make sure you adjust paths and parameters according to your actual dataset and project requirements.
Clustering and Segmentation Analysis
This section will focus on performing clustering and segmentation analysis on geospatial datasets for your consumer goods company, using Python in a Google Collab notebook.
Load Required Libraries
import pandas as pd
import numpy as np
import geopandas as gpd
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt
import seaborn as sns
Load and Prepare the Data
We’ll assume that the geospatial data has already been cleaned and preprocessed (as done in your previous steps).
# Load the geospatial datasets
data = gpd.read_file('path_to_your_cleaned_geospatial_data.geojson')
# Extract relevant features for clustering
features = data[['feature1', 'feature2', 'latitude', 'longitude']]
Standardize the Data
Standardize the features before performing clustering to ensure comparability.
scaler = StandardScaler()
scaled_features = scaler.fit_transform(features)
Apply K-Means Clustering
We’ll use K-Means for clustering. Choose an appropriate number of clusters (n_clusters
) based on your business requirements or using the elbow method.
# Example: Using 5 clusters
kmeans = KMeans(n_clusters=5, random_state=42)
data['cluster'] = kmeans.fit_predict(scaled_features)
Visualize Clusters on a Geospatial Plot
# Create a color map for the clusters
cmap = plt.cm.get_cmap('viridis', 5)
# Plot the geospatial data with clusters
fig, ax = plt.subplots(1, 1, figsize=(10, 6))
data.plot(column='cluster', cmap=cmap, legend=True, ax=ax)
plt.title('Geospatial Clusters')
plt.xlabel('Longitude')
plt.ylabel('Latitude')
plt.show()
Analyze Each Cluster
Provide descriptive statistics or other analyses for each cluster to inform strategic decisions.
# Displaying the size of each cluster
cluster_counts = data['cluster'].value_counts()
print(cluster_counts)
# Descriptive statistics for each cluster
cluster_analysis = data.groupby('cluster').mean()
print(cluster_analysis)
Save the Output
Export the clustered geospatial data for further analysis or reporting.
data.to_file('path_to_save_clustered_data.geojson', driver='GeoJSON')
This code provides a practical implementation for clustering and segmentation analysis on geospatial datasets using K-Means. Adapt the number of clusters and features based on your specific dataset and requirements.
Time Series Analysis on Geospatial Data
In this section, we will focus on performing time series analysis on geospatial data to uncover trends and patterns over time. The analysis will include loading the dataset, processing the time series data, and visualizing the results.
Loading the Data
Assuming that we have a dataset containing geospatial data with time stamps, let’s load it into our Google Collab environment.
import pandas as pd
import geopandas as gpd
# Load the dataset
df = pd.read_csv('your_geospatial_data.csv')
# Convert to GeoDataFrame if not already
gdf = gpd.GeoDataFrame(df, geometry=gpd.points_from_xy(df.longitude, df.latitude))
# Ensure the time column is in datetime format
gdf['timestamp'] = pd.to_datetime(gdf['timestamp'])
Preprocessing the Time Series Data
We’ll group the data by a specific geospatial attribute (e.g., region or location) and then resample it to a particular time frequency (e.g., daily, monthly).
# Set timestamp as the index
gdf.set_index('timestamp', inplace=True)
# Group by a specific geospatial attribute (e.g., 'region')
grouped = gdf.groupby('region')
# Resample the data to a monthly frequency, calculating the mean for each group
resampled = grouped.resample('M').mean()
Visualizing Time Series Data
We’ll visualize the time series data to observe trends and patterns. Let’s plot the data for a specific region.
import matplotlib.pyplot as plt
# Choose a region to plot
region_to_plot = 'Region_A'
# Extract the time series data for the chosen region
region_data = resampled.loc[region_to_plot]
# Plot the time series data
plt.figure(figsize=(10, 6))
plt.plot(region_data.index, region_data['value_column'], marker='o')
plt.title(f'Time Series Analysis of {region_to_plot}')
plt.xlabel('Time')
plt.ylabel('Value')
plt.grid(True)
plt.show()
Decomposing the Time Series
We can decompose the time series data to identify the trend, seasonality, and residual components.
from statsmodels.tsa.seasonal import seasonal_decompose
# Perform seasonal decomposition
decomposition = seasonal_decompose(region_data['value_column'], model='additive')
# Plot the decomposition results
fig, (ax1, ax2, ax3, ax4) = plt.subplots(4, 1, figsize=(15, 12))
decomposition.observed.plot(ax=ax1, title='Observed')
decomposition.trend.plot(ax=ax2, title='Trend')
decomposition.seasonal.plot(ax=ax3, title='Seasonal')
decomposition.resid.plot(ax=ax4, title='Residual')
plt.tight_layout()
plt.show()
Forecasting with ARIMA
Finally, we can use ARIMA to forecast future values.
from statsmodels.tsa.arima_model import ARIMA
# Fit an ARIMA model
model = ARIMA(region_data['value_column'], order=(5, 1, 0)) # Order parameters can be tuned
fit = model.fit(disp=0)
# Forecast the next 12 periods
forecast, stderr, conf_int = fit.forecast(steps=12)
# Plot the forecast
plt.figure(figsize=(10, 6))
plt.plot(region_data.index, region_data['value_column'], label='Historical')
plt.plot(pd.date_range(region_data.index[-1], periods=12, freq='M'), forecast, label='Forecast', color='red')
plt.fill_between(pd.date_range(region_data.index[-1], periods=12, freq='M'), conf_int[:, 0], conf_int[:, 1], color='pink', alpha=0.3)
plt.legend()
plt.show()
With these steps, you should be able to perform a comprehensive time series analysis on your geospatial dataset to uncover temporal trends and patterns as well as forecast future values. This can provide valuable insights for strategic decision-making in your consumer goods company.
Predictive Modeling with Geospatial Data
Objective
To create a predictive model using geospatial data to inform strategic decisions for a consumer goods company.
Implementation
Step 1: Import Necessary Libraries
import pandas as pd
import geopandas as gpd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error
import matplotlib.pyplot as plt
Step 2: Load and Prepare the Data
Assuming the data has already been cleaned and preprocessed, loaded into a GeoDataFrame gdf
.
# Example GeoDataFrame
# gdf = gpd.read_file('path_to_geospatial_data.geojson')
# Extract features and target variable
features = gdf.drop(columns=['target_variable', 'geometry']) # Replace 'target_variable' with actual target column name
target = gdf['target_variable'] # Replace with actual target column name
Step 3: Split the Data
X_train, X_test, y_train, y_test = train_test_split(features, target, test_size=0.3, random_state=42)
Step 4: Train the Model
model = RandomForestRegressor(n_estimators=100, random_state=42)
model.fit(X_train, y_train)
Step 5: Evaluate the Model
# Predict on test data
y_pred = model.predict(X_test)
# Calculate Mean Absolute Error
mae = mean_absolute_error(y_test, y_pred)
print("Mean Absolute Error:", mae)
Step 6: Visualize Predictions
# Add predictions back to the test GeoDataFrame
gdf_test = gdf.loc[X_test.index]
gdf_test['prediction'] = y_pred
# Plot actual vs predicted
fig, ax = plt.subplots(1, 2, figsize=(14, 7))
gdf_test.plot(column='target_variable', ax=ax[0], legend=True, cmap='viridis')
ax[0].set_title('Actual Values')
gdf_test.plot(column='prediction', ax=ax[1], legend=True, cmap='viridis')
ax[1].set_title('Predicted Values')
plt.show()
Step 7: Feature Importance
importances = model.feature_importances_
indices = np.argsort(importances)[::-1]
feature_names = features.columns
# Visualize feature importance
plt.figure(figsize=(10, 8))
plt.title("Feature Importances")
plt.bar(range(X_train.shape[1]), importances[indices], align="center")
plt.xticks(range(X_train.shape[1]), feature_names[indices], rotation=90)
plt.xlim([-1, X_train.shape[1]])
plt.show()
Conclusion
This concludes the implementation of predictive modeling with geospatial data in Python using a Google Collab notebook. The RandomForestRegressor model is used to predict the target variable and its performance is evaluated using Mean Absolute Error. Key features are also visualized to understand their impact on the prediction.
Consumer Behavior Analysis: Geospatial Impact
Step 1: Import Necessary Libraries
import pandas as pd
import geopandas as gpd
import matplotlib.pyplot as plt
from shapely.geometry import Point
from sklearn.cluster import KMeans
import folium
Step 2: Load Geospatial and Consumer Data
# Load geospatial data (example: shapefile of regions)
gdf_regions = gpd.read_file('path_to_shapefile.shp')
# Load consumer data (example: CSV file with longitude, latitude, and other features)
df_consumers = pd.read_csv('path_to_consumer_data.csv')
Step 3: Convert Consumer Data to GeoDataFrame
# Ensure the consumer data has 'longitude' and 'latitude' columns
geometry = [Point(xy) for xy in zip(df_consumers.longitude, df_consumers.latitude)]
gdf_consumers = gpd.GeoDataFrame(df_consumers, geometry=geometry)
# Set the Coordinate Reference System (CRS) if necessary
gdf_consumers.set_crs(epsg=4326, inplace=True)
Step 4: Plotting Consumers on the Map
# Plot regions and consumers
fig, ax = plt.subplots(figsize=(15, 15))
gdf_regions.plot(ax=ax, color='lightgray')
gdf_consumers.plot(ax=ax, color='red', markersize=5)
plt.title("Consumer Locations on Map")
plt.show()
Step 5: Clustering Consumers
# Extract coordinates for clustering
coords = df_consumers[['longitude', 'latitude']]
# Apply KMeans clustering
kmeans = KMeans(n_clusters=5) # Example: 5 clusters
df_consumers['cluster'] = kmeans.fit_predict(coords)
# Plot clusters on the map
gdf_consumers['cluster'] = df_consumers['cluster']
colors = ['red', 'blue', 'green', 'purple', 'orange']
fig, ax = plt.subplots(figsize=(15, 15))
gdf_regions.plot(ax=ax, color='lightgray')
for idx, color in enumerate(colors):
gdf_consumers[gdf_consumers['cluster'] == idx].plot(ax=ax, color=color, markersize=5, label=f'Cluster {idx}')
plt.legend()
plt.title("Consumer Clusters on Map")
plt.show()
Step 6: Heatmap of Consumer Density
# Create a base map
m = folium.Map(location=[df_consumers.latitude.mean(), df_consumers.longitude.mean()], zoom_start=10)
# Add points to the map
for idx, row in df_consumers.iterrows():
folium.CircleMarker(location=[row['latitude'], row['longitude']],
radius=5, color='blue', fill=True, fill_color='blue').add_to(m)
# Save the map
m.save('consumer_density_map.html')
Step 7: Accessing Cluster Insights
# Display the centers of the clusters
cluster_centers = pd.DataFrame(kmeans.cluster_centers_, columns=['longitude', 'latitude'])
print("Cluster Centers:\n", cluster_centers)
# Display the size of each cluster
cluster_sizes = df_consumers['cluster'].value_counts().reset_index()
cluster_sizes.columns = ['cluster', 'size']
print("Cluster Sizes:\n", cluster_sizes)
Conclusion
You have successfully implemented a complete analysis of consumer behavior using geospatial data. The analysis includes plotting consumer locations, clustering them, and visualizing density using a heatmap. The insights gained from this implementation can inform strategic decision-making.
Summarizing Findings and Generating Reports
Step-by-Step Implementation
1. Import Required Libraries
import pandas as pd
import geopandas as gpd
import matplotlib.pyplot as plt
from jinja2 import Template
import pdfkit
2. Load and Summarize Data
Assuming we have a GeoDataFrame gdf
already processed in previous steps:
# Assuming 'gdf' is your GeoDataFrame
summary_stats = gdf.describe()
# Save summary statistics as a DataFrame
summary_df = pd.DataFrame(summary_stats)
3. Generate Summary Visualizations
Create plots for the generated summary:
# Example: Distribution of a numeric column
plt.figure(figsize=(10, 6))
gdf['numeric_column'].hist(bins=30)
plt.title('Distribution of Numeric Column')
plt.xlabel('Value')
plt.ylabel('Frequency')
plt.savefig('/content/distribution_numeric_column.png')
plt.close()
4. Jinja2 Template for HTML Report
Create an HTML template for the report using Jinja2:
html_template = """
<!DOCTYPE html>
<html>
<head>
<title>Geospatial Data Analysis Report</title>
</head>
<body>
<h1>Geospatial Data Analysis Report</h1>
<h2>Summary Statistics</h2>
<table border="1">
<thead>
<tr>
{% for col in summary_df.columns %}
<th>{{ col }}</th>
{% endfor %}
</tr>
</thead>
<tbody>
{% for row in summary_df.iterrows() %}
<tr>
{% for value in row[1] %}
<td>{{ value }}</td>
{% endfor %}
</tr>
{% endfor %}
</tbody>
</table>
<br>
<h2>Distribution of Numeric Column</h2>
<img src="distribution_numeric_column.png" alt="Distribution of Numeric Column">
</body>
</html>
"""
# Render the template with summary data
template = Template(html_template)
rendered_html = template.render(summary_df=summary_df)
5. Generate PDF Report
Generate a PDF report from the rendered HTML:
# Save the rendered HTML to a file
with open('/content/report.html', 'w') as file:
file.write(rendered_html)
# Convert the HTML file to a PDF
pdfkit.from_file('/content/report.html', '/content/geospatial_data_analysis_report.pdf')
6. Display and Download Report in Google Colab
Make the report available for download:
from google.colab import files
# Generate the reports as before
files.download('/content/geospatial_data_analysis_report.pdf')
Conclusion
These sections combined ensure a comprehensive and automated way to summarize geospatial data findings and generate a report. This implementation can be directly applied within a Google Colab environment, and will result in a structured PDF document containing statistical summaries and visualizations of your geospatial dataset.