Python Correlation Guide In Creating Visuals - Enterprise DNA

Python Correlation: Guide In Creating Visuals

No comments

In today’s blog, we will walk through the process of visualizing Python correlation, and how to import these visuals into Power BI. You can watch the full video of this tutorial at the bottom of this blog.

Understanding Correlations

Here’s a nice image showing the different types of correlations. 

Understanding Correlations

Starting from the left, we have the perfect positive correlation which means it has a correlation value of 1. Then, it is followed by positive correlations in descending order leading to 0. 

The middle graph shows no correlation suggesting a correlation value equal to 0. 

Finally, the right hand side presents decreasing negative correlations values from 0. The rightmost graph is the perfect negative correlation which has a correlation value of -1.

Packages for Python Correlation

We will be using four packages for this tutorial. Our first package is Pandas to be used for data manipulation and saved as variable pd

For visualization, we will use Matplotlib, saved as plt variable for easier use of these functions. Seaborn, our statistical visualization library, will be saved as sns. And lastly, Numpy, to be saved as np, will be used for linear algebra. 

Packages for Python Correlation

For the data, we will use a sample dataset in Seaborn. Then using the sns variable, we will bring in the diamonds dataset as shown below. .

Packages for Python Correlation

Attributes of the Data

We can view the attributes of our data using dataset.info function. This function shows us all the different data types as seen in the last column below.

Attributes of the Data

Note that correlation only works on numerical variables, thus, we are going to look at the numerical variables most of the time. However, we will also learn how to utilize some of the categorical variables for visualization. 

The Python Correlation Dataset

By using the function head written as dataset.head, we can get the top five rows of our data which should look like this.

The Python Correlation Dataset

We have carat in the first column, followed by the categorical variables cut, color, and clarity, and then numerical values for the rest of the data. 

Python Correlation: Creating A Scatter Plot

When visualizing correlations and looking at two variables, we usually look at scatter plots

Thus, using the Seaborn library, we’ve created our scatter plot using the scatter plot function where we passed in the data we saved above as data=dataset. Then, we identified the X and Y variables—carat and price, respectively. 

Python Correlation: Creating A Scatter Plot

Here’s our scatter plot made with the Seaborn library. 

Python Correlation: Creating A Scatter Plot

You can see that this scatter plot is quite dense. That’s because we have about 54,000 rows of data and the points are not necessarily represented in the best way.

We can press the Shift + Tab keys to see the different ways to style the scatter plot. It will show us a list of different parameters that we can add to our scatter plot. 

Python Correlation: Creating A Scatter Plot

Scrolling further down will give us information on what each one of the listed parameters does. 

Python Correlation: Creating A Scatter Plot

Additional Scatter Plot Parameters

Let’s dive in a little bit. We can make the linewidth=0 because the white lines in our first scatter plot, shown below, somewhat obscure things. 

Python Correlation: Creating A Scatter Plot

We also want to adjust the alpha so we can control the opacity. Let’s use alpha=0.2 for our example. But of course, you could change that to 0.1 as well. 

Python Correlation: Creating A Scatter Plot

If we add these parameters and click on Run, you can see our scatter plot gets more opaque without the white lines. 

Python Correlation: Creating A Scatter Plot

You can play around with the parameters to get the best visual you are looking for. 

Using the Categorical Variables

We can also utilize some of our categorical variables to improve our visuals. For example, we know that our data has a cut for our diamond. 

What we can do is pass in that cut category using the hue parameter as hue=’cut’. This will allow us to visualize these points by changing the colors. 

Using the Categorical Variables

Of course, we can add more parameters like the alpha, for example. We can add that again, set to 0.2, and see how that changes the visual. Let’s click Run and you can see a little bit of a difference. 

Using the Categorical Variables

We can play around with the parameters to get the visual that we are looking for. We can also use different categories such as clarity, which gives us the clarity categories and also gives us a slightly different view of that scatter.

Using the Categorical Variables

Correlation With Other Variables

You may also be interested about how other values are correlated other than price and carat. So if we look at a scatter plot for table, which is the numerical dimension of that diamond and depth, we can see there is no one-to-one linear relationship.

Correlation With Other Variables

We can also look at two other variables such as depth and price. Based on the graph, we can see that the data centers around the middle area.

Correlation With Other Variables

Python Correlation: Creating A Regression Plot

Let’s advance to what we call a regression plot that allows us to evaluate the linear relationship between two variables. 

So instead of the scatter plot function, we will use the regplot function this time. We will pass in the same structure—our data then the X and Y variables. 

Python Correlation: Creating A Regression Plot

The result shows a line which measures the linear relationship between the variables. It is also evident how our values circle around that regression line.

Python Correlation: Creating A Regression Plot

This is not a very beautiful visual at the moment, but we can still optimize it to get a better one. For example, we can pass in a style using the Matplotlib variable. We can change the style to dark background using the code plt.style.use(‘dark_background’). 

Python Correlation: Creating A Regression Plot

Take that same regression plot and pass in some keywords for our scatter and line. Let’s use color red and a line width of 1 for our regression line. This is written as line_kws={“color” : “red”, ‘linewidth’ : 1). 

For our scatter keywords, let’s set the color as white, edge color as grey, and the opacity as 0.4 to be written as scatter_kws={“color” : “white”, ‘edgecolor’ : ‘grey’, ‘alpha’ : 0.4). 

Python Correlation: Creating A Regression Plot

These parameters give us a little bit of a different view shown below. 

Python Correlation: Creating A Regression Plot

Python Correlation: Creating A Correlation Matrix

So far, what we’ve been looking at are scatter plots with just two variables, but we may also want to look at all of our variable correlations.

This is performed using our dataset with a data frame function called correlation represented as dataset.corr. And what we will get is a matrix that shows us correlations on each one of these variables. 

Python Correlation: Creating A Correlation Matrix

The numbers in the table above represent the Pearson correlation, which focuses on the linear relationship between all of these variables. 

But if we are not sure if our variables are fully linearly correlated, we can use a different type of correlation which focuses more on impact than the linear part. It is called a Spearman correlation

Python Correlation: Creating A Correlation Matrix

And we can see information on all of these things by pressing Shift + Tab. If you scroll down, we can see the Spearman rank correlation, Pearson correlation coefficient, and quite a lot of different ways to measure our data. 

Python Correlation: Creating A Correlation Matrix

Looking back to our correlation matrix earlier, we know that price and carat are pretty well correlated.

Python Correlation: Creating A Correlation Matrix

They are from our plot here showing that they are quite linear at 0.92. 

Python Correlation: Creating A Correlation Matrix

Now if we use the Spearman correlation instead, the impact or the rank is going to be a little bit higher at 0.96.

Python Correlation: Creating A Correlation Matrix

These different types of correlations allow us to pick up different attributes of correlation between those variables. 

Multiple x Single Variable Correlation

Sometimes, we don’t want to see a matrix because we are more concerned about the correlation of all the variables with one variable alone (e.g., price).

What we can do then is isolate price using dataset.corr followed by ‘price’

Python Correlation: Creating A Correlation Matrix

Now, we can see that price is correlated with all our different numerical variables in this table. And the reason we may want to do this is for visual plots. 

So let’s look at visualizing our correlation matrix with a heat map.

Python Correlation: Creating a Heat Map

We can pass this correlation variable into a Seaborn heat map using the function sns.heatmap.

Python Correlation: Creating a Heat Map

This will give us a heat map that looks like this.

Python Correlation: Creating a Heat Map

Again, we can add parameters to our preference. We can pass in the parameter linewidths=1 and add annotations using annot=True

Python Correlation: Creating a Heat Map

And you can see that our heat map now looks quite different. Right now we have a pretty nice heat map. 

Python Correlation: Creating a Heat Map

We can see the usefulness of adding the lines and the annotations. Again, if we press Shift + Tab, all the different parameters that can go into that will show up. 

Next, try to add method=’spearman‘ in our code, so you’ll know how to use a different type of correlation depending on your use case. 

Python Correlation: Creating a Heat Map

Heat Map With One Variable

Next, we isolate one variable and create a heat map with the correlation going from negative to positive.

Heat map With One Variable

This will give us this heat map below.

Heat Map With One Variable

We can definitely change the styling as well. For example, we can use the cmap parameter as cmap=’coolwarm’. This changes the colors to cool and warm, and will eliminate our black background too. 

Heat Map With One Variable

If we click Run, we will get this heat map below. For cool, we have the blue and then for warm, we have the red bars. 

Heat Map With One Variable

We can also change the direction to align our map with the color bar. This is done by editing our sort_values parameter and adding ascending=False

Heat Map With One Variable

This will go from the most correlated (the red bar) to the least correlated (the blue bar). 

Heat Map With One Variable

Python Correlation: Creating a Staircase Visual

One advanced way to visualize our Python correlation is by using a mask to block out all of the correlations that we have already done. 

We can do this with Numpy, using some TRUE and FALSE functions to make a staircase visual for our correlations. 

Python Correlation: Creating a Staircase Visual

Here’s how the results should look like.

Python Correlation: Creating a Staircase Visual

Let’s see how we can pipe this over into Power BI.

Staircase Visual in Power BI

First, open Power BI. I’ve brought in an avocado dataset so we can see a different visual. You can see this dataset under the Fields pane. Initialize the Python visual by clicking on the Python icon under the Visualizations pane. 

Staircase Visual in Power BI

We need to create the dataset by adding in all the numerical variables that are indicated with the Σ. Add them by clicking the check boxes beside these variables. 

Staircase Visual in Power BI

Now that we have a data set, we can go over to our Jupyter notebook and copy this code we had earlier.

Staircase Visual in Power BI

Then, we will copy the code to the Python script editor in Power BI. 

Staircase Visual in Power BI

Next, we will choose our visual, which would be the staircase visual. We’ll go back to Jupyter, copy the code that we used for our staircase visual.

Staircase Visual in Power BI

Paste the code into the Python script editor.

Staircase Visual in Power BI

The last thing to do is make sure we are using plt.show, which is required in your Python script. Add plt.show in the last line of the code and click the run icon on the upper right corner of the script editor.

Staircase Visual in Power BI

For a bigger visual, stretch the box a bit so we can see the script running in the corner. We have our visual for our heat map, which looks quite nice.

Staircase Visual in Power BI

And in Power BI, we can definitely see how that visual may change according to the dataset. For example, we can click the Slicer icon in the Visualizations pane and go to Type in the Fields pane. 

It will give us the two types in our data set, the conventional and organic. If we click one type, say organic, you can see that the heat map changes. 

Staircase Visual in Power BI

Changes will also apply when we click on the conventional type next.

Staircase Visual in Power BI

Remember that we need to have a categorical variable in the dataset of our Python script to make these filters work. As we can see, the data set we created included type, enabling us to filter the visual in that manner. 

Staircase Visual in Power BI

***** Related Links *****
Building Your Data Model Relationships In Power BI
Text Analysis In Python | An Introduction
Python Scripting In Power BI Data Reports

Conclusion

In this blog, you learned how to visualize correlations in Python and Power BI using different methods such as Pearson correlation and Spearman rank correlation. 

Now, you can create scatter plots, regression plots, correlation matrix, heat maps, and staircase visuals to get the best visual for your data set. You can also use a variety of parameters to improve the styles and visuals. 

All the best,

Gaelim Holland

Enterprise DNA Power BI On-Demand

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.