Pandas AI: Data Analysis With Artificial Intelligence

by | Python

If you’re a Python programmer, chances are you’ve used the Pandas library for all your data manipulation and analysis needs. Well, guess what? It just got a turbo boost and is now diving headfirst into the world of AI! That’s right, hold on tight as we introduce you to the latest addition: Pandas AI.

PandasAI is an innovative Python library that integrates generative artificial intelligence capabilities with Pandas. This extension takes data analysis to the next level and provides a comprehensive solution for automating common tasks, generating synthetic datasets, and conducting unit tests. It allows you to use a natural language interface to scale key aspects of data analysis.

Pandas AI

Data scientists can improve their workflow with Pandas AI and save endless hours thanks to its ability to reveal insights and patterns more quickly and efficiently. In this article, we’ll explore what Pandas AI is and how you can use it to supercharge your analytics.

Let’s get into it!

What is Pandas AI?

Pandas AI is a Python library that integrates generative AI capabilities, specifically OpenAI‘s technology, into your pandas dataframes.

It is designed to be used with the Pandas library and is not a replacement for it. The integration of AI within Pandas enhances the efficiency and effectiveness of data analysis tasks.

How to Get Started With Pandas AI

To get started with Pandas AI, you can install the package using the following code:

pip install pandasai

The command will install the Pandas AI package into your operating system.

Installing Pandas AI

After installing the library, you will need an API to interact with a large language model on the backend.

We will be using OpenAI model in the demonstration. To get an API key from OpenAI, follow the steps given below:

  1. Go to https://openai.com/api/ and signup with your email address or connect your Google Account.
  2. Go to “View API keys” on left side of your Personal Account Settings
  3. Select Create new Secret key

After getting your API keys, you need to import the necessary libraries into your project notebook.

You can import the necessary libraries with the code given below:

import pandas as pd
from pandasai import PandasAI

After importing the libraries, you must load a dataset into your notebook. The code below demonstrates this step:

dataframe = pd.read_csv("data.csv")

The next step you need to take is to initiate an LLM model with your API key.

from pandasai.llm.openai import OpenAI
llm = OpenAI(api_token="YOUR_API_TOKEN")

Next, you can ask questions regarding your dataset with your Python notebook.

pandas_ai = PandasAI(llm)
pandas_ai(dataframe, prompt='What is the average livingArea?')

This integration allows you to explore and analyze your dataset without writing any exploratory data analysis code.

Using PandasAI for data analysis

Why Use PandasAI?

Pandas AI offers several benefits when working with your pandas dataframes:

  • Generative AI: It adds an extra layer of AI capabilities to your data analysis process, enabling you to generate new insights from existing data.
  • Conversational Interface: Pandas AI makes dataframes more conversational, allowing users to interact with data in a more intuitive and natural manner.
  • Documentation: In-depth documentation is provided for users who want to understand how to effectively utilize the library’s features within their projects.

Using Pandas AI can significantly improve your efficiency and productivity, as it is machine learning model and makes data easier to work with and interpret. This can lead to informed decision-making and faster results.

5 Examples and Uses Cases of PandasAI

In this section, you’ll find some examples and use cases of using PandasAI in your projects. This will allow you to understand better when and how to use this tool.

5 Examples and Uses Cases of pandas ai

Example 1: Querying Data

You can ask PandasAI to find all the rows in a dataframe where the value of a column is greater than a certain value.

Example 1: Querying Data

For instance, you could find all properties with a livingArea greater than a certain size with the following prompt:

pandas_ai(dataframe, prompt='Which properties have a livingArea greater than 2000?')

Example 2: Generating Charts

You can ask PandasAI to generate charts based on your data set.

Example 2: Generating Charts

For example, you could create a histogram showing the distribution of livingArea with the following command prompt:

pandas_ai(dataframe, prompt='Plot the histogram of properties showing the distribution of livingArea')

When generating charts, you can try different prompts and see if all give you the same output. Then choose the one that better fits your needs.

Example 3: Handling Multiple DataFrames

If you have data spread across multiple dataframes, you can use PandasAI as a manipulation tool by passing them all into PandasAI and asking questions that span across them.

Example 3: Handling Multiple DataFrames

Assuming you had another dataframe df2 with additional information about the properties:

pandas_ai([dataframe, df2], prompt='What is the average livingArea of waterfront properties?')

Example 4: Using Shortcuts

PandasAI provides a number of shortcuts to make common data processing tasks easier.

Example 4: Using Shortcuts

For example, you could impute missing values in your dataframe with the following prompt:

pandas_ai.impute_missing_values(dataframe)

Example 5: Enforcing Privacy

If you want to enforce privacy, you can instantiate PandasAI with enforce_privacy = True so it won’t send the head (but just column names) to the LLM. This will make sure that your data is safe even if you are using a LLM.

Example 5: Enforcing Privacy

You can use the following prompt:

pandas_ai = PandasAI(llm, enforce_privacy=True)

Learn more about the latest happenings in AI in the following video:

When Not to Use PandasAI

PandasAI is an incredibly powerful tool that can simplify many data analysis tasks, but it’s not always the right tool for the job.

When Not to Use PandasAI

We’ve listed a few situations where you might not want to use PandasAI:

1. When Working With Sensitive Data

If you’re working with sensitive data, you may not want to use PandasAI, because it sends data to OpenAI’s servers.

Even though the library tries to anonymize the data frame by randomizing it, and it offers an option to enforce privacy by not sending the head of the dataframe to the servers, there could still be potential privacy concerns?.

2. When Working With Large Dataframes

PandasAI is not ideal for large dataframes. Because the tool sends a version of your dataframe to the cloud for processing, it could be slow and resource-intensive for large datasets.

3. When Writing Simple Queries

For simple data manipulations and queries, using PandasAI might be overkill. Regular Pandas operations might be faster and more efficient.

For example, if you just want to calculate the mean of a column, using df[‘column’].mean() in Pandas is much more straightforward and faster than setting up a language model and making a request to an external server.

4. When Learning Data Analysis

If you aim to learn data analysis and Python programming, relying on PandasAI might not be the best approach.

While it simplifies many tasks, it also abstracts away the underlying operations, which could impede your understanding of how things work under the hood.

5. Costs Consideration for PandasAI

OpenAI’s API is not free, and using it extensively could lead to high costs. If you’re working on a project with a tight budget, you might want to stick to traditional data analysis methods.

Final Thoughts

PandasAI stands as an important breakthrough in data analysis. It bridges the gap between natural language processing and traditional data science methodologies.

By integrating PandasAI into your workflow, you can simplify complex data tasks and embrace a more intuitive way of interacting with data. Furthermore, it significantly reduces the time spent on data analysis, allowing you to focus on deriving insights and making informed decisions.

However, remember that every tool has its place. PandasAI shines in many areas, but traditional data analysis methods still hold their own in specific use cases. The key is to understand when to utilize each tool for maximum efficiency.

Frequently Asked Questions

Frequently Asked Questions

Q1: What is PandasAI?

PandasAI is a Python library that leverages the OpenAI Codex model to enable you to interact with your data using natural language. It simplifies complex data tasks, allowing you to ask questions, create plots, and manipulate dataframes using plain English commands.

Q2: Why should I use PandasAI instead of traditional Pandas functions?

PandasAI offers a more intuitive way of interacting with your data. Instead of writing lengthy code, you can simply ask questions or give commands in plain English. This can save significant time and effort, especially when working with more complex queries or multiple dataframes.

Q3: Is it safe to use PandasAI with sensitive data?

While PandasAI makes efforts to anonymize data, it does send a version of your dataframe to OpenAI’s servers. If you’re working with highly sensitive data, this might be a consideration. However, there’s an option to enforce privacy by not sending the dataframe’s head to the servers.

Q4: What are the limitations of PandasAI?

PandasAI might not be ideal for large dataframes, as it could be slow and resource-intensive. Also, for very simple queries or data manipulations, traditional Pandas operations might be more efficient.

Q5: Does using PandasAI incur costs?

Yes, OpenAI’s API, which PandasAI uses, is not free. Extensive use of the API could lead to costs, so it’s important to consider this when deciding whether to use PandasAI in your project.

author avatar
Sam McKay, CFA
Sam is Enterprise DNA's CEO & Founder. He helps individuals and organizations develop data driven cultures and create enterprise value by delivering business intelligence training and education.

Related Posts