Asking Questions About our Data
Thanks to Super Data Science we can look at a training data set with some sales data and gain some insights from it. Let’s take a little look at the data as it looks in Excel.
Upon initial inspection of the data, we can start thinking of some questions about it that we would want to answer.
What rep sold the most?
What item sold the most?
What region sold the most?
Loading the Data
Our data already has descriptive column names so that’ll be much easier for us since we won’t have to name the columns ourselves. But, we do need to understand what each of these columns mean to better understand our data.
To load the data into pandas, we must first import the packages that we’ll be using.
import pandas as pd
import numpy as np
We can use pandas to read in the CSV file with the read_csv method.
df = pd.read_csv("./OfficeSupplies.csv")If there are no errors when executing then the file loaded with no errors. But now what do we do with it? Pandas takes the data and creates a DataFrame data structure with it. The DataFrame allows us to do quite a bit of analysis on the data. We can look at the number of rows and columns to get a quick idea of how big our data is.
There are 43 rows and six columns in our data set. It’s not that big of a data set, but even small data sets can yield some good insights.
General Analysis of the Data
To get a quick idea of what the data looks like, we can call the head function on the data frame. By default, this returns the top five rows, but it can take in a parameter of how many rows to return.
This looks a lot like an Excel spreadsheet, doesn’t it? Under the hood, the data frame is a two-dimensional data structure and each column can have different types. To show that, we can call dtypes attribute on the data frame to see what each column types are.
Even though the first four columns are objects, we can see from the data that it’s text. Due to pandas using Numpy behind the scenes, it interprets strings as objects.
Next, we can look at some descriptive statistics of the data frame with the describe method.
This shows some descriptive statistics on the data set. Notice, it only shows the statistics on the numerical columns. From here you can see the following statistics:
Row count, which aligns to what the shape attribute showed us.
The mean, or average.
The standard deviation, or how spread out the data is.
The minimum and maximum value of each column
The number of items that fall within the first, second, and third percentiles.
We can tell a decent amount of our data from this, such as the average number of units sold is 49 and the average price of each unit is around $20.
Answering our Questions
Now we can do further analysis on our data to answer our questions. Before that, we should see if there are any missing values in our data set. In our current data set of 43 rows we can scan the data to tell there aren’t any missing values, but with much bigger data sets looking at the data by hand to find missing values will be too tedious. This is easy to do in pandas. To check if there are any missing values in the entire data set we use the isnull function, then see if there are any values.
We’re lucky we have such a nice data set and with no missing values. While we won’t focus on it in this post, a data scientist will spend their time cleaning (or wrangling ) the data. Since we don’t have any missing data, we can start doing further analysis on our data.
Who sold the most?
The first question we had was what rep had sold the most. Pandas has a lot of utility functions for querying the data frame to help us out. To answer this we can group by the “Rep” column and sum up the values in the columns. Notice in the result that pandas only does a sum on the numerical columns. Then we do a descending sort on the values based on the “Units” column.
Richard sold the most units, but did he sell the most in dollar amount? Does selling the most units mean the rep earned the most money? How can we determine that if there’s no column available to find that out? Pandas has you covered there, too. We can add columns to our data frame as we need (we can drop them, too, if they add too much noise to our data set). Pandas uses Numpy behind the scenes in the DataFrame object so it has the ability to do mathematical operations on columns, and it can do them quite fast.
df["Total Price"] = df["Units"] * df["Unit Price"]
Here we create a new column on our data frame called “Total Price” and multiply the “Units” and “Unit Price” columns to calculate it. We call the head function again to make sure it got added.
Now we can find who sold the most in total dollar amount.
df.groupby("Rep").sum().sort_values("Total Price", ascending=False).head()
We do the same grouping of the “Rep” column and sum up the values, but now we do a descending sort on the “Total Price” column values and print out the results.
Richard may have sold the most units, but Matthew sold the most in terms of total price.
What region sold the most?
We now know who sold the most, but we also have a “Region” column. With this, we can see who sold the most within each region.
group = df.groupby(["Region","Rep"]).sum()
total_price = group["Total Price"].groupby(level=0, group_keys=False)
Now we group by two columns, “Region” and “Rep”, and sum those values up. With that result, we save it off into a new data frame. Then take the “Total Price” column and group the values. In our groupby function, we specify what level to group by since we already did the first grouping, which creates a hierarchical index on the data. Then we take the largest five of those to print out. From the results, we can tell that Matthew sold the most in the Central region, Susan in the East, and James in the West.
Another thing to note from this result: the Central region has the same number of reps than East and West combined. That could be something to bring up to the sales team when presenting the data.