دسته‌ها
Buiness Analysis Data Science Programming Python علمی

Asking Questions About our Data

Asking Questions About our Data
Thanks to Super Data Science we can look at a training data set with some sales data and gain some insights from it. Let’s take a little look at the data as it looks in Excel.

Upon initial inspection of the data, we can start thinking of some questions about it that we would want to answer.

What rep sold the most?
What item sold the most?
What region sold the most?


Loading the Data
Our data already has descriptive column names so that’ll be much easier for us since we won’t have to name the columns ourselves. But, we do need to understand what each of these columns mean to better understand our data.
To load the data into pandas, we must first import the packages that we’ll be using.
import pandas as pd
import numpy as np
We can use pandas to read in the CSV file with the read_csv method.

df = pd.read_csv("./OfficeSupplies.csv")
If there are no errors when executing then the file loaded with no errors. But now what do we do with it? Pandas takes the data and creates a DataFrame data structure with it. The DataFrame allows us to do quite a bit of analysis on the data. We can look at the number of rows and columns to get a quick idea of how big our data is.

df.shape

There are 43 rows and six columns in our data set. It’s not that big of a data set, but even small data sets can yield some good insights.

General Analysis of the Data
To get a quick idea of what the data looks like, we can call the head function on the data frame. By default, this returns the top five rows, but it can take in a parameter of how many rows to return.

df.head()

This looks a lot like an Excel spreadsheet, doesn’t it? Under the hood, the data frame is a two-dimensional data structure and each column can have different types. To show that, we can call dtypes attribute on the data frame to see what each column types are.

df.dtypes

Even though the first four columns are objects, we can see from the data that it’s text. Due to pandas using Numpy behind the scenes, it interprets strings as objects.

Next, we can look at some descriptive statistics of the data frame with the describe method.

df.describe()

This shows some descriptive statistics on the data set. Notice, it only shows the statistics on the numerical columns. From here you can see the following statistics:

Row count, which aligns to what the shape attribute showed us.
The mean, or average.
The standard deviation, or how spread out the data is.
The minimum and maximum value of each column
The number of items that fall within the first, second, and third percentiles.
We can tell a decent amount of our data from this, such as the average number of units sold is 49 and the average price of each unit is around $20.

Answering our Questions
Now we can do further analysis on our data to answer our questions. Before that, we should see if there are any missing values in our data set. In our current data set of 43 rows we can scan the data to tell there aren’t any missing values, but with much bigger data sets looking at the data by hand to find missing values will be too tedious. This is easy to do in pandas. To check if there are any missing values in the entire data set we use the isnull function, then see if there are any values.

df.isnull().values.any()

We’re lucky we have such a nice data set and with no missing values. While we won’t focus on it in this post, a data scientist will spend their time cleaning (or wrangling ) the data. Since we don’t have any missing data, we can start doing further analysis on our data.

Who sold the most?
The first question we had was what rep had sold the most. Pandas has a lot of utility functions for querying the data frame to help us out. To answer this we can group by the “Rep” column and sum up the values in the columns. Notice in the result that pandas only does a sum on the numerical columns. Then we do a descending sort on the values based on the “Units” column.

df.groupby(["Rep"]).sum().sort_values("Units", ascending=False)

Richard sold the most units, but did he sell the most in dollar amount? Does selling the most units mean the rep earned the most money? How can we determine that if there’s no column available to find that out? Pandas has you covered there, too. We can add columns to our data frame as we need (we can drop them, too, if they add too much noise to our data set). Pandas uses Numpy behind the scenes in the DataFrame object so it has the ability to do mathematical operations on columns, and it can do them quite fast.

df["Total Price"] = df["Units"] * df["Unit Price"]
df.head()

Here we create a new column on our data frame called “Total Price” and multiply the “Units” and “Unit Price” columns to calculate it. We call the head function again to make sure it got added.

Now we can find who sold the most in total dollar amount.

df.groupby("Rep").sum().sort_values("Total Price", ascending=False).head()
We do the same grouping of the “Rep” column and sum up the values, but now we do a descending sort on the “Total Price” column values and print out the results.

Richard may have sold the most units, but Matthew sold the most in terms of total price.

What region sold the most?

We now know who sold the most, but we also have a “Region” column. With this, we can see who sold the most within each region.

group = df.groupby(["Region","Rep"]).sum()
total_price = group["Total Price"].groupby(level=0, group_keys=False)
total_price.nlargest(5)

Now we group by two columns, “Region” and “Rep”, and sum those values up. With that result, we save it off into a new data frame. Then take the “Total Price” column and group the values. In our groupby function, we specify what level to group by since we already did the first grouping, which creates a hierarchical index on the data. Then we take the largest five of those to print out. From the results, we can tell that Matthew sold the most in the Central region, Susan in the East, and James in the West.

Another thing to note from this result: the Central region has the same number of reps than East and West combined. That could be something to bring up to the sales team when presenting the data.

دسته‌ها
Buiness Analysis Data Science Python علمی محمد حسین ابراهیم زاده

python, Pandas Categorize the range

df[‘PriceBin’] = pd.cut(df[‘PriceAvg’], bins = 3)
df[‘PriceBin’].value_counts()

(۵۴۰۶۰.۰, ۲۰۴۰۰۰۰.۰]      ۲۰۹ 
(۲۰۴۰۰۰۰.۰, ۴۰۲۰۰۰۰.۰] ۴
(۴۰۲۰۰۰۰.۰, ۶۰۰۰۰۰۰.۰] ۱
Name: PriceBin, dtype: int64


df[‘PriceBin’] = pd.qcut(df[‘PriceAvg’], q=3)
df[‘PriceBin’].value_counts().sort_index()

(۵۹۹۹۹.۹۹۹, ۲۱۰۰۰۰.۰]    ۷۷
(۲۱۰۰۰۰.۰, ۳۱۵۰۰۰.۰]     ۶۶
(۳۱۵۰۰۰.۰, ۶۰۰۰۰۰۰.۰]    ۷۱
Name: PriceBin, dtype: int64

h = df.groupby(‘PriceBin’, as_index=False).median()[‘SalesAvg’]

h = pd.DataFrame(h)
h.reset_index(inplace=True)
h


PriceBin SalesAvg
۰(۵۹۹۹۹.۹۹۹, ۲۱۰۰۰۰.۰] ۴۲.۰۰۰۰۰۰۱
(۲۱۰۰۰۰.۰, ۳۱۵۰۰۰.۰] ۱۴۵.۱۶۶۶۶۷۲
(۳۱۵۰۰۰.۰, ۶۰۰۰۰۰۰.۰] ۱۱۴.۲۰۰۰۰۰

دسته‌ها
Buiness Analysis Data Science Python علمی محمد حسین ابراهیم زاده

Useful Python pandas codes

To Rename the data frame
df.rename(columns={“contract_id”:”deal_id”},inplace=True)

Where statement
tips[tips[‘time’] == ‘Dinner’].head(۵)

vlookup
mg = pd.merge(df,AgReg,on=”deal_id”,how=”left”)

choose the first column of an array or first part of a string with a delimitter
df[“cat”] = df[“CategoryID”].str.split(‘,’,1).str[0]

filling na or nan or Null values
df[“CategoryID”].fillna(“”,inplace=True)

Convert To date time
pd.to_datetime(df[“start_date”],errors=’ignore’)

combination of where and select some.
df[(df[‘CategoryName_Art’] == ۱)].iloc[0:5,:]

Get data ready for decision tree dmdf = pd.get_dummies(df[[‘ShamsiYear’, ‘ShamsiYearMonth’, ‘Brand’,       ‘CategoryName’, ‘SubCategoryName’,  ‘UnitPrice’, ‘SoldCoupon’,       ‘Area2’]])

دسته‌ها
Buiness Analysis Data Science Programming Python علمی

Pandas V.S SQL

If you knew SQL before and want to migrate to Python, you can use this article.

۱) Simple select

—-

SELECT total_bill, tip, smoker, time
FROM tips
LIMIT ۵;

—–

tips[[‘total_bill’‘tip’‘smoker’‘time’]].head(۵)

—————

—————

۲) where 

——

SELECT *
FROM tips
WHERE time = ‘Dinner’
LIMIT ۵;

———

tips[tips[‘time’== ‘Dinner’].head(۵)

————

۳) Multiple conditions

— tips of more than $5.00 at Dinner meals

SELECT *
FROM tips
WHERE time = ‘Dinner’ AND tip > ۵.۰۰;

—–

tips[(tips[‘time’== ‘Dinner’& (tips[‘tip’> ۵.۰۰)]

SELECT *
FROM frame
WHERE col2 IS NULL;

frame[frame[‘col2’].isna()]

————————–

۴) Group By 

SELECT sex, count(*)
FROM tips
GROUP BY sex;

———

tips.groupby(‘sex’).size()

sex

Female 87

Male 157

dtype: int64

Notice that in the pandas code we used size() and not count(). This is because count() applies the function to each column, returning the number of not null records within each.

tips.groupby(‘sex’).count()

        total_bill tip smoker day time size
sex
Female        ۸۷   ۸۷   ۸۷     ۸۷   ۸۷   ۸۷
Male         ۱۵۷   ۱۵۷  ۱۵۷   ۱۵۷  ۱۵۷  ۱۵۷

Alternatively, we could have applied the count() method to an individual column:

tips.groupby(‘sex’)[‘total_bill’].count()

Sex

Female 87

Male 157

Name: total_bill, dtype: int64

————–

SELECT dayAVG(tip), COUNT(*)
FROM tips
GROUP BY day;

—-

tips.groupby(‘day’).agg({‘tip’: np.mean, ‘day’: np.size})

——–

SELECT smoker, dayCOUNT(*), AVG(tip)
FROM tips
GROUP BY smoker, day;

—-

tips.groupby([‘smoker’‘day’]).agg({‘tip’: [np.size, np.mean]})

INNER JOIN

In [22]: df1 = pd.DataFrame({‘key’: [‘A’‘B’‘C’‘D’],
….: ‘value’: np.random.randn(۴)})

….:

In [23]: df2 = pd.DataFrame({‘key’: [‘B’‘D’‘D’‘E’],
….: ‘value’: np.random.randn(۴)})

SELECT *
FROM df1
INNER JOIN df2
ON df1.key = df2.key;

pd.merge(df1, df2, on=‘key’)

merge() also offers parameters for cases when you’d like to join one DataFrame’s column with another DataFrame’s index.

indexed_df2 = df2.set_index(‘key’)

pd.merge(df1, indexed_df2, left_on=‘key’, right_index=True)

LEFT OUTER JOIN

SELECT *
FROM df1
LEFT OUTER JOIN df2
ON df1.key = df2.key;

pd.merge(df1, df2, on=‘key’, how=‘left’)

RIGHT JOIN

SELECT *
FROM df1
RIGHT OUTER JOIN df2
ON df1.key = df2.key;

pd.merge(df1, df2, on=‘key’, how=‘right’)

FULL JOIN

SELECT *
FROM df1
FULL OUTER JOIN df2
ON df1.key = df2.key;

pd.merge(df1, df2, on=‘key’, how=‘outer’)

UNION

SELECT city, rank
FROM df1
UNION ALL
SELECT city, rank
FROM df2;

pd.concat([df1, df2])

SQL’s UNION is similar to UNION ALL, however UNION will remove duplicate rows.

SELECT city, rank
FROM df1
UNION
SELECT city, rank
FROM df2;

pd.concat([df1, df2]).drop_duplicates()

Top N rows with offset

SELECT * FROM tips
ORDER BY tip DESC
LIMIT ۱۰ OFFSET ۵;

tips.nlargest(۱۰ + ۵, columns=‘tip’).tail(۱۰)

Top N rows per group

— Oracle’s ROW_NUMBER() analytic function

SELECT * FROM (
SELECT
t.*,
ROW_NUMBER() OVER(PARTITION BY day ORDER BY total_bill DESCAS rn
FROM tips t
)
WHERE rn < ۳
ORDER BY day, rn;

(tips.assign(rn=tips.sort_values([‘total_bill’], ascending=False).groupby([‘day’]).cumcount() + ۱) .query(‘rn < 3’).sort_values([‘day’‘rn’]))

UPDATE

UPDATE tips
SET tip = tip*۲
WHERE tip < ۲;

tips.loc[tips[‘tip’< ۲‘tip’*= ۲

DELETE

DELETE FROM tips
WHERE tip > ۹;

tips = tips.loc[tips[‘tip’<= ۹]