Skip to main content
  1. Class Notes/

Automatidata Project Lab

10 mins· 0 · 0 · ·
Jerad Acosta
Google Advanced Data Analytics Professional Certificate lab jupyter python
Table of Contents
Foundations of Data Science - This article is part of a series.
Part 5: This Article

Automatidata project #

Course 2 - Get Started with Python

Course 2 end-of-course project: Inspect and analyze data #

In this activity, you will examine data provided and prepare it for analysis.

The purpose of this project is to investigate and understand the data provided.

The goal is to use a dataframe contructed within Python, perform a cursory inspection of the provided dataset and inform team members of your findings.

This activity has three parts:

Part 1: Understand the situation

  • How can you best prepare to understand and organize the provided taxi cab information?

Part 2: Understand the data

  • Create a pandas dataframe for data learning, and future exploratory data analysis (EDA) and statistical activities.

  • View and interpret the datasets data table

  • Sort and interpret the data table for two variables of your choice.

Part 3: Understand the variables

  • What is the min, mean, and max of your chosen variables?

  • Visualize the variables


Follow the instructions and answer the following questions to complete the activity. Then, you will complete an Executive Summary using the questions listed on the PACE Strategy Document.

Be sure to complete this activity before moving on. The next course item will provide you with a completed exemplar to compare to your own work.

Identify data types and relevant variables using Python #

Exercise Instructions: #

Complete the following step-by-step instructions to inspect and analyze this NYC taxi dataset.

This activity will help ensure the information is:

  1. Ready to answer questions and yield insights

  2. Ready for visualizations

  3. Ready for future hypothesis testing and statistical methods

Follow the instructions and answer questions to complete this activity. Afterward,

  1. Write a short Executive Summary using your findings.

  2. Use the structured notebook provided to help you in this project. Please complete the questions inside and prepare a summary for the data team.

  3. Consider the questions presented in the Course 2 PACE strategy document.

  4. Compare your data insights with the provided exemplar to confirm of your approach and results.

Throughout these project notebooks, you’ll see references to the problem-solving framework PACE. The following notebook components are labeled with the respective PACE stage: Plan, Analyze, Construct, and Execute.

PACE: Plan #

Step 1a. Understand the situation #

  1. How can you best prepare to understand and organize the provided taxi cab information?

Begin by exploring your dataset and consider reviewing the Data Dictionary.

### **==> EXEMPLAR CODE and OUTPUT**
#
# Question 1: How best to prepare?
#
# Q1 Answer:
# By reading the taxi cab data fields and ensuring I understand how 
# each one impacts the dataset. Reviewing the fact sheet could also 
# provide helpful background information. 

# However, my primary goal is to get the data into Python, inspect it, 
# and provide DeShawn with my initial observations. 
# Afterwards, I can learn more deeply about the data and check
# for any anomalies.

Step 1b. Import dataframe #

Create a pandas dataframe for data learning, and future exploratory data analysis (EDA) and statistical activities.

Code the following,

  • import pandas as pd #library exercise for buidling dataframes

  • import numpy as np #numpy is imported with pandas

  • import matplotlib.pyplot as plt #visualization library

  • import seaborn as sns #visualization library

  • df = pd.read_csv(‘2017_Yellow_Taxi_Trip_Data.csv’)

Note: pair the data object name “df” with pandas functions to manipulate data, such as df.groupby().

# ==> EXEMPLAR CODE and OUTPUT
import pandas as pd               #library exercise for buidling dataframes
import numpy as np                #numpy is imported with pandas
import matplotlib.pyplot as plt   #visualization library
import seaborn as sns             #visualization library

df = pd.read_csv('2017_Yellow_Taxi_Trip_Data.csv')
print("done")
done

PACE: Analyze #

Step 2a. Understand the data - Interpret the data table #

View and interpret the datasets data table by coding the following:

  1. df.head(10)
  2. df.info()

Consider the following two questions:

Question 1: When reviewing the df.head(10) output, are there any data points that surprise you or are not correct?

Question 2: When reviewing the df.info() output, what kind of data types are we working with?

# ==> EXEMPLAR CODE and OUTPUT
df.head(10)

Unnamed: 0VendorIDtpep_pickup_datetimetpep_dropoff_datetimepassenger_counttrip_distanceRatecodeIDstore_and_fwd_flagPULocationIDDOLocationIDpayment_typefare_amountextramta_taxtip_amounttolls_amountimprovement_surchargetotal_amount
024870114203/25/2017 8:55:43 AM03/25/2017 9:09:47 AM63.341N100231113.00.00.52.760.00.316.56
135634249104/11/2017 2:53:28 PM04/11/2017 3:19:58 PM11.801N18643116.00.00.54.000.00.320.80
2106203690112/15/2017 7:26:56 AM12/15/2017 7:34:08 AM11.001N26223616.50.00.51.450.00.38.75
338942136205/07/2017 1:17:59 PM05/07/2017 1:48:14 PM13.701N18897120.50.00.56.390.00.327.69
430841670204/15/2017 11:32:20 PM04/15/2017 11:49:03 PM14.371N4112216.50.50.50.000.00.317.80
523345809203/25/2017 8:34:11 PM03/25/2017 8:42:11 PM62.301N16123619.00.50.52.060.00.312.36
637660487205/03/2017 7:04:09 PM05/03/2017 8:03:47 PM112.831N79241147.51.00.59.860.00.359.16
769059411208/15/2017 5:41:06 PM08/15/2017 6:03:05 PM12.981N237114116.01.00.51.780.00.319.58
88433159202/04/2017 4:17:07 PM02/04/2017 4:29:14 PM11.201N23424929.00.00.50.000.00.39.80
995294817111/10/2017 3:20:29 PM11/10/2017 3:40:55 PM11.601N239237113.00.00.52.750.00.316.55
# ==> EXEMPLAR CODE and OUTPUT
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 22699 entries, 0 to 22698
Data columns (total 18 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   Unnamed: 0             22699 non-null  int64  
 1   VendorID               22699 non-null  int64  
 2   tpep_pickup_datetime   22699 non-null  object 
 3   tpep_dropoff_datetime  22699 non-null  object 
 4   passenger_count        22699 non-null  int64  
 5   trip_distance          22699 non-null  float64
 6   RatecodeID             22699 non-null  int64  
 7   store_and_fwd_flag     22699 non-null  object 
 8   PULocationID           22699 non-null  int64  
 9   DOLocationID           22699 non-null  int64  
 10  payment_type           22699 non-null  int64  
 11  fare_amount            22699 non-null  float64
 12  extra                  22699 non-null  float64
 13  mta_tax                22699 non-null  float64
 14  tip_amount             22699 non-null  float64
 15  tolls_amount           22699 non-null  float64
 16  improvement_surcharge  22699 non-null  float64
 17  total_amount           22699 non-null  float64
dtypes: float64(8), int64(7), object(3)
memory usage: 3.1+ MB

Step 2b. Understand the data - Sort by variables #

Sort and interpret the data table for two variables of your choice.

Answer the following three questions:

Question 1: Sort your first variable (trip_distance) from maximum to minimum value, do the values seem normal?

Question 2: Sort your by your second variable (total_amount), are any values unusual?

Question 3: Are the resulting rows similar for both sorts? Why or why not?

# ==> EXEMPLAR CODE and OUTPUT

df_sort = df.sort_values(by=['trip_distance'],ascending=False) 
df_sort.head(10)

Unnamed: 0VendorIDtpep_pickup_datetimetpep_dropoff_datetimepassenger_counttrip_distanceRatecodeIDstore_and_fwd_flagPULocationIDDOLocationIDpayment_typefare_amountextramta_taxtip_amounttolls_amountimprovement_surchargetotal_amount
928051810714206/18/2017 11:33:25 PM06/19/2017 12:12:38 AM233.965N1322652150.000.00.00.000.000.3150.30
1386140523668205/19/2017 8:20:21 AM05/19/2017 9:20:30 AM133.925N2292651200.010.00.551.645.760.3258.21
606449894023206/13/2017 12:30:22 PM06/13/2017 1:37:51 PM132.723N13811107.000.00.055.5016.260.3179.06
1029176319330209/11/2017 11:41:04 AM09/11/2017 12:18:58 PM131.954N1382652131.000.00.50.000.000.3131.80
2994052446211/06/2017 8:30:50 PM11/07/2017 12:00:00 AM130.831N13223180.000.50.518.5611.520.3111.38
1813090375786110/26/2017 2:45:01 PM10/26/2017 4:12:49 PM130.501N132220190.500.00.519.858.160.3119.31
579268023798208/11/2017 2:14:01 PM08/11/2017 3:17:31 PM130.332N132158152.000.00.514.645.760.373.20
1535077309977209/14/2017 1:44:44 PM09/14/2017 2:34:29 PM128.232N13132152.000.00.54.405.760.362.96
1030243431843105/15/2017 8:11:34 AM05/15/2017 9:03:16 AM128.202N90132152.000.00.511.715.760.370.27
259251094874206/16/2017 6:51:20 PM06/16/2017 7:41:42 PM127.972N261132252.004.50.50.005.760.363.06
# ==> EXEMPLAR CODE and OUTPUT
df_sort = df.sort_values(by=['total_amount'],ascending=False)
df_sort.head(10)

Unnamed: 0VendorIDtpep_pickup_datetimetpep_dropoff_datetimepassenger_counttrip_distanceRatecodeIDstore_and_fwd_flagPULocationIDDOLocationIDpayment_typefare_amountextramta_taxtip_amounttolls_amountimprovement_surchargetotal_amount
847611157412102/06/2017 5:50:10 AM02/06/2017 5:51:08 AM12.605N2262261999.990.00.0200.000.000.31200.29
20312107558404212/19/2017 9:40:46 AM12/19/2017 9:40:55 AM20.005N2652652450.000.00.00.000.000.3450.30
1386140523668205/19/2017 8:20:21 AM05/19/2017 9:20:30 AM133.925N2292651200.010.00.551.645.760.3258.21
12511107108848212/17/2017 6:24:24 PM12/17/2017 6:24:42 PM10.005N2652651175.000.00.046.6911.750.3233.74
1547455538852206/06/2017 8:55:01 PM06/06/2017 8:55:06 PM10.005N2652651200.000.00.511.000.000.3211.80
606449894023206/13/2017 12:30:22 PM06/13/2017 1:37:51 PM132.723N13811107.000.00.055.5016.260.3179.06
16379101198443211/30/2017 10:41:11 AM11/30/2017 11:31:45 AM125.505N1322652140.000.00.50.0016.260.3157.06
3582111653084101/01/2017 11:53:01 PM01/01/2017 11:53:42 PM17.305N111152.000.00.00.000.000.3152.30
1126951920669106/19/2017 12:51:17 AM06/19/2017 12:52:12 AM20.005N2652651120.000.00.020.0011.520.3151.82
928051810714206/18/2017 11:33:25 PM06/19/2017 12:12:38 AM233.965N1322652150.000.00.00.000.000.3150.30

Exemplar response: The two outputs in the table show different rows of data. A reason for this observation might be because there are trips that have a shorter distance, yet higher total amount (total ride fare).

Step 2c. Understand the data - Use df.describe() #

Instructions: Using pandas df.describe(), what is the min, mean, and max of your chosen variables?

Code and perform the following:

  • df.describe()

  • Question 1: What is the min, mean, and max of your first variable?

  • Question 2: What is the min, mean, and max of second variable?

  • Question 3: Are the values easily readable? Would could be done to make them more easily readable?

# ==> EXEMPLAR CODE and OUTPUT

df.describe()

#Question 1: what is the min, mean and max of your first variable ?

#Answer:             Min        Mean    Max
#total_amount        -120.30    16.31   1200.29

#Question 2: what is the min, mean, and max of your second variable?

#Answer:               Min   Mean    Max
#trip_distance         0     2.91    33.96

Unnamed: 0VendorIDpassenger_counttrip_distanceRatecodeIDPULocationIDDOLocationIDpayment_typefare_amountextramta_taxtip_amounttolls_amountimprovement_surchargetotal_amount
count2.269900e+0422699.00000022699.00000022699.00000022699.00000022699.00000022699.00000022699.00000022699.00000022699.00000022699.00000022699.00000022699.00000022699.00000022699.000000
mean5.675849e+071.5562361.6423192.9133131.043394162.412353161.5279971.33688713.0266290.3332750.4974451.8357810.3125420.29955116.310502
std3.274493e+070.4968381.2852313.6531710.70839166.63337370.1396910.49621113.2437910.4630970.0394652.8006261.3992120.01567316.097295
min1.212700e+041.0000000.0000000.0000001.0000001.0000001.0000001.000000-120.000000-1.000000-0.5000000.0000000.000000-0.300000-120.300000
25%2.852056e+071.0000001.0000000.9900001.000000114.000000112.0000001.0000006.5000000.0000000.5000000.0000000.0000000.3000008.750000
50%5.673150e+072.0000001.0000001.6100001.000000162.000000162.0000001.0000009.5000000.0000000.5000001.3500000.0000000.30000011.800000
75%8.537452e+072.0000002.0000003.0600001.000000233.000000233.0000002.00000014.5000000.5000000.5000002.4500000.0000000.30000017.800000
max1.134863e+082.0000006.00000033.96000099.000000265.000000265.0000004.000000999.9900004.5000000.500000200.00000019.1000000.3000001200.290000

PACE: Construct #

Step 3a. Visualize you variables #

Instructions: Create a histogram for each of the two variables. Act and reflect on the following steps:

  1. Histogram of your first variable (total_amount)

  2. Histogram of your second variable (trip_distance)

  3. Are your variables numerical (did the code work)?

# ==> EXEMPLAR CODE and OUTPUT
plt.figure(figsize=(8,6))
plt.xticks(fontsize=14); plt.yticks(fontsize=14)
df = df.sort_values(by='total_amount')
plt.hist(df['total_amount'],bins=[0,20,40,60,80,100])
plt.title('Histogram of Total Amount',fontsize=20)
plt.xticks(fontsize=15); plt.yticks(fontsize=15)
plt.xlabel('$ amount bin',fontsize=20)
plt.ylabel('Count', fontsize=20)
Text(0, 0.5, 'Count')

png

# ==> EXEMPLAR CODE and OUTPUT
plt.figure(figsize=(8,6))
#plt.xticks(fontsize=14); plt.yticks(fontsize=14)
df = df.sort_values(by='trip_distance')
plt.hist(df['trip_distance'],bins=[0,5,10,20,25])
plt.title('Histogram of Trip Distance',fontsize=20)
plt.xticks(fontsize=15); plt.yticks(fontsize=15)
plt.xlabel('Trip distance bin',fontsize=20)
plt.ylabel('Count', fontsize=20)
Text(0, 0.5, 'Count')

png

PACE: Execute #

Given your efforts, what can you summarize for DeShawn? #

Note for Learners: Your answer should address Luana’s request for a summary that covers the following points:

  • A summary of the data type of each variable

  • Any relevant and irrelevant columns

  • The minimum, mean, and max for the two most relevant variables

  • A summary of the data visualization



Foundations of Data Science - This article is part of a series.
Part 5: This Article