Skip to main content
  1. Class Notes/

Exploratory Data Analysis Using Basic Data Functions with Python

6 mins· 0 · 0 · ·
Jerad Acosta
Google Advanced Data Analytics Professional Certificate lab jupyter python
Table of Contents
Foundations of Data Science - This article is part of a series.
Part 10: This Article

Discovering #

Throughout the following exercises, you will implement discovering skills on a dataset. Before starting on this programming exercise, we strongly recommend watching the video lecture and completing the IVQ for the associated topics.

All the information you need for solving this assignment is in this notebook, and all the code you will be implementing will take place within this notebook.

As we move forward, you can find instructions on how to install required libraries as they arise in this notebook. Before we begin with the exercises and analyzing the data, we need to import all libraries and extensions required for this programming exercise. Throughout the course, we will be using pandas, numpy, datetime, for operations, and matplotlib, pyplot and seaborn for plotting.

Objective #

First, we will use pandas to examine 2018 lightning strike data collected by the National Oceanic and Atmospheric Administration (NOAA). Then, we will calculate the total number of strikes for each month, and plot this information on a bar graph.

Begin with importing the following packages and libraries:

import pandas as pd
import numpy as np
import datetime
import matplotlib.pyplot as plt
# Read in the 2018 lightning strike dataset
df = pd.read_csv('eda_using_basic_data_functions_in_python_dataset1.csv')
# Inspect the first 10 rows
df.head(10)

datenumber_of_strikescenter_point_geom
02018-01-03194POINT(-75 27)
12018-01-0341POINT(-78.4 29)
22018-01-0333POINT(-73.9 27)
32018-01-0338POINT(-73.8 27)
42018-01-0392POINT(-79 28)
52018-01-03119POINT(-78 28)
62018-01-0335POINT(-79.3 28)
72018-01-0360POINT(-79.1 28)
82018-01-0341POINT(-78.7 28)
92018-01-03119POINT(-78.6 28)

Notice that the data is structured as one row per day along with the geometric location of the strike.

A quick way to see how many rows and columns of data we have in total is to use df.shape. The information will be output as: ([rows], [columns]).

df.shape
(3401012, 3)

The total number of rows is 3,401,012 and there are three columns.

# Get more information about the data, including data types of each column
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3401012 entries, 0 to 3401011
Data columns (total 3 columns):
 #   Column             Dtype 
---  ------             ----- 
 0   date               object
 1   number_of_strikes  int64 
 2   center_point_geom  object
dtypes: int64(1), object(2)
memory usage: 77.8+ MB

info will give us the total number of rows (3,401,012) and columns (3). It will also tell us the names and data types of each column, as well as the size of the dataframe in memory.

In this case, notice that the date column is an ‘object’ type rather than a ‘date’ type. Objects are strings. When dates are encoded as strings, they cannot be manipulated as easily. Converting string dates to datetime will enable us to work with them much more easily.

Let’s convert to datetime using the pandas function to_datetime().

# Convert date column to datetime
df['date']= pd.to_datetime(df['date'])

As part of discovering, we want to get an idea of the highest data points. For this dataset, we can calculate the top 10 days of 2018 with the most number of lightning stikes using the groupby(), sum(), and sort_values() functions from pandas.

When we use groupby() on the date column, the function combines all rows with the same date into a single row.

Then, using sum() will perform a sum calculation on all other summable columns. In this case, we’re summing all the lightning strikes that happened on each day. Notice that the center_point_geom column is not included in the output. That’s because, as a string object, this column is not summable.

Finally, sort_values() returns the results in descending order of total strikes for each day in our data.

# Calculate days with most lightning strikes
df.groupby(['date']).sum().sort_values('number_of_strikes', ascending=False).head(10) 

number_of_strikes
date
2018-08-291070457
2018-08-17969774
2018-08-28917199
2018-08-27824589
2018-08-30802170
2018-08-19786225
2018-08-18741180
2018-08-16734475
2018-08-31723624
2018-08-15673455

As a word of caution, you’ll find that a common mistake among data professionals is using count() instead of sum(), and vice versa. In our case, count() would return the number of occurrences of each date in the dataset, which is not what we want.

Next, let’s extract the month data from the date column and add that extracted month data into a new column called month. This is why converting the date column to datetime is very useful. dt.month extracts just the month information (as a numeric value) from our date.

# Create a new `month` column
df['month'] = df['date'].dt.month
df.head()

datenumber_of_strikescenter_point_geommonth
02018-01-03194POINT(-75 27)1
12018-01-0341POINT(-78.4 29)1
22018-01-0333POINT(-73.9 27)1
32018-01-0338POINT(-73.8 27)1
42018-01-0392POINT(-79 28)1

Now we can sort our values by most strikes per month. We’ll use groupby(), sum() and sort_values() from pandas again.

# Calculate total number of strikes per month
df.groupby(['month']).sum().sort_values('number_of_strikes', ascending=False).head(12)

number_of_strikes
month
815525255
78320400
66445083
54166726
93018336
22071315
41524339
101093962
1860045
3854168
11409263
12312097

To help us read the data more easily, we’ll convert the month number to text using the datetime function dt.month_name() and add this as a new column in the dataframe. str.slice will omit the text after the first 3 letters.

# Create a new `month_txt` column
df['month_txt'] = df['date'].dt.month_name().str.slice(stop=3)
df.head()

datenumber_of_strikescenter_point_geommonthmonth_txt
02018-01-03194POINT(-75 27)1Jan
12018-01-0341POINT(-78.4 29)1Jan
22018-01-0333POINT(-73.9 27)1Jan
32018-01-0338POINT(-73.8 27)1Jan
42018-01-0392POINT(-79 28)1Jan

Our objective is to plot the total number of strikes per month as a bar graph. To help us with the plotting, we’ll create a new dataframe called df_by_month. This will allow us to easily access the month, month text, and total number of strikes for each month.

# Create new helper dataframe for plotting
df_by_month = df.groupby(['month','month_txt']).sum().sort_values('month', ascending=True).head(12).reset_index()
df_by_month

monthmonth_txtnumber_of_strikes
01Jan860045
12Feb2071315
23Mar854168
34Apr1524339
45May4166726
56Jun6445083
67Jul8320400
78Aug15525255
89Sep3018336
910Oct1093962
1011Nov409263
1112Dec312097

Now we’ll make our bar chart. Pyplot’s plt.bar() function takes positional arguments of x and height, representing the data used for the x- and y- axes, respectively. We want the x-axis to represent months, and the y-axis to represent strike count.

plt.bar(x=df_by_month['month_txt'],height= df_by_month['number_of_strikes'], label="Number of strikes")
plt.plot()

plt.xlabel("Months(2018)")
plt.ylabel("Number of lightning strikes")
plt.title("Number of lightning strikes in 2018 by months")
plt.legend()
plt.show()

png

If you have successfully completed the material above, congratulations! You have some of the fundamental elements of data discovery that you can apply to your own datasets.



Foundations of Data Science - This article is part of a series.
Part 10: This Article