Skip to main content
  1. Class Notes/

11 mins· 0 · 0 · ·
Table of Contents

Data Structuring #

Throughout the following exercises, you will practice structuring data in Python. Before starting on this programming exercise, we strongly recommend watching the video lecture and completing the IVQ for the associated topics.

All the information you need for solving this assignment is in this notebook, and all the code you will be implementing will take place within this notebook.

As we move forward, you can find instructions on how to install required libraries as they arise in this notebook. Before we begin with the exercises and analyzing the data, we need to import all libraries and extensions required for this programming exercise. Throughout the course, we will be using pandas for operations, and matplotlib and seaborn for plotting.

Objective #

We will be examining lightning strike data collected by the National Oceanic and Atmospheric Association (NOAA) for the year of 2018.

First, we will find the locations with the greatest number of strikes within a single day.
Then, we will examine the locations that had the greatest number of days with at least one lightning strike.
Next, we will determine whether certain days of the week had more lightning strikes than others.
Finally, we will add data from 2016 and 2017 and, for each month, calculate the percentage of total lightning strikes for that year that occurred in that month. We will then plot this data on a bar graph.

# Import statements
import pandas as pd
import numpy as np
import seaborn as sns
import datetime
from matplotlib import pyplot as plt
# Read in the 2016 data
df = pd.read_csv('eda_structuring_with_python_dataset1.csv') 
df.head()

datenumber_of_strikescenter_point_geom
02018-01-03194POINT(-75 27)
12018-01-0341POINT(-78.4 29)
22018-01-0333POINT(-73.9 27)
32018-01-0338POINT(-73.8 27)
42018-01-0392POINT(-79 28)

Just like the data you encountered previously, the dataset has three columns: date, number_of_strikes and center_point_geom. Start by converting the date column to datetime.

# Convert the `date` column to datetime
df['date'] = pd.to_datetime(df['date']) 

Now, let’s check the shape of the dataframe.

df.shape
(3401012, 3)

Let’s do a quick check for duplicates. If the shape of the data is different after running this code, we’ll know there were duplicate rows.

df.drop_duplicates().shape
(3401012, 3)

The shape of the dataset after dropping duplicates is the same, so we can assume no duplicates. Hence, there is at most one row per date, per area, per number of strikes.

Locations with most strikes in a single day #

To identify the locations with the most strikes in a single day, sort the number_of_strikes column in descending value, or most to least.

# Sort by number of strikes in descending order
df.sort_values(by='number_of_strikes', ascending=False).head(10)

datenumber_of_strikescenter_point_geom
3027582018-08-202211POINT(-92.5 35.5)
2783832018-08-162142POINT(-96.1 36.1)
2808302018-08-172061POINT(-90.2 36.1)
2804532018-08-172031POINT(-89.9 35.9)
2783822018-08-161902POINT(-96.2 36.1)
115172018-02-101899POINT(-95.5 28.1)
2775062018-08-161878POINT(-89.7 31.5)
249062018-02-251833POINT(-98.7 28.9)
2843202018-08-171767POINT(-90.1 36)
248252018-02-251741POINT(-98 29)

Locations with most days with at least one lightning strike #

To find the number of days that a given geographic location had at least one lightning strike, we’ll use the value_counts() function on the center_point_geom column. The logic is that if each row represents a location-day, then counting the number of times each location occurs in the data will give us the number of days that location had lightning.

# Identify locations that appear most in the dataset
df.center_point_geom.value_counts()
POINT(-81.5 22.5)     108
POINT(-84.1 22.4)     108
POINT(-82.5 22.9)     107
POINT(-82.7 22.9)     107
POINT(-82.5 22.8)     106
                     ... 
POINT(-119.3 35.1)      1
POINT(-119.3 35)        1
POINT(-119.6 35.6)      1
POINT(-119.4 35.6)      1
POINT(-58.5 45.3)       1
Name: center_point_geom, Length: 170855, dtype: int64

We find that the locations with the most days with lightning strikes had at least one strike on 108 days—nearly one out of every three days of the year. They are all rather close to each other geographically, which makes sense. Notice also that the value_counts() function automatically sorts the results in descending order.

Let’s examine whether there is an even distribution of values, or whether 106+ strikes are unusually high value for days with lightning strikes. We’ll use the value_counts() function again, but this time we’ll output the top 20 results. We’ll also rename the columns and apply a color gradient.

# Identify top 20 locations with most days of lightning
df.center_point_geom.value_counts()[:20].rename_axis('unique_values').reset_index(name='counts').style.background_gradient()
 unique_valuescounts
0POINT(-81.5 22.5)108
1POINT(-84.1 22.4)108
2POINT(-82.5 22.9)107
3POINT(-82.7 22.9)107
4POINT(-82.5 22.8)106
5POINT(-84.2 22.3)106
6POINT(-76 20.5)105
7POINT(-75.9 20.4)105
8POINT(-82.2 22.9)104
9POINT(-78 18.2)104
10POINT(-83.9 22.5)103
11POINT(-84 22.4)102
12POINT(-82 22.8)102
13POINT(-82 22.4)102
14POINT(-82.3 22.9)102
15POINT(-78 18.3)102
16POINT(-84.1 22.5)101
17POINT(-75.5 20.6)101
18POINT(-84.2 22.4)101
19POINT(-76 20.4)101

Lightning strikes by day of week #

One useful grouping is categorizing lightning strikes by day of the week, which will tell us whether any particular day of the week had fewer or more lightning strikes than others. To calculate this, we’ll take advantage of the fact that the data in our date column is of the datetime class. Because these entries are datetime objects, we can extract date-related information from them and create new columns.

First, we’ll create a column called week using dt.isocalendar() on the date column. This function is designed to be used on a pandas series, and it will return a new dataframe with year, week, and day columns. The information is formatted numerically, so, for example, 3 January 1950 would be respresented as:

YearWeekDay
195013

Because we only want to extract the week number, we’ll add .week to the end. You can learn more about dt.isocalendar() in the dt.isocalendar() pandas documentation.

We’ll also add a weekday column using dt.day_name(). This is another pandas function designed to be used on a pandas series. It extracts the text name of the day for any given datetime date. You can learn more about this function in the dt.day_name() pandas documentation.

# Create two new columns
df['week'] = df.date.dt.isocalendar().week
df['weekday'] = df.date.dt.day_name()
df.head()

datenumber_of_strikescenter_point_geomweekweekday
02018-01-03194POINT(-75 27)1Wednesday
12018-01-0341POINT(-78.4 29)1Wednesday
22018-01-0333POINT(-73.9 27)1Wednesday
32018-01-0338POINT(-73.8 27)1Wednesday
42018-01-0392POINT(-79 28)1Wednesday

Now, we can calculate the mean number of lightning strikes for each weekday of the year. We’ll use the groupby() function to do this.

# Calculate mean count of lightning strikes for each weekday
df[['weekday','number_of_strikes']].groupby(['weekday']).mean()

number_of_strikes
weekday
Friday13.349972
Monday13.152804
Saturday12.732694
Sunday12.324717
Thursday13.240594
Tuesday13.813599
Wednesday13.224568

Interesting! It seems that Saturday and Sunday has fewer average lightning strikes than the other five weekdays. Let’s plot the distributions of the strike counts for each day of the week. We want each distribution to be represented as a boxplot.

Let’s begin by defining the order of the days. We’ll begin with Monday and end with Sunday. This is how the days will be ordered in the plot we create.

# Define order of days for the plot
weekday_order = ['Monday','Tuesday', 'Wednesday', 'Thursday','Friday','Saturday','Sunday']

Now, we’ll code the plot. Remember that showfliers is the parameter that controls whether or not outliers are displayed in the plot. If you input True, outliers are included; if you input False, outliers are left off of the box plot. Keep in mind, we aren’t deleting any outliers from the dataset when we create this chart, we are only excluding them from the visualization.

# Create boxplots of strike counts for each day of week
g = sns.boxplot(data=df, 
            x='weekday',
            y='number_of_strikes', 
            order=weekday_order, 
            showfliers=False 
            );
g.set_title('Lightning distribution per weekday (2018)');

png

Notice that the median remains the same on all of the days of the week. As for Saturday and Sunday, however, the distributions are both lower than the rest of the week. We also know that the mean numbers of strikes that occurred on Saturday and Sunday were lower than the other weekdays. Why might this be? Perhaps the aerosol particles emitted by factories and vehicles increase the likelihood of lightning strike. In the U.S., Saturday and Sunday are days that many people don’t work, so there may be fewer factories operating and fewer cars on the road. This is only speculation, but it’s one possible path for further exploration.

Monthly lightning strikes 2016–2018 #

Finally, we’ll examine monthly lightning strike data from 2016–2018. We’ll calculate the percentage of total lightning strikes for each year that occurred in a given month. We will then plot this data on a bar graph.

# Import 2016–2017 data
df_2 = pd.read_csv('eda_structuring_with_python_dataset2.csv')
df_2.head()

datenumber_of_strikescenter_point_geom
02016-01-0455POINT(-83.2 21.1)
12016-01-0433POINT(-83.1 21.1)
22016-01-0546POINT(-77.5 22.1)
32016-01-0528POINT(-76.8 22.3)
42016-01-0528POINT(-77 22.1)

The data is in the same format as the 2018 data when we first imported it above. Let’s convert the date column to datetime.

# Convert `date` column to datetime
df_2['date'] = pd.to_datetime(df_2['date'])

Now we can combine the 2016–2017 dataframe with the 2018 dataframe. There are several functions that can do this. We’ll use concat(). Remember that the 2018 data has two added columns, week and weekday. To simplify the results of our combined dataframe, we’ll drop these added columns during the concatenation. Note that the following code doesn’t permanently modify df. The columns drop only for this operation. You can learn more about the concat() function in the concat() pandas documentation.

# Create new dataframe combining 2016–2017 data with 2018 data
union_df = pd.concat([df.drop(['weekday','week'],axis=1), df_2], ignore_index=True)
union_df.head()

datenumber_of_strikescenter_point_geom
02018-01-03194POINT(-75 27)
12018-01-0341POINT(-78.4 29)
22018-01-0333POINT(-73.9 27)
32018-01-0338POINT(-73.8 27)
42018-01-0392POINT(-79 28)

To help us with naming the bars of the bar plot, we’ll create three new columns that isolate the year, month number, and month name.

# add 3 new columns
union_df['year'] = union_df.date.dt.year
union_df['month'] = union_df.date.dt.month
union_df['month_txt'] = union_df.date.dt.month_name()
union_df.head()

datenumber_of_strikescenter_point_geomyearmonthmonth_txt
02018-01-03194POINT(-75 27)20181January
12018-01-0341POINT(-78.4 29)20181January
22018-01-0333POINT(-73.9 27)20181January
32018-01-0338POINT(-73.8 27)20181January
42018-01-0392POINT(-79 28)20181January

Let’s check the overall lightning strike count for each year.

# Calculate total number of strikes per year
union_df[['year','number_of_strikes']].groupby(['year']).sum()

number_of_strikes
year
201641582229
201735095195
201844600989

Now we’ll calculate the percentage of total lightning strikes for each year that occurred in a given month and assign the results to a new dataframe called lightning_by_month.

# Calculate total lightning strikes for each month of each year
lightning_by_month = union_df.groupby(['month_txt','year']).agg(
    number_of_strikes = pd.NamedAgg(column='number_of_strikes',aggfunc=sum)
    ).reset_index()

lightning_by_month.head()

month_txtyearnumber_of_strikes
0April20162636427
1April20173819075
2April20181524339
3August20167250442
4August20176021702

By the way, we can use the agg() function to calculate the same yearly totals we found before, with 2017 having fewer strikes than the other two years.

# Calculate total lightning strikes for each year
lightning_by_year = union_df.groupby(['year']).agg(
  year_strikes = pd.NamedAgg(column='number_of_strikes',aggfunc=sum)
).reset_index()

lightning_by_year.head()

yearyear_strikes
0201641582229
1201735095195
2201844600989

Back to our bar plot… We need to use the monthly totals to calculate percentages. For each month, we’ll need the monthly total strike count and the total strike count for that year. Let’s create another dataframe called percentage_lightning that adds a new column called year_strikes, which represents the total number of strikes for each year. We can do this using the merge() function. We’ll merge the lightning_by_month dataframe with the lightning_by_year dataframe, specifying to merge on the year column. This means that wherever the year columns contain the same value in both dataframes, a row is created in our new dataframe with all the other columns from both dataframes being merged. To learn more about this function, refer to the merge() pandas documentation.

# Combine `lightning_by_month` and `lightning_by_year` dataframes into single dataframe
percentage_lightning = lightning_by_month.merge(lightning_by_year,on='year')
percentage_lightning.head()

month_txtyearnumber_of_strikesyear_strikes
0April2016263642741582229
1August2016725044241582229
2December201631645041582229
3February201631267641582229
4January201631359541582229

Now we’ll create a new column in our new dataframe that represents the percentage of total lightning strikes for each year that occurred during each month. We do this by dividing the number_of_strikes column by the year_strikes column and multiplying the result by 100.

# Create new `percentage_lightning_per_month` column
percentage_lightning['percentage_lightning_per_month'] = (percentage_lightning.number_of_strikes/
                                                          percentage_lightning.year_strikes * 100.0)
percentage_lightning.head()

month_txtyearnumber_of_strikesyear_strikespercentage_lightning_per_month
0April20162636427415822296.340273
1August201672504424158222917.436396
2December2016316450415822290.761022
3February2016312676415822290.751946
4January2016313595415822290.754156

Now we can plot the percentages by month in a bar graph!

plt.figure(figsize=(10,6));

month_order = ['January', 'February', 'March', 'April', 'May', 'June', 
               'July', 'August', 'September', 'October', 'November', 'December']

sns.barplot(
    data = percentage_lightning,
    x = 'month_txt',
    y = 'percentage_lightning_per_month',
    hue = 'year',
    order = month_order );
plt.xlabel("Month");
plt.ylabel("% of lightning strikes");
plt.title("% of lightning strikes each Month (2016-2018)");

png

For all three years, there is a clear pattern over the course of each year. One month stands out. More than 1/3 of the lightning in 2018 happened in August.

If you have successfully completed the material above, congratulations! You now understand how to structure data in Python and should be able to start using it on your own datasets.