Read and format project data
# Load flights data from URL
= pd.read_json(
flights "https://raw.githubusercontent.com/byuidatascience/data4missing/master/data-raw/flights_missing/flights_missing.json")
Delayed flights are not something most people look forward to. In the best case scenario you may only wait a few extra minutes for the plane to be cleaned. However, those few minutes can stretch into hours if a mechanical issue is discovered or a storm develops. Arriving hours late may result in you missing a connecting flight, job interview, or your best friend’s wedding.
In 2003 the Bureau of Transportation Statistics (BTS) began collecting data on the causes of delayed flights. The categories they use are Air Carrier, National Aviation System, Weather, Late-Arriving Aircraft, and Security. You can visit the BTS website to read definitions of these categories.
# Load flights data from URL
= pd.read_json(
flights "https://raw.githubusercontent.com/byuidatascience/data4missing/master/data-raw/flights_missing/flights_missing.json")
Which airport has the worst delays? Discuss the metric you chose, and why you chose it to determine the “worst” airport. Your answer should include a summary table that lists (for each airport) the total number of flights, total number of delayed flights, proportion of delayed flights, and average delay time in hours.
# Group flights data by airport code and calculate metrics
= (flights.groupby('airport_code')
prop_total_delays =('num_of_flights_total', 'sum'),
.agg(total_flights=('num_of_delays_total', 'sum'),
total_delays=('minutes_delayed_total', 'sum'))
total_delay_mins=lambda x: x.total_delays / x.total_flights,
.assign(prop_delays=lambda x: x.total_delay_mins / (x.total_delays * 60))
avg_delay_hours=["prop_delays", "avg_delay_hours"], ascending=False))
.sort_values(by
# Display summary table as markdown
print(prop_total_delays.to_markdown(floatfmt='.2f'))
| airport_code | total_flights | total_delays | total_delay_mins | prop_delays | avg_delay_hours |
|:---------------|----------------:|---------------:|-------------------:|--------------:|------------------:|
| SFO | 1630945.00 | 425604.00 | 26550493.00 | 0.26 | 1.04 |
| ORD | 3597588.00 | 830825.00 | 56356129.00 | 0.23 | 1.13 |
| ATL | 4430047.00 | 902443.00 | 53983926.00 | 0.20 | 1.00 |
| IAD | 851571.00 | 168467.00 | 10283478.00 | 0.20 | 1.02 |
| SAN | 917862.00 | 175132.00 | 8276248.00 | 0.19 | 0.79 |
| DEN | 2513974.00 | 468519.00 | 25173381.00 | 0.19 | 0.90 |
| SLC | 1403384.00 | 205160.00 | 10123371.00 | 0.15 | 0.82 |
I chose the proportion of delayed flights and the average delay time in hours as the metrics to determine the “worst” airport because these factors give us a better understanding of how many flights are affected by delays and how long the delays are on average. A high number of delayed flights and a long average delay time would indicate that an airport experiences many delays, which can be frustrating for passengers.
What is the best month to fly if you want to avoid delays of any length? Discuss the metric you chose and why you chose it to calculate your answer. Include one chart to help support your answer, with the x-axis ordered by month. (To answer this question, you will need to remove any rows that are missing the Month variable.)
# Remove rows with missing values in the "month" column
= (flights.drop(flights.index[flights["month"] == "n/a"]))
flight_data
# Group the data by month and calculate the total number of delays for each month.
= flight_data.groupby(
total_delays 'month')['num_of_delays_total'].sum().reset_index()
# A bar chart that shows the total number of delays for each month.
= alt.Chart(total_delays).mark_bar().encode(
chart1 =alt.X('month:N', sort=['January', 'February', 'March', 'April', 'May', 'June',
x'July', 'August', 'September', 'October', 'November', 'December'],
='Month'),
title=alt.Y('num_of_delays_total:Q', title='Total Delays'),
y=[alt.Tooltip('month:N', title='Month'), alt.Tooltip(
tooltip'num_of_delays_total:Q', title='Total Delays')]
='Total Delays by Month').configure_axis(grid=True)
).properties(title
# Display the chart
chart1
# Group the data by month and calculate the total number of minutes delayed for each month.
= flight_data.groupby(
total_delay_minutes 'month')['minutes_delayed_total'].sum().reset_index()
# Create a new data frame that shows the average time per delay for each month.
= pd.DataFrame({'month': total_delay_minutes['month'],
avg_delay_time 'avg_time_per_delay': total_delay_minutes['minutes_delayed_total'] / total_delays['num_of_delays_total']})
# A bar chart that shows the average time per delay for each month.
= alt.Chart(avg_delay_time).mark_bar().encode(
chart2 =alt.X('month:N', sort=['January', 'February', 'March', 'April', 'May', 'June',
x'July', 'August', 'September', 'October', 'November', 'December'],
='Month'),
title=alt.Y('avg_time_per_delay:Q', title='Average Time per Delay (minutes)'),
y=[alt.Tooltip('month:N', title='Month'), alt.Tooltip(
tooltip'avg_time_per_delay:Q', title='Average Time per Delay (minutes)')]
='Average Time per Delay by Month').configure_axis(grid=True)
).properties(title
# Display the chart
chart2
The metric used to calculate the answer is the average time per delay. This metric indicates how long a passenger might expect a delay during each flight in a particular month. The lower the average time per delay, the better the month would be for flying to avoid delays.
# Group the data by month and calculate the total number of flights for each month.
= flight_data.groupby(
total_flights "month")["num_of_flights_total"].sum().reset_index()
# Merge the data frames for total delays and total flights by month and
# calculate the percentage of delays for each month.
= total_delays.merge(total_flights, on="month")
percentage_delays_month "percentage_delays"] = percentage_delays_month["num_of_delays_total"] / \
percentage_delays_month["num_of_flights_total"] * 100
percentage_delays_month[
# A bar chart that shows the percentage of delays for each month.
= alt.Chart(percentage_delays_month).mark_bar().encode(
chart3 =alt.X('month:N', sort=['January', 'February', 'March', 'April', 'May', 'June',
x'July', 'August', 'September', 'October', 'November', 'December'], title='Month'),
=alt.Y('percentage_delays:Q', title='Percentage of Delays'),
y=[alt.Tooltip('month:N', title='Month'), alt.Tooltip(
tooltip'percentage_delays:Q', title='Percentage of Delays')]
='Percentage of Delays by Month').configure_axis(grid=True)
).properties(title
# Display the chart
chart3
The resulting chart shows the percentage of flights with any delay for each month, sorted in ascending order. According to the chart, September is the best month to fly to avoid delays, with roughly 16% of flights experiencing delays. This percentage is followed closely by October and November, with around 17–18% of flights experiencing delays. The worst months to fly in terms of delay frequency are June, July, and December, with about 23–26% of flights experiencing some delay.
According to the BTS website, the “Weather” category only accounts for severe weather delays. Mild weather delays are not counted in the “Weather” category, but are actually included in both the “NAS” and “Late-Arriving Aircraft” categories. Your job is to create a new column that calculates the total number of flights delayed by weather (both severe and mild). You will need to replace all the missing values in the Late Aircraft variable with the mean. Show your work by printing the first 5 rows of data in a table. Use these three rules for your calculations:
a. 100% of delayed flights in the Weather category are due to weather
b. 30% of all delayed flights in the Late-Arriving category are due to weather.
c. From April to August, 40% of delayed flights in the NAS category are due to weather. The rest of the months, the proportion rises to 65%.
# Calculate the average of the column num_of_delays_late_aircraft
= flights.num_of_delays_late_aircraft.replace(
late_aircraft_avg -999, np.nan).mean()
# Replace all the invalid values -999 with the average value only in the column "num_of_delays_late_aircraft"
"num_of_delays_late_aircraft"].replace(
flights[-999, late_aircraft_avg, inplace=True)
# Calculate the total number of flights delayed by weather using the given rules
= flights["num_of_delays_weather"]
severe_weather = 0.3 * flights["num_of_delays_late_aircraft"]
mild_weather_late_aircraft = np.where(flights["month"].isin(
nas_weather_multiplier "April", "May", "June", "July", "August"]), 0.4, 0.65)
[= nas_weather_multiplier * flights["num_of_delays_nas"]
mild_weather_nas = severe_weather + \
num_delays_weather + mild_weather_nas
mild_weather_late_aircraft
# Add the new columns to the flights dataframe
"severe_weather"] = severe_weather
flights["mild_weather_late_aircraft"] = mild_weather_late_aircraft
flights["mild_weather_nas"] = mild_weather_nas
flights["num_delays_weather"] = num_delays_weather
flights[
# Print the first 5 rows of the new dataframe with the required columns
print(flights[["airport_code", "severe_weather", "mild_weather_late_aircraft",
"mild_weather_nas", "num_delays_weather"]].head().to_markdown())
| | airport_code | severe_weather | mild_weather_late_aircraft | mild_weather_nas | num_delays_weather |
|---:|:---------------|-----------------:|-----------------------------:|-------------------:|---------------------:|
| 0 | ATL | 448 | 332.731 | 2988.7 | 3769.43 |
| 1 | DEN | 233 | 278.4 | 607.75 | 1119.15 |
| 2 | IAD | 61 | 317.4 | 581.75 | 960.15 |
| 3 | ORD | 306 | 676.5 | 3519.75 | 4502.25 |
| 4 | SAN | 56 | 204 | 414.7 | 674.7 |
The resulting table calculates the total number of flights delayed by weather (both severe and mild) using three rules and adds new columns to the flights data frame with the required data. It first calculates the average of the column “num_of_delays_late_aircraft,” replaces all the invalid values of -999 with the average value only in the column “num_of_delays_late_aircraft,” and then calculates the total number of flights delayed by weather using the given rules. It then adds new columns to the flights data frame with the required data and prints the first 5 rows of the new data frame with the required columns in a table.
Using the new weather variable calculated above, create a barplot showing the proportion of all flights that are delayed by weather at each airport. Discuss what you learn from this graph.
# # Group flights by airport code and calculate total number of flights and total number of weather delays for each airport
# delays_by_weather_by_airport = (flights.groupby('airport_code')
# .agg(total_flights=('num_of_flights_total', 'sum'),
# total_delays_weather=('num_delays_weather', 'sum'))
# # Add a new column with the proportion of delays caused by weather for each airport
# .assign(prop_delays_weather=lambda x: x.total_delays_weather / x.total_flights * 100)
# # Reset the index so that airport code becomes a column instead of an index
# .reset_index()
# )
# Group flights by airport code and calculate total number of flights and total number of weather delays for each airport
= (flights.groupby('airport_code', as_index=False)
delays_by_weather_by_airport =('num_of_flights_total', 'sum'),
.agg(total_flights=('num_delays_weather', 'sum'))
total_delays_weather# Add a new column with the proportion of delays caused by weather for each airport
=lambda x: x.total_delays_weather / x.total_flights * 100)
.assign(prop_delays_weather
)
# Create a bar chart showing the proportion of delays caused by weather for each airport
= alt.Chart(delays_by_weather_by_airport).mark_bar().encode(
chart # Set the x-axis to the airport code column
=alt.X('airport_code', title='Airport Code'),
x# Set the y-axis to the proportion of delays caused by weather column
=alt.Y('prop_delays_weather', title='Proportion of Delays by Weather(%)'),
y# Set the color of the bars to the airport code
=alt.Color('airport_code', legend=None),
color# Add tooltip showing the airport code and the percentage of delays
=[
tooltip'airport_code', title='Airport Code'),
alt.Tooltip('prop_delays_weather',
alt.Tooltip(='Proportion of Delays by Weather(%)', format='.2f')
title
]={'text': "Proportion of Flight Delays by Weather", 'subtitle': "A Comparison of Delayed Flights at Different Airports"})
).properties(title# Hide the legend
= chart.configure_legend(orient='none')
chart
# Display the chart
chart
The plot shows the proportion of all flights delayed by weather at each airport. The proportion of delayed flights caused by weather is relatively low for most airports, with less than 10% of all delayed flights caused by weather conditions. However, there are some airports with a higher proportion of delays caused by weather, such as “SFO” (San Francisco International Airport) and “ORD” (Chicago O’Hare International Airport).
Fix all of the varied missing data types in the data to be consistent (all missing values should be displayed as “NaN”). In your report include one record example (one row) from your new data, in the raw JSON format. Your example should display the “NaN” for at least one missing value.
# find values equal to -999 or "1500+" or "" or "n/a"
= flights.isin([-999, "1500+", "", "n/a"])
missing_values
# Display the missing values in a markdown table
'Missing Values': missing_values.sum(),
display(pd.DataFrame({'Percentage of Total': (missing_values.sum()/len(flights)*100).round(0).astype(str) + '%'}))
Missing Values | Percentage of Total | |
---|---|---|
airport_code | 0 | 0.0% |
airport_name | 56 | 6.0% |
month | 27 | 3.0% |
year | 0 | 0.0% |
num_of_flights_total | 0 | 0.0% |
num_of_delays_carrier | 73 | 8.0% |
num_of_delays_late_aircraft | 0 | 0.0% |
num_of_delays_nas | 0 | 0.0% |
num_of_delays_security | 0 | 0.0% |
num_of_delays_weather | 0 | 0.0% |
num_of_delays_total | 0 | 0.0% |
minutes_delayed_carrier | 0 | 0.0% |
minutes_delayed_late_aircraft | 0 | 0.0% |
minutes_delayed_nas | 17 | 2.0% |
minutes_delayed_security | 0 | 0.0% |
minutes_delayed_weather | 0 | 0.0% |
minutes_delayed_total | 0 | 0.0% |
severe_weather | 0 | 0.0% |
mild_weather_late_aircraft | 0 | 0.0% |
mild_weather_nas | 0 | 0.0% |
num_delays_weather | 0 | 0.0% |
We are checking to see if there are any missing values in the data frame. If so, we will count them and compare the values to the overall amount in each column. Once we finish counting, we will replace the missing values with ‘NaN.’
# Replace all missing values with "NaN"
-999, "1500+", "", "n/a"], np.nan, inplace=True)
flights.replace([
# find values equal to -999 or "1500+" or "" or "n/a"
= flights.isin([-999, "1500+", "", "n/a"])
check_missing
# # count the number of values that match the mask
# print(np.count_nonzero(check_missing))
= pd.DataFrame({'Missing Values': check_missing.sum()})
flights_data_missing
# Display the missing values in a markdown table
= pd.DataFrame({'Filled Values': (len(flights) - check_missing.sum()),
flights_data_filled 'Percentage of Total': ((len(flights) - check_missing.sum())/len(flights)*100).round(0).astype(str) + '%'})
# Display both tables side by side
=1)) display(pd.concat([flights_data_missing, flights_data_filled], axis
Missing Values | Filled Values | Percentage of Total | |
---|---|---|---|
airport_code | 0 | 924 | 100.0% |
airport_name | 0 | 924 | 100.0% |
month | 0 | 924 | 100.0% |
year | 0 | 924 | 100.0% |
num_of_flights_total | 0 | 924 | 100.0% |
num_of_delays_carrier | 0 | 924 | 100.0% |
num_of_delays_late_aircraft | 0 | 924 | 100.0% |
num_of_delays_nas | 0 | 924 | 100.0% |
num_of_delays_security | 0 | 924 | 100.0% |
num_of_delays_weather | 0 | 924 | 100.0% |
num_of_delays_total | 0 | 924 | 100.0% |
minutes_delayed_carrier | 0 | 924 | 100.0% |
minutes_delayed_late_aircraft | 0 | 924 | 100.0% |
minutes_delayed_nas | 0 | 924 | 100.0% |
minutes_delayed_security | 0 | 924 | 100.0% |
minutes_delayed_weather | 0 | 924 | 100.0% |
minutes_delayed_total | 0 | 924 | 100.0% |
severe_weather | 0 | 924 | 100.0% |
mild_weather_late_aircraft | 0 | 924 | 100.0% |
mild_weather_nas | 0 | 924 | 100.0% |
num_delays_weather | 0 | 924 | 100.0% |
We need to double-check to ensure that there are not any missing values in the data set. All values considered missing should now show as “NaN.” The total percentage will be 100% if there are no missing values.
# Convert data to JSON
= flights.to_json()
flights_json
# Get one record from the JSON file
= pd.read_json(flights_json, orient='records').head(1)
one_record
# Display one record
display(one_record)
airport_code | airport_name | month | year | num_of_flights_total | num_of_delays_carrier | num_of_delays_late_aircraft | num_of_delays_nas | num_of_delays_security | num_of_delays_weather | ... | minutes_delayed_carrier | minutes_delayed_late_aircraft | minutes_delayed_nas | minutes_delayed_security | minutes_delayed_weather | minutes_delayed_total | severe_weather | mild_weather_late_aircraft | mild_weather_nas | num_delays_weather | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | ATL | Atlanta, GA: Hartsfield-Jackson Atlanta Intern... | January | 2005.0 | 35048 | NaN | 1109.104072 | 4598 | 10 | 448 | ... | 116423.0 | 104415 | 207467.0 | 297 | 36931 | 465533 | 448 | 332.731222 | 2988.7 | 3769.431222 |
1 rows × 21 columns
The JSON file output shows that “NaN” replaces the missing value in the “minutes_delayed_carrier” column.
# Load flights data from URL
= pd.read_json(
flights "https://raw.githubusercontent.com/byuidatascience/data4missing/master/data-raw/flights_missing/flights_missing.json")
# Question #1
# Group flights data by airport code and calculate metrics
= (flights.groupby('airport_code')
prop_total_delays =('num_of_flights_total', 'sum'),
.agg(total_flights=('num_of_delays_total', 'sum'),
total_delays=('minutes_delayed_total', 'sum'))
total_delay_mins=lambda x: x.total_delays / x.total_flights,
.assign(prop_delays=lambda x: x.total_delay_mins / (x.total_delays * 60))
avg_delay_hours=["prop_delays", "avg_delay_hours"], ascending=False))
.sort_values(by
# Question #2
# Remove rows with missing values in the "month" column
= (flights.drop(flights.index[flights["month"] == "n/a"]))
flight_data
# Group the data by month and calculate the total number of delays for each month.
= flight_data.groupby(
total_delays 'month')['num_of_delays_total'].sum().reset_index()
# Group the data by month and calculate the total number of minutes delayed for each month.
= flight_data.groupby(
total_delay_minutes 'month')['minutes_delayed_total'].sum().reset_index()
# Create a new data frame that shows the average time per delay for each month.
= pd.DataFrame({'month': total_delay_minutes['month'],
avg_delay_time 'avg_time_per_delay': total_delay_minutes['minutes_delayed_total'] / total_delays['num_of_delays_total']})
# Group the data by month and calculate the total number of flights for each month.
= flight_data.groupby(
total_flights "month")["num_of_flights_total"].sum().reset_index()
# Merge the data frames for total delays and total flights by month and
# calculate the percentage of delays for each month.
= total_delays.merge(total_flights, on="month")
percentage_delays_month "percentage_delays"] = percentage_delays_month["num_of_delays_total"] / \
percentage_delays_month["num_of_flights_total"] * 100
percentage_delays_month[
# A bar chart that shows the percentage of delays for each month.
= alt.Chart(percentage_delays_month).mark_bar().encode(
chart3 =alt.X('month:N', sort=['January', 'February', 'March', 'April', 'May', 'June',
x'July', 'August', 'September', 'October', 'November', 'December'], title='Month'),
=alt.Y('percentage_delays:Q', title='Percentage of Delays'),
y=[alt.Tooltip('month:N', title='Month'), alt.Tooltip(
tooltip'percentage_delays:Q', title='Percentage of Delays')]
='Percentage of Delays by Month').configure_axis(grid=True)
).properties(title
# Display the chart
# chart3
# Sort the chart by percentage of delays in ascending order
= chart.transform_window(
best_month_chart ='rank(percentage_delays)',
rank=[alt.SortField('percentage_delays', order='ascending')]
sort
).transform_filter(== 1)
(alt.datum.rank ='Best Month to Fly to Avoid Delays')
).properties(title
# Display the chart
# best_month_chart
# Question #3
# Calculate the average of the column num_of_delays_late_aircraft
= flights.num_of_delays_late_aircraft.replace(
late_aircraft_avg -999, np.nan).mean()
# Replace all the invalid values -999 with the average value only in the column "num_of_delays_late_aircraft"
"num_of_delays_late_aircraft"].replace(
flights[-999, late_aircraft_avg, inplace=True)
# Calculate the total number of flights delayed by weather using the given rules
= flights["num_of_delays_weather"]
severe_weather = 0.3 * flights["num_of_delays_late_aircraft"]
mild_weather_late_aircraft = np.where(flights["month"].isin(
nas_weather_multiplier "April", "May", "June", "July", "August"]), 0.4, 0.65)
[= nas_weather_multiplier * flights["num_of_delays_nas"]
mild_weather_nas = severe_weather + \
num_delays_weather + mild_weather_nas
mild_weather_late_aircraft
# Add the new columns to the flights dataframe
"severe_weather"] = severe_weather
flights["mild_weather_late_aircraft"] = mild_weather_late_aircraft
flights["mild_weather_nas"] = mild_weather_nas
flights["num_delays_weather"] = num_delays_weather
flights[
# Question #4
# Group flights by airport code and calculate total number of flights and total number of weather delays for each airport
= (flights.groupby('airport_code')
delays_by_weather_by_airport =('num_of_flights_total', 'sum'),
.agg(total_flights=('num_delays_weather', 'sum'))
total_delays_weather# Add a new column with the proportion of delays caused by weather for each airport
=lambda x: x.total_delays_weather / x.total_flights * 100)
.assign(prop_delays_weather# Reset the index so that airport code becomes a column instead of an index
.reset_index()
)
# Question #5
# find values equal to -999 or "1500+" or "" or "n/a"
= flights.isin([-999, "1500+", "", "n/a"])
missing_values
# Replace all missing values with "NaN"
-999, "1500+", "", "n/a"], np.nan, inplace=True)
flights.replace([
# find values equal to -999 or "1500+" or "" or "n/a"
= flights.isin([-999, "1500+", "", "n/a"])
check_missing
# count the number of values that match the mask
# print(np.count_nonzero(check_missing))
= pd.DataFrame({'Missing Values': check_missing.sum()})
flights_data_missing
# Display the missing values in a markdown table
= pd.DataFrame({'Filled Values': (len(flights) - check_missing.sum()),
flights_data_filled 'Percentage of Total': ((len(flights) - check_missing.sum())/len(flights)*100)})
'flights_missing_cleaned.json', orient='records')
flights.to_json(
with open('flights_missing_cleaned.json') as f:
= json.load(f)
data
# print(data[2])
# Convert data to JSON
= flights.to_json()
flights_json
# Get one record from the JSON file
= pd.read_json(flights_json, orient='records').head(1) one_record