2011 Airport Delays

Analysis of Outgoing Houston Flight Data Using Excel & TableauDataDashboards
Overview & Objectives
The goal of this project was to get a better understanding of the differences in delays between departing flights from George Bush Intercontinental/Houston Airport (IAH) and William P. Hobby Airport (HOU). The data used for this project was posted in Kaggle by user Arun Jangir and consists of temporal data of flights, flight times, delays, and cancellations, amongst other variables, for flights coming out of IAH and HOU in 2011. The questions that inspired the project and that I wished to find answers to are as follows:
  1. Which airport has the higher delay times on average?
  2. Which airport has the greatest number of cancellations?
  3. Are there days in the week or months during the year in which one would be benefited by choosing to use one airport over another?
Method & Skills
The dataset has over 200,000 data points, with the original file being in the csv format. To analyze this data, Excel was used to perform an initial exploration, cleaning, and joining of data, whilst Tableau was used to design and create visuals that better explained the relationships within the dataset. The following skills were valuable in the data exploration and analysis & visualization phases, respectively:
  • Excel: Conditional Formatting, Filtering, VLOOKUP, Boolean Functions, Pivot Tables, Bar Charts, Pie Charts, and Line Charts.
  • Tableau: Maps, Tables, Bar Charts, Line Charts, Double Axes, and Dynamic Filtering.
Key Insights
It is important to note that the data covers only flights from 2011 as well as the fact that IAH flights out number HOU flights drastically. Regarding the questions that inspired this project, the answers are as follows:
  1. HOU airport tends to have statistically significant higher delay times on average over IAH airport (p << 0.05).
  2. Proportionally, HOU airport sees more outlier flights (very long delay times) and cancellations and for what some could say to be less justifiable reasons (carrier issues vs. weather) than IAH airport. However, IAH airport sees more deviations.
  3. As a general rule of thumb, if one is not limited by choice of carrier or location of travel (IAH might serve some airports that HOU might not serve and vice versa), IAH seems to have lower delay times for most hours of the day, days of the week, and months of the year than HOU.
Project Summary

We begin this project by downloading the data and creating a few new variables. The most important variable for this analysis will be what is named TotalDelay on the snippet below. It is the sum of the arrival delay and departure delay of any given flight. We see some calculations in the top right regarding the quartiles of this variable. Following general standard, we will define outlier delays as delays that surpass the third quartile by 1.5 times the interquartile range, or 62.5 minutes in this case. Seen as well is a table indicating time ranges and hours. This table was used alongside VLOOKUP to assign flight departures to hours of the day in order to analyze in the future. Also depicted are destinations and their latitude and longitude, as well as the origin airports (IAH and HOU) and their latitude and longitude. This information will be used in the construction of a map to see where most of the outgoing flights are headed.

Continuing with Excel, we perform a few more analyses to take a quick look at how the data looks like. Below we have a pivot table that was created to make a histogram of the total delay times of IAH and HOU flights that were under the outlier threshold. It is hard to see the difference between the shapes of the two histograms, so we also run a 2 sample t-test on the means of the grouped data to ensure that they are indeed statistically different.  

Moving on, we take a look at the carriers making up the totality of each of the airport's outgoing flights. From the bar charts below and the pivot tables created to produce them, we see that HOU is mainly using WN, whilst IAH mainly uses a combination of XE, CO, and OO. The airports don't necessarily share all the carriers. I believe diversifying carriers is useful for any airport, especially if delays or cancellations, as we will see in the future, are being caused predominantly carriers.

Let's move on to Tableau now, where we create our first dashboard visualizing the count and trajectories of flights from HOU and IAH. We immediately see that IAH offers flights to more locations than HOU, and also that there are around 3 times as many IAH flights than HOU flights in the dataset.

Since the dashboard is dynamic, one can choose to view only IAH flights, for example, to get a clearer view of the locations they reach.

Moving on, we take a look at some important information between IAH and HOU groups within the data. Some of this information was also captured in the Excel exploration. From the table located in the above portion of the dashboard, we see that for outlier flights (flights with more than 62.5 min delay), IAH actually averages a longer delay than HOU. IAH also has a higher percentage of diverted flights, although somewhat marginal. However, HOU has a higher average delay time than IAH for non-outlier flights whilst also having a higher percentage of outlier flights and cancelled flights. Looking at the bottom right bar graph, was also see that the cancellations that do occur in HOU are more likely to be because of the carriers, whilst the cancellations that occur in IAH are most likely to be caused by weather problems. Given the limits of a single year of data, this point favors IAH, as weather problems likely serve as better justification over carrier issues. Note that there are slight differences in the calculation of some of the values seen here vs. the Excel sheet. These might be due to some rows of data not being imported correctly from Excel to Tableau, but the differences are quite small and shouldn't impact the overall analysis.

Code

Now we turn to look at how flight delays change over the course of a single day. We exclude the hours of late night and early morning, as  the number of flights are quite low. Beginning with the left graph, we see the number of non-outlier flights on the bars and the average delay in min for non-outlier flights on the lines. In general, we see that there are less flights in the early and late hours of the day, but overall the same number of flights for each hour within normal operating hours. On the other hand, we see a definite trend in terms of delay times, with HOU averaging greater delay times quicker over IAH. Regarding the graph on the right, we see the percentage of outlier flights over each hour of the day for both HOU and IAH. Although there are some hours where IAH has significantly less outlier flights than HOU, both airports seem to exhibit the same exponential growth as the night approaches. Overall, we see that HOU and IAH have similar behaviors in terms of number of flights and percentage of outlier flights, yet IAH tends to have smaller average delay times.

Code

Zooming out the temporal scope slightly, we will take a look at average non-outlier delay times aggregated by day of the week, month, and date. There is much information on this dashboard, but the key takeaway is that for most days of the week and months, HOU seems to display larger average delay times than IAH. The bars indicate a 95% confidence interval for the mean. The time series does not seem to show any strong seasonality. We would require more years of data to see if any seasonality would present itself.

Code

One nice feature of this dashboard is that it is dynamic. For example, we can choose to only look at the month of February, in which case we could see how delay times aggregated over the days of the week in that month, and how they changed over each day of the month.

Code

We can also do the same for the day of the week. This type of aggregation is more strange, as we would be looking at the average amongst all Wednesdays within each month, and also all Wednesdays of the year. This functionality still provides for some interesting visuals that give a better idea of how IAH outperforms HOU during most of the year. Overall, the analysis as is stands has shown us that IAH has lower delays times than HOU at most points in the year and during most hours of the day. Unless you would like to fly somewhere that HOU can take you to and IAH cannot, I would recommend, from this dataset alone, to fly via IAH. Of course, before flying to conclusions, it would be beneficial to look at more years of data and compare pricing, as there is more to look at than just delays when deciding what flight to catch.

Code

Overall, answers to the posed questions were sufficiently provided, although more work could be done to deepen this analysis. This dataset only contains data for the year 2011, which introduces some bias into the analysis. It might be the case that HOU had an uncommonly bad year in terms of delays, or that IAH had an uncommonly good one. Also, the data is limited in just showing us airport and flight information. It would thus be beneficial to gather data from other years as well as more useful information like weather/climate data to add to the analysis. With information like this, it would even be possible to potentially predict delay times going into the future.