US National Flights (III)

David Magraner
9 min readJan 12, 2021

Tools: Python, Tableau, SQL (find both other entries linked here).

This is the closing blog entry for this project. In it, I will explain all the procedures I have performed in Python and the reasoning behind it. This will explain the machine learning model I propose, but note that the whole project took place in the span of a week and there should be further work made on it; we were given a tight schedule to submit a minimum viable product, and this entry will explain it, along with how could it be improved in future research.

The goal of this project was to try to explain some of the variables that affect the quantity of flights operating between each pair of cities in the US. Given that the aviation industry is extremely competitive in the US, it is hard to obtain enough publicly available information on each of the flights to create a reliable model. Instead, I decided to take a look at the flight information I found available and think how could that be exploited. As I explained in a previous entry, the information comes from the Bureau of Transport Statistics, available here.

The columns I used were origin and destination (airport and state), month, carrier and distance. Each observation gives information on each individual flight operated: if a route (let’s say from JFK to ATL) is operated 10 times a day, there will be 10 different observations for a particular day, with information on the carrier operating each of the flights. As this project took place during the last month of 2020, and I was interested in having observations for a complete year, I downloaded all 12 monthly datasets for the year 2019. Each of them contained around 600,000 observations (yes, this is the number of flights operating monthly within the US!), so I ended up with a dataset of over 7.4 million observations.

As this is a massive amount to work with I started with the data wrangling straightaway. The first thing I did was to create a column ‘Route’ which is just a concatenation of origin airport code, a hyphen and the destination airport code, so in the previous example it would yield ‘JFK-ATL’. Then, I grouped all observations by route, month and carrier. This means that if a route is operated 300 times in a month by 2 carriers, I will just get 2 observations: one for each carrier. This drastically reduced the total amount of observations to just under 120,000, which makes it significantly much faster to work with for the machine. As of now, our dataframe looks like this:

Dataframe containing the grouped observations by route, month and carrier.

This was my main dataset, but the information it contains cannot explain much: we can only use distance as a regressor, and that should not be able to shed any light on the amount of flights operated. I thought I had to find a few more regressors that can explain total flights within two cities, and those should be available publicly. Thinking intuitively, it is quite safe to assume that some economic and demographic variables may explain that quite reasonably. Do large cities/metropolitan areas have a larger number of flights operating to/from them? Do states with a larger economic activity account for a larger proportion of flights? Absolutely, they do. So my next step was to get data on cities’ population, states’ GDP per capita or share of the national GDP contribution of each of the states.

Moreover, there are some other facts that can be taken into account, and are particularities of some of the states, which will almost certainly affect the amount of flights operating within them. For example, Hawaii and Alaska’s only connection to other US states is to be performed by air; so we have to note that as a dummy variable as it is a factor we want to control for. This basically has a motivation: the most operated route in US is within 2 islands in Hawaii (Kahului and Honolulu airport), but this derives from a geographical reason; economic and demographic variables will not be able to explain this fact, so we note it as a special feature that will be kept as a dummy variable (which I named ‘Away’).

There may also be a different amount of flights if we observe a winter or a summer month (‘Summer’), or whether one of the states is a coast state (‘Coast’), as there may be a tourism effect we should be able to account for. I also want to check whether the amount of flights between two cities that belong to the same state is significantly different than those that do not, so it is also interesting to create a dummy for that effect (‘Within’). Finally, another interesting feature to take into account is whether an airport is considered a hub for any of the carriers (‘Hub’).

Before creating the machine learning model, I wanted to find out some pairs of cities that were not yet connected. The model itself will use all current existent routes to ‘train’ itself, and we will then use it to predict how many flights could we populate a new route with if it was to be opened, according to the available features.

In order to find pairs of unconnected [large] cities, we took the 40 busiest airports and checked whether each of them was connected to the other 39 through a for-loop, using a list of such airports. The results were quite shocking: there were many pairs of busy airports which were not connected. This was only shocking until I realised that, for example, New York City has 3 different airports serving the city: John Fitzgerald Kennedy Airport (JFK), Newark Airport (EWR) and LaGuardia National Airport (LGA). So another airport, let’s say, Atlanta-Hartsfield (ATL) may be connected to LGA but not to JFK. This made me realize I would have to cluster or group airports for this specific task. So, for example, in the previous case, I renamed all three airports to NYC: I am interested in the flights arriving or departing from New York City, regardless of which airport the flight is operated in. So as to make clusters, I grouped all airports that are within a 2-hour drive from another large airport, and created clusters for Los Angeles, San Francisco, New York City, Dallas, Denver, Chicago, Washington or Miami among other major metropolitan areas.

Once the grouping was made, I reran the for-loop and found out some pairs of cities that were unconnected, and decided to take 3 pairs, with different geographical features to test what would the model predict. The choice was:

  1. Miami, FL-Portland, OR, as the biggest unconnected cities from each coast.
  2. Boston, MA-San Antonio, TX, as two big cities that are unconnected and one of them is in the coast.
  3. Pittsburgh, PA-Kansas City, MO, as two rather large cities which have no coast and are relatively closer to each other (compared with both other pairs).

Finally, it was time to run the model. I estimated 5 different models, each with a different setup of regressors: in some cases I added the population of the origin and the destination cities, in other cases I multiplied them (so smaller cities would penalise the observation); I also used different combinations of the GDP per capita of the state of each airport in a route, and also created another variable that counts the number of carriers operating a route (because, well, it seems intuitive that the more carriers operating within two cities, the more flights performed). I also included the squares of the shares of GDP of the states, to account for a trend that increased or decreased with the share. All the models were estimated through OLS; this was the first approach I used, and the results were quite satisfactory, so I decided not to perform different methods due to the time limit we were facing.

In the end, and according to the R² coefficient, the best model was able to predict almost 73% of the variation in the dependent variable ‘Flights’ using the aforementioned regressors. The coefficients of the model are as follow:

Output of the regressed model using the statsmodels.api library.

We can see that all regressors are statistically significant at the 5% level, except for ‘SumCoa’, an interaction dummy between ‘Coast’ and ‘Summer’: this means that there is no evidence to believe that there is a further effect on the number of flights operated in coast states during summer months other than the addition of the separated effects of ‘Coast’ and ‘Summer’.

As for the other coefficients, most of them are quite intuitive:

  • As the amount of carriers increases by 1, the number of flights in a route increases by an average of 63.43 per month.
  • As the distance increases by 1,000 miles, the number of flights operated within a pair of cities decreases by 38.44 per month.
  • If any of the airports within a route is a hub, is located in Alaska or Hawaii, is located in a coast state, the observation belongs to a summer month or if both airports are located in the same state, each of these facts have a significant positive effect in the number of flights performing the route.
  • If the cities have larger populations, or the states have a larger GDP per capita, also impacts the total amount of flights positively.
  • The surprising coefficients are ‘Share’ and ‘Share2’, which is ‘Share’ squared. These state that the larger the share of GDP a state contributes to the national GDP, the lower the number of flights. For example, California is the state that contributes the most to GDP, roughly a 15% of the national total. This means that it is a financial or industrial focus of some point and intuition says that this should impact flights positively. One of the possible explanations would be that this facts are already majorly explained by population or GDP per capita, so if these variables are correlated, they may lose their explanatory power.

Finally, we used the model to predict the amount of flights that could be operated in the unconnected pairs of cities we chose before. The results have [apparently] some logic behind it, so we can somewhat see that the model could be used as a first approximation for opening a new route. For the predictions we basically used the demographic and economic variables, the dummy variables, and we assigned a value of 1 to ‘Carriers’.

  1. Miami, FL-Portland, OR: the model predicts 24.2 flights per month (28 in summer), meaning a frequency of 6–7 flights per week approximately.
  2. Boston, MA-San Antonio, TX: the model predicts 52.8 flights per month (56.6 in summer), meaning the route would be operated almost twice per day.
  3. Pittsburgh, PA-Kansas City, MO: the model also predicts almost 2 daily flights, at 56.3 flights per month (63.1 in summer).

Overall, I am glad I achieved a model with an apparently moderate explanatory power, based on the R² value. This model, however, is far from finished, but this was the MVP I provided for this project. There is tons of room for improvement, mainly to check some statistical assumptions and conditions that should hold for the model to be robust:

  • The variable ‘Carrier’ could be dependent on ‘Flights’, meaning that there could be a 2-way relationship: the higher the number of flights that operate a route, the larger the incentive for an extra carrier to operate the route. I would like to check whether this is true and how can this relationship affect the validity of the model.
  • Limited information. Instead of number of flights operating a route, it would surely be more interesting to know the passengers travelling it: we don’t know how full or empty those flights were, and it could make a big difference. There are some other variables that will surely have an effect on it that we have no information of, such as prices, for example. (Note that it will also be hard to include the complex pricing systems of airlines into a model in a simple way). This could mean that there is an omitted relevant regressor that can cause the independence assumption (with respect to the error terms) to fail.
  • Different models could be trained. We opted for the OLS, but with the information we have, it would be possible to run different models, even non-parametric, to find whether there is a more accurate model that helps us explain better the relationship of each of the variables with the dependent one.

This is the end of the explanation of the project I submitted as part of the bootcamp at Ironhack. I would definitely love to continue working on it, more precisely, I would check with some statistical hypotheses and tests whether the model is valid or under which assumptions it would be a good-performing model. I am really interested in statistics, so would enjoy understanding the potential flaws of the model and how can they be solved; but, again, be reminded of the information available and the short period of time to perform the whole project. If you have ANY feedback or comments on how the model can be improved or what has to be taken into account, feel free to reach me through email or LinkedIn, I will appreciate it enormously!

If you are interested in checking the Python code I wrote for the project, find it in my GitHub repository. Thanks for reading!

--

--

David Magraner

Data lover, numbers enthusiast with a curious mind. The only reptile I like is Python. https://www.linkedin.com/in/davidmagraner/