Airport Delays & Clustering
Scenario
You’ve been hired by the FAA as a consultant to analyze the operations of major airports around the country. The FAA wants to cut down on delays nationwide, and the most important part of this task is understanding the characteristics and groupings of airports based on a dataset of departure and operational delays.
- A certain degree of delay is expected in airport operations, however the
FAA is noticing significant delays with certain airports
- When a flight takes off, it's departure delay is recorded in minutes,
as well as operational data relating to this delay
- At the end of the year, this data is averaged out for each airport.
Your datasets have these averaged for a 10 year range between 2004 and 2014
- Over this 10 year range, some delay times have not improved or have worsened.
Prompt:
There are three different datasets related to airport operations. These include a dataset detailing the arrival and departure delays/diversions by airport, a dataset that provides metrics related to arrivals and departures for each airport, and a dataset that details names and characteristics for each airport code.
You will help the FAA:
- Organize and store their data so that they can easily understand it after your consulting work is done
- Mine and refine the data to uncover its basic attributes and characteristics.
- Use your skills with PCA to uncover the core components of operations related to delays.
When you’ve finished your analysis, the FAA would like a report detailing your findings, with recommendations as to which airports and operational characteristics they should target to decrease delays.
Here are some questions to keep in mind:
- What operational factors are most directly correlated to delays?
- Take a look at airports groupings - are there any relationships by region? Size?
Risks & Assumptions:
Importing Data and Creating PostgreSQL Database
First, all three datasets are imported from CSV into the Jupyter Notebook in order to start cleaning, perform analysis, and use unsupervised learning techniques. These datasets will help understand the distribution, characteristics, and components od individual airport operations that lead to delays. After reading in the CSV of the datasets, sqlalchemy was imported to create a postgreSQL database. In the database, tables consisting of the airport flights, airport cancellations, and airport operations were created. Next a seperate table was created of all the tables in one by using SQL and joins.
Data Cleaning/ Munging
For analysis purposes using python and pandas, the full combined table from the PostgreSQL database was converted to a dataframe. Next the columns, alias, facility type, key, Airport, airport, year, Boundary data available, and county were dropped. Alias was dropped because it was just an reiteration of the AP Name, but had the full entire name and NaNs. Airport and airport were dropped for similar reasons as the names were repeated but as shortened acronyms. Like the airport columns, year was dropped because there were two of the same year columns when the tables were combined. The column key was dropped because it was just a reference/ index number. Since all the facilities are airports, facility type was dropped. County data was dropped because it was unimportant at the county level where the airport was located and because some county data was unavailable and were replaced with #NAME?. Finally, the boundary data available was removed because it is irrelevant to the clustering analysis.
Data Analysis / EDA
Before data mining and performing cluster analyses, to get familiar and understand the data better, some exploratory analysis is necessary. The number of entries for flights in total is 799. When looking at the highest number of cancellations from both a departure and arrival standpoint, Orlando can be seen to have the highest number of cancellations among all airports in all the years.
The relationship between departure and arrival cancellations have a positive correlation. This means generally that if an airport has a high amount of arrival cancellations, it is very likely they will also have high departure cancellations and vice versa. Again, Orlando seems to be an outlier as there were a lot more departure and arrival cancellations than any other airports during the years 2007 - 2008.
Arrival cancellations and arrival diversions don’t seem to have any noticeable relationship, except that most arrival diversions fall around the 0 - 500 range. There are definitely less diversions than cancellations because of the nature of flights and the likelyhood that a flight wouldn’t have to be sidetracked because of operational issues.
The same idea applies to departure cancellations and departure diversions, where there is significantly less departure diversions than cancellations. Compared to arrival diversions though, there is a lot less variance in departure diversions than arrival diversions. Instead, arrival diversions have a lot more outliers, where one or two airports would have a lot more arrival diversions than others.
As shown previously in the various scatterplots, both arrival and departure cancellations and diversions are highly correlated. Also, percent on time gate departurs and arrivals are also correlated for airports.This may indicate that airports with delay problems consistently have high delays while airports with on time departures consistently have on time departures.
This scatterplot matrix shows a more detailed view of the highly correlated areas (marked in red in the heatmap) of the cancellations and diversions.
Lastly, the map shows the location of all the airports and gives a visual representation of where the airports are located.
Data Mining
The first clustering method explored is the PCA. Before performing the analysis, the FAA regions were coded to numerical values using the label encoder so the PCA could be run. Then using the standard scaler to account for deviation between different measurements, the columns were transformed for PCA.
Dimensionality Reduction
Since PCA uses dimensionality reduction, the covariance matrix needs to be calculated where the features are