Background
One of my hobbies is following the commercial aviation scene. My job requires me to fly around a lot, and that only sparks my curiosity further around how the commercial aviation business works. What better then, that to use this hobby as a motivation to test new features in SQL Server 2012? This hybrid interest, which I call Aviation BI, is the driver behind a series of blog posts, where I will take you through a typical lifecycle of a BI project:
- Defining requirements
- Solution design and technology selection
- Cleansing and importing data into the data warehouse
- Querying the data warehouse
- Using a multidimensional / tabular model to obtain slice / dice insights
- Visualizing those insights in eye-catching fashion
In the first blog post of this series, I will take you through what I (as an end user) see as possible requirements. We will also review the datasets available for consumption.
Requirements
There are some categories of requirements that we have, depending on the perspective we are talking about. A typical retail consumer would want to check some statistics:
- Given a particular sector, which airlines serve that sector?
- How is the on-time performance of each airline on the sector?
- Which airline has been providing the cheapest tickets on the sector?
- What is the average arrival delay for a particular airport? Is there a pattern based on time of day?
- What is the average load factor (how full is the aircraft) on the sector?
As an airline executive, I might want to look at:
- Which are the routes with the highest traffic?
- Which aircraft am I routinely having service difficulties with?
As an airport administrator, I might want to look at:
- For my airport, is the ramp time (the time taken to taxi from gate to runway) very high compared to other airports?
- What are the most common locations from which international passengers arrive to my airport?
- Which is the busiest time for arrivals and departures for my airport?
As an aviation analyst I may have the following queries:
- Which type of aircraft is used to make trans-Atlantic journeys?
- Which operators are still using DC-10 aircraft (DC-10s were largely phased out by major operators in the late 1990s)
- Did the Antonov 225 make any appearances in the US recently?
- What are the longest non-stop flights to the USA?
And so on… these perspectives are imaginary, but I suppose they are very typical of what people in those roles would be asking to see.
Data sets available
There are some amazing datasets available in the public domain. For our purposes, we will be using the following:
Data set | Data Points | Row count |
BTS On-time Performance | Origin, Destination airports Departure delay Arrival delay Taxi times Flight distance Flight time Data available: 1987 onwards Granularity: flight |
146,090,545 |
Airline Origin and Destination Survey (DB1B) Market | Number of coupons for this market1 Origin, destination airports Ticketing and operating carrier Market Fare Number of passengers Data available: 1993 onwards Granularity: individual ‘market’ |
346,513,372 |
Airline Origin and Destination Survey (DB1B) Ticket | Number of coupons in itinerary Origin airport Round trip indicator Miles flown Ticket fare per person Reporting carrier Data available: 1993 onwards Granularity: individual ticket |
190,783,532 |
Airline Origin and Destination Survey (DB1B) Coupon | Origin, destination airports Number of passengers included Fare class Data available: 1993 onwards Granularity: individual coupon |
584,874,480 |
T-100 Segment | Unique carrier code Origin, destination airport Aircraft type Departures scheduled and performed Load factor Seats vs. passengers Freight and mail Distance Ramp to ramp time Air time Data available: 1990 onwards Granularity: aggregated flights (monthly) |
7,347,563 |
T-100 Market 2 | Unique carrier code Origin, destination airport Service class Domestic / International indicator Passengers Freight and mail Distance Data available: 1990 onwards Granularity: |
5,994,306 |
FAA Service Difficulty Reporting (SDR) | Data available: 1994 onwards Granularity: each service difficulty report |
1 A ‘market’ in DB1B terminology is a break in journey other than for changing planes.
2 The difference between ‘market’ and ‘segment’ in T-100 is explained at this link. The T-100 reporting guide is also useful to understand this dataset.
Next Steps
Over a series of blog posts, I will take you through how we:
- Cleanse and import this data into a data warehouse.
- Build some analysis services databases on top of the relational data. We will explore both multidimensional and tabular mode databases for this step and see where each fits in.
- We will then consume this data – through Reporting Services, Excel and PowerView clients.
- We will finally do an incremental loading of ‘new’ data into the warehouse, and follow the incremental processing of the Analysis Services databases and confirm if the reporting pieces refresh the new data.
All of this, towards answering the ‘customer’ requirements which we laid out earlier!
References
For those interested in reading further about airline statistics, here are some references:
- A similar effort using Cloud Numerics
- A thesis which uses DB1B to analyze fare structures
- Airline Industry Datasets - Jason Blevins
- Overview of O&D and T-100 2008 Data Sources (Domestic) ACI-NA Air
- Low Cost Carriers and the Evolution of the U.S. Airline Industry
- Patterns and Effects of Entry in U.S. Airline markets
So if you have some comments on the scope of this ‘project’, I would LOVE to hear from you! Do leave a comment below if you liked this idea and would like to see more.