Improving Transportation for Queens' Most Needy

The transportation system in New York City has been a large part of my life growing up. Even though I grew up in Queens, I commuted to high school in downtown Brooklyn and graduate school in Harlem. This commute was a large part of my day and more often than not, the commute would take 60 to 90 minutes, one way.

Despite this, I felt truly blessed and thankful for the Metropolitian Transportation Authority (MTA) for providing a service that allowed me to live in the greatest city in the world. Without them, getting around would seem almost impossible in a city of this scale.

However, after using MTA services my entire life, I did notice some things.

From my experience, it felt like the services were always worse off in areas where poverty was higher. This observation got me thinking about the people who live in these communities. I knew my commute was long but were their's longer or more diffcult? How did that affect their daily lives?

For my Data Incubator Capstone project, I set out to use data to try and make sense of these observations and questions. I wanted to identify areas where poverty was high and the commute times into the city were also high. My reasoning is that by identifying these areas, any special attention given would have a larger impact for the people living there.

The initial steps I took was to do some reading about the topic. Based on the literature, numerous studies indicate that having access to a reliable transportation system is crucial to escaping poverty. Here are some links to the interesting articles I found.

Commute Times

I decided to focus my study on the borough of Queens to start. I needed an idea of what the commute times are within the borough. I decided to look at the problem on a zipcode basis.

Using Google Maps, I looked at what the commute times to Pennsylvania Station in Manhattan were from every zipcode in Queens. I took the time of travel as a typical Monday morning during the 8 am rush hour. This is when the MTA system is said to be at its peak.

The choropleth below shows the commute times mapped out by zipcodes for queens. A more detailed interactive version of the map can be found here.

Penn Station was chosen as a reference point for the commute times because of the way the subway system in NYC is designed. All subways (except the G train) travel through manhattan at some point in their routes. Penn station represents a economic hub within the city where there are a lot of opportunities.

Darker shades of color indicate longer commute times. Click on the interactive map, here, for a more detailed version. At first glance the map makes sense. The further out from Manhattan you get, the longer your commute become.

Areas of High Need

The next step in my study was to identify the impoverished areas within queens. There are many ways to do this but I chose to keep it simple by looking at the numbers provided by the Internal Revenue System. More specifically, I looked at the number of returns filed that were under the federal guidelines for poverty.

The choropleth below represents these results.

Just like in the commute times map, darker shades indicate zipcodes that have higher number of poorer households (income less than $25000 per year for a family of 4). A more detailed version of this map can be found here.

There are a number of zipcodes with higher numbers of returns spread all over the borough. In order to identify the zipcodes that demonstrate high need we have to correlate the commute time data together with the tax return data. To do this and develop a "Need Score", I normalized the two data sets and combined them together to obtain a score. Because the range of the values of the number of returns feature varied much more greatly than the commute times feature, I performed a feature scaling normalization on the tax return data set.

The choropleth below represents the Need Scores for each zipcode in queens.

As with the previous choropleths, the darker colors indicate a higher Need score. A more detailed version of this map can be found here. These Need scores ranged from values of 0.008 to about 1.743. The zip codes with the highest Need scores and their summaries are listed below.

Zipcode Commute Time Total Returns Need Score
11434 78 minutes 13060 1.743
11419 61 minutes 12050 1.581
11354 60 minutes 17410 1.571
11355 59 minutes 33920 1.562
11432 57 minutes 13960 1.543

After doing this analysis and identifying which poor neighborhoods could use better transportation, I looked at how their commute times can be improved. This factor is key because several studies indicated that people who had higher (>60 minutes) commute times to get to work or school often performed worse economically than individuals whose commutes were shorter.

Looking at Bus Times

Typically, travel within the city is two pronged; most people take buses to the nearest subways station, where trains can more quickly transport them through the city. I hypothesized that if there were any inefficiencies in the system, it would be within the bus routes.

In this part of the study, I took a look at the massive (~5 GBs) historical GTFS database that the engineers at the MTA has on bustimes. A link to the database can be found here. Most of the information are contained in comma delimited text files.

Since this is a preliminary study, I looked at a small subset of the data. Specifically, I looked at the buses that ran in the 11434 zipcode. I wanted to see if a preliminary insights can be drawn from the information.

The 11434 zip code contains the neighborhoods of Springfield Garden, South Jamaica, Rochdale, and St. Albans. The area has a population of about 60,000 people and is primarily served by five buses, the Q111, Q113, Q114, and the Q6. Looking at the MTA ridership information, two of those buses are in the top 10 for highest average weekday Ridership.

Based on the numbers, the 11434 zip code has about a 22% poverty rate and the population is extremely dependent on the public transportation.

The Buses

Since the Q111 route was the most used route in the neighborhood and ranked 9th in the entire city in usage, I tackled its data first.

The information provided in the dataset is absolutely expansive; it ranges way back to 2014 and provides historical data on the bus routes in the city. The data I was looking at was the arrival and departure times of buses at each stop along its route.

After an arduous ETL process, I assembled a sqlite database of trips for the Q111 bus in 2016 and their respective stop times. I ended up reducing an overall database containing over 4 million rows to about 100k rows of only Q111 weekday datapoints. That reduced database was further reduced to a table that only contained distinct trips and their trip times.

I repeated the process for data of the other lines in the neighborhood and filtered out the northbound trips only. In this area of interest, all subway stations are north of the 11434 neighborhood. These routes were also designed so that their ending positions are close to subway stations or major transportation hubs.

During this process, I discovered some downsides to the data I was using. It didnt really contain data from ALL the buses that were running thoughout the day, but only some select trips. I suspected that the data in reality represented the QC/QA trips that the MTA might run to adjust and set bus schedules. Regardless, the data was usable for the purposes of this project.

I decided to take a look at each bus route's schedules to get information on the listed trip times. The table below shows the listed trip times for each route of buses that travel through the 11434.

Route Trip Time
Q111 40 minutes
Q6 37 minutes
Q113 59 minutes
Q114 59 minutes

Using the actual trip times I constructed from the dataset, I calculated the "scores" of the trips. The score is essentially the difference between the actual trip time and what the trip time is supposed to be. The scores are displayed in the plot below. Essentially a score of 0 means the bus was on time at this time, while a positive score indicated that the bus took longer than expected to finish the route.

Insights

Before we can list some observations from the plot, one thing that should be clarified is that the scores on the plot above represents the average scores for the scheduled departure times for the buses.

That is why the Q6 has a lot of noise in its graph; there are many departure times meaning that the bus didn’t do a good job of leaving the terminal on a timely manner. Included in the noise are also the “limited stops” buses, which generally have shorter trip times. These buses are most active in the mornings; you can zoom in at around 7:00 am to see which buses they are.

So what can we learn?

Immediately, we can see that the buses do not really do a good job during the times at which they are desperately needed. All four lines run the slowest during the morning weekday rush and overall are very slow during the day. It is only after the nighttime rush that one can board a bus and expect to arrive quicker than expected.

The best bus that runs through this neighborhood is definitely the Q111. It provides the most consistent service, despite already being pushed to its limits in terms of ridership numbers.

This graph points to the Q113 and Q114 as the worst buses in the neighborhood. It’ll take longer than expected to complete a trip all throughout the daytime hours. A whole trip can end needing an extra 13 minutes during the morning rush.

Riding the Q6 seems to be a crap shot. There’s a lot of variability in its travel times during the morning rush, stemming from whether or not you are able to board the limited bus or not.

What's Next?

I consider this part 1 of the project. Now that I have an understanding of how the travel times in this neighborhood are distributed along a bus route. The next step is to look at where along its route there are pile ups/slow downs in the traffic. The data in the MTA set cant really provide that type of insight. I would have to build a scraper that can pull real time vehicle positions at regular intervals from the MTA API and build a database from that information.

Eventually I want to see if I can build a machine learning model that can predict bus trip times depending on time of day based on this database. There are a number of papers that provide a framework for this process.

The point of this project was to really see if I could set out to use resources out there to try and gain some insight to a question I had about a real life phenomenon. It became a real good chance to practice critical thinking of how to utilize data from a multitude of sources, and a good way to practice some programming and web design skills. Overall I used data from over a dozen different sources and spent over 150 hours scraping, cleaning, and analyzing the sources of data in python.

Here’s a list below of the Python modules and tools I used. The code I used to clean the data from the MTA is provided in the github for this project. Feel free to look through it.

  • Python: pandas, numpy, datetime, geojson, simplejson
  • SQL: sqlite in Python
  • Data Vizualization: leaflet, mapbox, geopandas, bokeh, matplotlib, javascript