MTA Turnstile Data Exploration

Project Overview

The main goal of this project was to become comfortable with cleaning data as well as provide a light analysis that could provide some business value. Busking is the act of public performance art often which can garner donations allowing musicians to make extra money and promote themselves simultaneously. In addition to this. busking plays a large role in the atmosphere of many major cities across the world. The fictitious company “BSKR” is calling upon data scientists to maximize the exposure of street performers by optimizing all clients to have the highest amount of exposure across the top 50 subway stations (arbitrarily chosen) while making sure performers don’t compete with each other.

The Data

The MTA’s turnstile data counts each rider who enters or exits the turnstile in four hour intervals. We used one weeks’ worth of data in our analysis (9/10/2016 to 9/16/2016). We summed the entries and exits to get an “impressions” value that is the total number of people that would hypothetically see a performer perform and thus be another potential donation. Different turnstyles had different counters and thus simply subtracting each cell from the previous to get the net number of entries and exits led to an incorrect value each time a new turnstyle was observed. In order to remove this data, and in the interest of time, we simply looked at the highest daily value for the busiest station and set a slightly higher value than that as the cutoff. This removed all of the erroneous impression values.

Next we wanted to get a feel for the data visually and examined a few charts to make sure that there weren’t any glaring errors that we may not have caught without visual aid. First we looked at the total number of impressions across the top 50 stations and it makes sense that the weekdays had a higher median impression count when compared to the weekend.median-impressions

We then looked at the impressions per day across stations that were included in all 7 days worth of top 50 stations (there were 38 such stations). The interquartile range was between ~80,000 and ~150,000 with a few outliers on the high end.impstationcommonboxplotcommon

There were 26 stations that weren’t included every day and they had a much lower impression count with an IQR of between ~25,000 and ~58,000. This confirmed the fact that the most popular stations would be in the top 50 every day while less visited stations would fall in an out of the top 50 depending on the day.



Performer Allocation:

We decided to arbitrarily try to allocate 150 performers to stations as a function of traffic density and randomly assigned performers to spaces. We then simulated impression count per artist as a function of artists per station and ran the simulation for a week, a month and a year. The distribution became more peaked at the .95 to 1.0 bin meaning that our algorithm seems to be working and that performers are getting an equal number of impressions.




Future Possibilities:

There are several features we could have included that would make the model more robust in order to allocate performers to get even better impressions and money.


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s