Project 1 - Enter Through the Turnstile Data
THE BEGINNING
Week 1 at Metis began with a prompt, an email from a fictional non-profit organization.
“We are interested in harnessing the power of data and analytics to optimize the effectiveness of our street team work.”
“Where we’d like to solicit your engagement is to use MTA subway data […] to help us optimize the placement of our street teams, such that we can gather the most signatures, ideally from those who will attend [our] gala and contribute to our cause.”
As a brand new resident of Chicago and longtime (now former) resident of New York, I felt as though I had lucked out. Perhaps I was new to data science, but I was certainly not new to the New York subway system.
THE TASK
Based on the email we were given, our task was to:
‘Optimize effectiveness of street teams’
at ‘entrances to subway stations’
to target individuals who are ‘passionate about technology’
to attend a gala ‘at the beginning of the summer’
THE TEAM
Myself, Andrew Way, former actor and recruiter
Ake Paramadilok, physical therapist
And Tony Ghabour, consultant
THE APPROACH
I didn’t know until this project that the MTA publishes all their weekly ridership data, containing information about every turnstile at every entrance of all ~470 stations in the system. While this dataset is rather cryptic at first glance, looking at it from the right direction can illuminate trends about the subway and the people who utilize it.
We decided that since this fictional gala is taking place in June, we should look at the 4 weeks ending in May of 2019. Once we had imported our data and cleaned it with various techniques, we began to pick apart the intricacies of the data.
The turnstile data contains cumulative counts of people both entering and exiting the subway stations. Given that we are merely looking for people, regardless of their trajectory, we combined these metrics into a total of target individuals.
From our initial visualizations, we were able to see distinct peaks of activity in the morning and evening, roughly correlating with a sort of daily commute. However, when operating optimally, the turnstiles only send their records at 4-hour intervals. While this is not nearly as specific as I think any of us would have preferred, we could only accept the circumstance and attempt to optimize for it.
We opted to break the data roughly into two sections, morning and evening, partitioning at noon. We decided not to consider any data associated with late night or early morning (8:00pm to 4:00am).
We then began categorizing on the basis of weekday vs weekend to target the professional market as opposed to the tourist market. When categorized by day type, then organized by mean target volume, 9 of our top 10 stations were observed means from weekdays.
With this observation, we decided to plot all of our stations with their weekday average on one axis, weekend average on the other, just to see the general shape of how these two variables interact.
We can observe a general trend of weekday use outpacing weekend use. Given the very linear nature of this data, we went back to the drawing board to try to distinguish some feature that may guide us to the demographic of technological professionals without having to wade into the very deep (and usually not terribly current) pool of census data.
THE MAP
In my own qualitative research online, I stumbled across BuiltInNYC, an organization specifically for New York based startups and tech companies. From their website, I manually scraped data on ~115 of NYC’s largest tech employers. I also used one of their recent articles to gather information on the largest tech companies in New York based on footprint.
Using two very handy packages, I was able to turn these lists of names and addresses into a powerful visualization comparing our top stations’ locations to this array of companies. Geopy is a package that can accept addresses as strings and return extensive location data on them, including geographical coordinates. I then took those coordinates and ran them through Folium, a package for creating leaflet maps. The resulting map compares tech companies (blue) to subway stations (red), indicating potential reach.
THE CONCLUSION
We recommended the following stations based on our research:
Fulton Street - A/C, J/Z, 2/3, 4/5
connects to PATH train for access to New Jersey
tech presence with Spotify and Conde Nast
Columbus Circle - A/C, B/D, 1
closest junction station to New York Institute of Technology
northernmost station in top stations, dividing residential uptown from commercial downtown
Union Square - L, N/Q/R/W, 4/5/6
close proximity to Facebook, Oath, New York University
serves L train, the most utilized train in the MTA
THE AFTERMATH
After our presentation (and a few celebratory high fives), we got our feedback and got back to work refactoring our code, taking out all the missteps and dead ends and such. Once we’d made a notebook we were all pleased with, we gave it one last run and shipped it.
Given my new dexterity with python, I decided to go back to my exercises from the beginning of the week and give them another shot. Pandas had become significantly more familiar, so I was quite certain I would be able to speed through.
Instead of looking at any of the most utilized turnstiles, I turned my attention to my old stomping grounds. Unlike many New Yorkers I’ve met, I only ever lived in one place in NYC - Ditmas Park, a quiet area south of Prospect Park. It’s full of trees and lawns and old Victorian homes. I zoomed in on my own station, Newkirk Plaza.
THE REALIZATION
And then it dawned on me. As I compared the station to the date, it suddenly occurred to me that I was looking at myself through the data, my own motion through turnstiles, my own commute. And I felt both big and small, anonymous and recognized.
The point of data science is to make sense of the world around us. To bring order to the chaos. And in this moment, a field of study that felt so nebulous and hypothetical suddenly rooted itself into the ground.
Andrew Way, Jan 2020