The MIT Big Data Challenge: Visualizing Four Million Taxi Rides

The other day, I was showing a colleague how to use Python and Jupyter notebooks for some quick-and-dirty data visualization. It reminded me of some work I’d done while competing in the MIT Big Data Challenge. I meant to blog about it at the time, but never got down to it. However, it’s never too late, so I’m starting with this post.

The Visualization

The main point of this post is the following animated visualization which overlays a heatmap of the pickup locations of around 4.2 million taxi rides over a period of about five months in 2012 on top of a map of downtown Boston. The interesting thing about the heatmap of pickup spots is that it reveals the streets Boston, and highlights popular hotspots. Overlaying it on the actual satellite map of Boston shows this more clearly:

Those familiar with Boston will see that main streets like Massachusetts Ave, Boylston St and Broadway are starkly delineated. Hotspots like Fenway Park, Prudential Center, the Waterfront and Mass General Hospital show up quite clearly. The three most popular hotspots appear to be Logan Airport, South Station, and Back Bay Station, which is very much to be expected. I remember several occasions where I’ve taken a cab from these places after a night out or returning home after a flight or Amtrak, particularly on a freezing winter day!

The Challenge

Even though the challenge is a few years old (winter of 2013-2014), the context is still very much relevant today, perhaps even more so. The main goal of the challenge was to predict taxi demand in downtown Boston. Specifically, contestants were required to build a model that predicted the number of pickups within a certain radius of a location given (i) the latitude and longitude of the location, (ii) a day, and (iii) a time period (typically spanning a few hours) in that day. In addition, information about weather and events around the city was also available. Such a model has obvious uses – drivers on the likes of services like Uber and Lyft could use it to tell where future demand is likely to be, and use that information to plan their driving and optimize their earnings. The services themselves could also use it to anticipate times and locations of high demand, and to dynamically meet that demand by appropriately incentivizing drivers well in advance through surge —err, I mean dynamic pricing, so that there is enough time to get enough drivers at the location by the time the demand starts to pick up.

I was thrilled when after a lot of perseverance, I finally managed to get on the leaderboard. My self-congratulations were brief though, as I was soon blown away. Which was not surprising in the least; this challenge was organized at MIT’s Computer Science and Artificial Intelligence Lab after all. Enough said.

Getting on the leaderboard, albeit briefly, was fun. But winning was never the goal. Rather, my goals for competing in the challenge were three-fold. First, I wanted to getter a deeper understanding of data science workflows and processes through a relatively complex project. Second, I wanted to expand my machine learning skills. Finally, I wanted to try using Python (and in particular scikit-learn). So far I’d only used R for building predictive models (recognizing handwritten digits and predicting the number of “useful” votes a Yelp review will receive). But I had recently learned Python and had used it extensively during my summer internship at Google building analytical models for Google Express and Hotel Ads. One of the main drivers was that I was about to start an product management internship at a stealth-mode visual analytics startup called DataPad founded by the Wes McKinney and Chang She, the creators of Pandas (the super popular Python library that I had used extensively in my work at Google), and wanted to be prepared with some product ideas.

In the end the process was extremely educational and led to many insights in several areas of data science and machine learning. I even distilled some of the work for a Python for Data Science Bootcamp which I conducted for the MIT Sloan Data Analytics Club along with my co-founder and co-president. I’ll write about some of the learnings and insights in future posts, but for this one one I’d like to talk briefly about the heatmap visualization from the start of this post.

The Making Of

Inevitably the first thing one does when encountered with a new dataset, particularly for a predictive challenge such as this, is to understand the “shape” of the data. To help with this, it’s fairly typical to create visualizations using one or more variables from the data. So the first thing I did was to fire up IPython Notebook (now known as Jupyter), and start to look at the pickups dataset provided.

After staring at it for a while, the proverbial light bulb went off in my head. Even though longitude and latitude are measured in degrees and are used to indicate a point on the surface of a the earth which is a sphere, what if they could be used as Cartesian coordinates for a scatter plot on a plane with longitude on the y-axis and latitude on the x-axis? I tried precisely that (for just one day’s rides) and lo and behold the map of downtown Boston was revealed:

It became obvious that for a small area (relative to the size of the earth) like Boston, the curvature of the earth could be ignored. From here, creating a heatmap was fairly straightforward. I used matplotlib’s hexagonal binning plot (essentially, a 2-d histogram) with a logarithmic scale for the color map. If you’d like to understand it from the ground up, I made a simple step-by-step introductory tutorial of data visualization in Jupyter that ends with the the generation of this heatmap for the MIT Sloan Analytics Club “Python for Data Science” bootcamp .

To create the visualization for this post, I redid the heatmap using all the pickup data available (around 4.2 million pickups over 5 months) and used a different color palette to end up with this:

The code can be found here, but the full dataset itself is not included because it’s over 300MB in size and GitHub has a limit of 100MB.

Then it was a matter of taking a satellite photo of Boston, and messing about with GIMP to overlay the heatmap on top of it, create the animation by blending the two, and exporting it as an animated GIF. This was the first time I used GIMP in anger (I always thought it was for Linux and didn’t realize there was a Mac app available), and I have to say it’s pretty awesome as a free alternative to Photoshop. It doesn’t quite feel like a native Mac app — the behavior and look of the menus and navigation are a little funky— but it got the job done really well for what I needed to do.

Bonus Interactive Visualization

While trying to figure out the best way to present the heatmap overlayed on the Boston map (and eventually settling on the simplicity and versatility of an animated GIF), I came across the cool “onion skin” image comparison feature of GitHub. Click on “Onion Skin” in the image comparison that shows up for this commit.

github1

You can use a slider to manually blend the two images and clearly see the how the taxirides heatmap maps onto (pun intended!) the streets of Boston.

github2

Improvements

Even though I was relatively familiar with Boston having lived there for two years, it was still not immediately obvious what some of the specific hotspots where. This could be addressed in a couple of ways:

Alternative “Static” Visualization

Create an similar animated GIF visualization but using a street map with labels.

Dynamic Overlay on “Live” Interactive Map

A better approach would be to create an app that uses something like the Google Maps API to show a ”live” interactive map view that allows the user to use all the features of Google Maps like zooming, switching between street and satellite views etc.. The app would let the user toggle visibility of the overlay heatmap overlay on top of the map. The user could choose from a set of colormaps for the overlay (some would be more suitable for street vs satellite views), and also use a slider to play with the overlay’s opacity (like with GitHub’s onion skin tool).

Dynamic Overlay on 3D Map

The next logical step would be to take the dynamic overlay concept and apply it to a live 3D map view. Here is a “concept” of that idea:

3d_overlay_concept

Analytics, Data Visualization

Where Do Sloanies Go After They Graduate?

It’s been a year since I graduated from the full-time MBA program at MIT Sloan and moved to the San Francisco Bay Area to work as a Product Manager. I thought it would be interesting to see where Sloanies go after they graduate and, using data from a survey sent out shortly before graduation, came up with an interactive visualization: Sloanies Around the World.

Clicking on “USA” from the menu presents a clearer picture of where Sloanies ended up in the States:

The top 4 cities where my classmates ended up are:

Boston
San Francisco Bay Area
New York
Seattle

It is interesting to see how post-MBA career choice determines location. I wanted to remain in software, and chose MIT because of its reputation in technology and entrepreneurship. In fact, technology is the second most popular career choice (after consulting) for Sloanies. Indeed, out of all the M7 business schools, Sloan had the highest proportion of graduates choosing technology (26%). (Source: The M7: The Super Elite Business Schools By The Numbers) For me, the Bay Area was the obvious choice even though it’s on the opposite coast. Many of my classmates echoed this sentiment, which explains San Francisco and Seattle being top post-MBA destinations.

While I’d expect this year’s graduating class to have a similar map, what would be most interesting is visualizations from other schools. I would expect to see maps for schools with a strong finance reputations like Harvard, Booth, Wharton and Columbia be much more heavily skewed towards financial centers like New York.

Gadgets, Trends & Strategy

The Experience of Listening to Music is Broken

The realization that music ownership as we knew it was soon to become redundant in favor of streaming services came to me when I signed up for Spotify in early 2009, when they opened up public registration to their free service. At the time it was only available in the UK, and I was fortunate to be living in London. Although I had my entire music collection on my iPod Classic at work, I found myself primarily listening to music on Spotify. Despite the conspicuous absence of some favorite bands like Metallica, AC/DC and Led Zeppelin, Spotify offered a vast selection of old and new music, all for free! It was a revelation.

Five years on, music streaming has matured and Spotify has many competitors. There is no lack of all-you-can-eat subscription services for around $10 a month with features mobile apps and offline listening. Some of the more popular ones are Google Play Music, Rdio, Sony Music Unlimited, Xbox Music and the one that has been in the press a lot lately after its acquisition by Apple, Beats Music. Additionally, there are services like SoundCloud that allow emerging and unsigned artists to upload their music for free consumption.

The economics of streaming services from the perspective of the consumer are incredible. We not only have access to a virtually unlimited selection of music but at a fraction of our “willingness to pay”. Thanks to long tail economics at work, we have already become accustomed to having access to a large selection of music when purchasing over the internet, whether buying physical CDs from Amazon or downloading music from iTunes. But $10 only gets you 10 songs on iTunes or half a CD on Amazon. For the same price, a subscription gives you virtually all the latest music. For example, Spotify adds 20,000 tracks per day (though not all of them are new). As a comparison, the majority of the music that I own is a collection of around 200 CDs bought over a period of nine years. At $20 per CD that equates to an expenditure on music of approximately $40 per month, for a grand total of around 3,000 tracks, most of which are album fillers.

However, despite all the goodness, the experience of playing music suffers from some major problems.

Problem #1: Music silos
It is safe to say that the majority of users listen to music from multiple sources. For example, I have playlists on Spotify and SoundCloud as well as music that I own (lossless versions of which are sitting on a NAS drive, and compressed versions uploaded to Google Play Music); that’s four different places. But there is no way to seamlessly mix and match music from multiple sources in a single playlist. Ok, that’s not quite true; there is one service that does allow this, but in a limited manner. More on that later. The ability to buy individual tracks on iTunes meant that playlists replaced albums as the de facto collection of songs that get played in one go. Music was freed from albums. Now it needs to be freed from music services!

Problem #2: Lossy compression
With very few exceptions, all music available for download and streaming on the internet — the de facto mode of listening to music today — is compressed and is significantly inferior to the original recording. This is a real disservice to the music. Lossy codecs like MP3 trade storage and transmission efficiency for audio quality, which is fine for listening to an iPod with crappy Apple earbuds, but doesn’t cut it for proper music listening on decent equipment. Storage and bandwidth are so cheap today that the we don’t really need to make this compromise any more. In fact there has been the recent emergence of “high resolution audio” which offers significantly better quality than CDs.

So here then are my requirements for playing music:

Have a similar experience whether I am playing music at home (audio system), at work (laptop) or on the go (smartphone)
Search, play and create playlists across streaming services and music that I own using one interface
CD quality or better

I haven’t been able to come up with a compelling solution that I am completely satisfied with (hence this blog post), but the situation isn’t completely hopeless either.

The solution that comes the closest to satisfying these requirements is Sonos. The Sonos Connect (and Sonos speakers) can play music from most major music subscriptions services (most notably Spotify and Google Play Music) and lossless music from my NAS. Using the latest version of their Controller app, I can create playlists that span multiple sources. Kudos to Sonos for this innovation! I must say they are doing a tremendous job of creating a highly compelling experience for listening to music at home. Trouble is that even though Sonos has smartphone, tablet and desktop apps, these apps are only controllers that control Sonos devices. I can’t play my Sonos playlists on the go (on my smartphone) or at work (on my laptop). My appeal to Sonos is: Please enable playback on your Controller app and cement Sonos the one-stop experience for music not only in the home but also outside it. Yes I can’t expect to have files on my NAS drive be playable everywhere, but I would happily stream those files using Google Music (discussed next). It would be possible to build a player if the likes of Spotify and Google created APIs like SoundCloud and Rdio have done.

Google Play Music lets you upload up to 20,000 of your own songs for free, which is a compelling reason to switch from Spotify since I could then create playlists by mixing and matching tracks from their subscription service and my own uploaded music and listen to them anywhere. However, lossless files are transcoded to MP3 so the downside of this would be that I would have to live with lossy compression. My appeal to Google is: Please allow lossless uploading of music and charge for it if you must.

Qobuz is a CD-quality streaming service based in France. It is supported by Sonos. It’s only available in certain countries in Europe currently, but hopefully it will be available in the US soon. Looks like a European startup is once again leading the charge in music innovation.

Finally, there are several services that let you buy lossless music, most notably HDtracks and Murfie. Murfie also has an interesting service whereby you send them your CDs and/or vinyl they rip them into lossless files which you can stream or download. They don’t yet offer a lossless locker service for music that you have ripped already.

I still haven’t quite figured out what the best compromise is going to be but I am leaning towards moving from Spotify to Google. Spotify, if you’re listening, please offer a locker service for owned music with lossless streaming.

Gadgets

Home Baked NAS Using a Raspberry Pi and a Portable Hard Drive

I am in the early stages of putting together a hi-fi setup for my future living room. The primary source is going to be a Sonos Connect which will allow me to stream music from cloud services like Spotify and also play my CD collection (from when CDs used to be a thing) which I have ripped to lossless files. However, my lossless files have been sitting on my laptop and needed to be liberated onto a NAS. Rather than spend $300 or more for a decent NAS, I decided to build one from a Raspberry Pi that needed to be put to good use and an old portable hard drive I had lying around.

Here is the result:

The Raspberry Pi Model B is attached to a Transcend StoreJet 320GB using a Quirky Bandit. Powering the Pi and the hard drive (the drive comes with a Y-cable whose extra plug is solely for additional power) is a small, powered USB hub, which is attached to the Pi using another Bandit. The two wires external to the unit are the ethernet cable and the power cable for the USB hub.

Software-wise it’s fairly simple – just Samba running on Raspbian. This is perfectly adequate for my current needs, since the Sonos Connect just needs a Samba share to play local music, but the great thing about using the Pi as a NAS controller is its extensibility – I could quite easily install MiniDLNA for uPnP support. The setup was fairly straightforward despite by less-than-elite Linux skills, although the Samba configuration took a little bit of fiddling to get right. I also found the Logitech K400 Wireless Keyboard with Built-In Touchpad super useful in working with the Pi initially, prior to running it headless.

Analytics

Predicting the Number of “Useful” Votes a Yelp Review Will Receive

A few months ago I wrote about creating a submission for the Digit Recognizer tutorial “competition” on Kaggle that could correctly recognize 95% of handwritten digits.

Enthused by this, I decided to participate in a real competition on Kaggle, and picked the Yelp Recruiting Competition which challenges data scientists to predict the number of “useful” votes a review on Yelp will receive. Good reviews on Yelp accumulate lots of Useful, Funny and Cool votes over time. “What if we didn’t have to wait for the community to vote on the best reviews to know which ones are high quality?”, was the question posed by Yelp as the motivation for the challenge.

I am pleased to report that I ended the competition in the top one-third of the leaderboard (110 out of 352). Although the final result was decent, there were many stumbling blocks along the way.

Data

The training data consisted of ~230,000 reviews along with data on users, businesses and checkins. The data was in JSON format, so the first step was to convert it to tab-delimited format, using a simply Python script, so that it could be easily loaded into R.

Visualization

Next, I tried to understand the data by visualizing it. Here is a distribution of the number of useful votes: NewImage

Evaluation

Because Kaggle only allows two submissions every day, I created a function to evaluate the results of the prediction before submission, by replicating the algorithm used by Kaggle to evaluate the results i.e. the Root Mean Squared Logarithmic Error (“RMSLE”):

Evaluation Yelp Recruiting Competition | Kaggle

where:

ϵ is the RMSLE value (score)
n is the total number of reviews in the data set
pi is the predicted number of useful votes for review i
ai is the actual number of useful votes for review i
log(x) is the natural logarithm of x

Refining the Model

Next, I split the data into training and validation sets in a 70:30 ratio and created a linear regression using just two independent variables: ‘star rating’ and ‘length of review’. This model resulted in an error of ~0.67 on the test data i.e. after submission.

Next, I hypothesized that the good reviews were written by good reviewers and for each review, calculated the average number of useful votes that the user writing the review received for all the other reviews that he/she wrote. Including this variable reduced the error dramatically to ~0.55.

Next, I incorporated more user data i.e. the number of reviews written by the user, the number of funny/useful/cool votes given and the the average star rating. None of these variables proved to be predictive of the number of useful votes with linear regression so I tried random forests, but to no avail.

Next, I incorporated business data to see if the type of business, the star rating or number of reviews received would increase the predictive power of the model. But again, these failed to reduce the error.

Next, I incorporated checkin data to see if the number of checkins would improve the model. Again, this failed to reduce the error.

Having exhausted all the easy options, I turned to text mining to analyze the actual content of the review. I split the reviews into two categories – ham (good reviews with more than five useful votes) and spam (bad reviews with five useful votes or less). For each category, I created a “term document matrix” i.e. a matrix with terms as columns, documents (review text) as rows and cells as the frequency of the term in the document. I then created a list of the most frequent terms in each category that were distinct i.e. that were only in one category or the other. To the model I added variables from the frequencies of each of these words and in addition added the frequencies of the exclamation mark (!) and comma (,). The final list of words for which I created frequency variables was:

, (comma)
!
nice
little
time
chicken
good
people
pretty
you
service
wait
cheese
day
hot
night
salad
sauce
table

The frequency variables improved the predictive power of the model significantly and resulted in an error of ~0.52.

Visualization of Final Model

Here is a heatmap of predicted (x-axis) vs actual (y-axis) useful votes:

NewImage

For lower numbers of useful votes (i.e. up to ~8) there is a relatively straight diagonal line indicating that by-and-large the prediction and actual values coincide. Beyond this, the model starts to falter and there is a fair amount of scattering.

Improvements

I couldn’t find time to improve the model even further, but I am fairly confident that additional text mining approaches such as stemming and natural language processing would do so.

Big Data

Big Data(bases): Making Sense of OldSQL vs NoSQL vs NewSQL

A few months ago, I had the great pleasure of meeting and discussing big data with Michael Stonebraker, a legendary computer scientist at MIT who specializes in database systems and is considered to be the forefather of big data. Stonebraker developed INGRES, which helped pioneer the use of relational databases, and has formed nine companies related to database technologies.

Until recently, the choice of a database architecture was largely a non-issue. Relational databases were the de-facto standard and the main choices were Oracle, SQL Server or an open source database like MySQL. But with the advent of big data, scalability and performance issues with relational databases became commonplace. For online processing, NoSQL databases have emerged as a solution to these problems. NoSQL is a catch-all for different kinds of database architectures — key-value stores, document databases, column family databases and graph databases. Each has it’s own relative advantages and disadvantages. However, in order to get scalability and performance, NoSQL databases give up “queryability” (i.e. not being able to use SQL) and ACID transactions.

More recently a new type of database has emerged that offers high performance and scalability without giving up SQL and ACID transactions. This class of database is called NewSQL, a term coined by Stonebraker. He provides an excellent overview of OldSQL vs NoSQL vs NewSQL in this video.

Some key points from the video:

SQL is good.
Traditional databases are slow not because SQL is slow. It’s because of their architecture and the fact that they are running code that is 30 years old.
NewSQL provides performance and scalability while preserving SQL and ACID transactions by using a new architecture that drastically reduces overhead.

In the video, Stonebraker talks about VoltDB, an open source NewSQL database that comes from a company of the same name founded by him. Some of the performance figures of VoltDB are pretty amazing:

3 million transactions per second on a “couple hundred cores”
45x the performance of “a SQL vendor who’s name has more than three letters and less than nine”
5-6 times faster than Cassandra and same speed as Memcached on key-value operations

VoltDB sounds like an extremely compelling alternative to NoSQL databases, and certainly warrants a look if you want to move from a traditional “OldSQL” database to one that is highly scalable and performant without losing SQL and ACID.

Analytics, MBA

Amateur Data Scientist?: How I Built a Handwritten Digit Recognizer with 95% Accuracy

Almost two years ago, I wrote a post entitled Stats are Sexy, in which I mentioned the emerging discipline of data science. Soon after, I discovered the amazing platform Kaggle, which lets companies host Netflix Prize -style competitions where data scientists from all over the world compete to come up with the best predictive models. Fascinated, I really wanted to learn machine learning by competing in the “easy” Digit Recognizer competition which requires taking an image of a handwritten digit, and determining what that digit is, but struggled to gather the know-how in the limited free time that I had. Instead, I quenched my desire to do innovative work with data by building data visualization showcases as a Technical Evangelist for Infragistics, my employer at the time: Population Explosion and Flight Watcher.

Now, as an MBA student at the Massachusetts Institute of Technology, I am taking a class called The Analytics Edge, which I am convinced is one of the most important classes I will take during my time at business school. More on that later. Part of my excitement for taking the class was to learn R, the most widely-used tool (by a long margin) among data scientists competing on Kaggle.

After several lectures, I had some basic knowledge of how to identify and solve the three broad types of data mining problems – regression, classification and clustering. So, I decided to see if I could apply what I had learnt so far by revisiting the Digit Recognizer competition on Kaggle, and signed up as a competitor.

I recognized this as a classification problem – given input data (an image), the problem is to determine which class (a number from 0 to 9) it belongs to. I decided to try CART (Classification and Regression Trees). I used 70% of the data (which contains 42,000 handwritten digits along with labels that identify what numbers the digits actually are) to build a predictive model, and 30% to test it’s accuracy. The CART model was only about 62% accurate in recognizing digits, so I tried Random Forests, which to my surprise turned out to be ~90% accurate! I downloaded the “real” test set which contains 28,000 handwritten digits, ran it through the model and created a file that predicted what each of the digits actually was.

I uploaded my prediction file and to my surprise it turned out that the accuracy was 93%. I increased the number of tree in the random forests to see if I could do better, and it indeed worked, bumping up the accuracy to 95% and moving me up 43 positions in the leaderboard:

Digit submission

Here is the code in it’s entirety: DigitRecognizer.r

I was amazed that I was able to build something like this in a couple of hours in under 30 lines of code (including sanity checks). (Of course, I didn’t have to clean and normalize the data, which can be painful and time-consuming.) Next up: a somewhat ambitious project to recognize gestures made by moving smartphones around in the air. Updates to follow in a future post.

What’s really exciting from a business standpoint is that predictive analytics can be applied in a large number of business scenarios to gain actionable insights or to create economic value. Have a look at the competitions on Kaggle to get an idea.

It has been clear for some time that companies can obtain a significant competitive advantage through data analytics, but this is not limited to specific industries. A few experts from the MIT Sloan Management Review 2013 Data & Analytics Global Executive Study and Research Project published only a few days ago hint at the scale of “The Analytics Revolution”:

How organizations capture, create and use data is changing the way we work and live. This big idea, which is gaining currency among executives, academics and business analysts, reflects a growing belief that we are on the cusp of an analytics revolution that may well transform how organizations are managed, and also transform the economies and societies in which they operate.

Fully 67% of survey respondents report that their companies are gaining a competitive edge from their use of analytics. Among this group, we identified a set of companies that are relying on analytics both to gain a competitive advantage and to innovate. These Analytical Innovators constitute leaders of the analytics revolution. They exist across industries, vary in size and employ a variety of business models.

If I was enthused about the power of analytics before, I am even more convinced now. Which is why I consider classes like the Analytics Edge that teach hard analytics skills to be extremely valuable for managers in the current global business environment.

Will you be an Analytics Innovator?

Data Visualization

Map-Based Visualization Revisited

In January, I created an animated visualization showing the movement of domestic flights of Indian carrier Jet Airways over the course of a day:

Al flights

The time of day is depicted using a slider that moves across horizontally over time:

Al time

The motivation behind this visualization, besides trying to build something cool, was to showcase the capabilities of my former employer’s Geographic Map API and encourage developers to build map-based visualizations.

I am thrilled to see that a similar approach is taken by Google’s visualization of flights to and from London, which was released earlier this month:

Google flights

The time of day in this case is depicted using a simple clock:

Google time

This visualization is part of the More Than a Map website showcasing the capabilities of the Google Maps API – do check it out.

I’ve always been a big fan of Google Maps. I remember looking at the newly released Street View feature in 2007 and shaking my head in awe. And MapsGL released earlier this year produced a similar reaction.

What’s great for developers is that they can leverage all of the goodness of Google Maps via the API for free! What will you build?