Posts Tagged ‘nyc data mine’

Mapping NYC subway stations

I previously wrote about showing the transit layer with the Google Maps API, this is somewhat of a continuation, but narrower in scope; here I’ll talk about showing custom markers for New York City subway stations, making use of data from NYC Open Data (formerly the NYC Data Mine).

The reason for doing this at all, given that a Google Map already shows subway stations, is that the default indicators are fairly inflexible:

  • You can’t use a custom icon to change how they look
  • You can’t fire off custom event handlers when the user interacts with them

NYC subway stations near city hall

(this is true for all of the point of interest indicators: parks, schools, etc.)

Replacing the point of interest indications with markers turned out to be fairly simple and the data from NYC Open Data was (relatively) clean and readily usable, a pleasant surprise given my previous experience. For this little project, I exported the Subway Stations dataset; it can be exported in a number of formats, but JSON is probably the easiest to work client-side. With the data readily available as a JSON file, it can be loaded simply with an AJAX call to get the file and, once loaded, the subway stations can be plotted on the map by iterating through the list of stations in the JSON data.

It’s worth taking a look at the format of the JSON data, as the indexing of the nodes isn’t all that clear. There’s a meta element and data element at the root, the data element contains a zero-indexed array of subway stations, and each subway station contains a zero-indexed array of attributes, notably:

  • 10 = station name
  • 12 = dash-delimited list of train lines
  • 9 = object with latitude, longitude, and geometry field
    (Note that these values are referenced by the field name, not a numeric index).

NYC Open Data, Subway Station, JSON format

The code to get the JSON data (using jQuery’s .ajax), iterate through the array of subway stations, extract the relevant pieces of information about the stations, and create map markers for them is shown below. A label is attached to the marker by making use of Marc Ridey’s Label class.

function (ret)
for(var i=0; i<; i++)
// extract station name, latitude, longitude, and dash-delimited list of train lines at station
var stationName =[i][10];
var lat =[i][9]['latitude'];
var lon =[i][9]['longitude'];
var trainLines = ([i][12]).split('-');

// make comma-delimited list of train lines
var trainLinesLbl = '';
for(var k=0; k<trainLines.length; k++)
trainLinesLbl += trainLines[k];
if(k < trainLines.length-1)
trainLinesLbl +=

// create marker
marker = new google.maps.Marker({
"position": new google.maps.LatLng(lat, lon),
"map": map,
"title": stationName + " [" + trainLinesLbl + "]",
"icon": ""

// create label for marker
// uses Label created by Marc Ridey
var label = new Label({ map: map });
'position', marker, 'position');
'text', marker, 'title');                                                     

With some minor styling to the label and an icon from Map Icons Collection, here’s my result:

NYC subway stations with custom markers

NYC yum yum

NYC yum yum is my web app for the NYC Big Apps 2.0 competition.

NYC yum yum

It’s a very simple app for quickly finding restaurants by cusine + location. For each resturaunt, it shows the restaurant inspection grade, to give you an idea of how clean and safe the food at the establishment is, and also pulls the Yelp star rating to give you and idea of how good the food is.

NYC yum yum front page

NYC yum yum results

Search by cuisine and location (specifically, neighborhood) I felt was really important. Searching by name is, of course, much simpler, but when it comes to finding a new restaurant your likely not to know the name beforehand. Even if you have been to the restaurant before, unless it’s a regular spot for you, you’d likely still has issues remembering the name given the number of places to eat in New York City.

That said, this was far more difficult than I anticipated and I’m sad to say the current implementation is far from ideal. Here’s why:

  • The raw data set from the NYC Data Mine contains no geographic coordinates (latitude, longitude), so to get an accurate location, the address must be geocoded. There are a number of services to do this, but they’re all pretty limited. Google’s Geocoding API was my first choice, but I was wary of the terms of use,

    … the Geocoding API may only be used in conjunction with a Google map; geocoding results without displaying them on a map is prohibited. For complete details on allowed usage, consult the Maps API Terms of Service License Restrictions.

    My second choice was, but it had issues geocoding locations outside of Manhattan. So I put aside the idea of doing a geocode.
  • My next attempt was to do queries by neighborhood. Surely, it would be easy to translate between neighborhoods and zip codes, right?! Nope. All I could find were commercial services (e.g. maponics) that did it, and I wasn’t willing to go down that road.
  • My final attempt and what’s implemented, is grabbing the zip code from the the marker location on the google map. This sorta works. The problem is that the area covered by a zip code doesn’t necessarily match up exactly with a neighborhood’s boundary. So a restaurant that may not be shown in the search if it falls into an adjacent zip code.

My big disappointment is how restrictive and inaccessible all of this geographic data is, and the lack of such data severely compromises the effectiveness of the search.

Another, somewhat major issue with the raw data set is how restaurants are categorized. I’m noticing tons of establishments are simply identified with the cuisine type of “American.” For example, B Cup is a café in the East Village. If I was looking for it, I would search for “Café/Coffee/Tea” not “American.”

NYC yum yum

In fact, I think “American” is way to generic to describe any sort of cuisine.

In any case, NYC yum yum still works pretty well and does hit of the target of being able to find good, clean places to eat at, quick and easily. I’ll likely be working on improvements to it in the near future.

Edit: mixed up geocoding and reverse geocoding

NYC Data Mine, restaurant inspection data

I’ve just finished importing the current restaurant inspection data from the NYC Data Mine into a PostgreSQL database. It wasn’t the most difficult migration, but more difficult than it should be as the raw data from the data mine is messy and not well-formed; a typical problem with many of the data sets present in NYC Data Mine. I came across a great post by Steven Romalewski (director of the CUNY Mapping Service) about the poor data quality and poor metadata based on his experiences.

From looking at the restaurant inspection data and skimming a few other sets, I get the sense that structured and relational data simply isn’t understood or handled well. To be fair, there’s a very real lack of tools in the market, at least at the consumer/data-entry level, for handling such data, so it’s not surprising that everything gets jerryrigged into an Excel worksheet. This is very clear when looking at the restaurant inspection data, you notice right away that restaurant ids and names are repeated across multiple rows.

In any case, the restaurant inspection data is better than most of the sets, but there’s a few issues to take note of:

  • In multiple cases the same row, with the exact same data, is repeated.
  • There are 2 columns for the inspection date: INSPDATE and GRADEDATE; GRADEDATE = INSPDATE if there’s a letter grade for the restaurant, otherwise it’s blank/null.
  • Most glaring, there are invalid timestamps in the GRADEDATE column for 2 restaurants (but, of course, it’s across multiple rows as the restaurants has multiple entries), CAPRI RESTAURANT and MAMA LUCIA:

    timestamp problem

For my purposes, I only wanted the most recent inspection result (i.e. the row the latest INSPDATE timestamp). To do this, I added an additional column for a serial/auto_increment id number. Then, once imported, I deleted the unneeded rows with the following query:

/* table is restaurant
id = CAMIS
inspection_score_date = INSPDATE
internal_id = serial/auto_increment id number

DELETE FROM restaurant WHERE internal_id NOT IN
(SELECT MAX(restaurant.internal_id) AS max_iid FROM restaurant,
(SELECT id, dba, MAX(inspection_score_date) AS last_inspt FROM restaurant GROUP BY id, dba) AS sub
WHERE AND restaurant.inspection_score_date=sub.last_inspt GROUP BY

The innermost subquery pulls the rows with the most recent inspection date, the outer takes care of duplicate rows with the same inspection date by simple taking the row with the max internal id number. What results is a column of internal id numbers – each representing a row with a unique restaurant inspection for the most-recent inspection.

I’m not sure if this is the best or most efficient way to do this, but it works and took about 14s to delete the unneeded rows for 398,878 rows on a low-end VPS.