eliminate duplicates | semi/signal

Posts Tagged ‘eliminate duplicates’

NYC Data Mine, restaurant inspection data

Jan 11 2011 · Databases

I’ve just finished importing the current restaurant inspection data from the NYC Data Mine into a PostgreSQL database. It wasn’t the most difficult migration, but more difficult than it should be as the raw data from the data mine is messy and not well-formed; a typical problem with many of the data sets present in NYC Data Mine. I came across a great post by Steven Romalewski (director of the CUNY Mapping Service) about the poor data quality and poor metadata based on his experiences.

From looking at the restaurant inspection data and skimming a few other sets, I get the sense that structured and relational data simply isn’t understood or handled well. To be fair, there’s a very real lack of tools in the market, at least at the consumer/data-entry level, for handling such data, so it’s not surprising that everything gets jerryrigged into an Excel worksheet. This is very clear when looking at the restaurant inspection data, you notice right away that restaurant ids and names are repeated across multiple rows.

In any case, the restaurant inspection data is better than most of the sets, but there’s a few issues to take note of:

In multiple cases the same row, with the exact same data, is repeated.
There are 2 columns for the inspection date: INSPDATE and GRADEDATE; GRADEDATE = INSPDATE if there’s a letter grade for the restaurant, otherwise it’s blank/null.
Most glaring, there are invalid timestamps in the GRADEDATE column for 2 restaurants (but, of course, it’s across multiple rows as the restaurants has multiple entries), CAPRI RESTAURANT and MAMA LUCIA:

For my purposes, I only wanted the most recent inspection result (i.e. the row the latest INSPDATE timestamp). To do this, I added an additional column for a serial/auto_increment id number. Then, once imported, I deleted the unneeded rows with the following query:

/* table is restaurant
   id = CAMIS
   inspection_score_date = INSPDATE
   internal_id = serial/auto_increment id number
*/

DELETE FROM restaurant WHERE internal_id NOT IN
    (SELECT MAX(restaurant.internal_id) AS max_iid FROM restaurant,
            (SELECT id, dba, MAX(inspection_score_date) AS last_inspt FROM restaurant GROUP BY  id, dba) AS sub
        WHERE restaurant.id=sub.id AND  restaurant.inspection_score_date=sub.last_inspt GROUP BY restaurant.id)

The innermost subquery pulls the rows with the most recent inspection date, the outer takes care of duplicate rows with the same inspection date by simple taking the row with the max internal id number. What results is a column of internal id numbers – each representing a row with a unique restaurant inspection for the most-recent inspection.

I’m not sure if this is the best or most efficient way to do this, but it works and took about 14s to delete the unneeded rows for 398,878 rows on a low-end VPS.