The Economist recently wrote a bit about how speech recognition got so good:
… words do not appear in random order, so the computer does not have to guess from (say) a vocabulary of 20,000 words for each word you speak. Instead, the software assesses how likely you are to have said a given word based on the surrounding words, drawing on statistical models derived from vast repositories of digitised documents and the previous utterances of other users.
This reminded me of a talk by Peter Norvig: The Unreasonable Effectiveness of Data, where he discusses utilizing such large repositories of data in order to develop effective algorithms for a number of problems; there is a heavy focus on natural language processing problems but the concept can, of course, be applied in other areas.
(If the name Peter Norvig sounds familiar, he’s the co-author of Artificial Intelligence: A Modern Approach which you might have used if you ever took an AI class.)
As a programmer, this is exciting stuff and certainly changed my thinking in regards to how I would approach similar problems in the future. Whereas before I would look at sample data sets and try to derive an algorithm, I’d now attempt to mine as much data as I could, build a statistical model, and use that as the basis of the algorithm. Of course mining a massive data set is sometimes easier said than done; especially in regards to data, much of the web is still a walled garden.