Implementing a “did you mean…?” function
Avishkar Autar · Apr 7 2012 · Random
A while ago I became interested in how one would go about implementing something akin to Google Search’s “did you mean…” function. This answer from StackOverflow provides a good overview of what Google does, which involves looking at a user’s incorrect entry as well as a subsequent correction provided by the user. Data mining both the incorrect entry and correction, for millions of users and billions (trillions?) of entries, Google Search can thus make an intelligent guess as to what a user really meant when an incorrect entry is submitted. While most applications can’t do this at Google Search’s scale or generality (which is a span of terms from across the entire web), I can certainly see this model working for smaller applications, which only need to deal with a smaller subset of terms (this blog for example only needs to handle terms that are in my posts). Remove the data capture aspect, providing a fixed database of terms from which to search, and implementation becomes even simpler!
Digging deeper into implementation details, there’s the problem of figuring out how closely an input string (X) matches each string in a database of terms (Ti). The closeness or distance here is the edit distance between the 2 strings, or the minimum number of edits it takes to turn one string into another. There are a number of edit distance algorithms, but the Levenshtein distance seems to be a popular choice.
For a simple did-you-mean suggester, computing the edit distance is the crux of the method.
To test things out, I did a simple project in C++. Using a straightforward implementation of the Levenshtein distance and a vector of 112 chemical elements (up to copernicium), I wrote a program that would prompt the user for the name of an element, if the element was found it would output “ELEMENT FOUND”, if not it would suggest the name of an element based on the user’s input.
Includes
#include <iostream>
#include <string>
#include <vector>
Levenshtein distance implementation
int ld(const std::string& strA, const std::string& strB)
{
int lenA = strA.length() + 1;
int lenB = strB.length() + 1;
int** mat;
mat = new int*[lenA];
for(int i=0; i<lenA; i++)
{
mat[i] = new int[lenB];
}
for(int i=0; i<lenA; i++)
{
mat[i][0] = i;
}
for(int j=0; j<lenB; j++)
{
mat[0][j] = j;
}
for(int i=1; i<=lenA-1; i++)
{
for(int j=1; j<=lenB-1; j++)
{
if(strA[i-1] == strB[j-1])
{
mat[i][j] = mat[i-1][j-1];
}
else
{
mat[i][j] = std::min(mat[i-1][j-1]+1, std::min(mat[i-1][j] + 1, mat[i][j-1] + 1) );
}
}
}
int ret = mat[lenA-1][lenB-1];
// memory cleanup
for(int i=0; i<lenA; i++)
{
delete [] mat[i];
}
delete [] mat;
return ret;
}
Function to construct std::vector of chemical elements
std::vector<std::string> make_elements_vector()
{
std::vector<std::string> elements;
elements.push_back("hydrogen");
elements.push_back("helium");
elements.push_back("lithium");
elements.push_back("beryllium");
elements.push_back("boron");
elements.push_back("carbon");
elements.push_back("nitrogen");
elements.push_back("oxygen");
elements.push_back("fluorine");
elements.push_back("neon");
elements.push_back("sodium");
elements.push_back("magnesium");
elements.push_back("aluminium");
elements.push_back("silicon");
elements.push_back("phosphorus");
elements.push_back("sulfur");
elements.push_back("chlorine");
elements.push_back("argon");
elements.push_back("potassium");
elements.push_back("calcium");
elements.push_back("scandium");
elements.push_back("titanium");
elements.push_back("vanadium");
elements.push_back("chromium");
elements.push_back("manganese");
elements.push_back("iron");
elements.push_back("cobalt");
elements.push_back("nickel");
elements.push_back("copper");
elements.push_back("zinc");
elements.push_back("gallium");
elements.push_back("germanium");
elements.push_back("arsenic");
elements.push_back("selenium");
elements.push_back("bromine");
elements.push_back("krypton");
elements.push_back("rubidium");
elements.push_back("strontium");
elements.push_back("yttrium");
elements.push_back("zirconium");
elements.push_back("niobium");
elements.push_back("molybdenum");
elements.push_back("technetium");
elements.push_back("ruthenium");
elements.push_back("rhodium");
elements.push_back("palladium");
elements.push_back("silver");
elements.push_back("cadmium");
elements.push_back("indium");
elements.push_back("tin");
elements.push_back("antimony");
elements.push_back("tellurium");
elements.push_back("iodine");
elements.push_back("xenon");
elements.push_back("caesium");
elements.push_back("barium");
elements.push_back("lanthanum");
elements.push_back("cerium");
elements.push_back("praseodymium");
elements.push_back("neodymium");
elements.push_back("promethium");
elements.push_back("samarium");
elements.push_back("europium");
elements.push_back("gadolinium");
elements.push_back("terbium");
elements.push_back("dysprosium");
elements.push_back("holmium");
elements.push_back("erbium");
elements.push_back("thulium");
elements.push_back("ytterbium");
elements.push_back("lutetium");
elements.push_back("hafnium");
elements.push_back("tantalum");
elements.push_back("tungsten");
elements.push_back("rhenium");
elements.push_back("osmium");
elements.push_back("iridium");
elements.push_back("platinum");
elements.push_back("gold");
elements.push_back("mercury");
elements.push_back("thallium");
elements.push_back("lead");
elements.push_back("bismuth");
elements.push_back("polonium");
elements.push_back("astatine");
elements.push_back("radon");
elements.push_back("francium");
elements.push_back("radium");
elements.push_back("actinium");
elements.push_back("thorium");
elements.push_back("protactinium");
elements.push_back("uranium");
elements.push_back("neptunium");
elements.push_back("plutonium");
elements.push_back("americium");
elements.push_back("curium");
elements.push_back("berkelium");
elements.push_back("californium");
elements.push_back("einsteinium");
elements.push_back("fermium");
elements.push_back("mendelevium");
elements.push_back("nobelium");
elements.push_back("lawrencium");
elements.push_back("rutherfordium");
elements.push_back("dubnium");
elements.push_back("seaborgium");
elements.push_back("bohrium");
elements.push_back("hassium");
elements.push_back("meitnerium");
elements.push_back("darmstadtium");
elements.push_back("roentgenium");
elements.push_back("copernicium");
return elements;
}
Application Logic
int main(int argc, char* argv[])
{
std::vector<std::string> elements = make_elements_vector();
std::cout << "What element are you attempting to find? ";
std::string inputStr;
std::cin >> inputStr;
size_t minDistIndex = 0;
int minDist = INT_MAX;
for(size_t i=0; i<elements.size(); i++)
{
int dist = ld(elements[i], inputStr);
if(dist < minDist)
{
minDist = dist;
minDistIndex = i;
}
}
if(minDist == 0)
{
std::cout << "ELEMENT FOUND!" << std::endl;
}
else
{
std::string dym = "Did you mean " + elements[minDistIndex] + "?";
std::cout << dym << std::endl;
}
return 0;
}
Search for “hydrogen”
Search for “tillium”
This little demo works surprisingly well and suggestions are more-or-less inline with what you’d expect. There are obviously limitations as you think about applying this to other domains as language, context, etc. are not taken into consideration, but as a simple suggester it holds up pretty well and is perhaps a nice addition to a number of search methods in a variety of applications (the majority of which don’t seem to implement anything of the sort).
nice C implementation for fast levenshtein performance