Posts Tagged ‘search’

Full-text search with PostgreSQL

I spent some time experimenting with PostgresSQL’s full-text search functionality, which has been available to some degree since v8.3. If Postgres is already being used as a data store, this functionality is attractive as it provides a simple way to implement non-trivial search without the need to build out additional infrastructure and code (e.g. an Elasticsearch cluster + application code to load data into Elasticsearch and keep it up-to-date).

I experimented with basic querying and ranking using the Lexiio database, the definitions table in particular provides a good dataset to work with, containing 604,076 term definitions.

Querying

Below is a sample query where we search for the phrase “to break into small pieces”, rank each item in the result set, and order the results based on their rank.

SELECT id, definition
FROM definitions
WHERE to_tsvector('english', definition) @@ plainto_tsquery('english', 'to break into small pieces')

Understanding this query is mostly about understanding some vendor-specific SQL.

  • The tsvector type represents is a sorted list of lexemes. to_tsvector(…) is a function to convert raw text to a tsvector.
  • The tsquery type represents lexemes to be searched for and the operators combining them. to_tsquery(…) is a function to convert raw text to a tsquery.
  • @@ is the match operator, this is a binary operator which take a tsvector and a tsquery

PostgreSQL full-text search query

To get a better understanding of of these types, it can be helpful to run the conversion functions with a few phrases.

SELECT to_tsvector('english', 'to break into small pieces');

"'break':2 'piec':5 'small':4"
SELECT plainto_tsquery('english', 'to break into small pieces');

"'break' & 'small' & 'piec'"

Ranking

The ranking of a full-text search match can be computed with either ts_rank(…), which provides a standard ranking, or ts_rank_cd(…), which gives a coverage density ranking (where ranking is also based on term proximity and cooccurrence, as described in “Relevance ranking for one to three term queries”).

SELECT
id,
definition,
ts_rank(
to_tsvector('english', definition),
plainto_tsquery('english', 'to break into small pieces')
)
AS rank
FROM definitions
WHERE to_tsvector('english', definition) @@ plainto_tsquery('english', 'to break into small pieces')
ORDER BY rank DESC

Higher rank values correspond to more relevant search results.

Here’s the result set, with rankings, for the query above:

+--------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------+----------+ | id | definition | rank | +--------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------+----------+ | 568352 | # {{transitive}} To [[break]] small pieces from. | 0.26833 | +--------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------+----------+ | 135231 | # Resistant to chipping (breaking into small pieces). | 0.266913 | +--------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------+----------+ | 572891 | # {{transitive}} To break into small pieces or fragments. | 0.266913 | +--------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------+----------+ | 568348 | # {{transitive}} To [[break]] into small pieces. | 0.266913 | +--------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------+----------+ | 176962 | # To break into crumbs or small pieces with the fingers; to [[crumble]]. | 0.25948 | +--------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------+----------+ | 50744 | # A small piece of [[detailing]] added to break up the [[surface]] of an [[object]] and add [[visual]] interest, particularly in [[movie]] [[special effect]]s. | 0.25134 | +--------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------+----------+ | 568350 | # {{transitive}} To [[break]] open or [[crush]] to small pieces by impact or stress. | 0.25134 | +--------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------+----------+ | 572890 | # {{transitive}} To break into [[fragment]]s or small pieces. | 0.25134 | +--------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------+----------+ | 547405 | # {{surgery}} The [[operation]] of breaking a [[stone]] in the [[bladder]] into small pieces capable of being [[void]]ed. | 0.221355 | +--------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------+----------+

Indexing

Without an index the above query takes ~5.3 seconds on my local machine (i7-4790K @ 3.7GHz, Intel 730 Series SSD, DDR3 1600 RAM w/ more than enough available to PG).

A Generalized Inverted Index (GIN) is recommended for full-text search. A GIN index can be created directly on a tsvector column or, in this case where there’s an existing text column, an expression index can be created using the to_tsvector() function.

CREATE INDEX ix_def
ON definitions
USING GIN(to_tsvector('english', definition));

With this index, performance improves drastically, with query times ~13 milliseconds.

Is it worth trying?

Maybe. If you’re not using Postgres as the data store for what needs to be searched over (i.e. you’d have to continually ETL data into Postgres), you already have a sophisticated search solution in place, or you’re operating at a scale where you need a clustered solution, probably not. However, if you’re using Postgres, and looking to implement search in an application or move beyond simple substring search, what Postgres is offering is fairly powerful and worth trying out.

Implementing a “did you mean…?” function

A while ago I became interested in how one would go about implementing something akin to Google Search’s “did you mean…” function. This answer from StackOverflow provides a good overview of what Google does, which involves looking at a user’s incorrect entry as well as a subsequent correction provided by the user. Data mining both the incorrect entry and correction, for millions of users and billions (trillions?) of entries, Google Search can thus make an intelligent guess as to what a user really meant when an incorrect entry is submitted. While most applications can’t do this at Google Search’s scale or generality (which is a span of terms from across the entire web), I can certainly see this model working for smaller applications, which only need to deal with a smaller subset of terms (this blog for example only needs to handle terms that are in my posts). Remove the data capture aspect, providing a fixed database of terms from which to search, and implementation becomes even simpler!

did you mean google?

Digging deeper into implementation details, there’s the problem of figuring out how closely an input string (X) matches each string in a database of terms (Ti). The closeness or distance here is the edit distance between the 2 strings, or the minimum number of edits it takes to turn one string into another. There are a number of edit distance algorithms, but the Levenshtein distance seems to be a popular choice.

For a simple did-you-mean suggester, computing the edit distance is the crux of the method.

To test things out, I did a simple project in C++. Using a straightforward implementation of the Levenshtein distance and a vector of 112 chemical elements (up to copernicium), I wrote a program that would prompt the user for the name of an element, if the element was found it would output “ELEMENT FOUND”, if not it would suggest the name of an element based on the user’s input.

Includes

#include <iostream>
#include <string>
#include <vector>

Levenshtein distance implementation

int ld(const std::string& strA, const std::string& strB)
{
    
int lenA = strA.length() + 1;
    
int lenB = strB.length() + 1;

    
int** mat;
    mat =
new int*[lenA];
    
for(int i=0; i<lenA; i++)
    {
        mat[i] =
new int[lenB];
    }

    
for(int i=0; i<lenA; i++)
    {
        mat[i][0] = i;
    }
    
    
for(int j=0; j<lenB; j++)
    {
        mat[0][j] = j;
    }

    
for(int i=1; i<=lenA-1; i++)
    {
        
for(int j=1; j<=lenB-1; j++)
        {
            
if(strA[i-1] == strB[j-1])
            {
                mat[i][j] = mat[i-1][j-1];
            }
            
else
            
{
                mat[i][j] = std::min(mat[i-1][j-1]+1, std::min(mat[i-1][j] + 1, mat[i][j-1] + 1) );
            }
        }
    }

    
int ret = mat[lenA-1][lenB-1];

    
// memory cleanup
    
for(int i=0; i<lenA; i++)
    {
        
delete [] mat[i];
    }
    
delete [] mat;

    
return ret;
}

Function to construct std::vector of chemical elements

std::vector<std::string> make_elements_vector()
{
    std::vector<std::string> elements;

    elements.push_back(
"hydrogen");
    elements.push_back(
"helium");
    elements.push_back(
"lithium");
    elements.push_back(
"beryllium");
    elements.push_back(
"boron");
    elements.push_back(
"carbon");
    elements.push_back(
"nitrogen");
    elements.push_back(
"oxygen");
    elements.push_back(
"fluorine");
    elements.push_back(
"neon");
    elements.push_back(
"sodium");
    elements.push_back(
"magnesium");
    elements.push_back(
"aluminium");
    elements.push_back(
"silicon");
    elements.push_back(
"phosphorus");
    elements.push_back(
"sulfur");
    elements.push_back(
"chlorine");
    elements.push_back(
"argon");
    elements.push_back(
"potassium");
    elements.push_back(
"calcium");
    elements.push_back(
"scandium");
    elements.push_back(
"titanium");
    elements.push_back(
"vanadium");
    elements.push_back(
"chromium");
    elements.push_back(
"manganese");
    elements.push_back(
"iron");
    elements.push_back(
"cobalt");
    elements.push_back(
"nickel");
    elements.push_back(
"copper");
    elements.push_back(
"zinc");
    elements.push_back(
"gallium");
    elements.push_back(
"germanium");
    elements.push_back(
"arsenic");
    elements.push_back(
"selenium");
    elements.push_back(
"bromine");
    elements.push_back(
"krypton");
    elements.push_back(
"rubidium");
    elements.push_back(
"strontium");
    elements.push_back(
"yttrium");
    elements.push_back(
"zirconium");
    elements.push_back(
"niobium");
    elements.push_back(
"molybdenum");
    elements.push_back(
"technetium");
    elements.push_back(
"ruthenium");
    elements.push_back(
"rhodium");
    elements.push_back(
"palladium");
    elements.push_back(
"silver");
    elements.push_back(
"cadmium");
    elements.push_back(
"indium");
    elements.push_back(
"tin");
    elements.push_back(
"antimony");
    elements.push_back(
"tellurium");
    elements.push_back(
"iodine");
    elements.push_back(
"xenon");
    elements.push_back(
"caesium");
    elements.push_back(
"barium");
    elements.push_back(
"lanthanum");
    elements.push_back(
"cerium");
    elements.push_back(
"praseodymium");
    elements.push_back(
"neodymium");
    elements.push_back(
"promethium");
    elements.push_back(
"samarium");
    elements.push_back(
"europium");
    elements.push_back(
"gadolinium");
    elements.push_back(
"terbium");
    elements.push_back(
"dysprosium");
    elements.push_back(
"holmium");
    elements.push_back(
"erbium");
    elements.push_back(
"thulium");
    elements.push_back(
"ytterbium");
    elements.push_back(
"lutetium");
    elements.push_back(
"hafnium");
    elements.push_back(
"tantalum");
    elements.push_back(
"tungsten");
    elements.push_back(
"rhenium");
    elements.push_back(
"osmium");
    elements.push_back(
"iridium");
    elements.push_back(
"platinum");
    elements.push_back(
"gold");
    elements.push_back(
"mercury");
    elements.push_back(
"thallium");
    elements.push_back(
"lead");
    elements.push_back(
"bismuth");
    elements.push_back(
"polonium");
    elements.push_back(
"astatine");
    elements.push_back(
"radon");
    elements.push_back(
"francium");
    elements.push_back(
"radium");
    elements.push_back(
"actinium");
    elements.push_back(
"thorium");
    elements.push_back(
"protactinium");
    elements.push_back(
"uranium");
    elements.push_back(
"neptunium");
    elements.push_back(
"plutonium");
    elements.push_back(
"americium");
    elements.push_back(
"curium");
    elements.push_back(
"berkelium");
    elements.push_back(
"californium");
    elements.push_back(
"einsteinium");
    elements.push_back(
"fermium");
    elements.push_back(
"mendelevium");
    elements.push_back(
"nobelium");
    elements.push_back(
"lawrencium");
    elements.push_back(
"rutherfordium");
    elements.push_back(
"dubnium");
    elements.push_back(
"seaborgium");
    elements.push_back(
"bohrium");
    elements.push_back(
"hassium");
    elements.push_back(
"meitnerium");
    elements.push_back(
"darmstadtium");
    elements.push_back(
"roentgenium");
    elements.push_back(
"copernicium");

    
return elements;
}

Application Logic

int main(int argc, char* argv[])
{

    std::vector<std::string> elements = make_elements_vector();

    std::cout <<
"What element are you attempting to find? ";
    std::string inputStr;
    std::cin >> inputStr;

    size_t minDistIndex = 0;
    
int minDist = INT_MAX;

    
for(size_t i=0; i<elements.size(); i++)
    {
        
int dist = ld(elements[i], inputStr);

        
if(dist < minDist)
        {
            minDist = dist;
            minDistIndex = i;
        }
    }

    
if(minDist == 0)
    {
        std::cout <<
"ELEMENT FOUND!" << std::endl;
    }
    
else
    
{
        std::string dym =
"Did you mean " + elements[minDistIndex] + "?";
        std::cout << dym << std::endl;
    }

    
return 0;
}

Search for “hydrogen”

hydrogen, element found

Search for “tillium”

tillium, did you mean gallium?

This little demo works surprisingly well and suggestions are more-or-less inline with what you’d expect. There are obviously limitations as you think about applying this to other domains as language, context, etc. are not taken into consideration, but as a simple suggester it holds up pretty well and is perhaps a nice addition to a number of search methods in a variety of applications (the majority of which don’t seem to implement anything of the sort).

AltaVista

Everyone’s mourning the death (perhaps prematurely) of delicious but it’s worth nothing that the search engine AltaVista is going to be shut down as well; its time has come, but it’s still somewhat sad to see it go.

altavista logo

Steven Vaughan-Nichols @ ZDNet echoed my sentiments perfectly,

“AltaVista was Google before there was Google.”

If you remember web search in the 90s you, of course, remember Yahoo, but Yahoo along with all of its contemporaries snubbed search in favor of navigating and drilling-down into specific categories. If what you were looking for didn’t fit nicely into a category, it was like finding a needle in a haystack, assuming you were able to find the needle at all. Another not-so-fond memory of web search in the 90s is the continual intrusion of more and more invasive ads. Yahoo was cleaner than most, but sites like Excite peppered their sites with popups and banners; I distinctly remember by system freezing simply from going to excite.com.

AltaVista was revolutionary: sleek, minimal design focusing on text input (though, there was categories at a certain point), good results, and no annoying ads.

So, goodbye AltaVista, and thanks for all the fish.