postgresql | semi/signal

Posts Tagged ‘postgresql’

Full-text search with PostgreSQL

Jan 7 2018 · Databases

I spent some time experimenting with PostgresSQL’s full-text search functionality, which has been available to some degree since v8.3. If Postgres is already being used as a data store, this functionality is attractive as it provides a simple way to implement non-trivial search without the need to build out additional infrastructure and code (e.g. an Elasticsearch cluster + application code to load data into Elasticsearch and keep it up-to-date).

I experimented with basic querying and ranking using the Lexiio database, the definitions table in particular provides a good dataset to work with, containing 604,076 term definitions.

Querying

Below is a sample query where we search for the phrase “to break into small pieces”, rank each item in the result set, and order the results based on their rank.

SELECT id, definition
FROM definitions
WHERE to_tsvector('english', definition) @@ plainto_tsquery('english', 'to break into small pieces')

Understanding this query is mostly about understanding some vendor-specific SQL.

The tsvector type represents is a sorted list of lexemes. to_tsvector(…) is a function to convert raw text to a tsvector.
The tsquery type represents lexemes to be searched for and the operators combining them. to_tsquery(…) is a function to convert raw text to a tsquery.
@@ is the match operator, this is a binary operator which take a tsvector and a tsquery

To get a better understanding of of these types, it can be helpful to run the conversion functions with a few phrases.

SELECT to_tsvector('english', 'to break into small pieces');

"'break':2 'piec':5 'small':4"

SELECT plainto_tsquery('english', 'to break into small pieces');

"'break' & 'small' & 'piec'"

Ranking

The ranking of a full-text search match can be computed with either ts_rank(…), which provides a standard ranking, or ts_rank_cd(…), which gives a coverage density ranking (where ranking is also based on term proximity and cooccurrence, as described in “Relevance ranking for one to three term queries”).

SELECT 
  id, 
  definition, 
  ts_rank(
    to_tsvector('english', definition), 
    plainto_tsquery('english', 'to break into small pieces')
  ) AS rank
FROM definitions
WHERE to_tsvector('english', definition) @@ plainto_tsquery('english', 'to break into small pieces')
ORDER BY rank DESC

Higher rank values correspond to more relevant search results.

Here’s the result set, with rankings, for the query above:


+--------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------+----------+
| id     | definition                                                                                                                                                      | rank     |
+--------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------+----------+
| 568352 | # {{transitive}} To [[break]] small pieces from.                                                                                                                | 0.26833  |
+--------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------+----------+
| 135231 | # Resistant to chipping (breaking into small pieces).                                                                                                           | 0.266913 |
+--------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------+----------+
| 572891 | # {{transitive}} To break into small pieces or fragments.                                                                                                       | 0.266913 |
+--------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------+----------+
| 568348 | # {{transitive}} To [[break]] into small pieces.                                                                                                                | 0.266913 |
+--------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------+----------+
| 176962 | # To break into crumbs or small pieces with the fingers; to [[crumble]].                                                                                        | 0.25948  |
+--------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------+----------+
| 50744  | # A small piece of [[detailing]] added to break up the [[surface]] of an [[object]] and add [[visual]] interest, particularly in [[movie]] [[special effect]]s. | 0.25134  |
+--------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------+----------+
| 568350 | # {{transitive}} To [[break]] open or [[crush]] to small pieces by impact or stress.                                                                            | 0.25134  |
+--------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------+----------+
| 572890 | # {{transitive}} To break into [[fragment]]s or small pieces.                                                                                                   | 0.25134  |
+--------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------+----------+
| 547405 | # {{surgery}} The [[operation]] of breaking a [[stone]] in the [[bladder]] into small pieces capable of being [[void]]ed.                                       | 0.221355 |
+--------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------+----------+

Indexing

Without an index the above query takes ~5.3 seconds on my local machine (i7-4790K @ 3.7GHz, Intel 730 Series SSD, DDR3 1600 RAM w/ more than enough available to PG).

A Generalized Inverted Index (GIN) is recommended for full-text search. A GIN index can be created directly on a tsvector column or, in this case where there’s an existing text column, an expression index can be created using the to_tsvector() function.

CREATE INDEX ix_def 
ON definitions
USING GIN(to_tsvector('english', definition));

With this index, performance improves drastically, with query times ~13 milliseconds.

Is it worth trying?

Maybe. If you’re not using Postgres as the data store for what needs to be searched over (i.e. you’d have to continually ETL data into Postgres), you already have a sophisticated search solution in place, or you’re operating at a scale where you need a clustered solution, probably not. However, if you’re using Postgres, and looking to implement search in an application or move beyond simple substring search, what Postgres is offering is fairly powerful and worth trying out.

expression indexfull-text searchpostgresqlsearchsql

PostgreSQL database import with Ansible

Feb 27 2016 · Databases

I had a hard time pulling together all the steps needed to import a PostgreSQL database using Ansible. Here’s the Ansible YAML blocks used to import the seed database for Lexiio.

1. Install PostgreSQL - name: Install Postgres apt: name={{ item }} update_cache=yes cache_valid_time=3600 state=present sudo: yes with_items: - postgresql - postgresql-contrib - libpq-dev - python-psycopg2 tags: packages

2. Create the database (lexiiodb), UTF-8 for encoding and collation

- name: Create lexiiodb database
  sudo_user: postgres
  postgresql_db: name=lexiiodb encoding='UTF-8' lc_collate='en_US.UTF-8' lc_ctype='en_US.UTF-8' state=present

3. Create a role that will be granted access to the database (password is a variable read from some secret source)

- name: Create lexiio role for database
  sudo_user: postgres
  postgresql_user: db=lexiiodb user=lexiio password="{{ password }}" priv=ALL state=present

4. Start the PostgreSQL service

- name: Start the Postgresql service
  sudo: yes
  service:
    name: postgresql
    state: started
    enabled: true

5. Import data into the database (using psql to pull in data from /home/lexiiodb.dump.sql)

- name: Importing lexiiodb data
  sudo_user: postgres
  shell: psql lexiiodb < /home/lexiiodb.dump.sql

6. For the role created, grant permissions on all schemas in the DB

- name: Grant usage of schema to lexiio role
  sudo_user: postgres
  postgresql_privs: database=lexiiodb state=present privs=USAGE type=schema roles=lexiio objs=dictionary

7. For the role created, grant permissions on all tables in the DB

- name: Grant table permissions for lexiio role
  sudo_user: postgres
  postgresql_privs: database=lexiiodb schema=dictionary state=present privs=SELECT,INSERT,UPDATE type=table roles=lexiio grant_option=no objs=ALL_IN_SCHEMA

8. For the role created, grant permissions on all sequences in the DB

- name: Grant sequence permissions for lexiio role
  sudo_user: postgres
  postgresql_privs: database=lexiiodb schema=dictionary state=present privs=USAGE type=sequence roles=lexiio grant_option=no objs=ALL_IN_SCHEMA

ansibledatabasepostgresqlprovisioning

A more relational dictionary

Feb 20 2016 · Random

As I started looking to add more functionality to Lexiio, I realized the Wiktionary definitions database dump I was using wasn’t going to cut it; specifically, I needed a normalized schema, or I’d have data duplication all over the place. I started normalizing in MySQL, but whether it was MySQL or MySQL Workbench, I kept running into character encoding issues. Using a simple INSERT-SELECT, in MySQL 5.7, to transfer words from the existing table to a new table resulted losing characters:

MySQL losing characters

I dumped the data into PostgreSQL, didn’t encounter the issue, and just kept working from there.

The normalized schema can be downloaded here: LexiioDB normalized
(released under the Creative Commons Attribution-ShareAlike License)

LexiioDB schema

The unknown_words and unknown_to_similar_words tables is specific to Lexiio and serve as a place to store unknown words entered by the user and close/similar matches to known words (via the Levenshtein distance).

databasedatabase normalizationdictionarylexiiomysqlpostgresqlwiktionary

NYC Data Mine, restaurant inspection data

Jan 11 2011 · Databases

I’ve just finished importing the current restaurant inspection data from the NYC Data Mine into a PostgreSQL database. It wasn’t the most difficult migration, but more difficult than it should be as the raw data from the data mine is messy and not well-formed; a typical problem with many of the data sets present in NYC Data Mine. I came across a great post by Steven Romalewski (director of the CUNY Mapping Service) about the poor data quality and poor metadata based on his experiences.

From looking at the restaurant inspection data and skimming a few other sets, I get the sense that structured and relational data simply isn’t understood or handled well. To be fair, there’s a very real lack of tools in the market, at least at the consumer/data-entry level, for handling such data, so it’s not surprising that everything gets jerryrigged into an Excel worksheet. This is very clear when looking at the restaurant inspection data, you notice right away that restaurant ids and names are repeated across multiple rows.

In any case, the restaurant inspection data is better than most of the sets, but there’s a few issues to take note of:

In multiple cases the same row, with the exact same data, is repeated.
There are 2 columns for the inspection date: INSPDATE and GRADEDATE; GRADEDATE = INSPDATE if there’s a letter grade for the restaurant, otherwise it’s blank/null.
Most glaring, there are invalid timestamps in the GRADEDATE column for 2 restaurants (but, of course, it’s across multiple rows as the restaurants has multiple entries), CAPRI RESTAURANT and MAMA LUCIA:

For my purposes, I only wanted the most recent inspection result (i.e. the row the latest INSPDATE timestamp). To do this, I added an additional column for a serial/auto_increment id number. Then, once imported, I deleted the unneeded rows with the following query:

/* table is restaurant
   id = CAMIS
   inspection_score_date = INSPDATE
   internal_id = serial/auto_increment id number
*/

DELETE FROM restaurant WHERE internal_id NOT IN
    (SELECT MAX(restaurant.internal_id) AS max_iid FROM restaurant,
            (SELECT id, dba, MAX(inspection_score_date) AS last_inspt FROM restaurant GROUP BY  id, dba) AS sub
        WHERE restaurant.id=sub.id AND  restaurant.inspection_score_date=sub.last_inspt GROUP BY restaurant.id)

The innermost subquery pulls the rows with the most recent inspection date, the outer takes care of duplicate rows with the same inspection date by simple taking the row with the max internal id number. What results is a column of internal id numbers – each representing a row with a unique restaurant inspection for the most-recent inspection.

I’m not sure if this is the best or most efficient way to do this, but it works and took about 14s to delete the unneeded rows for 398,878 rows on a low-end VPS.

csveliminate duplicatesnyc data minepostgresqlrestaurant inspectionrestaurantssqlsubquery

PostgreSQL + PHP installation on Windows 2003 x64

Jan 10 2011 · Databases

Well the PostgreSQL installation itself is easy enough, getting it to work with PHP is the challenging part. Here’s what I did:

Install PostgreSQL
Edit php.ini, uncomment “extension=php_pgsql.dll”
Edit environment variables, add PostgreSQL /bin and /lib directories to Path. This solves the issue of php_pgsql.dll not loading due to it not being able to resolve dependencies.
Download http://files.dll-vista.com/dllvista-php_pgsql.dll.zip, replace php_pgsql.dll with the one in the zip file. This solves the issue where you get a message about php_pgsql.dll not being a valid win32 application if you run php.exe;
Done. PHP should now be able to communicate with PostgreSQL.