Archive for the ‘Databases’ Category

Are spreadsheets databases?

A few weeks ago, news that using Excel resulted in the loss of ~16,000 coronavirus cases in England due to a 65K row limit in XLS files, sparked a number of tweets around how Excel isn’t a database. There’s a lot to cringe at here, but saying Excel or, more precisely, spreadsheets aren’t databases and using a “proper database” would have prevented a failure is incredibly reductive.

excel is not a database

Definition

To start, it’s worth looking at what the definition of “database” actually is:

definition of database

At least from that definition, I think it’s fair to say spreadsheets are databases. They’re primitive, there’s little-to-nothing in the way of concurrency, security, constraints, etc., but they are databases.

In software engineering, the term “database” is typically shorthand for a relational database, but that feels more and more problematic, as we now see many more databases in use now that aren’t relational (Cassandra, Mongo, DynamoDB, etc.). Even S3 now serves as a foundational basis for many databases.

System Considerations

Beyond definitions, what’s also interesting here is that tooling and database choice is never a simple equation for a non-trivial data system; there’s a host of considerations that come into play. Here’s a few that pop into my mind:

  • Interfacing: Who’s accessing and/or manipulating data in the system? Sophisticated data systems and bespoke interfaces are powerful, but they require training and expertise, and there’s typically a higher maintenance burden. Leveraging common and ubiquitous tooling can be beneficial when it comes to interfacing needs for a larger audience. This comment on ArsTechnica points out that the “proper tools” are inordinately complex and not something a typical end-user can pick-up and understand easily, and I think that’s a fair assessment of the product landscape.
  • System limits: every data system has limits, some are explicit and obvious, some are not. It’s also not surprising to bump into limits due how a database is setup or how a schema is designed. Hitting a 65K row limit is frustrating and problematic, but so is discovering a value is truncated because the field length was set too small or an incorrect type was used.
  • Cost: What is the cost of the technical infrastructure? What about the cost of the people needed maintain the system? Unsurprisingly, more sophisticated and complex systems will cost more.
  • Failure modes: What are some common ways this system fails? What does it take to recover and get back to normal operations? With simple systems you tend to just hit hard limits, but more complex systems fail in a multitude of ways.
  • Time: When does this need to be shipped and what compromises need to be made? When you don’t have weeks or months to design and prototype, leveraging per-existing and proven method is typically the path of least resistance.

It’s perhaps easy to point to some of the issues that come into play with Excel and spreadsheets, but any data system will have its fair share of limits and risks, along with any potential benefits.

Installing ODBC Driver 17 for SQL Server on Debian

For a project at work I went through a bit of struggle getting the Microsoft ODBC Driver 17 for SQL Server installed on a barebones Debian system (this was for a Docker container, no SSH, no package manager). Here’s what I discovered and did to finally get it working.

Dependencies

These are the base dependencies to be able to invoke the driver:

  • libltdl7
  • libodbc1
  • odbcinst
  • odbcinst1debian2
  • locales-all

The locales-all package is strictly necessary, but some locale configuration is necessary or you’ll run into the error locale::facet::_S_create_c_locale name not valid when trying to run the sqlcmd utility. This article provides a bit more detail.

ODBC Configuration

After installing dependencies, config the driver by appending the config details to /etc/odbcinst.ini:

[ODBC Driver 17 for SQL Server] Description=Microsoft ODBC Driver 17 for SQL Server Driver=/opt/microsoft/msodbcsql17/lib64/libmsodbcsql-17.5.so.2.1 UsageCount=1

OpenSSL Issue

The final issue I ran into was super cryptic, when trying to connect to a MSSQL instance I ran into the following error:

SQLSTATE[08001]: [Microsoft][ODBC Driver 17 for SQL Server]TCP Provider: Error code 0x2746

The issue is described GitHub issue. In short this is a compatibility issue with the driver and OpenSSL + Debian 10, which has disabled SHA1 for signatures. The fix involves editing /etc/ssl/openssl.cnf and changing the last line from CipherString = DEFAULT@SECLEVEL=2 to CipherString = DEFAULT@SECLEVEL=1. Of course consider if doing this is a security risk in your environment and for your use-case.

A connection!

With the above done, I was able to successfully connect to the MSSQL instance using sqlcmd. I was hoping at this point things would “just work” with pyodbc, but that wasn’t the case (the connection would simply hang when attempting to connect.. no timeouts, no errors). So at this point, I’m considering whether it’s still worth it to try and use this driver or stick with the existing (FreeTDS).

Full-text search with PostgreSQL

I spent some time experimenting with PostgresSQL’s full-text search functionality, which has been available to some degree since v8.3. If Postgres is already being used as a data store, this functionality is attractive as it provides a simple way to implement non-trivial search without the need to build out additional infrastructure and code (e.g. an Elasticsearch cluster + application code to load data into Elasticsearch and keep it up-to-date).

I experimented with basic querying and ranking using the Lexiio database, the definitions table in particular provides a good dataset to work with, containing 604,076 term definitions.

Querying

Below is a sample query where we search for the phrase “to break into small pieces”, rank each item in the result set, and order the results based on their rank.

SELECT id, definition
FROM definitions
WHERE to_tsvector('english', definition) @@ plainto_tsquery('english', 'to break into small pieces')

Understanding this query is mostly about understanding some vendor-specific SQL.

  • The tsvector type represents is a sorted list of lexemes. to_tsvector(…) is a function to convert raw text to a tsvector.
  • The tsquery type represents lexemes to be searched for and the operators combining them. to_tsquery(…) is a function to convert raw text to a tsquery.
  • @@ is the match operator, this is a binary operator which take a tsvector and a tsquery

PostgreSQL full-text search query

To get a better understanding of of these types, it can be helpful to run the conversion functions with a few phrases.

SELECT to_tsvector('english', 'to break into small pieces');

"'break':2 'piec':5 'small':4"
SELECT plainto_tsquery('english', 'to break into small pieces');

"'break' & 'small' & 'piec'"

Ranking

The ranking of a full-text search match can be computed with either ts_rank(…), which provides a standard ranking, or ts_rank_cd(…), which gives a coverage density ranking (where ranking is also based on term proximity and cooccurrence, as described in “Relevance ranking for one to three term queries”).

SELECT
id,
definition,
ts_rank(
to_tsvector('english', definition),
plainto_tsquery('english', 'to break into small pieces')
)
AS rank
FROM definitions
WHERE to_tsvector('english', definition) @@ plainto_tsquery('english', 'to break into small pieces')
ORDER BY rank DESC

Higher rank values correspond to more relevant search results.

Here’s the result set, with rankings, for the query above:

+--------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------+----------+ | id | definition | rank | +--------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------+----------+ | 568352 | # {{transitive}} To [[break]] small pieces from. | 0.26833 | +--------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------+----------+ | 135231 | # Resistant to chipping (breaking into small pieces). | 0.266913 | +--------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------+----------+ | 572891 | # {{transitive}} To break into small pieces or fragments. | 0.266913 | +--------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------+----------+ | 568348 | # {{transitive}} To [[break]] into small pieces. | 0.266913 | +--------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------+----------+ | 176962 | # To break into crumbs or small pieces with the fingers; to [[crumble]]. | 0.25948 | +--------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------+----------+ | 50744 | # A small piece of [[detailing]] added to break up the [[surface]] of an [[object]] and add [[visual]] interest, particularly in [[movie]] [[special effect]]s. | 0.25134 | +--------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------+----------+ | 568350 | # {{transitive}} To [[break]] open or [[crush]] to small pieces by impact or stress. | 0.25134 | +--------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------+----------+ | 572890 | # {{transitive}} To break into [[fragment]]s or small pieces. | 0.25134 | +--------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------+----------+ | 547405 | # {{surgery}} The [[operation]] of breaking a [[stone]] in the [[bladder]] into small pieces capable of being [[void]]ed. | 0.221355 | +--------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------+----------+

Indexing

Without an index the above query takes ~5.3 seconds on my local machine (i7-4790K @ 3.7GHz, Intel 730 Series SSD, DDR3 1600 RAM w/ more than enough available to PG).

A Generalized Inverted Index (GIN) is recommended for full-text search. A GIN index can be created directly on a tsvector column or, in this case where there’s an existing text column, an expression index can be created using the to_tsvector() function.

CREATE INDEX ix_def
ON definitions
USING GIN(to_tsvector('english', definition));

With this index, performance improves drastically, with query times ~13 milliseconds.

Is it worth trying?

Maybe. If you’re not using Postgres as the data store for what needs to be searched over (i.e. you’d have to continually ETL data into Postgres), you already have a sophisticated search solution in place, or you’re operating at a scale where you need a clustered solution, probably not. However, if you’re using Postgres, and looking to implement search in an application or move beyond simple substring search, what Postgres is offering is fairly powerful and worth trying out.

PostgreSQL database import with Ansible

I had a hard time pulling together all the steps needed to import a PostgreSQL database using Ansible. Here’s the Ansible YAML blocks used to import the seed database for Lexiio.

1. Install PostgreSQL - name: Install Postgres
apt: name={{ item }} update_cache=yes cache_valid_time=3600 state=present
sudo: yes
with_items:
- postgresql
- postgresql-contrib
- libpq-dev
- python-psycopg2
tags: packages

2. Create the database (lexiiodb), UTF-8 for encoding and collation

- name: Create lexiiodb database
sudo_user: postgres
postgresql_db: name=lexiiodb encoding='UTF-8' lc_collate='en_US.UTF-8' lc_ctype='en_US.UTF-8' state=present

3. Create a role that will be granted access to the database (password is a variable read from some secret source)

- name: Create lexiio role for database
sudo_user: postgres
postgresql_user: db=lexiiodb user=lexiio password="{{ password }}" priv=ALL state=present

4. Start the PostgreSQL service

- name: Start the Postgresql service
sudo: yes
service:
name: postgresql
state: started
enabled: true

5. Import data into the database (using psql to pull in data from /home/lexiiodb.dump.sql)

- name: Importing lexiiodb data
sudo_user: postgres
shell: psql lexiiodb < /home/lexiiodb.dump.sql

6. For the role created, grant permissions on all schemas in the DB

- name: Grant usage of schema to lexiio role
sudo_user: postgres
postgresql_privs: database=lexiiodb state=present privs=USAGE type=schema roles=lexiio objs=dictionary

7. For the role created, grant permissions on all tables in the DB

- name: Grant table permissions for lexiio role
sudo_user: postgres
postgresql_privs: database=lexiiodb schema=dictionary state=present privs=SELECT,INSERT,UPDATE type=table roles=lexiio grant_option=no objs=ALL_IN_SCHEMA

8. For the role created, grant permissions on all sequences in the DB

- name: Grant sequence permissions for lexiio role
sudo_user: postgres
postgresql_privs: database=lexiiodb schema=dictionary state=present privs=USAGE type=sequence roles=lexiio grant_option=no objs=ALL_IN_SCHEMA

Pivot tables

One of the more interesting things I’ve learned about recently, that’s proven itself incredibly useful, is the pivot table. A pivot table turns rows into columns. This may seem odd, but the utility of this transformation becomes apparent when you have data that can’t be modeled precisely by a set of attributes because there are attributes which apply to some pieces of data and not others. The typical solution to this problem is simply to have additional columns, allowing for NULL values or defining an appropriate default value (e.g. empty string). However, with a large number of attributes or with user-defined attributes, this solution becomes unattractive, and constructing a pivot table is preferable.

Here I’ll present an example of constructing a pivot table for a schema in which a number of optional attributes, stored as key-value pairs, are attributed to entities. Here’s an ER diagram of the simple schema used in this example:

Pivot Table example schema

  • entities holds a list of entities (companies, users, etc.), with name being the only attribute required for all entities.
  • keyvals hold a list of key-value pairs to associate with entities.
  • entity_attributes maps keyvals to entities.

An SQL query to grab all entities and all associated keyvals would look something like this:

SELECT *
FROM entities
LEFT JOIN entity_attributes ea ON ea.entity_id=entities.id
LEFT JOIN keyvals kv ON kv.id = ea.keyval_id
WHERE 1;

… this would return every entity LEFT JOINed with any associated keyval:

No pivot

We can turn each entry in the key column (keyvals.key) into its own column, with rows having the corresponding entry from the value column (keyvals.value), using a simple conditional statement and alias as shown below:

SELECT entities.*,
IF(kv.key='location', kv.value, '') AS location,
    
IF(kv.key='fax_num', kv.value, '') AS fax_num
FROM entities
LEFT JOIN entity_attributes ea ON ea.entity_id=entities.id
LEFT JOIN keyvals kv ON kv.id = ea.keyval_id
WHERE 1;

No pivot

The keys, location and fax_num, are now represented as columns. For entities with 2+ associated keyvals, we still have multiple rows for each entity, but the data is such that each row only holds a single value entry (keyvals.value) for a single key (keyvals.key), under the respective column. To get a single row per entity, we GROUP BY the entity and take the MAX of each key column.

SELECT entities.*,
MAX(IF(kv.key='location', kv.value, '')) AS location,
    
MAX(IF(kv.key='fax_num', kv.value, '')) AS fax_num
FROM entities
LEFT JOIN entity_attributes ea ON ea.entity_id=entities.id
LEFT JOIN keyvals kv ON kv.id = ea.keyval_id
WHERE 1
GROUP BY entities.id;

pivoted

The result is a pivot table.

NYC Data Mine, restaurant inspection data

I’ve just finished importing the current restaurant inspection data from the NYC Data Mine into a PostgreSQL database. It wasn’t the most difficult migration, but more difficult than it should be as the raw data from the data mine is messy and not well-formed; a typical problem with many of the data sets present in NYC Data Mine. I came across a great post by Steven Romalewski (director of the CUNY Mapping Service) about the poor data quality and poor metadata based on his experiences.

From looking at the restaurant inspection data and skimming a few other sets, I get the sense that structured and relational data simply isn’t understood or handled well. To be fair, there’s a very real lack of tools in the market, at least at the consumer/data-entry level, for handling such data, so it’s not surprising that everything gets jerryrigged into an Excel worksheet. This is very clear when looking at the restaurant inspection data, you notice right away that restaurant ids and names are repeated across multiple rows.

In any case, the restaurant inspection data is better than most of the sets, but there’s a few issues to take note of:

  • In multiple cases the same row, with the exact same data, is repeated.
  • There are 2 columns for the inspection date: INSPDATE and GRADEDATE; GRADEDATE = INSPDATE if there’s a letter grade for the restaurant, otherwise it’s blank/null.
  • Most glaring, there are invalid timestamps in the GRADEDATE column for 2 restaurants (but, of course, it’s across multiple rows as the restaurants has multiple entries), CAPRI RESTAURANT and MAMA LUCIA:

    timestamp problem

For my purposes, I only wanted the most recent inspection result (i.e. the row the latest INSPDATE timestamp). To do this, I added an additional column for a serial/auto_increment id number. Then, once imported, I deleted the unneeded rows with the following query:

/* table is restaurant
id = CAMIS
inspection_score_date = INSPDATE
internal_id = serial/auto_increment id number
*/

DELETE FROM restaurant WHERE internal_id NOT IN
(SELECT MAX(restaurant.internal_id) AS max_iid FROM restaurant,
(SELECT id, dba, MAX(inspection_score_date) AS last_inspt FROM restaurant GROUP BY id, dba) AS sub
WHERE restaurant.id=sub.id AND restaurant.inspection_score_date=sub.last_inspt GROUP BY restaurant.id)

The innermost subquery pulls the rows with the most recent inspection date, the outer takes care of duplicate rows with the same inspection date by simple taking the row with the max internal id number. What results is a column of internal id numbers – each representing a row with a unique restaurant inspection for the most-recent inspection.

I’m not sure if this is the best or most efficient way to do this, but it works and took about 14s to delete the unneeded rows for 398,878 rows on a low-end VPS.

PostgreSQL + PHP installation on Windows 2003 x64

Well the PostgreSQL installation itself is easy enough, getting it to work with PHP is the challenging part. Here’s what I did:

Spott map

Something pretty cool in MSSQL Server Management Studio: for columns with the geography data type, Management Studio will plot the points on a map. Here’s a map of all locations tagged from all dotspott users.

spott map

Specifying blank string in MSSQL Server Managment Studio

If you want a specify a blank string in the Column Properties table (instead of writing a query) for something like the Default Value or Binding property, enter (”)

MSSQL Server Management Studio

Just having will not work as that will actually insert 2 apostrophes.

h/t to Dhowe from this thread on bigresource.com

“Saving changes is not permitted…” in SQL Server 2008 Management Studio

The issue occurs when saving changes that would require dropping and recreating a table (e.g. messing around with columns). I would think a warning would be sufficient instead of a restriction that requires digging through the app options window (which is not even hinted at in the popup). I also can’t imagine how effective this would be; unless you design your schema perfectly the first time around you’d always encounter this issue and need to disable this “feature” as you made changes.

sql server 2008 management studio error

The fix is to go to Tools » Options » Designers and uncheck “Prevent saving changes that require table re-creation”

h/t to Deems’ Weblog for the solution.