Posts Tagged ‘database’

Are spreadsheets databases?

A few weeks ago, news that using Excel resulted in the loss of ~16,000 coronavirus cases in England due to a 65K row limit in XLS files, sparked a number of tweets around how Excel isn’t a database. There’s a lot to cringe at here, but saying Excel or, more precisely, spreadsheets aren’t databases and using a “proper database” would have prevented a failure is incredibly reductive.

excel is not a database

Definition

To start, it’s worth looking at what the definition of “database” actually is:

definition of database

At least from that definition, I think it’s fair to say spreadsheets are databases. They’re primitive, there’s little-to-nothing in the way of concurrency, security, constraints, etc., but they are databases.

In software engineering, the term “database” is typically shorthand for a relational database, but that feels more and more problematic, as we now see many more databases in use now that aren’t relational (Cassandra, Mongo, DynamoDB, etc.). Even S3 now serves as a foundational basis for many databases.

System Considerations

Beyond definitions, what’s also interesting here is that tooling and database choice is never a simple equation for a non-trivial data system; there’s a host of considerations that come into play. Here’s a few that pop into my mind:

  • Interfacing: Who’s accessing and/or manipulating data in the system? Sophisticated data systems and bespoke interfaces are powerful, but they require training and expertise, and there’s typically a higher maintenance burden. Leveraging common and ubiquitous tooling can be beneficial when it comes to interfacing needs for a larger audience. This comment on ArsTechnica points out that the “proper tools” are inordinately complex and not something a typical end-user can pick-up and understand easily, and I think that’s a fair assessment of the product landscape.
  • System limits: every data system has limits, some are explicit and obvious, some are not. It’s also not surprising to bump into limits due how a database is setup or how a schema is designed. Hitting a 65K row limit is frustrating and problematic, but so is discovering a value is truncated because the field length was set too small or an incorrect type was used.
  • Cost: What is the cost of the technical infrastructure? What about the cost of the people needed maintain the system? Unsurprisingly, more sophisticated and complex systems will cost more.
  • Failure modes: What are some common ways this system fails? What does it take to recover and get back to normal operations? With simple systems you tend to just hit hard limits, but more complex systems fail in a multitude of ways.
  • Time: When does this need to be shipped and what compromises need to be made? When you don’t have weeks or months to design and prototype, leveraging per-existing and proven method is typically the path of least resistance.

It’s perhaps easy to point to some of the issues that come into play with Excel and spreadsheets, but any data system will have its fair share of limits and risks, along with any potential benefits.

PostgreSQL database import with Ansible

I had a hard time pulling together all the steps needed to import a PostgreSQL database using Ansible. Here’s the Ansible YAML blocks used to import the seed database for Lexiio.

1. Install PostgreSQL - name: Install Postgres
apt: name={{ item }} update_cache=yes cache_valid_time=3600 state=present
sudo: yes
with_items:
- postgresql
- postgresql-contrib
- libpq-dev
- python-psycopg2
tags: packages

2. Create the database (lexiiodb), UTF-8 for encoding and collation

- name: Create lexiiodb database
sudo_user: postgres
postgresql_db: name=lexiiodb encoding='UTF-8' lc_collate='en_US.UTF-8' lc_ctype='en_US.UTF-8' state=present

3. Create a role that will be granted access to the database (password is a variable read from some secret source)

- name: Create lexiio role for database
sudo_user: postgres
postgresql_user: db=lexiiodb user=lexiio password="{{ password }}" priv=ALL state=present

4. Start the PostgreSQL service

- name: Start the Postgresql service
sudo: yes
service:
name: postgresql
state: started
enabled: true

5. Import data into the database (using psql to pull in data from /home/lexiiodb.dump.sql)

- name: Importing lexiiodb data
sudo_user: postgres
shell: psql lexiiodb < /home/lexiiodb.dump.sql

6. For the role created, grant permissions on all schemas in the DB

- name: Grant usage of schema to lexiio role
sudo_user: postgres
postgresql_privs: database=lexiiodb state=present privs=USAGE type=schema roles=lexiio objs=dictionary

7. For the role created, grant permissions on all tables in the DB

- name: Grant table permissions for lexiio role
sudo_user: postgres
postgresql_privs: database=lexiiodb schema=dictionary state=present privs=SELECT,INSERT,UPDATE type=table roles=lexiio grant_option=no objs=ALL_IN_SCHEMA

8. For the role created, grant permissions on all sequences in the DB

- name: Grant sequence permissions for lexiio role
sudo_user: postgres
postgresql_privs: database=lexiiodb schema=dictionary state=present privs=USAGE type=sequence roles=lexiio grant_option=no objs=ALL_IN_SCHEMA

A more relational dictionary

As I started looking to add more functionality to Lexiio, I realized the Wiktionary definitions database dump I was using wasn’t going to cut it; specifically, I needed a normalized schema, or I’d have data duplication all over the place. I started normalizing in MySQL, but whether it was MySQL or MySQL Workbench, I kept running into character encoding issues. Using a simple INSERT-SELECT, in MySQL 5.7, to transfer words from the existing table to a new table resulted losing characters:

MySQL losing characters

I dumped the data into PostgreSQL, didn’t encounter the issue, and just kept working from there.

The normalized schema can be downloaded here: LexiioDB normalized
(released under the Creative Commons Attribution-ShareAlike License)

LexiioDB schema

The unknown_words and unknown_to_similar_words tables is specific to Lexiio and serve as a place to store unknown words entered by the user and close/similar matches to known words (via the Levenshtein distance).

GeoNames geographical database

I came across the GeoNames database recently and was impressed with the breadth of locations available. I downloaded the allCountries.zip from http://download.geonames.org/export/dump/ which gives data (name, location, population, etc.) on places across all countries in one, TSV delimited, text file. To work with the data more easily, I wrote a PHP script to put the entries into a MySQL database table (it’s actually just a simple modification to the script I used for the Wiktionary definitions import). The TSV, MySQL database, and PHP script are all presented below.

GeoNames allCountries.zip

GeoNames MySQL database export

<?php

require "Database.php";

$tsvInputFilePath = "allCountries.txt";

echo "Importing {$tsvInputFilePath} ...\n";

// Open file
$fp = fopen($tsvInputFilePath, "r");
if($fp === FALSE) {
echo "Could not find file path: " . $tsvInputFilePath;
exit;
}

// Establish DB connection
$db = new Database();

while (!feof($fp)) {

// Get line and parse tab-delimited fields
$ln = fgets($fp);
$parts = explode("\t", $ln);

if(count($parts) < 19) {
continue;
}

// Insert into database
$db->query("INSERT INTO cities (`id`,
`name`,
`asciiname`,
`alternatenames`,
`latitude`,
`longitude`,
`feature_class`,
`feature_code`,
`country_code`,
`cc2`,
`admin1_code`,
`admin2_code`,
`admin3_code`,
`admin4_code`,
`population`,
`elevation`,
`dem`,
`timezone`,
`last_modified_at`)
VALUES (?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?)"
,

$parts[0],
$parts[1],
$parts[2],
$parts[3],
$parts[4],
$parts[5],
$parts[6],
$parts[7],
$parts[8],
$parts[9],
$parts[10],
$parts[11],
$parts[12],
$parts[13],
$parts[14],
$parts[15],
$parts[16],
$parts[17],
$parts[18]

);


}

echo "done.\n";
exit;

The Database class is wrapper for mysqli, you can find it, along with the script above, in the geonames-allcountries-import bitbucket repo.

Note that this script will take a while to run (likely a few days) as there are 9,195,153 records that need to be inserted and we’re just doing simple INSERTs with no optimizations.

An overview of each of the fields in the database can be found in the GeoNames export readme.txt. Particularly important is the feature_class and feature_code fields, the range of values for which can be found on the GeoNames Feature Codes page. Also, as indicated in the readme, the data is licensed under the Creative Commons Attribution 3.0 License.

Wiktionary definitions database

Having a dictionary can be incredibly useful in software development, and forms the basis for a wide range of natural language processing applications. However, finding an open-source dictionary, one that can be easily parsed and used within applications, is incredibly difficult as there simply isn’t a lot of options available.

WordNet is one option I came across, but requires significant work parsing the WordNet ASCII database files or Prolog database files.

Wiktionary was the other viable option, and the one I went with. The Wiktionary XML dumps are available, but being a wiki, these files are likely even more difficult to parse than the WordNet database files as you’d have to deal with wiki markup. However, a while ago I was able to get a TSV file with words, parts of speech, and definitions from the Wikimedia Toolserver at http://toolserver.org/~enwikt/definitions. The Toolserver has since been discontinued and I haven’t found updated TSVs hosted anywhere else, but the file I downloaded, dated November 27, 2012, is still fairly up-to-date for a dictionary and useful in many applications.

I wrote a PHP script to parse the TSV and make INSERTs into a MySQL database. The TSV file, MySQL database, and PHP script are presented below.

Wiktionary TSV file

Wiktionary MySQL database export

PHP Script:

<?php

require "Database.php";

$tsvInputFilePath = "TEMP-E20121127.tsv";

echo "Importing {$tsvInputFilePath} ...\n";

// Open file
$fp = fopen($tsvInputFilePath, "r");
if($fp === FALSE) {
echo "Could not find file path: " . $tsvInputFilePath;
exit;
}

// Establish DB connection
$db = new Database();

while (!feof($fp)) {

// Get line and parse tab-delimited fields
$ln = fgets($fp);
$parts = explode("\t", $ln);
if(count($parts) < 4) {
continue;
}

$lang = $parts[0];
$word = $parts[1];
$partOfSpeech = $parts[2];
$definitionRaw = $parts[3];

// Insert into database
$db->query("INSERT INTO words (language, word, part_of_speech, definition_raw)
VALUES (?, ?, ?, ?)"
,
$lang, $word, $partOfSpeech, $definitionRaw);

}

echo "done.\n";
exit;

The Database class is wrapper for mysqli, you can find it, along with the script above, in the wiktionary-tsv-import bitbucket repo.

Note that definitions need to be parsed further, as they contain wiki markup. The parsing doesn’t seem difficult and is something I hope to get done in the near future.

Related resources:

There’s valuable stuff from each of the projects above, but like WordNet, requires significantly more time to evaluate and implement in an application, compared to the simple TSV -> MySQL translation.

EDIT (12/13/2015): I’ve updated the MySQL database export. There was some holes in the data because I was using utf8 column encoding for definitions, however, MySQL’s has a weird “UTF-8” implementation that only handles codepoint that up to 3 bytes in size. utf8mb4 encoding needs to be used for a proper UTF-8 encoding supporting up to 4 bytes.

Broken comments

h/t to Dave, the guy who contacted me and mentioned that he was unable to post comments on this blog; I didn’t realize the comments system was not functioning.

Now why the WordPress comment system would go down was a bit confusing and frustrating to diagnose. WP simply sent you to a blank page (wp-comments-post.php#comment-). After ruling out it was not an issue with the theme or a problem with a blank line in wp-comments-post.php (the first 2 cases mentioned here). I tried repairing the wp_comments table in the DB, which didn’t work (the post linked to mentions going further and wiping out the data and recreating the table, but I didn’t want to go that far). I ran across a few posts (such as this one) on the WP forums about similar issues, but nothing that led to a solution.

I decided to dig into wp-comments-post.php and started playing around with a few echo statements to see where things were going wrong. Turns out, wp_new_comment() was always returning a comment_id of 0. Not good. Assuming the WP code was correct (I had no reason to suspect otherwise), it had to be a problem with the database. The problem and solution turned out to be pretty trivial: the comment_ID field of the wp_comments table had the auto_increment attribute missing (which needs to be set; you can see the WP database scheme here). So all new comments were being assigned the default value of 0.

I’m not sure why this would happen; maybe an issue resulting from an upgrade my host did a while back or perhaps an issue with the WP auto-update process.