## Archive for July, 2014

### Round to midnight

A problem I’ve run into a few times is taking the current unix timestamp and rounding it to midnight, so that I can get the unix time for the start of the day. In PHP, I’ve commonly done the following:

`\$timestamp = strtotime('today midnight');`

It’s one of the solutions presented in this StackOverflow post.

The solution above works fine, but I began thinking about how to actually do the computation and bypass the string parsing done by strtotime(). The computation is actually pretty simple, as it’s in the same vein as snapping a point to a grid. The verbose code snippet below shows the step-by-step process in the computation.

`// Given the number of seconds in a day\$numSecondsInDay = 86400;// .. and the current unix time\$currentTime = time();// We can compute the number of days since the unix epoch (the decimal/fractional part is the portion of the current day that's elapsed)\$daysSinceEpoch = \$currentTime / \$numSecondsInDay;// We can throw away the fractional part by rounding down with the floor() function\$wholeDaysSinceEpoch = floor(\$daysSinceEpoch);// The number of whole days since the epoch x the number of seconds in a day will give the time for the current day at midnight\$midnightToday = \$wholeDaysSinceEpoch * \$numSecondsInDay; `

One interesting thing to notice: if you replace the floor() function with the ceil() function, rounding up the number of days since the epoch, you’ll get the start of the next day – midnight tomorrow.

### Wiktionary definitions database

Having a dictionary can be incredibly useful in software development, and forms the basis for a wide range of natural language processing applications. However, finding an open-source dictionary, one that can be easily parsed and used within applications, is incredibly difficult as there simply isn’t a lot of options available.

WordNet is one option I came across, but requires significant work parsing the WordNet ASCII database files or Prolog database files.

Wiktionary was the other viable option, and the one I went with. The Wiktionary XML dumps are available, but being a wiki, these files are likely even more difficult to parse than the WordNet database files as you’d have to deal with wiki markup. However, a while ago I was able to get a TSV file with words, parts of speech, and definitions from the Wikimedia Toolserver at http://toolserver.org/~enwikt/definitions. The Toolserver has since been discontinued and I haven’t found updated TSVs hosted anywhere else, but the file I downloaded, dated November 27, 2012, is still fairly up-to-date for a dictionary and useful in many applications.

I wrote a PHP script to parse the TSV and make INSERTs into a MySQL database. The TSV file, MySQL database, and PHP script are presented below.

Wiktionary TSV file

Wiktionary MySQL database export

PHP Script:

`<?php require "Database.php";\$tsvInputFilePath = "TEMP-E20121127.tsv";echo "Importing {\$tsvInputFilePath} ...\n";// Open file\$fp = fopen(\$tsvInputFilePath, "r");if(\$fp === FALSE) { echo "Could not find file path: " . \$tsvInputFilePath; exit;}// Establish DB connection\$db = new Database();while (!feof(\$fp)) { // Get line and parse tab-delimited fields \$ln = fgets(\$fp); \$parts = explode("\t", \$ln); if(count(\$parts) < 4) { continue; } \$lang = \$parts[0]; \$word = \$parts[1]; \$partOfSpeech = \$parts[2]; \$definitionRaw = \$parts[3]; // Insert into database \$db->query("INSERT INTO words (language, word, part_of_speech, definition_raw) VALUES (?, ?, ?, ?)", \$lang, \$word, \$partOfSpeech, \$definitionRaw); }echo "done.\n";exit; `

The Database class is wrapper for mysqli, you can find it, along with the script above, in the wiktionary-tsv-import bitbucket repo.

Note that definitions need to be parsed further, as they contain wiki markup. The parsing doesn’t seem difficult and is something I hope to get done in the near future.

Related resources:

There’s valuable stuff from each of the projects above, but like WordNet, requires significantly more time to evaluate and implement in an application, compared to the simple TSV -> MySQL translation.

EDIT (12/13/2015): I’ve updated the MySQL database export. There was some holes in the data because I was using utf8 column encoding for definitions, however, MySQL’s has a weird “UTF-8” implementation that only handles codepoint that up to 3 bytes in size. utf8mb4 encoding needs to be used for a proper UTF-8 encoding supporting up to 4 bytes.