dictionary | semi/signal

Posts Tagged ‘dictionary’

A more relational dictionary

Feb 20 2016 · Random

As I started looking to add more functionality to Lexiio, I realized the Wiktionary definitions database dump I was using wasn’t going to cut it; specifically, I needed a normalized schema, or I’d have data duplication all over the place. I started normalizing in MySQL, but whether it was MySQL or MySQL Workbench, I kept running into character encoding issues. Using a simple INSERT-SELECT, in MySQL 5.7, to transfer words from the existing table to a new table resulted losing characters:

MySQL losing characters

I dumped the data into PostgreSQL, didn’t encounter the issue, and just kept working from there.

The normalized schema can be downloaded here: LexiioDB normalized
(released under the Creative Commons Attribution-ShareAlike License)

LexiioDB schema

The unknown_words and unknown_to_similar_words tables is specific to Lexiio and serve as a place to store unknown words entered by the user and close/similar matches to known words (via the Levenshtein distance).

databasedatabase normalizationdictionarylexiiomysqlpostgresqlwiktionary

Lexiio

Dec 13 2015 · Random

Another little experiment of mine: Lexiio, a web-based CLI dictionary.

Lexiio

A few takeaways:

Part of the reason for building this was that I wanted to actually make use of the Wiktionary data set snapshot in a real project. The data set is pretty comprehensive, and easy to parse and work with.
This was also a learning exercise for Golang. There’s nothing complex here but, so far, working with Go has been enjoyable. I like that I’m building a native application, types are enforced, and the HTTP server included as part of the standard library is incredibly easy to setup and work with.
I wanted to experiment a bit with what a web-based CLI would look and feel like. For something like a dictionary, where user interaction revolves around textual input/output, a command-line interface seems to work really well.

CLIdictionarygolanglexiiowiktionary

Wiktionary definitions database

Jul 14 2014 · Random

Having a dictionary can be incredibly useful in software development, and forms the basis for a wide range of natural language processing applications. However, finding an open-source dictionary, one that can be easily parsed and used within applications, is incredibly difficult as there simply isn’t a lot of options available.

WordNet is one option I came across, but requires significant work parsing the WordNet ASCII database files or Prolog database files.

Wiktionary was the other viable option, and the one I went with. The Wiktionary XML dumps are available, but being a wiki, these files are likely even more difficult to parse than the WordNet database files as you’d have to deal with wiki markup. However, a while ago I was able to get a TSV file with words, parts of speech, and definitions from the Wikimedia Toolserver at http://toolserver.org/~enwikt/definitions. The Toolserver has since been discontinued and I haven’t found updated TSVs hosted anywhere else, but the file I downloaded, dated November 27, 2012, is still fairly up-to-date for a dictionary and useful in many applications.

I wrote a PHP script to parse the TSV and make INSERTs into a MySQL database. The TSV file, MySQL database, and PHP script are presented below.

Wiktionary TSV file

Wiktionary MySQL database export

PHP Script:

<?php 

require "Database.php";

$tsvInputFilePath = "TEMP-E20121127.tsv";

echo "Importing {$tsvInputFilePath} ...\n";

// Open file
$fp = fopen($tsvInputFilePath, "r");
if($fp === FALSE) {
    echo "Could not find file path: " . $tsvInputFilePath;
    exit;
}

// Establish DB connection
$db = new Database();

while (!feof($fp)) {
    
    // Get line and parse tab-delimited fields
    $ln = fgets($fp);
    $parts = explode("\t", $ln);
    if(count($parts) < 4) {
        continue;
    }
    
    $lang = $parts[0];
    $word = $parts[1];
    $partOfSpeech = $parts[2];    
    $definitionRaw = $parts[3];
    
    // Insert into database
    $db->query("INSERT INTO words (language, word, part_of_speech, definition_raw) 
                VALUES (?, ?, ?, ?)", 
                $lang, $word, $partOfSpeech, $definitionRaw);
       
}

echo "done.\n";
exit;

The Database class is wrapper for mysqli, you can find it, along with the script above, in the wiktionary-tsv-import bitbucket repo.

Note that definitions need to be parsed further, as they contain wiki markup. The parsing doesn’t seem difficult and is something I hope to get done in the near future.

Related resources:

Wikokit – parser to produce a machine-readable Wiktionary
DBpedia Wiktionary RDF extraction – RDF database and SPARQL querying interface of Wiktionary
perl-wiktionary-parser – PERL Wiktionary parser

There’s valuable stuff from each of the projects above, but like WordNet, requires significantly more time to evaluate and implement in an application, compared to the simple TSV -> MySQL translation.

EDIT (12/13/2015): I’ve updated the MySQL database export. There was some holes in the data because I was using utf8 column encoding for definitions, however, MySQL’s has a weird “UTF-8” implementation that only handles codepoint that up to 3 bytes in size. utf8mb4 encoding needs to be used for a proper UTF-8 encoding supporting up to 4 bytes.