natural language processing

Posts Tagged ‘natural language processing’

Wiktionary definitions database

Jul 14 2014 · Random

Having a dictionary can be incredibly useful in software development, and forms the basis for a wide range of natural language processing applications. However, finding an open-source dictionary, one that can be easily parsed and used within applications, is incredibly difficult as there simply isn’t a lot of options available.

WordNet is one option I came across, but requires significant work parsing the WordNet ASCII database files or Prolog database files.

Wiktionary was the other viable option, and the one I went with. The Wiktionary XML dumps are available, but being a wiki, these files are likely even more difficult to parse than the WordNet database files as you’d have to deal with wiki markup. However, a while ago I was able to get a TSV file with words, parts of speech, and definitions from the Wikimedia Toolserver at http://toolserver.org/~enwikt/definitions. The Toolserver has since been discontinued and I haven’t found updated TSVs hosted anywhere else, but the file I downloaded, dated November 27, 2012, is still fairly up-to-date for a dictionary and useful in many applications.

I wrote a PHP script to parse the TSV and make INSERTs into a MySQL database. The TSV file, MySQL database, and PHP script are presented below.

Wiktionary TSV file

Wiktionary MySQL database export

PHP Script:

<?php 

require "Database.php";

$tsvInputFilePath = "TEMP-E20121127.tsv";

echo "Importing {$tsvInputFilePath} ...\n";

// Open file
$fp = fopen($tsvInputFilePath, "r");
if($fp === FALSE) {
    echo "Could not find file path: " . $tsvInputFilePath;
    exit;
}

// Establish DB connection
$db = new Database();

while (!feof($fp)) {
    
    // Get line and parse tab-delimited fields
    $ln = fgets($fp);
    $parts = explode("\t", $ln);
    if(count($parts) < 4) {
        continue;
    }
    
    $lang = $parts[0];
    $word = $parts[1];
    $partOfSpeech = $parts[2];    
    $definitionRaw = $parts[3];
    
    // Insert into database
    $db->query("INSERT INTO words (language, word, part_of_speech, definition_raw) 
                VALUES (?, ?, ?, ?)", 
                $lang, $word, $partOfSpeech, $definitionRaw);
       
}

echo "done.\n";
exit;

The Database class is wrapper for mysqli, you can find it, along with the script above, in the wiktionary-tsv-import bitbucket repo.

Note that definitions need to be parsed further, as they contain wiki markup. The parsing doesn’t seem difficult and is something I hope to get done in the near future.

Related resources:

Wikokit – parser to produce a machine-readable Wiktionary
DBpedia Wiktionary RDF extraction – RDF database and SPARQL querying interface of Wiktionary
perl-wiktionary-parser – PERL Wiktionary parser

There’s valuable stuff from each of the projects above, but like WordNet, requires significantly more time to evaluate and implement in an application, compared to the simple TSV -> MySQL translation.

EDIT (12/13/2015): I’ve updated the MySQL database export. There was some holes in the data because I was using utf8 column encoding for definitions, however, MySQL’s has a weird “UTF-8” implementation that only handles codepoint that up to 3 bytes in size. utf8mb4 encoding needs to be used for a proper UTF-8 encoding supporting up to 4 bytes.

databasedictionarynatural language processingPHPtsvwikiwiki markupwikokitwiktionarywordnet

Data driven

May 1 2014 · Random

The Economist recently wrote a bit about how speech recognition got so good:

… words do not appear in random order, so the computer does not have to guess from (say) a vocabulary of 20,000 words for each word you speak. Instead, the software assesses how likely you are to have said a given word based on the surrounding words, drawing on statistical models derived from vast repositories of digitised documents and the previous utterances of other users.

This reminded me of a talk by Peter Norvig: The Unreasonable Effectiveness of Data, where he discusses utilizing such large repositories of data in order to develop effective algorithms for a number of problems; there is a heavy focus on natural language processing problems but the concept can, of course, be applied in other areas.

(If the name Peter Norvig sounds familiar, he’s the co-author of Artificial Intelligence: A Modern Approach which you might have used if you ever took an AI class.)

As a programmer, this is exciting stuff and certainly changed my thinking in regards to how I would approach similar problems in the future. Whereas before I would look at sample data sets and try to derive an algorithm, I’d now attempt to mine as much data as I could, build a statistical model, and use that as the basis of the algorithm. Of course mining a massive data set is sometimes easier said than done; especially in regards to data, much of the web is still a walled garden.

artificial intelligencebig datanatural language processingpeter norvigspeech recognition