Archive for the ‘Random’ Category

Reel

I wrote a little desktop application to capture short videos and turn them into GIFs. I call it Reel. It’s still rough around the edges but you can grab an early version of it below.

Reel 0.1 (Windows Install)

I’ll have a Linux/Ubuntu version soon. Maybe an OS X version… I have to jump through a few extra hoops here as Apple still refuses to allow OS X to be virtualized.

Reel - Drinking Bird

Aside from its utility, this was also an experiment piecing together some technologies I’ve written about here before: XUL + XPCOM + SocketBridge, video capture using web tech and, in general, using web technologies for desktop applications.

Logging to stdout and a file

A simple way to log to both stdout and a file (using a pipe and tee):

./myapp 2>&1 | tee -a myapp.log

A more relational dictionary

As I started looking to add more functionality to Lexiio, I realized the Wiktionary definitions database dump I was using wasn’t going to cut it; specifically, I needed a normalized schema, or I’d have data duplication all over the place. I started normalizing in MySQL, but whether it was MySQL or MySQL Workbench, I kept running into character encoding issues. Using a simple INSERT-SELECT, in MySQL 5.7, to transfer words from the existing table to a new table resulted losing characters:

MySQL losing characters

I dumped the data into PostgreSQL, didn’t encounter the issue, and just kept working from there.

The normalized schema can be downloaded here: LexiioDB normalized
(released under the Creative Commons Attribution-ShareAlike License)

LexiioDB schema

The unknown_words and unknown_to_similar_words tables is specific to Lexiio and serve as a place to store unknown words entered by the user and close/similar matches to known words (via the Levenshtein distance).

Lexiio

Another little experiment of mine: Lexiio, a web-based CLI dictionary.

Lexiio

A few takeaways:

  • Part of the reason for building this was that I wanted to actually make use of the Wiktionary data set snapshot in a real project. The data set is pretty comprehensive, and easy to parse and work with.
  • This was also a learning exercise for Golang. There’s nothing complex here but, so far, working with Go has been enjoyable. I like that I’m building a native application, types are enforced, and the HTTP server included as part of the standard library is incredibly easy to setup and work with.
  • I wanted to experiment a bit with what a web-based CLI would look and feel like. For something like a dictionary, where user interaction revolves around textual input/output, a command-line interface seems to work really well.

Function argument tricks

I came across this “trick” on coderwall for avoiding an if statement when you have a a function argument that is allowed to be either an array or a scalar value (something that seems oddly common in loosely typed languages).

function example(ids) {
;[].concat(ids).forEach(
function (id) {
// ...
})
}

The trick in question is just concatenating the argument with an empty array so that, within the function, you’re always dealing with an array and elements of the array.

Taking a step back, the bigger question is why do this?
In my experience, it’s almost always better to keep the code within the function straightforward and force the caller to adapt and give the function what it needs. In this case, that means simply forcing the caller to always pass an array.

The Future of Programming

This is a great talk given by Bret Victor that I came across a while ago:

Bret Victor – The Future of Programming from Bret Victor on Vimeo.

All four ideas presented resonate with me any my work, particularly direct manipulation of data, as I’m continually disappointed when I see markup languages and frameworks seen as the go-to solution in places where proper tools would yield better and faster results.

Rtf2Html 1.3

I recently made a small update to Rtf2Html (the converter I wrote for converting RTF text to HTML markup):

  • Support for conversion to SVG markup
  • Updated preview form to use GeckoFX 29 and XULRunner 29.0.1 (the major version numbers have to match)

Download it here.
Note that this version requires the .NET Framework v4.0 or later.

Rtf2Html SVG support

While you can put the generated SVG text on a page, that wasn’t really my motivation here; what I wanted was a way to import syntax-highlighted text (in my case, typically code) into a vector graphics application (Inkscape, Illustrator, etc.) to be placed as part of a diagram.

GeoNames geographical database

I came across the GeoNames database recently and was impressed with the breadth of locations available. I downloaded the allCountries.zip from http://download.geonames.org/export/dump/ which gives data (name, location, population, etc.) on places across all countries in one, TSV delimited, text file. To work with the data more easily, I wrote a PHP script to put the entries into a MySQL database table (it’s actually just a simple modification to the script I used for the Wiktionary definitions import). The TSV, MySQL database, and PHP script are all presented below.

GeoNames allCountries.zip

GeoNames MySQL database export

<?php

require "Database.php";

$tsvInputFilePath = "allCountries.txt";

echo "Importing {$tsvInputFilePath} ...\n";

// Open file
$fp = fopen($tsvInputFilePath, "r");
if($fp === FALSE) {
echo "Could not find file path: " . $tsvInputFilePath;
exit;
}

// Establish DB connection
$db = new Database();

while (!feof($fp)) {

// Get line and parse tab-delimited fields
$ln = fgets($fp);
$parts = explode("\t", $ln);

if(count($parts) < 19) {
continue;
}

// Insert into database
$db->query("INSERT INTO cities (`id`,
`name`,
`asciiname`,
`alternatenames`,
`latitude`,
`longitude`,
`feature_class`,
`feature_code`,
`country_code`,
`cc2`,
`admin1_code`,
`admin2_code`,
`admin3_code`,
`admin4_code`,
`population`,
`elevation`,
`dem`,
`timezone`,
`last_modified_at`)
VALUES (?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?)"
,

$parts[0],
$parts[1],
$parts[2],
$parts[3],
$parts[4],
$parts[5],
$parts[6],
$parts[7],
$parts[8],
$parts[9],
$parts[10],
$parts[11],
$parts[12],
$parts[13],
$parts[14],
$parts[15],
$parts[16],
$parts[17],
$parts[18]

);


}

echo "done.\n";
exit;

The Database class is wrapper for mysqli, you can find it, along with the script above, in the geonames-allcountries-import bitbucket repo.

Note that this script will take a while to run (likely a few days) as there are 9,195,153 records that need to be inserted and we’re just doing simple INSERTs with no optimizations.

An overview of each of the fields in the database can be found in the GeoNames export readme.txt. Particularly important is the feature_class and feature_code fields, the range of values for which can be found on the GeoNames Feature Codes page. Also, as indicated in the readme, the data is licensed under the Creative Commons Attribution 3.0 License.

Wiktionary definitions database

Having a dictionary can be incredibly useful in software development, and forms the basis for a wide range of natural language processing applications. However, finding an open-source dictionary, one that can be easily parsed and used within applications, is incredibly difficult as there simply isn’t a lot of options available.

WordNet is one option I came across, but requires significant work parsing the WordNet ASCII database files or Prolog database files.

Wiktionary was the other viable option, and the one I went with. The Wiktionary XML dumps are available, but being a wiki, these files are likely even more difficult to parse than the WordNet database files as you’d have to deal with wiki markup. However, a while ago I was able to get a TSV file with words, parts of speech, and definitions from the Wikimedia Toolserver at http://toolserver.org/~enwikt/definitions. The Toolserver has since been discontinued and I haven’t found updated TSVs hosted anywhere else, but the file I downloaded, dated November 27, 2012, is still fairly up-to-date for a dictionary and useful in many applications.

I wrote a PHP script to parse the TSV and make INSERTs into a MySQL database. The TSV file, MySQL database, and PHP script are presented below.

Wiktionary TSV file

Wiktionary MySQL database export

PHP Script:

<?php

require "Database.php";

$tsvInputFilePath = "TEMP-E20121127.tsv";

echo "Importing {$tsvInputFilePath} ...\n";

// Open file
$fp = fopen($tsvInputFilePath, "r");
if($fp === FALSE) {
echo "Could not find file path: " . $tsvInputFilePath;
exit;
}

// Establish DB connection
$db = new Database();

while (!feof($fp)) {

// Get line and parse tab-delimited fields
$ln = fgets($fp);
$parts = explode("\t", $ln);
if(count($parts) < 4) {
continue;
}

$lang = $parts[0];
$word = $parts[1];
$partOfSpeech = $parts[2];
$definitionRaw = $parts[3];

// Insert into database
$db->query("INSERT INTO words (language, word, part_of_speech, definition_raw)
VALUES (?, ?, ?, ?)"
,
$lang, $word, $partOfSpeech, $definitionRaw);

}

echo "done.\n";
exit;

The Database class is wrapper for mysqli, you can find it, along with the script above, in the wiktionary-tsv-import bitbucket repo.

Note that definitions need to be parsed further, as they contain wiki markup. The parsing doesn’t seem difficult and is something I hope to get done in the near future.

Related resources:

There’s valuable stuff from each of the projects above, but like WordNet, requires significantly more time to evaluate and implement in an application, compared to the simple TSV -> MySQL translation.

EDIT (12/13/2015): I’ve updated the MySQL database export. There was some holes in the data because I was using utf8 column encoding for definitions, however, MySQL’s has a weird “UTF-8” implementation that only handles codepoint that up to 3 bytes in size. utf8mb4 encoding needs to be used for a proper UTF-8 encoding supporting up to 4 bytes.

Identifying the operating system with XPCOM

The following shows how to get a string identifying the current operating system from an instance of nsIXULRuntime:

var getOS = function() {
var env = Components.classes["@mozilla.org/xre/app-info;1"].getService(Components.interfaces.nsIXULRuntime);
return env.OS;
}

The nsIXULRuntime.OS string is one of the OS_TARGET values.

Ideally, I’d prefer XUL and XPCOM code to remain platform-agnostic, but I’ve used OS detection as a cheap way (versus jumping through 3 objects) to determine what path separator to use when referencing files and directories (backslash for “WINNT”, forward-slash for everything else). XPCOM is sensitive to the path separator; on Windows, it will not reference a file or directory if you use the forward slash. This is actually bizarre because Win32 API functions will accept paths with the forward slash as a separator. Even more bizarre is that we have a layer of abstraction that actually makes it harder to write platform-independent code.