Archive for the ‘Random’ Category

Rtf2Html 1.3

I recently made a small update to Rtf2Html (the converter I wrote for converting RTF text to HTML markup):

  • Support for conversion to SVG markup
  • Updated preview form to use GeckoFX 29 and XULRunner 29.0.1 (the major version numbers have to match)

Download it here.
Note that this version requires the .NET Framework v4.0 or later.

Rtf2Html SVG support

While you can put the generated SVG text on a page, that wasn’t really my motivation here; what I wanted was a way to import syntax-highlighted text (in my case, typically code) into a vector graphics application (Inkscape, Illustrator, etc.) to be placed as part of a diagram.

GeoNames geographical database

I came across the GeoNames database recently and was impressed with the breadth of locations available. I downloaded the allCountries.zip from http://download.geonames.org/export/dump/ which gives data (name, location, population, etc.) on places across all countries in one, TSV delimited, text file. To work with the data more easily, I wrote a PHP script to put the entries into a MySQL database table (it’s actually just a simple modification to the script I used for the Wiktionary definitions import). The TSV, MySQL database, and PHP script are all presented below.

GeoNames allCountries.zip

GeoNames MySQL database export

<?php

require "Database.php";

$tsvInputFilePath = "allCountries.txt";

echo "Importing {$tsvInputFilePath} ...\n";

// Open file
$fp = fopen($tsvInputFilePath, "r");
if($fp === FALSE) {
echo "Could not find file path: " . $tsvInputFilePath;
exit;
}

// Establish DB connection
$db = new Database();

while (!feof($fp)) {

// Get line and parse tab-delimited fields
$ln = fgets($fp);
$parts = explode("\t", $ln);

if(count($parts) < 19) {
continue;
}

// Insert into database
$db->query("INSERT INTO cities (`id`,
`name`,
`asciiname`,
`alternatenames`,
`latitude`,
`longitude`,
`feature_class`,
`feature_code`,
`country_code`,
`cc2`,
`admin1_code`,
`admin2_code`,
`admin3_code`,
`admin4_code`,
`population`,
`elevation`,
`dem`,
`timezone`,
`last_modified_at`)
VALUES (?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?)"
,

$parts[0],
$parts[1],
$parts[2],
$parts[3],
$parts[4],
$parts[5],
$parts[6],
$parts[7],
$parts[8],
$parts[9],
$parts[10],
$parts[11],
$parts[12],
$parts[13],
$parts[14],
$parts[15],
$parts[16],
$parts[17],
$parts[18]

);


}

echo "done.\n";
exit;

The Database class is wrapper for mysqli, you can find it, along with the script above, in the geonames-allcountries-import bitbucket repo.

Note that this script will take a while to run (likely a few days) as there are 9,195,153 records that need to be inserted and we’re just doing simple INSERTs with no optimizations.

An overview of each of the fields in the database can be found in the GeoNames export readme.txt. Particularly important is the feature_class and feature_code fields, the range of values for which can be found on the GeoNames Feature Codes page. Also, as indicated in the readme, the data is licensed under the Creative Commons Attribution 3.0 License.

Wiktionary definitions database

Having a dictionary can be incredibly useful in software development, and forms the basis for a wide range of natural language processing applications. However, finding an open-source dictionary, one that can be easily parsed and used within applications, is incredibly difficult as there simply isn’t a lot of options available.

WordNet is one option I came across, but requires significant work parsing the WordNet ASCII database files or Prolog database files.

Wiktionary was the other viable option, and the one I went with. The Wiktionary XML dumps are available, but being a wiki, these files are likely even more difficult to parse than the WordNet database files as you’d have to deal with wiki markup. However, a while ago I was able to get a TSV file with words, parts of speech, and definitions from the Wikimedia Toolserver at http://toolserver.org/~enwikt/definitions. The Toolserver has since been discontinued and I haven’t found updated TSVs hosted anywhere else, but the file I downloaded, dated November 27, 2012, is still fairly up-to-date for a dictionary and useful in many applications.

I wrote a PHP script to parse the TSV and make INSERTs into a MySQL database. The TSV file, MySQL database, and PHP script are presented below.

Wiktionary TSV file

Wiktionary MySQL database export

PHP Script:

<?php

require "Database.php";

$tsvInputFilePath = "TEMP-E20121127.tsv";

echo "Importing {$tsvInputFilePath} ...\n";

// Open file
$fp = fopen($tsvInputFilePath, "r");
if($fp === FALSE) {
echo "Could not find file path: " . $tsvInputFilePath;
exit;
}

// Establish DB connection
$db = new Database();

while (!feof($fp)) {

// Get line and parse tab-delimited fields
$ln = fgets($fp);
$parts = explode("\t", $ln);
if(count($parts) < 4) {
continue;
}

$lang = $parts[0];
$word = $parts[1];
$partOfSpeech = $parts[2];
$definitionRaw = $parts[3];

// Insert into database
$db->query("INSERT INTO words (language, word, part_of_speech, definition_raw)
VALUES (?, ?, ?, ?)"
,
$lang, $word, $partOfSpeech, $definitionRaw);

}

echo "done.\n";
exit;

The Database class is wrapper for mysqli, you can find it, along with the script above, in the wiktionary-tsv-import bitbucket repo.

Note that definitions need to be parsed further, as they contain wiki markup. The parsing doesn’t seem difficult and is something I hope to get done in the near future.

Related resources:

There’s valuable stuff from each of the projects above, but like WordNet, requires significantly more time to evaluate and implement in an application, compared to the simple TSV -> MySQL translation.

EDIT (12/13/2015): I’ve updated the MySQL database export. There was some holes in the data because I was using utf8 column encoding for definitions, however, MySQL’s has a weird “UTF-8” implementation that only handles codepoint that up to 3 bytes in size. utf8mb4 encoding needs to be used for a proper UTF-8 encoding supporting up to 4 bytes.

Identifying the operating system with XPCOM

The following shows how to get a string identifying the current operating system from an instance of nsIXULRuntime:

var getOS = function() {
var env = Components.classes["@mozilla.org/xre/app-info;1"].getService(Components.interfaces.nsIXULRuntime);
return env.OS;
}

The nsIXULRuntime.OS string is one of the OS_TARGET values.

Ideally, I’d prefer XUL and XPCOM code to remain platform-agnostic, but I’ve used OS detection as a cheap way (versus jumping through 3 objects) to determine what path separator to use when referencing files and directories (backslash for “WINNT”, forward-slash for everything else). XPCOM is sensitive to the path separator; on Windows, it will not reference a file or directory if you use the forward slash. This is actually bizarre because Win32 API functions will accept paths with the forward slash as a separator. Even more bizarre is that we have a layer of abstraction that actually makes it harder to write platform-independent code.

Data driven

The Economist recently wrote a bit about how speech recognition got so good:

… words do not appear in random order, so the computer does not have to guess from (say) a vocabulary of 20,000 words for each word you speak. Instead, the software assesses how likely you are to have said a given word based on the surrounding words, drawing on statistical models derived from vast repositories of digitised documents and the previous utterances of other users.

This reminded me of a talk by Peter Norvig: The Unreasonable Effectiveness of Data, where he discusses utilizing such large repositories of data in order to develop effective algorithms for a number of problems; there is a heavy focus on natural language processing problems but the concept can, of course, be applied in other areas.

(If the name Peter Norvig sounds familiar, he’s the co-author of Artificial Intelligence: A Modern Approach which you might have used if you ever took an AI class.)

As a programmer, this is exciting stuff and certainly changed my thinking in regards to how I would approach similar problems in the future. Whereas before I would look at sample data sets and try to derive an algorithm, I’d now attempt to mine as much data as I could, build a statistical model, and use that as the basis of the algorithm. Of course mining a massive data set is sometimes easier said than done; especially in regards to data, much of the web is still a walled garden.

Launching an application with XPCOM

Continuing to document my work with XULRunner, XUL, and XPCOM, here I’m presenting code on how to launch an executable using XPCOM’s nsILocalFile interface fetch the executable file and the nsIProcess interface to execute the process.

// target = path to executable
// args = arguments for executable
function exec(target, args) {

try {

var file = Components.classes["@mozilla.org/file/local;1"].createInstance(Components.interfaces.nsILocalFile);
file.initWithPath(target);

var process = Components.classes["@mozilla.org/process/util;1"].createInstance(Components.interfaces.nsIProcess);
process.init(file);

var args = [''];
process.run(
false, args, args.length);
        
        
return process;
}
catch (err) {
alert(err);
return null;
}

}

Fixed (non-resizable) windows with XULRunner

I’ve been working a bit with XULRunner lately and wanted to create a fixed, non-resizable application window. After some searching, I eventually stumbled upon some code which led to a solution – adding a function call to the application’s prefs.js file:

pref("toolkit.defaultChromeFeatures", "chrome,resizable=no,dialog=no");

resizable=no prevents the window from being resized, dialog=no makes it a non-dialog window so that you can still minimize it.

Simple stuff, but this was difficult to find. Discovered thanks to this post on glazman.org

Note that after some testing with XULRunner on Ubuntu, it appears that this may be a Windows-only setting.

Habitat 67

An interesting housing complex design in Montreal. Habitat 67 designed by Moshe Safdie:

Habitat 67
Photo by Taxiarchos228 at the German language Wikipedia

Musei Civici Veneziani

Cleaning out old cards from my wallet and found my museum pass, for the Civic Museums of Venice, from a trip to Italy a few years ago.

musei-civici-veneziani-pass

C for Dummies

I snapped a photo of my C for Dummies books on a visit to my parents place (my apartment is tiny, so a lot of my old books are there). C was the first programming language I took a serious interest in learning. I knew a bit about code prior as I had toyed around with QBASIC and Visual Basic, but I was interested in video games and computer graphics, for which neither of those languages were well suited.

I spend maybe a month consuming these 2 books when I was about 16, going back-and-forth between the pages and MS Edit + DJGPP to write and compile code. When it all clicked (and there really was that moment of clarity as everything went from looking cryptic and indiscernible, to elegant and natural) it was incredibly exciting.

These books were a great introduction to the C language, and programming in general, and Dan Gookin deserves a lot of credit; technical writing is hard, even more-so when your audience is the absolute beginner.

C for Dummies