Posts Tagged ‘PHP’

GeoNames geographical database

I came across the GeoNames database recently and was impressed with the breadth of locations available. I downloaded the allCountries.zip from http://download.geonames.org/export/dump/ which gives data (name, location, population, etc.) on places across all countries in one, TSV delimited, text file. To work with the data more easily, I wrote a PHP script to put the entries into a MySQL database table (it’s actually just a simple modification to the script I used for the Wiktionary definitions import). The TSV, MySQL database, and PHP script are all presented below.

GeoNames allCountries.zip

GeoNames MySQL database export

<?php

require "Database.php";

$tsvInputFilePath = "allCountries.txt";

echo "Importing {$tsvInputFilePath} ...\n";

// Open file
$fp = fopen($tsvInputFilePath, "r");
if($fp === FALSE) {
echo "Could not find file path: " . $tsvInputFilePath;
exit;
}

// Establish DB connection
$db = new Database();

while (!feof($fp)) {

// Get line and parse tab-delimited fields
$ln = fgets($fp);
$parts = explode("\t", $ln);

if(count($parts) < 19) {
continue;
}

// Insert into database
$db->query("INSERT INTO cities (`id`,
`name`,
`asciiname`,
`alternatenames`,
`latitude`,
`longitude`,
`feature_class`,
`feature_code`,
`country_code`,
`cc2`,
`admin1_code`,
`admin2_code`,
`admin3_code`,
`admin4_code`,
`population`,
`elevation`,
`dem`,
`timezone`,
`last_modified_at`)
VALUES (?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?)"
,

$parts[0],
$parts[1],
$parts[2],
$parts[3],
$parts[4],
$parts[5],
$parts[6],
$parts[7],
$parts[8],
$parts[9],
$parts[10],
$parts[11],
$parts[12],
$parts[13],
$parts[14],
$parts[15],
$parts[16],
$parts[17],
$parts[18]

);


}

echo "done.\n";
exit;

The Database class is wrapper for mysqli, you can find it, along with the script above, in the geonames-allcountries-import bitbucket repo.

Note that this script will take a while to run (likely a few days) as there are 9,195,153 records that need to be inserted and we’re just doing simple INSERTs with no optimizations.

An overview of each of the fields in the database can be found in the GeoNames export readme.txt. Particularly important is the feature_class and feature_code fields, the range of values for which can be found on the GeoNames Feature Codes page. Also, as indicated in the readme, the data is licensed under the Creative Commons Attribution 3.0 License.

Round to midnight

A problem I’ve run into a few times is taking the current unix timestamp and rounding it to midnight, so that I can get the unix time for the start of the day. In PHP, I’ve commonly done the following:

$timestamp = strtotime('today midnight');

It’s one of the solutions presented in this StackOverflow post.

The solution above works fine, but I began thinking about how to actually do the computation and bypass the string parsing done by strtotime(). The computation is actually pretty simple, as it’s in the same vein as snapping a point to a grid. The verbose code snippet below shows the step-by-step process in the computation.

// Given the number of seconds in a day
$numSecondsInDay = 86400;

// .. and the current unix time
$currentTime = time();

// We can compute the number of days since the unix epoch (the decimal/fractional part is the portion of the current day that's elapsed)
$daysSinceEpoch = $currentTime / $numSecondsInDay;

// We can throw away the fractional part by rounding down with the floor() function
$wholeDaysSinceEpoch = floor($daysSinceEpoch);

// The number of whole days since the epoch x the number of seconds in a day will give the time for the current day at midnight
$midnightToday = $wholeDaysSinceEpoch * $numSecondsInDay;

One interesting thing to notice: if you replace the floor() function with the ceil() function, rounding up the number of days since the epoch, you’ll get the start of the next day – midnight tomorrow.

Wiktionary definitions database

Having a dictionary can be incredibly useful in software development, and forms the basis for a wide range of natural language processing applications. However, finding an open-source dictionary, one that can be easily parsed and used within applications, is incredibly difficult as there simply isn’t a lot of options available.

WordNet is one option I came across, but requires significant work parsing the WordNet ASCII database files or Prolog database files.

Wiktionary was the other viable option, and the one I went with. The Wiktionary XML dumps are available, but being a wiki, these files are likely even more difficult to parse than the WordNet database files as you’d have to deal with wiki markup. However, a while ago I was able to get a TSV file with words, parts of speech, and definitions from the Wikimedia Toolserver at http://toolserver.org/~enwikt/definitions. The Toolserver has since been discontinued and I haven’t found updated TSVs hosted anywhere else, but the file I downloaded, dated November 27, 2012, is still fairly up-to-date for a dictionary and useful in many applications.

I wrote a PHP script to parse the TSV and make INSERTs into a MySQL database. The TSV file, MySQL database, and PHP script are presented below.

Wiktionary TSV file

Wiktionary MySQL database export

PHP Script:

<?php

require "Database.php";

$tsvInputFilePath = "TEMP-E20121127.tsv";

echo "Importing {$tsvInputFilePath} ...\n";

// Open file
$fp = fopen($tsvInputFilePath, "r");
if($fp === FALSE) {
echo "Could not find file path: " . $tsvInputFilePath;
exit;
}

// Establish DB connection
$db = new Database();

while (!feof($fp)) {

// Get line and parse tab-delimited fields
$ln = fgets($fp);
$parts = explode("\t", $ln);
if(count($parts) < 4) {
continue;
}

$lang = $parts[0];
$word = $parts[1];
$partOfSpeech = $parts[2];
$definitionRaw = $parts[3];

// Insert into database
$db->query("INSERT INTO words (language, word, part_of_speech, definition_raw)
VALUES (?, ?, ?, ?)"
,
$lang, $word, $partOfSpeech, $definitionRaw);

}

echo "done.\n";
exit;

The Database class is wrapper for mysqli, you can find it, along with the script above, in the wiktionary-tsv-import bitbucket repo.

Note that definitions need to be parsed further, as they contain wiki markup. The parsing doesn’t seem difficult and is something I hope to get done in the near future.

Related resources:

There’s valuable stuff from each of the projects above, but like WordNet, requires significantly more time to evaluate and implement in an application, compared to the simple TSV -> MySQL translation.

EDIT (12/13/2015): I’ve updated the MySQL database export. There was some holes in the data because I was using utf8 column encoding for definitions, however, MySQL’s has a weird “UTF-8” implementation that only handles codepoint that up to 3 bytes in size. utf8mb4 encoding needs to be used for a proper UTF-8 encoding supporting up to 4 bytes.

PHP count() is O(1)

I was curious about the performance of PHP’s count() function a while back and whether it was worth it to store the result in a variable for repeated use. I discovered the following from this answer by FractalizeR on Stack Overflow:

PHP_FUNCTION(count) calls php_count_recursive(), which in turn calls zend_hash_num_elements() for non-recursive array, which is implemented this way:

ZEND_API int zend_hash_num_elements(const HashTable *ht)
{
IS_CONSISTENT(ht);

return ht->nNumOfElements;
}

So you can see, it’s O(1) for $mode = COUNT_NORMAL.

IMAP Pickup

An interesting little project I wanted to work on; I wanted to be able to pull attachments from emails in an IMAP mailbox and then download them. I wanted an IMAP solution instead of writing a script for the MTA as a script would be specific to the MTA software and not transferable to another server. In addition, there’s also the common case where you may simply not have access to the MTA.

The biggest help in putting this together and dealing with attachments was this blog post and this comment on the PHP docs. Information on doing this is a bit scattered and incomplete in many cases, likely because extracting attachments is somewhat difficult as email is a notoriously bad way to transfer files; the file data is base64 encoded and dumped in as part of the message body.

ImapPickup is the class that encapsulates all the necessary functionality,

class ImapPickup
{
protected $imapStream = null;

protected function findAttachments($part)
{
$partNum = -1;
$attachments = array();

$this->findAttachmentsRec($part, &$attachments, &$partNum, -1);

return $attachments;
}

protected function findAttachmentsRec($part, &$attachments, &$partNum, $partNumSub)
{
if (isset($part->parts))
{
foreach ($part->parts as $partOfPart)
{
$this->findAttachmentsRec($partOfPart, &$attachments, &$partNum, $partNumSub+1);
}
}
else
{

if (isset($part->disposition)){
if ($part->disposition == 'attachment') {
$attachments[] = array($part->dparameters[0]->value, $partNum, $partNumSub);
}
}
}

$partNum++;
}

public function getAttachmentContent($msgNum, $partNum)
{
$contents = imap_fetchbody($this->imapStream, $msgNum, $partNum, FT_UID);
return imap_base64($contents);
}

public function getAttachments($msgNum)
{
$struct = imap_fetchstructure($this->imapStream,$msgNum,FT_UID);
$attachments = $this->findAttachments($struct);

return $attachments;
}

public function getAttachmentsFromMessages($msgArray)
{
$msgIdToAttachmentsMap = array();

if ($msgArray)
{
foreach($msgArray as $msgId)
{
$attachments = $this->getAttachments($msgId);
if(!empty($attachments))
{
$msgIdToAttachmentsMap[$msgId] = $attachments;
}
}
}

return $msgIdToAttachmentsMap;
}

public function getMessages($searchQuery)
{
return imap_search($this->imapStream, $searchQuery, SE_UID);
}

public function connect($mailbox, $user, $password)
{
$this->imapStream = imap_open($mailbox, $user, $password);
}

public function disconnect()
{
imap_close(
$this->imapStream);
}

}

Here’s a little example of how it can be used. This will query all messages with “pickup::” in the subject line and print out the messageID of all messages with attachments, followed by the filenames of all attachments for that message.

$imapPickup = new ImapPickup();
$imapPickup->connect("{mail.hotspotdot.net:143}INBOX", "test@test.net", "pass123");

$messages = $imapPickup->getMessages("SUBJECT pickup::");
$attachments = $imapPickup->getAttachmentsFromMessages($messages);

foreach($attachments as $msgId => $attArr)
{
echo "<p>{$msgId} => ";

foreach($attArr as $attachment)
{
echo $attachment[0];
echo ",";
}

echo "</p>";
}

$imapPickup->disconnect();

The array for a single file attachment contains 3 entries:

  • [0] => filename
  • [1] => major part number
  • [2] => minor part number

getAttachments(), findAttachments(), and findAttachmentsRec() will return an array of such entries (or an empty array is there are no attachments). getAttachmentsFromMessages() will return a map from messageID => array of single attachments.

The part number (both major and minor) is needed to retrieve the contents of an attachment. For getAttachmentContent(), simply use the major number if the minor number is <= 0, or concatenate them with a period separating them (e.g. "2.3").

Unity

I did a little experiment with meshing XML + HTML + PHP.

First, a little background info. Most service-side web technologies (PHP, Ruby on Rails, etc.) were designed around the concept of applying server-side markup to HTML,

<title><?php echo $title;?></title>

… which is interpreted and processed when the page is pulled. There are various frameworks and templating systems to make this easier, cleaner, more structured, etc. but the concept is the same. AJAX comes into play here and there, but it’s seen mostly as a way to sprinkle bits of dynamic data across a page.

This model was fine a few years ago, but for modern web services and apps I think there needs to be a push away from this model, primarily for 2 reasons:

  • Large portions, if not most, of the content of many web apps is dynamic, AJAX-queried data.
  • Support for mobile and desktop platforms connecting to web services is becoming increasingly important; platforms/apps for which HTML/CSS (or at least HTML/CSS designed for a browser) may not serve to provide the best UI.

AJAX structured code fits the bill.

A 100% AJAX architecture elegantly separates application logic (output as XML) from presentation (output as HTML/CSS). Unfortunately, 100% AJAX is not that easy for a complex application. JavaScript interpretation can be slow, killing frontend performance, and backend performance can suffer from the repeated flood of asynchronous requests. In addition, the latency from a multitude of AJAX requests+responses can quickly become noticeable (especially with a burdened backend and during an initial page load). When such issues become apparent, the solution always seems to be to go back to the messy world of hacking server-side tags into HTML.

Unity is an attempt to maintain a 100% AJAX application framework, but alleviate the issues mentioned above for web apps. It’s very, very simple; just a bit of glue that takes the XML from one or more AJAX calls (on localhost; the application server itself), turns it into JSON, and embeds it within the <head> element of the HTML document being served. This should not be done for every call, but for AJAX calls where the data being presented is “mostly static” (e.g. username, messages, etc.). The data can then be accessed in the client-side JavaScript and used as necessary.

Looking at the typical three-tier architecture, Unity is a bit of glue that sits above the app logic (technically, it could be thought of as part of the app logic, but it’s separated here for illustrative purposes).

app stack with glue

However, since the glue makes 1+ calls to fetch the XML from the app logic, things look a bit more like this on the backend:

backend app stack, multiple, with glue

As I mentioned, this is all very simple stuff. Here’s the code that does the magic:

class Unity
{
protected $js = null;
protected $htmlFile = null;

function __construct($_htmlFile)
{
$this->js = array();
$this->js[] = "var Unity = {}";

$this->htmlFile = $_htmlFile;
}

public function PushXml($_xmlFile, $_namespace)
{
$xml = simplexml_load_file($_xmlFile);

// see http://www.php.net/manual/en/function.json-encode.php#97008 for why json_encode is done like this
$this->js[] = "Unity.{$_namespace} = " . json_encode(new SimpleXMLElement($xml->asXML(), LIBXML_NOCDATA));
}

public function PushVar($name, $var)
{
$snippet = "Unity.{$name} = {$var}";
$this->js[] = $snippet;
}

public function UnifyAndEcho()
{
$scriptBlock = implode(';', $this->js);

$dom = new DOMDocument();
$dom->loadHTMLFile($this->htmlFile);
$head = $dom->getElementsByTagName("head")->item(0);

$scriptElem = $dom->createElement('script', $scriptBlock);
$scriptElem->setAttribute("type", "text/javascript");

if($head->firstChild)
{
$head->insertBefore($scriptElem, $head->firstChild);
}
else
{
$head->appendChild($scriptElem);
}

echo $dom->saveHTML();
}
}

Note that you don’t have to parse the XML; json_encode() takes the SimpleXMLElement object and converts the entire tree to JSON. This is done in order to:

  • keep Unity (and the glue layer) as brain-dead simple as possible
  • avoid mucking around with application logic at this level (if the XML doesn’t fits well [e.g. too big, too much unnecessary data, etc.] it’s a problem at a lower level)

Now an example of how the Unity class would be utilized:

$uni = new Unity("test-page.html");
$uni->PushXml("http://localhost/test-xml.xml", "Account");
$uni->UnifyAndEcho();

Assume the XML output is:

<?xml version="1.0" encoding="UTF-8"?>
<root>
<account_info>
<username>
pixel</username>
<stats
timespan="month">
<check_ins>
12432</check_ins>
<photos>
123</photos>
<followers>
32</followers>
</stats>
</account_info>
</root>

… in the client-side JavaScript code, we can access any of this data via Unity.<namespace>… The namespace was specified as “Account”, so the variables can be accessed as follows,

// username
Unity.Account.account_info.username;

// stats.check_ins
Unity.Account.account_info.stats.check_ins;

// ...

Sessions

A not-too-obvious challenge in doing something like this handling session data. As Unity is calling scripts from localhost (and not the client’s computer), a different session, one for localhost, is active. In addition, session data is locked, preventing two active scripts from accessing the same session data (which the xml generating script may very well need)

First, here’s how to get the session id:

session_start();
$sess_id = session_id();
session_write_close();

session_write_close() closes the session so that the subsequent script call will be able to access it. If necessary, session data can be started and access again after or in-between Unity::PushXml() calls; however, again, Unity is meant to be just a lightweight bit of glue, there shouldn’t really be a need to read or manipulate session data at this point in the code.

We can now pass the session id as a GET parameter:

$uni->PushXml($localbase . "controller/getlogin.php?sid={$sess_id}", "Account");

Now within the script being called, session_id() needs to be called with the passed session id, before session_start().

if(isset($_GET['sid']))
{
session_id(
$_GET['sid']);
}

session_start();

That’s it.

Finally it’s worth mentioning that none of this is bound to PHP nor XML. This could be done in any server-side language and the app logic could just as easily output JSON.

PostgreSQL + PHP installation on Windows 2003 x64

Well the PostgreSQL installation itself is easy enough, getting it to work with PHP is the challenging part. Here’s what I did:

Reflex Feedback widget

I worked on a small AJAX widget for user feedback built atop jQuery UI: Reflex Feedback. It’s inspired by the widgets you see from services like Get Satisfaction and UserVoice, but much simpler and it’s a frontend-only widget, how you handle the feedback info on the backend is up to you.

Here’s what it looks like.

reflex feedback widget dialog

And here’s what the tag that opens the dialog looks like:

reflex feedback widget tag

To use it, download or clone the ReflexFeedback repo from bitbucket

Place the .js file wherever you’d like but the /reflex.content folder should a subdirectory in the same folder as the page loading the .js file. Load reflex.js as you would any other javascript file:

<script type="text/javascript" src="js/reflex.js"></script>

Call Reflex.init() to add the widget to the page. The first argument is the DOM element to attach the additional HTML/CSS code to. The seconds argument is the server-side script to call when the user clicks Send Feedback.

Reflex.init($('body'), 'controller/post_feedback.php');

That’s it for the frontend. You should see the tag show up in the right-hand corner and when clicked the dialog open.

For the backend, the AJAX call to send the feedback info will send a POST request with 2 fields: feedback_type, feedback_txt.

Reflex expects an XML reply from the server:

<reflex>
<result>ok</result>
</reflex>

ok indicates a successful result, any other reply is considered an error.

A successful result will close the dialog and show another with a thank you message.

reflex feedback thank you dialog

For an error, a message is shown below the Send Feedback button, informing the user that an error has occurred and to try again.

reflex feedback send fail

As for what to actually do with the feedback, that’s up to you, but what I’m doing is sending myself an email with the feedback info. I’ve posted my PHP script below; feel free to use it, modify it, etc. If you do use this code, be sure to fill in your mail server credentials and a from address; you’ll also need PEAR’s Mail package installed.

<?php

require_once "Mail.php";
require_once "Mail/mime.php";

header('Content-type: application/xml; charset=utf-8');
echo "<?xml version=\"1.0\" encoding=\"utf-8\"?>\r\n";

if(!isset($_POST['feedback_type']) || !isset($_POST['feedback_txt']))
{
echo "<reflex><result>error:missing-arguments</result></reflex>";
}
else
{
$from = "...";
$to = "...";
$subject = "Feedback from user...";

$feedback_type = $_POST['feedback_type'];
$feedback_txt = $_POST['feedback_txt'];

$bodyHtml = "<html><body>";
$bodyHtml .= "<p>Type: {$feedback_type}</p>";
$bodyHtml .= "<p>Feedback: {$feedback_txt}</p>";
$bodyHtml .= "</body></html>";
$body = $bodyHtml;

$host = "...";
$port = "...";
$username = "...";
$password = "...";

$headers = array('MIME-Version' => '1.0rn',
'Content-type' => 'text/html; charset=utf-8',
'From' => $from, 'To' => $to, 'Subject' => $subject);


$smtp = Mail::factory('smtp',
array ('host' => $host,
'port' => $port,
'auth' => true,
'username' => $username,
'password' => $password));

$mail = $smtp->send($to, $headers, $body);

if (PEAR::isError($mail))
{
$err_details = $mail->getMessage();
echo "<reflex><result>error:send-failure</result><details>{$err_details}</details></reflex>";
}
else
{
echo "<reflex><result>ok</result></reflex>";
}
}

?>

That’s all for now. I’ll work on more features and options for customization in the future. You can see the widget in action over at dotspott.com

hotspotdot

hotspotdot is my entry into Microsoft’s My App is Better Challenge. In a nutshell, hotspotdot allows you to locate and tag “hot spots” that are important to you, allowing you to create a personal database of hot spots.

It’s a very simple app, utilizing PHP, SQL Server 2008, jQuery, jQuery UI, and the Google Maps API. I wrapped thing up in under 2 weeks, but in order to make the August 25th deadline there were a few features I didn’t get a chance to put in, such as preloaders, icons on the map, and searching by tag. I’ll probably devote some time to these in the coming weeks.

hotspotdot login

The goal of the contest was to create something showing off the power of SQL Server and Microsoft’s new PHP for SQL Server 2.0 drivers. I used the new PDO driver which was pretty sweet – very simple, elegant API. Two things in particular that impressed me were transactions (very nice for doing multiple inserts or deletions) and a consistent and well designed exception model (makes error handling much easier and especially powerful when combined with transactions as failed queries don’t effect the database [code jumps to exception handler before commit() is called], so no chance of junk being inserted). Error messages, in general, were also much more descriptive than those I’ve encountered with MySQL.

Finding a server proved difficult. I got a shared hosting solution, but the host was unable (though I suspect unwilling) to install the PHP for SQL drivers. So I ended up getting a virtual private server, but this only came with SQL Server Express and I used some features (see below) that prevented a migration. In order to avoid purchasing an SQL Server 2008 license (way out of my budget), my final solution was use the shared hosting server for the DB and the VPS for everything else. Hence the reason for the site being located at the my. subdomain, which maps to the IP of the VPS.

SQL Server 2008 is a fine system (despite some annoyances with the management studio)… though it’s really just a solid database system, which isn’t bad, but there’s nothing really impressive or creative about that. It’s not really leaps and bounds above a cheap solution like MySQL. The contest was about SQL Server, so I tried to do something that utilized a fairly unique aspect of the system: the geography data type. I wasn’t too impressed. It simply holds a (longitude, latitude) pair in a certain format, nothing more. The one big advantage of having a vector type like this would be doing comparisons based on distance but, as far as I could tell, this isn’t supported (my queries failed). Worse yet, in SQL Server Management Studio the display of the geography type is in hexadecimal, making things very cryptic… I couldn’t help but wonder why this is any better than using 2 columns and storing the longitude and latitude values independently. That said, the idea of richer/more-complex data types within a relational database is a pretty cool idea and it would interesting to see it taken further and beyond its current, primitive state.

hotspotdot map

hotspotdot map

Leave a comment if you find a bug or have any questions, comments, etc.

PHP session_start() “Node no longer exists”

I stumbled upon this error earlier today as I attempted to store the value of a SimpleXMLElement as a session variable. I was able to narrow down the issue thanks to this post on bytes.com.

According to a user post on the PHP site, this occurs because SimpleXML returns a reference to an object containing the node value, and you can’t store a reference as a session variable. The value must be dereferenced and copied which can be done by casting.

// Bad! $storageBoxSize = $xml->data->storage_box_size;

// Good! $storageBoxSize = (int)$xml->data->storage_box_size;

I find myself hating loosely-typed languages more and more.