PHP | semi/signal

Posts Tagged ‘PHP’

Improving on strip_tags (part 2)

Feb 26 2023 · PHP

Whitespace and tags

Previously, I looked at improving the functionality of strip_tags such that words across tags are not mashed together. The method I derived works well enough but it’s limited in that all tags are treated the same way and all whitespace separators are the same. I wanted to see if I could improve the method a bit more to address these limitations; that is, introducing whitespace based on the type of tag encountered instead of injecting whitespace after stripping away a tag.

For example, when dealing with inline tags, whitespace should be preserved:

This bit of HTML:
<span>the quick brown fox </span><span>jumped over the moon</span>
… should produce:
the quick brown fox jumped over the moon

Alternatively, when dealing with block-level tags, a newline should be injected:

This bit of HTML:
<div>the quick brown fox</div><div>jumped over the moon</div>
… should produce:
the quick brown fox jumped over the moon

Note that we’re simply talking about common/expected browser behavior from what’s thought of as inline-level or block-level tags. In reality, this categorization isn’t really part of the HTML standard anymore and layout behavior is relegated determined by CSS. From MDN:

That said, when looking at arbitrary HTML content, I still think “block” vs. “inline” is a useful distinction, at least insofar as inferring default or common behavior.

The special case

The <br> tag presents a special case. While it’s classified as an inline element, <br> represents whitespace that is generally similar to that of a block-level element (e.g. a newline). In implementation this is simple to handle but does introduce a tiny bit of additional complexity.

Looking at the high-level transformations needed, we get the following:

Inline-level tags → strip away (no action needed, don’t alter any existing whitespace within tag contents)
Block-level tags → strip away, replace with newline
<br> tags → strip away, replace with newline

Code

Reworking the convert() method from the previous post, we get the following:

class HTMLToPlainText
{
    const BLOCK_LEVEL_ELEMENTS = [
        "address",
        "article",
        "aside",
        "blockquote",
        "details",
        "dialog",
        "dd",
        "div",
        "dl",
        "dt",
        "fieldset",
        "figcaption",
        "figure",
        "footer",
        "form",
        "h1",
        "h2",
        "h3",
        "h4",
        "h5",
        "h6",
        "header",
        "hgroup",
        "hr",
        "li",
        "main",
        "nav",
        "ol",
        "p",
        "pre",
        "section",
        "table",
        "ul"
    ];

    const INLINE_LEVEL_ELEMENTS_THAT_PRODUCE_NEWLINE = [
        "br",
    ];

    const STATE_READING_CONTENT = 1;
    const STATE_READING_TAG_NAME = 2;

    static public function convert(string $input, string $blockContentSeparator = "\n"): string
    {
        // the input string as UTF-32
        $fixedWidthString = iconv('UTF-8', 'UTF-32', $input);

        // string within tags that we've found
        $output = "";

        // buffer for current/last tag name read
        $currentTagName = "";
        $currentTagIsClosing = null;

        // buffer content in the current tag being read
        $contentInCurrentTag = "";

        // flag to indicate how we should interpret what we're reading from $fixedWidthString
        // .. this is initially set to STATE_READING_CONTENT, as we assume we're reading content from the start, even
        // if we haven't encountered a tag (e.g. string that doesn't contain tags)
        $parserState = self::STATE_READING_CONTENT;

        $flushCurrentToOutput = function() use (&$output, &$contentInCurrentTag, &$currentTagName, &$currentTagIsClosing, &$blockContentSeparator) {
            // handle inline tags, which produce a newline (e.g. <br>)
            // .. not that these can be empty (<br>) or self-closing (<br/>)
            if(in_array(strtolower($currentTagName), self::INLINE_LEVEL_ELEMENTS_THAT_PRODUCE_NEWLINE)) {
                $output .= $contentInCurrentTag . $blockContentSeparator;
            } else {
                // append $blockContentSeparator if we're at the *opening or closing* of a block-level element
                // (for inline element, leave content as-is)
                if (in_array(strtolower($currentTagName), self::BLOCK_LEVEL_ELEMENTS)) {
                    $output .= $contentInCurrentTag . $blockContentSeparator;
                } else {
                    $output .= $contentInCurrentTag;
                }
            }

            // reset
            $contentInCurrentTag = "";
            $currentTagIsClosing = null;
            $currentTagName = "";
        };

        // iterate through characters in $fixedWidthString
        // checking for tokens indicating if we're within a tag or within content
        for($i=0; $i<strlen($fixedWidthString); $i+=4) {
            // convert back to UTF-8 to simplify character/token checking
            $ch = iconv('UTF-32', 'UTF-8', substr($fixedWidthString, $i, 4));

            if($ch === '<') {
                $flushCurrentToOutput();
                $parserState = self::STATE_READING_TAG_NAME;
                continue;
            }

            if($ch === '>') {
                $flushCurrentToOutput();
                $parserState = self::STATE_READING_CONTENT;
                continue;
            }

            if($parserState == self::STATE_READING_TAG_NAME && $ch == '/') {
                $currentTagIsClosing = true;
                continue;
            }

            if($parserState == self::STATE_READING_TAG_NAME) {
                $currentTagName .= $ch;
                continue;
            }

            if($parserState === self::STATE_READING_CONTENT) {
                $contentInCurrentTag .= $ch;
                continue;
            }
        }

        $flushCurrentToOutput();

        return trim($output, $blockContentSeparator);
    }
}

Testing

Throwing some arbitrary bits of HTML at this function seems to indicate that the method works correctly but, a method like this, really calls for some form of automated testing. I could derive test cases from the function logic, and this is what’s typically done when testing some arbitrary method, but this approach is biased and limited here. Biased in that I’d be looking at the function and coming up with test cases based upon my experiences (what I’ve encountered and where I think there may be potential issues). Limited in that I’d likely only come up with a handful of test cases unless I invested a significant chunk of time into compiling a comprehensive set of cases; HTML has relatively few building blocks but, given the number of different ways those blocks can be combined and arranged, we end up with a fairly large number of permutations. What would really be effective here is testing with a large and varied corpus of test cases, mappings of HTML snippets to plain text representations; i.e. data-driven testing. It’s usually hard to generate or find data for such testing but the PHP repository has a number of test cases for strip_tags() that can be leveraged:

strip_tags_basic1.phpt has some good baseline tests (HTML tags, PHP tags, tags with attributes, HTML comments, etc.)
strip_tags_basic2.phpt has a good test case (different tags + mix of block and inline elements + PHP tags) but is really testing the allowed_tags_array argument to strip_tags(), which I forgot was a thing and didn’t consider in my method

Beyond the test cases in these 2 files, there are other good cases scattered in the repo, seemingly tied to specific bugs encountered (e.g. bug #53319, which involves handling of “<br />” tags) but they can be hard to locate given the organization or lack thereof of the test files. In any case, it’s great having this data to work with and there were some issues that surfaced when I began subjecting my code to some of these test (e.g. the content separator for block-level elements needing to be attended at the point of both the opening and closing tags, not just the closing tag).

Implementation-wise, testing is mainly encoding the test case in a map and assert that the actual result matches expectations:

$testCases = [
    "<html>hello</html>" => "hello",
    "<?php echo hello ?>" => "",
    "<? echo hello ?>" => "",
    "<% echo hello %>" => "",
    "<script language=\"PHP\"> echo hello </script>" => " echo hello ",
    "<html><b>hello</b><p>world</p></html>" => "hello\nworld",
    "<html><!-- COMMENT --></html>" => "",
    "<html><p>hello</p><b>world</b><a href=\"#fragment\">Other text</a></html><?php echo hello ?>" => "hello\nworldOther text",
    "<p>hello</p><p>world</p>" => "hello\n\nworld",
    '<br /><br />USD<input type="text"/><br/>CDN<br><input type="text" />' => "USD\nCDN",
];

foreach ($testCases as $html => $expectedPlainText) {
    $actualPlainText = HTMLToSearchableText::convert_ex($html);

    echo "TEST: " . $html . "\n";
    echo "EXPECTED: " . $expectedPlainText . "\n";
    echo "ACTUAL: " . $actualPlainText . "\n";
    echo "----\n";

    assert($actualPlainText === $expectedPlainText);
}

Testing is still limited here. I’ve love to simply have a large batch of test cases to throw at the function but something like that is not readily available.

Limitations / future work

The new convert() method is more robust but there’s still some key limitations when compared to the strip_tags() function:

PHP’s strip_tags() is actually a lot more robust when it comes to invalid/malformed HTML content, as the tests in strip_tags.phpt demonstrate
Preserving certain tags (as with the allowed_tags_array argument) wasn’t considered

Also, whitespace/separators produced from <br> elements at the beginning or end of any inputted HTML is stripped away. I don’t think this is correct as browsers preserve whitespace from <br> elements and don’t collapse them as with empty block-level elements.

block-leveldata-driven testingHTMLinline-levelPHPsoftware testingstrip_tagsunit testing

Improving on strip_tags

Aug 13 2022 · PHP

The Problem

PHP’s strip_tags() method will strip away tags but makes no attempt to introduce whitespace to separate content in adjacent tags. This is an issue with arbitrary HTML as adjacent block-level elements may not have any intermediate whitespace and simply stripping away the tags will incorrectly concatenate the textual content in the 2 elements.

For example, running strip_tags() on the following:

<div>the quick brown fox</div><div>jumped over the moon</div>

… will return:

the quick brown foxjumped over the moon

This is technically correct (we’re stripped away the <div> tags) but having no whitespace between “fox” and “jumped” means we’ve transformed the content such that we’ve lost semantic and presentational details.

The Solution

There’s 2 ways I can see to fix this behavior:

Pre-process the HTML content to ensure or introduce whitespace between block-level elements
Don’t use strip_tags() and utilize a method that better understands the need for spacing between elements

I’ll focus on the latter because that’s the avenue I went down and I didn’t consider pre-processing at the time.

Pulling together a quick-and-dirty parser, I wrote the following. It’s worth noting that still still doesn’t really consider what the tags are (e.g. whether they’re inline or block) but allows the caller to specify a string ($tagContentSeparator), typically some whitespace, that is inserted between the stripped away tags:

<?php

class HTMLToPlainText
{
    const STATE_READING_CONTENT = 1;
    const STATE_READING_TAG_NAME = 2;

    static public function convert(string $input, string $tagContentSeparator = " "): string
    {
        // the input string as UTF-32
        $fixedWidthString = iconv('UTF-8', 'UTF-32', $input);

        // string within tags that we've found
        $foundContentStrings = [];

        // buffer for current content being read
        $currentContentString = "";

        // flag to indicate how we should interpret what we're reading from $fixedWidthString
        // .. this is initially set to STATE_READING_CONTENT, as we assume we're reading content from the start, even
        // if we haven't encountered a tag (e.g. string that doesn't contain tags)
        $parserState = self::STATE_READING_CONTENT;

        // method to add a non-empty string to $foundContentStrings and reset $currentContentString
        $commitCurrentContentString = function() use (&$currentContentString, &$foundContentStrings) {
            if(strlen($currentContentString) > 0) {
                $foundContentStrings[] = trim($currentContentString);
                $currentContentString = "";
            }
        };

        // iterate through characters in $fixedWidthString
        // checking for tokens indicating if we're within a tag or within content
        for($i=0; $i<strlen($fixedWidthString); $i+=4) {
            // convert back to UTF-8 to simplify character/token checking
            $ch = iconv('UTF-32', 'UTF-8', substr($fixedWidthString, $i, 4));

            if($ch === '<') {
                $parserState = self::STATE_READING_TAG_NAME;
                $commitCurrentContentString();
                continue;
            }

            if($ch === '>') {
                $parserState = self::STATE_READING_CONTENT;
                continue;
            }

            if($parserState === self::STATE_READING_CONTENT) {
                $currentContentString .= $ch;
                continue;
            }
        }

        $commitCurrentContentString();

        return implode($tagContentSeparator, $foundContentStrings);
    }
}

Note that the to/from UTF-8 ↔ UTF-32 isn’t really necessary, I initially did the conversion as I was worried about splitting a multibyte character, but this isn’t possible given how the function reads the input string.

Now if we take the following HTML snippet:

<div>the quick brown fox</div><div>jumped over the moon</div>

… rendered in a browser, we get:

… with strip_tags() we get:

the quick brown foxjumped over the moon

… and with HTMLToPlainText::convert() (passing in “\n” for $tagContentSeparator), we get:

the quick brown fox
jumped over the moon

The latter results in text that is semantically correct, as words in different blocks aren’t incorrectly joined. Presentationally we also get a more correct conversion but, the method isn’t really doing anything fancy here, this is due to the calling knowing a bit about the HTML snippet, how a browser would render it, and passing passing in “\n” for $tagContentSeparator.

Limitations / future work

The improvement here is that textual content is pretty preserved when doing a conversion, i.e. we don’t have to worry about textual elements being incorrectly concatenated. However, what I wrote is still lacking in 2 keys areas:

Generally, in terms of presentation, an arbitrary bit of HTML won’t map to what a user sees in a browser. To a certain degree this is an intractable problem, as presentation is based on browser defaults, CSS styles, etc. Also, there are things that simply don’t have a standard representation in plain-text (e.g. bold text, list items, etc.). However, there are cases where sensible defaults might make sense, e.g. stripping away <span> tags but putting newline between <p> tags.
Whitespace is trimmed from content within tags. This may or may not matter depending on application. In my case, I cared about the words and additional whitespace just added bloat even if it was more accurate to what was in the HTML.

EDIT: See part 2 on addressing these limitations and making the code more robust.

HTMLPHPstrip_tags

Pushing computation to the front: client-side compression

Dec 25 2020 · Web Technologies

Client → Server Compression

Content from a web server being automatically gzipped (via apache, nginx, etc.) and transferred to the browser isn’t anything new, but there’s really nothing in the way of compression when going in the other direction (i.e. transferring content from the client to the server). This is not too surprising, as most client payloads are small bits of textual content and/or binary content that is already well compressed (e.g. JPEG images), where there’s little gain from compression and you’re likely to just waste CPU cycles doing it. That said, when your frontend client is a space for content creation, you’re potentially going to run into cases where you’re sending a lot of uncompressed data to the server.

Use-case: ScratchGraph Export

ScratchGraph has an export feature that essentially renders the page (minus UI components) as a string of HTML. This string packaged along with some metadata and sent to the server, which sends it to a service running puppeteer, that renders the HTML string to either an image or a PDF. The overall process looks something like this:

The HTML string being sent to the server is relatively large, a couple of MBs, due to:

The CSS styles (particularly due to external resources being pulled in and inlined as base64 URLs)
The user simply having lots of content

To be fair, it’s usually the former rather than the latter, and optimizing to avoid the inlining of resources (the intent of which was to try and do exports entirely in the browser) would have a greater impact in reducing the amount of data being transferred to the server. However, for the purposes of this blog post (and also because it leads to a more complex discussion on how the application architecture can/should evolve and what this feature looks like in the future), we’re going to sidestep that discussion and focus on what benefits data compression may offer.

Compression with pako

I was more than ready to implement a compression algorithm, but was happy to discover pako, which does zlib compression. Compressing (i.e. deflating) with pako is very simple, below I encode the HTML string to UTF8 via TextEncoder.encode() (this is because I want UTF8, this isn’t a requirement of pako), which returns a Uint8Array, then use that as the input for pako.deflate(), which also returns a Uint8Array.

const staticHtmlUtf8Arr = (new TextEncoder()).encode(html);
const compressedStaticHtmlUtf8Arr = pako.deflate(staticHtmlUtf8Arr);

Here’s what that looks like in practice, exporting the diagram shown above:

ScratchGraph Export, with pako compression, results

That’s fairly significant, as the data size has been reduced by 1,237,266 bytes (42.77%)!

The final bit for the frontend is sending this to the server. I use a FormData object for the XHR call and, for the compressed data, I put append it as a Blob:

formData.append(
    "compressedStaticHtml", 
    new Blob([compressedStaticHtmlUtf8Arr], {type: 'application/zlib'}), 
    "compressedStaticHtml"
);

Handling the compressed data server-side with PHP

PHP support zlib compression/decompression via the zlib module. The only additional logic needed server-side is calling gzuncompress() to decompressed the compressed data.

$staticHtml = gzuncompress(file_get_contents($compressedStaticHtmlFile->getFilePath()));

Note that $compressedStaticHtmlFile is an object representing a file pulled from the request (note that FormData will append a Blob in the same manner as a file, so server-side, you’re dealing with the data as a file). The File.getFilePath() method here is simply returning the path for the uploaded file.

Limitations

Compressing and decompressing data will cost CPU cycles and, for zlib and most algorithms, this will scale with the size of the data. So considerations around what the client-side system looks like and the size of the data need to be taken into account. In addition, compression within a browser’s main thread can lead to UI events, reflow, and repaint being blocked (i.e. the page becomes unresponsive). If the compression time is significant, performing it within a web worker instead would be a better path.

compressionfrontendgzuncompressjavascriptpakoPHPzlib

Performance visibility with HTTP Server-Timing

Dec 23 2018 · Web Technologies

Visibility into the performance of backend components can be invaluable when it comes to spotting and understanding service degradation, debugging failures, and knowing if and where optimization is needed. There’s a host of collection agents, aggregators, and visualization tools to handle metrics, but just breaking down and looking at what happens during an HTTP request can offer a lot of insight into how components are performing. This is why I’m pretty excited about the the HTTP Server-Timing header, it works well as a lightweight mechanism to surface performance metrics, especially now that it’s read and graphed by Chrome Devtools (and, perhaps sometime soon, by Firefox Devtools as well).

An HTTP response with the Server-Timing header

The following code snippet shows an Illuminate/Http/Response from a controller that PUTs an image into an Amazon S3 bucket.

return response()
    ->json(
        [],
        StatusCode::STATUS_OK,
        [
            'Server-Timing' => 's3-io;desc="Image upload to S3";dur=' . calculateTimeToPut(),
        ]
    );

Let’s assume the calculateTimeToPut() function returns 5500 (i.e. 5500 milliseconds to PUT the image onto S3), and the response header looks something like this:

Each metric is a group composed of 3 pieces, with each piece delimited by a semicolon:

Metric Name (required)
Metric Description
Metric Value

Multiple metrics can be surfaced by separating each group with a comma.

return response()
    ->json(
        [],
        StatusCode::STATUS_OK,
        [
            'Server-Timing' => 
                's3-io;desc="Image upload to S3";dur=' . calculateTimeToPut() . 
                ',' . 
                'db-io;desc="DB update of entity";dur=' . calculateTimeToUpdate()
        ]
    );

(The above code is a bit simplistic, you’d likely want to better way to store and group metrics, then do a final transformation to construct the Server-Timing string when it’s time to send the HTTP response)

Surfacing in DevTools

Surfacing metrics in an HTTP response is not something terribly complex and I’m sure most could devise other ways to do it, but one reason Server-Timing is a bit more attractive vs a custom solution is the out-of-the-box support within Chrome DevTools.

Firefox Devtools will likely follow suit (hopefully?) in the near future.

The PerformanceServerTiming interface

Server-Timing metrics can also be surfaced via the PerformanceServerTiming interface, from MDN:

In addition to having Server-Timing header metrics appear in the developer tools of the browser, the PerformanceServerTiming interface enables tools to automatically collect and process metrics from JavaScript.

This opens up some interesting possibilities as it enables collecting metrics via a frontend script (as is already done for a lot of product metrics via services like Google Analytics), rather than a backend collector mechanism. While not ground-breaking, the standardization around PerformanceServerTiming may allow for greater adoption and acceptance of this collection pattern.

chromedevtoolsHTTPmetricsperformance metricsPerformanceServerTimingPHPServer-Timing

GeoNames geographical database

Aug 30 2014 · Random

I came across the GeoNames database recently and was impressed with the breadth of locations available. I downloaded the allCountries.zip from http://download.geonames.org/export/dump/ which gives data (name, location, population, etc.) on places across all countries in one, TSV delimited, text file. To work with the data more easily, I wrote a PHP script to put the entries into a MySQL database table (it’s actually just a simple modification to the script I used for the Wiktionary definitions import). The TSV, MySQL database, and PHP script are all presented below.

GeoNames allCountries.zip

GeoNames MySQL database export

<?php 

require "Database.php";

$tsvInputFilePath = "allCountries.txt";

echo "Importing {$tsvInputFilePath} ...\n";

// Open file
$fp = fopen($tsvInputFilePath, "r");
if($fp === FALSE) {
    echo "Could not find file path: " . $tsvInputFilePath;
    exit;
}

// Establish DB connection
$db = new Database();

while (!feof($fp)) {
    
    // Get line and parse tab-delimited fields
    $ln = fgets($fp);
    $parts = explode("\t", $ln);
        
    if(count($parts) < 19) {
        continue;
    }
       
    // Insert into database
    $db->query("INSERT INTO cities (`id`,
                                    `name`,
                                    `asciiname`,
                                    `alternatenames`,
                                    `latitude`,
                                    `longitude`,
                                    `feature_class`,
                                    `feature_code`,
                                    `country_code`,
                                    `cc2`,
                                    `admin1_code`,
                                    `admin2_code`,
                                    `admin3_code`,
                                    `admin4_code`,
                                    `population`,
                                    `elevation`,
                                    `dem`,
                                    `timezone`,
                                    `last_modified_at`) 
                VALUES (?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?)", 
            
                $parts[0],
                $parts[1],
                $parts[2],
                $parts[3],
                $parts[4],
                $parts[5],
                $parts[6],
                $parts[7],
                $parts[8],
                $parts[9],            
                $parts[10],
                $parts[11],
                $parts[12],
                $parts[13],
                $parts[14],
                $parts[15],
                $parts[16],
                $parts[17],
                $parts[18]
            
                );

       
}

echo "done.\n";
exit;

The Database class is wrapper for mysqli, you can find it, along with the script above, in the geonames-allcountries-import bitbucket repo.

Note that this script will take a while to run (likely a few days) as there are 9,195,153 records that need to be inserted and we’re just doing simple INSERTs with no optimizations.

An overview of each of the fields in the database can be found in the GeoNames export readme.txt. Particularly important is the feature_class and feature_code fields, the range of values for which can be found on the GeoNames Feature Codes page. Also, as indicated in the readme, the data is licensed under the Creative Commons Attribution 3.0 License.

citiesdatabasegazettergeographygeonameslocationsPHPplacestsv

Round to midnight

Jul 26 2014 · Math

A problem I’ve run into a few times is taking the current unix timestamp and rounding it to midnight, so that I can get the unix time for the start of the day. In PHP, I’ve commonly done the following:

$timestamp = strtotime('today midnight');

It’s one of the solutions presented in this StackOverflow post.

The solution above works fine, but I began thinking about how to actually do the computation and bypass the string parsing done by strtotime(). The computation is actually pretty simple, as it’s in the same vein as snapping a point to a grid. The verbose code snippet below shows the step-by-step process in the computation.

// Given the number of seconds in a day
$numSecondsInDay = 86400;

// .. and the current unix time
$currentTime = time();

// We can compute the number of days since the unix epoch (the decimal/fractional part is the portion of the current day that's elapsed)
$daysSinceEpoch = $currentTime / $numSecondsInDay;

// We can throw away the fractional part by rounding down with the floor() function
$wholeDaysSinceEpoch = floor($daysSinceEpoch);

// The number of whole days since the epoch x the number of seconds in a day will give the time for the current day at midnight
$midnightToday = $wholeDaysSinceEpoch * $numSecondsInDay;

One interesting thing to notice: if you replace the floor() function with the ceil() function, rounding up the number of days since the epoch, you’ll get the start of the next day – midnight tomorrow.

PHPstrtotimeunix time

Wiktionary definitions database

Jul 14 2014 · Random

Having a dictionary can be incredibly useful in software development, and forms the basis for a wide range of natural language processing applications. However, finding an open-source dictionary, one that can be easily parsed and used within applications, is incredibly difficult as there simply isn’t a lot of options available.

WordNet is one option I came across, but requires significant work parsing the WordNet ASCII database files or Prolog database files.

Wiktionary was the other viable option, and the one I went with. The Wiktionary XML dumps are available, but being a wiki, these files are likely even more difficult to parse than the WordNet database files as you’d have to deal with wiki markup. However, a while ago I was able to get a TSV file with words, parts of speech, and definitions from the Wikimedia Toolserver at http://toolserver.org/~enwikt/definitions. The Toolserver has since been discontinued and I haven’t found updated TSVs hosted anywhere else, but the file I downloaded, dated November 27, 2012, is still fairly up-to-date for a dictionary and useful in many applications.

I wrote a PHP script to parse the TSV and make INSERTs into a MySQL database. The TSV file, MySQL database, and PHP script are presented below.

Wiktionary TSV file

Wiktionary MySQL database export

PHP Script:

<?php 

require "Database.php";

$tsvInputFilePath = "TEMP-E20121127.tsv";

echo "Importing {$tsvInputFilePath} ...\n";

// Open file
$fp = fopen($tsvInputFilePath, "r");
if($fp === FALSE) {
    echo "Could not find file path: " . $tsvInputFilePath;
    exit;
}

// Establish DB connection
$db = new Database();

while (!feof($fp)) {
    
    // Get line and parse tab-delimited fields
    $ln = fgets($fp);
    $parts = explode("\t", $ln);
    if(count($parts) < 4) {
        continue;
    }
    
    $lang = $parts[0];
    $word = $parts[1];
    $partOfSpeech = $parts[2];    
    $definitionRaw = $parts[3];
    
    // Insert into database
    $db->query("INSERT INTO words (language, word, part_of_speech, definition_raw) 
                VALUES (?, ?, ?, ?)", 
                $lang, $word, $partOfSpeech, $definitionRaw);
       
}

echo "done.\n";
exit;

The Database class is wrapper for mysqli, you can find it, along with the script above, in the wiktionary-tsv-import bitbucket repo.

Note that definitions need to be parsed further, as they contain wiki markup. The parsing doesn’t seem difficult and is something I hope to get done in the near future.

Related resources:

Wikokit – parser to produce a machine-readable Wiktionary
DBpedia Wiktionary RDF extraction – RDF database and SPARQL querying interface of Wiktionary
perl-wiktionary-parser – PERL Wiktionary parser

There’s valuable stuff from each of the projects above, but like WordNet, requires significantly more time to evaluate and implement in an application, compared to the simple TSV -> MySQL translation.

EDIT (12/13/2015): I’ve updated the MySQL database export. There was some holes in the data because I was using utf8 column encoding for definitions, however, MySQL’s has a weird “UTF-8” implementation that only handles codepoint that up to 3 bytes in size. utf8mb4 encoding needs to be used for a proper UTF-8 encoding supporting up to 4 bytes.

databasedictionarynatural language processingPHPtsvwikiwiki markupwikokitwiktionarywordnet

PHP count() is O(1)

Oct 10 2013 · PHP

I was curious about the performance of PHP’s count() function a while back and whether it was worth it to store the result in a variable for repeated use. I discovered the following from this answer by FractalizeR on Stack Overflow:

PHP_FUNCTION(count) calls php_count_recursive(), which in turn calls zend_hash_num_elements() for non-recursive array, which is implemented this way:
ZEND_API int zend_hash_num_elements(const HashTable *ht) { IS_CONSISTENT(ht); return ht->nNumOfElements; }

So you can see, it’s O(1) for $mode = COUNT_NORMAL.

arraycountperformancePHPtime complexity

IMAP Pickup

May 1 2011 · PHP

An interesting little project I wanted to work on; I wanted to be able to pull attachments from emails in an IMAP mailbox and then download them. I wanted an IMAP solution instead of writing a script for the MTA as a script would be specific to the MTA software and not transferable to another server. In addition, there’s also the common case where you may simply not have access to the MTA.

The biggest help in putting this together and dealing with attachments was this blog post and this comment on the PHP docs. Information on doing this is a bit scattered and incomplete in many cases, likely because extracting attachments is somewhat difficult as email is a notoriously bad way to transfer files; the file data is base64 encoded and dumped in as part of the message body.

ImapPickup is the class that encapsulates all the necessary functionality,

class ImapPickup
{
    protected $imapStream = null;

    protected function findAttachments($part)
    {
        $partNum = -1;
        $attachments = array();

        $this->findAttachmentsRec($part, &$attachments, &$partNum, -1);

        return $attachments;
    }

    protected function findAttachmentsRec($part, &$attachments, &$partNum, $partNumSub)
    {
        if (isset($part->parts))
        {
            foreach ($part->parts as $partOfPart)
            {
                $this->findAttachmentsRec($partOfPart, &$attachments, &$partNum, $partNumSub+1);
            }
        }
        else
        {

            if (isset($part->disposition)){
                    if ($part->disposition == 'attachment') {
                        $attachments[] = array($part->dparameters[0]->value, $partNum, $partNumSub);
                    }
                }
        }

        $partNum++;
    }

    public function getAttachmentContent($msgNum, $partNum)
    {
        $contents = imap_fetchbody($this->imapStream, $msgNum, $partNum, FT_UID);
        return imap_base64($contents);
    }

    public function getAttachments($msgNum)
    {
        $struct = imap_fetchstructure($this->imapStream,$msgNum,FT_UID);
        $attachments = $this->findAttachments($struct);

        return $attachments;
    }

    public function getAttachmentsFromMessages($msgArray)
    {
        $msgIdToAttachmentsMap = array();

        if ($msgArray)
        {
            foreach($msgArray as $msgId)
            {
                $attachments = $this->getAttachments($msgId);
                if(!empty($attachments))
                {
                    $msgIdToAttachmentsMap[$msgId] = $attachments;
                }
            }
        }

        return $msgIdToAttachmentsMap;
    }

    public function getMessages($searchQuery)
    {
        return imap_search($this->imapStream, $searchQuery, SE_UID);
    }

    public function connect($mailbox, $user, $password)
    {
        $this->imapStream = imap_open($mailbox, $user, $password);
    }

    public function disconnect()
    {
        imap_close($this->imapStream);
    }

}

Here’s a little example of how it can be used. This will query all messages with “pickup::” in the subject line and print out the messageID of all messages with attachments, followed by the filenames of all attachments for that message.

$imapPickup = new ImapPickup();
$imapPickup->connect("{mail.hotspotdot.net:143}INBOX", "test@test.net", "pass123");

$messages = $imapPickup->getMessages("SUBJECT pickup::");
$attachments = $imapPickup->getAttachmentsFromMessages($messages);

foreach($attachments as $msgId => $attArr)
{
    echo "<p>{$msgId} => ";
    
    foreach($attArr as $attachment)
    {
        echo $attachment[0];
        echo ",";
    }

    echo "</p>";
}

$imapPickup->disconnect();

The array for a single file attachment contains 3 entries:

[0] => filename
[1] => major part number
[2] => minor part number

getAttachments(), findAttachments(), and findAttachmentsRec() will return an array of such entries (or an empty array is there are no attachments). getAttachmentsFromMessages() will return a map from messageID => array of single attachments.

The part number (both major and minor) is needed to retrieve the contents of an attachment. For getAttachmentContent(), simply use the major number if the minor number is <= 0, or concatenate them with a period separating them (e.g. "2.3").

attachmentemailimapImapPickupPHP

PostgreSQL + PHP installation on Windows 2003 x64

Jan 10 2011 · Databases

Well the PostgreSQL installation itself is easy enough, getting it to work with PHP is the challenging part. Here’s what I did:

Install PostgreSQL
Edit php.ini, uncomment “extension=php_pgsql.dll”
Edit environment variables, add PostgreSQL /bin and /lib directories to Path. This solves the issue of php_pgsql.dll not loading due to it not being able to resolve dependencies.
Download http://files.dll-vista.com/dllvista-php_pgsql.dll.zip, replace php_pgsql.dll with the one in the zip file. This solves the issue where you get a message about php_pgsql.dll not being a valid win32 application if you run php.exe;
Done. PHP should now be able to communicate with PostgreSQL.

PHPphp_pgsql.dllpostgresql

semi/signal

Posts Tagged ‘PHP’

Improving on strip_tags (part 2)

Whitespace and tags

The special case

Code

Testing

Limitations / future work

Improving on strip_tags

The Problem

The Solution

Limitations / future work

Pushing computation to the front: client-side compression

Client → Server Compression

Use-case: ScratchGraph Export

Compression with pako

Handling the compressed data server-side with PHP

Limitations

Performance visibility with HTTP Server-Timing

An HTTP response with the Server-Timing header

Surfacing in DevTools

The PerformanceServerTiming interface

GeoNames geographical database

Round to midnight

Wiktionary definitions database

PHP count() is O(1)

IMAP Pickup

PostgreSQL + PHP installation on Windows 2003 x64

Projects

Tags

Contact

Feed

Archives