August | 2022 | semi/signal

Archive for August, 2022

Improving on strip_tags

Aug 13 2022 · PHP

The Problem

PHP’s strip_tags() method will strip away tags but makes no attempt to introduce whitespace to separate content in adjacent tags. This is an issue with arbitrary HTML as adjacent block-level elements may not have any intermediate whitespace and simply stripping away the tags will incorrectly concatenate the textual content in the 2 elements.

For example, running strip_tags() on the following:

<div>the quick brown fox</div><div>jumped over the moon</div>

… will return:

the quick brown foxjumped over the moon

This is technically correct (we’re stripped away the <div> tags) but having no whitespace between “fox” and “jumped” means we’ve transformed the content such that we’ve lost semantic and presentational details.

The Solution

There’s 2 ways I can see to fix this behavior:

Pre-process the HTML content to ensure or introduce whitespace between block-level elements
Don’t use strip_tags() and utilize a method that better understands the need for spacing between elements

I’ll focus on the latter because that’s the avenue I went down and I didn’t consider pre-processing at the time.

Pulling together a quick-and-dirty parser, I wrote the following. It’s worth noting that still still doesn’t really consider what the tags are (e.g. whether they’re inline or block) but allows the caller to specify a string ($tagContentSeparator), typically some whitespace, that is inserted between the stripped away tags:

<?php

class HTMLToPlainText
{
    const STATE_READING_CONTENT = 1;
    const STATE_READING_TAG_NAME = 2;

    static public function convert(string $input, string $tagContentSeparator = " "): string
    {
        // the input string as UTF-32
        $fixedWidthString = iconv('UTF-8', 'UTF-32', $input);

        // string within tags that we've found
        $foundContentStrings = [];

        // buffer for current content being read
        $currentContentString = "";

        // flag to indicate how we should interpret what we're reading from $fixedWidthString
        // .. this is initially set to STATE_READING_CONTENT, as we assume we're reading content from the start, even
        // if we haven't encountered a tag (e.g. string that doesn't contain tags)
        $parserState = self::STATE_READING_CONTENT;

        // method to add a non-empty string to $foundContentStrings and reset $currentContentString
        $commitCurrentContentString = function() use (&$currentContentString, &$foundContentStrings) {
            if(strlen($currentContentString) > 0) {
                $foundContentStrings[] = trim($currentContentString);
                $currentContentString = "";
            }
        };

        // iterate through characters in $fixedWidthString
        // checking for tokens indicating if we're within a tag or within content
        for($i=0; $i<strlen($fixedWidthString); $i+=4) {
            // convert back to UTF-8 to simplify character/token checking
            $ch = iconv('UTF-32', 'UTF-8', substr($fixedWidthString, $i, 4));

            if($ch === '<') {
                $parserState = self::STATE_READING_TAG_NAME;
                $commitCurrentContentString();
                continue;
            }

            if($ch === '>') {
                $parserState = self::STATE_READING_CONTENT;
                continue;
            }

            if($parserState === self::STATE_READING_CONTENT) {
                $currentContentString .= $ch;
                continue;
            }
        }

        $commitCurrentContentString();

        return implode($tagContentSeparator, $foundContentStrings);
    }
}

Note that the to/from UTF-8 ↔ UTF-32 isn’t really necessary, I initially did the conversion as I was worried about splitting a multibyte character, but this isn’t possible given how the function reads the input string.

Now if we take the following HTML snippet:

<div>the quick brown fox</div><div>jumped over the moon</div>

… rendered in a browser, we get:

… with strip_tags() we get:

the quick brown foxjumped over the moon

… and with HTMLToPlainText::convert() (passing in “\n” for $tagContentSeparator), we get:

the quick brown fox
jumped over the moon

The latter results in text that is semantically correct, as words in different blocks aren’t incorrectly joined. Presentationally we also get a more correct conversion but, the method isn’t really doing anything fancy here, this is due to the calling knowing a bit about the HTML snippet, how a browser would render it, and passing passing in “\n” for $tagContentSeparator.

Limitations / future work

The improvement here is that textual content is pretty preserved when doing a conversion, i.e. we don’t have to worry about textual elements being incorrectly concatenated. However, what I wrote is still lacking in 2 keys areas:

Generally, in terms of presentation, an arbitrary bit of HTML won’t map to what a user sees in a browser. To a certain degree this is an intractable problem, as presentation is based on browser defaults, CSS styles, etc. Also, there are things that simply don’t have a standard representation in plain-text (e.g. bold text, list items, etc.). However, there are cases where sensible defaults might make sense, e.g. stripping away <span> tags but putting newline between <p> tags.
Whitespace is trimmed from content within tags. This may or may not matter depending on application. In my case, I cared about the words and additional whitespace just added bloat even if it was more accurate to what was in the HTML.

EDIT: See part 2 on addressing these limitations and making the code more robust.