February | 2023 | semi/signal

Archive for February, 2023

Improving on strip_tags (part 2)

Feb 26 2023 · PHP

Whitespace and tags

Previously, I looked at improving the functionality of strip_tags such that words across tags are not mashed together. The method I derived works well enough but it’s limited in that all tags are treated the same way and all whitespace separators are the same. I wanted to see if I could improve the method a bit more to address these limitations; that is, introducing whitespace based on the type of tag encountered instead of injecting whitespace after stripping away a tag.

For example, when dealing with inline tags, whitespace should be preserved:

This bit of HTML:
<span>the quick brown fox </span><span>jumped over the moon</span>
… should produce:
the quick brown fox jumped over the moon

Alternatively, when dealing with block-level tags, a newline should be injected:

This bit of HTML:
<div>the quick brown fox</div><div>jumped over the moon</div>
… should produce:
the quick brown fox jumped over the moon

Note that we’re simply talking about common/expected browser behavior from what’s thought of as inline-level or block-level tags. In reality, this categorization isn’t really part of the HTML standard anymore and layout behavior is relegated determined by CSS. From MDN:

That said, when looking at arbitrary HTML content, I still think “block” vs. “inline” is a useful distinction, at least insofar as inferring default or common behavior.

The special case

The <br> tag presents a special case. While it’s classified as an inline element, <br> represents whitespace that is generally similar to that of a block-level element (e.g. a newline). In implementation this is simple to handle but does introduce a tiny bit of additional complexity.

Looking at the high-level transformations needed, we get the following:

Inline-level tags → strip away (no action needed, don’t alter any existing whitespace within tag contents)
Block-level tags → strip away, replace with newline
<br> tags → strip away, replace with newline

Code

Reworking the convert() method from the previous post, we get the following:

class HTMLToPlainText
{
    const BLOCK_LEVEL_ELEMENTS = [
        "address",
        "article",
        "aside",
        "blockquote",
        "details",
        "dialog",
        "dd",
        "div",
        "dl",
        "dt",
        "fieldset",
        "figcaption",
        "figure",
        "footer",
        "form",
        "h1",
        "h2",
        "h3",
        "h4",
        "h5",
        "h6",
        "header",
        "hgroup",
        "hr",
        "li",
        "main",
        "nav",
        "ol",
        "p",
        "pre",
        "section",
        "table",
        "ul"
    ];

    const INLINE_LEVEL_ELEMENTS_THAT_PRODUCE_NEWLINE = [
        "br",
    ];

    const STATE_READING_CONTENT = 1;
    const STATE_READING_TAG_NAME = 2;

    static public function convert(string $input, string $blockContentSeparator = "\n"): string
    {
        // the input string as UTF-32
        $fixedWidthString = iconv('UTF-8', 'UTF-32', $input);

        // string within tags that we've found
        $output = "";

        // buffer for current/last tag name read
        $currentTagName = "";
        $currentTagIsClosing = null;

        // buffer content in the current tag being read
        $contentInCurrentTag = "";

        // flag to indicate how we should interpret what we're reading from $fixedWidthString
        // .. this is initially set to STATE_READING_CONTENT, as we assume we're reading content from the start, even
        // if we haven't encountered a tag (e.g. string that doesn't contain tags)
        $parserState = self::STATE_READING_CONTENT;

        $flushCurrentToOutput = function() use (&$output, &$contentInCurrentTag, &$currentTagName, &$currentTagIsClosing, &$blockContentSeparator) {
            // handle inline tags, which produce a newline (e.g. <br>)
            // .. not that these can be empty (<br>) or self-closing (<br/>)
            if(in_array(strtolower($currentTagName), self::INLINE_LEVEL_ELEMENTS_THAT_PRODUCE_NEWLINE)) {
                $output .= $contentInCurrentTag . $blockContentSeparator;
            } else {
                // append $blockContentSeparator if we're at the *opening or closing* of a block-level element
                // (for inline element, leave content as-is)
                if (in_array(strtolower($currentTagName), self::BLOCK_LEVEL_ELEMENTS)) {
                    $output .= $contentInCurrentTag . $blockContentSeparator;
                } else {
                    $output .= $contentInCurrentTag;
                }
            }

            // reset
            $contentInCurrentTag = "";
            $currentTagIsClosing = null;
            $currentTagName = "";
        };

        // iterate through characters in $fixedWidthString
        // checking for tokens indicating if we're within a tag or within content
        for($i=0; $i<strlen($fixedWidthString); $i+=4) {
            // convert back to UTF-8 to simplify character/token checking
            $ch = iconv('UTF-32', 'UTF-8', substr($fixedWidthString, $i, 4));

            if($ch === '<') {
                $flushCurrentToOutput();
                $parserState = self::STATE_READING_TAG_NAME;
                continue;
            }

            if($ch === '>') {
                $flushCurrentToOutput();
                $parserState = self::STATE_READING_CONTENT;
                continue;
            }

            if($parserState == self::STATE_READING_TAG_NAME && $ch == '/') {
                $currentTagIsClosing = true;
                continue;
            }

            if($parserState == self::STATE_READING_TAG_NAME) {
                $currentTagName .= $ch;
                continue;
            }

            if($parserState === self::STATE_READING_CONTENT) {
                $contentInCurrentTag .= $ch;
                continue;
            }
        }

        $flushCurrentToOutput();

        return trim($output, $blockContentSeparator);
    }
}

Testing

Throwing some arbitrary bits of HTML at this function seems to indicate that the method works correctly but, a method like this, really calls for some form of automated testing. I could derive test cases from the function logic, and this is what’s typically done when testing some arbitrary method, but this approach is biased and limited here. Biased in that I’d be looking at the function and coming up with test cases based upon my experiences (what I’ve encountered and where I think there may be potential issues). Limited in that I’d likely only come up with a handful of test cases unless I invested a significant chunk of time into compiling a comprehensive set of cases; HTML has relatively few building blocks but, given the number of different ways those blocks can be combined and arranged, we end up with a fairly large number of permutations. What would really be effective here is testing with a large and varied corpus of test cases, mappings of HTML snippets to plain text representations; i.e. data-driven testing. It’s usually hard to generate or find data for such testing but the PHP repository has a number of test cases for strip_tags() that can be leveraged:

strip_tags_basic1.phpt has some good baseline tests (HTML tags, PHP tags, tags with attributes, HTML comments, etc.)
strip_tags_basic2.phpt has a good test case (different tags + mix of block and inline elements + PHP tags) but is really testing the allowed_tags_array argument to strip_tags(), which I forgot was a thing and didn’t consider in my method

Beyond the test cases in these 2 files, there are other good cases scattered in the repo, seemingly tied to specific bugs encountered (e.g. bug #53319, which involves handling of “<br />” tags) but they can be hard to locate given the organization or lack thereof of the test files. In any case, it’s great having this data to work with and there were some issues that surfaced when I began subjecting my code to some of these test (e.g. the content separator for block-level elements needing to be attended at the point of both the opening and closing tags, not just the closing tag).

Implementation-wise, testing is mainly encoding the test case in a map and assert that the actual result matches expectations:

$testCases = [
    "<html>hello</html>" => "hello",
    "<?php echo hello ?>" => "",
    "<? echo hello ?>" => "",
    "<% echo hello %>" => "",
    "<script language=\"PHP\"> echo hello </script>" => " echo hello ",
    "<html><b>hello</b><p>world</p></html>" => "hello\nworld",
    "<html><!-- COMMENT --></html>" => "",
    "<html><p>hello</p><b>world</b><a href=\"#fragment\">Other text</a></html><?php echo hello ?>" => "hello\nworldOther text",
    "<p>hello</p><p>world</p>" => "hello\n\nworld",
    '<br /><br />USD<input type="text"/><br/>CDN<br><input type="text" />' => "USD\nCDN",
];

foreach ($testCases as $html => $expectedPlainText) {
    $actualPlainText = HTMLToSearchableText::convert_ex($html);

    echo "TEST: " . $html . "\n";
    echo "EXPECTED: " . $expectedPlainText . "\n";
    echo "ACTUAL: " . $actualPlainText . "\n";
    echo "----\n";

    assert($actualPlainText === $expectedPlainText);
}

Testing is still limited here. I’ve love to simply have a large batch of test cases to throw at the function but something like that is not readily available.

Limitations / future work

The new convert() method is more robust but there’s still some key limitations when compared to the strip_tags() function:

PHP’s strip_tags() is actually a lot more robust when it comes to invalid/malformed HTML content, as the tests in strip_tags.phpt demonstrate
Preserving certain tags (as with the allowed_tags_array argument) wasn’t considered

Also, whitespace/separators produced from <br> elements at the beginning or end of any inputted HTML is stripped away. I don’t think this is correct as browsers preserve whitespace from <br> elements and don’t collapse them as with empty block-level elements.