Posts Tagged ‘strip_tags’

Improving on strip_tags (part 2)

Whitespace and tags

Previously, I looked at improving the functionality of strip_tags such that words across tags are not mashed together. The method I derived works well enough but it’s limited in that all tags are treated the same way and all whitespace separators are the same. I wanted to see if I could improve the method a bit more to address these limitations; that is, introducing whitespace based on the type of tag encountered instead of injecting whitespace after stripping away a tag.

For example, when dealing with inline tags, whitespace should be preserved:

This bit of HTML:

<span>the quick brown fox </span><span>jumped over the moon</span>

… should produce:

the quick brown fox jumped over the moon

Alternatively, when dealing with block-level tags, a newline should be injected:

This bit of HTML:

<div>the quick brown fox</div><div>jumped over the moon</div>

… should produce:

the quick brown fox jumped over the moon

Note that we’re simply talking about common/expected browser behavior from what’s thought of as inline-level or block-level tags. In reality, this categorization isn’t really part of the HTML standard anymore and layout behavior is relegated determined by CSS. From MDN:

That said, when looking at arbitrary HTML content, I still think “block” vs. “inline” is a useful distinction, at least insofar as inferring default or common behavior.

The special case

The <br> tag presents a special case. While it’s classified as an inline element, <br> represents whitespace that is generally similar to that of a block-level element (e.g. a newline). In implementation this is simple to handle but does introduce a tiny bit of additional complexity.

Looking at the high-level transformations needed, we get the following:

  • Inline-level tags → strip away (no action needed, don’t alter any existing whitespace within tag contents)
  • Block-level tags → strip away, replace with newline
  • <br> tags → strip away, replace with newline

Code

Reworking the convert() method from the previous post, we get the following:

class HTMLToPlainText { const BLOCK_LEVEL_ELEMENTS = [ "address", "article", "aside", "blockquote", "details", "dialog", "dd", "div", "dl", "dt", "fieldset", "figcaption", "figure", "footer", "form", "h1", "h2", "h3", "h4", "h5", "h6", "header", "hgroup", "hr", "li", "main", "nav", "ol", "p", "pre", "section", "table", "ul" ]; const INLINE_LEVEL_ELEMENTS_THAT_PRODUCE_NEWLINE = [ "br", ]; const STATE_READING_CONTENT = 1; const STATE_READING_TAG_NAME = 2; static public function convert(string $input, string $blockContentSeparator = "\n"): string { // the input string as UTF-32 $fixedWidthString = iconv('UTF-8', 'UTF-32', $input); // string within tags that we've found $output = ""; // buffer for current/last tag name read $currentTagName = ""; $currentTagIsClosing = null; // buffer content in the current tag being read $contentInCurrentTag = ""; // flag to indicate how we should interpret what we're reading from $fixedWidthString // .. this is initially set to STATE_READING_CONTENT, as we assume we're reading content from the start, even // if we haven't encountered a tag (e.g. string that doesn't contain tags) $parserState = self::STATE_READING_CONTENT; $flushCurrentToOutput = function() use (&$output, &$contentInCurrentTag, &$currentTagName, &$currentTagIsClosing, &$blockContentSeparator) { // handle inline tags, which produce a newline (e.g. <br>) // .. not that these can be empty (<br>) or self-closing (<br/>) if(in_array(strtolower($currentTagName), self::INLINE_LEVEL_ELEMENTS_THAT_PRODUCE_NEWLINE)) { $output .= $contentInCurrentTag . $blockContentSeparator; } else { // append $blockContentSeparator if we're at the *opening or closing* of a block-level element // (for inline element, leave content as-is) if (in_array(strtolower($currentTagName), self::BLOCK_LEVEL_ELEMENTS)) { $output .= $contentInCurrentTag . $blockContentSeparator; } else { $output .= $contentInCurrentTag; } } // reset $contentInCurrentTag = ""; $currentTagIsClosing = null; $currentTagName = ""; }; // iterate through characters in $fixedWidthString // checking for tokens indicating if we're within a tag or within content for($i=0; $i<strlen($fixedWidthString); $i+=4) { // convert back to UTF-8 to simplify character/token checking $ch = iconv('UTF-32', 'UTF-8', substr($fixedWidthString, $i, 4)); if($ch === '<') { $flushCurrentToOutput(); $parserState = self::STATE_READING_TAG_NAME; continue; } if($ch === '>') { $flushCurrentToOutput(); $parserState = self::STATE_READING_CONTENT; continue; } if($parserState == self::STATE_READING_TAG_NAME && $ch == '/') { $currentTagIsClosing = true; continue; } if($parserState == self::STATE_READING_TAG_NAME) { $currentTagName .= $ch; continue; } if($parserState === self::STATE_READING_CONTENT) { $contentInCurrentTag .= $ch; continue; } } $flushCurrentToOutput(); return trim($output, $blockContentSeparator); } }

Testing

Throwing some arbitrary bits of HTML at this function seems to indicate that the method works correctly but, a method like this, really calls for some form of automated testing. I could derive test cases from the function logic, and this is what’s typically done when testing some arbitrary method, but this approach is biased and limited here. Biased in that I’d be looking at the function and coming up with test cases based upon my experiences (what I’ve encountered and where I think there may be potential issues). Limited in that I’d likely only come up with a handful of test cases unless I invested a significant chunk of time into compiling a comprehensive set of cases; HTML has relatively few building blocks but, given the number of different ways those blocks can be combined and arranged, we end up with a fairly large number of permutations. What would really be effective here is testing with a large and varied corpus of test cases, mappings of HTML snippets to plain text representations; i.e. data-driven testing. It’s usually hard to generate or find data for such testing but the PHP repository has a number of test cases for strip_tags() that can be leveraged:

  • strip_tags_basic1.phpt has some good baseline tests (HTML tags, PHP tags, tags with attributes, HTML comments, etc.)
  • strip_tags_basic2.phpt has a good test case (different tags + mix of block and inline elements + PHP tags) but is really testing the allowed_tags_array argument to strip_tags(), which I forgot was a thing and didn’t consider in my method

Beyond the test cases in these 2 files, there are other good cases scattered in the repo, seemingly tied to specific bugs encountered (e.g. bug #53319, which involves handling of “<br />” tags) but they can be hard to locate given the organization or lack thereof of the test files. In any case, it’s great having this data to work with and there were some issues that surfaced when I began subjecting my code to some of these test (e.g. the content separator for block-level elements needing to be attended at the point of both the opening and closing tags, not just the closing tag).

Implementation-wise, testing is mainly encoding the test case in a map and assert that the actual result matches expectations:

$testCases = [ "<html>hello</html>" => "hello", "<?php echo hello ?>" => "", "<? echo hello ?>" => "", "<% echo hello %>" => "", "<script language=\"PHP\"> echo hello </script>" => " echo hello ", "<html><b>hello</b><p>world</p></html>" => "hello\nworld", "<html><!-- COMMENT --></html>" => "", "<html><p>hello</p><b>world</b><a href=\"#fragment\">Other text</a></html><?php echo hello ?>" => "hello\nworldOther text", "<p>hello</p><p>world</p>" => "hello\n\nworld", '<br /><br />USD<input type="text"/><br/>CDN<br><input type="text" />' => "USD\nCDN", ]; foreach ($testCases as $html => $expectedPlainText) { $actualPlainText = HTMLToSearchableText::convert_ex($html); echo "TEST: " . $html . "\n"; echo "EXPECTED: " . $expectedPlainText . "\n"; echo "ACTUAL: " . $actualPlainText . "\n"; echo "----\n"; assert($actualPlainText === $expectedPlainText); }

Testing is still limited here. I’ve love to simply have a large batch of test cases to throw at the function but something like that is not readily available.

Limitations / future work

The new convert() method is more robust but there’s still some key limitations when compared to the strip_tags() function:

  • PHP’s strip_tags() is actually a lot more robust when it comes to invalid/malformed HTML content, as the tests in strip_tags.phpt demonstrate
  • Preserving certain tags (as with the allowed_tags_array argument) wasn’t considered

Also, whitespace/separators produced from <br> elements at the beginning or end of any inputted HTML is stripped away. I don’t think this is correct as browsers preserve whitespace from <br> elements and don’t collapse them as with empty block-level elements.

Improving on strip_tags

The Problem

PHP’s strip_tags() method will strip away tags but makes no attempt to introduce whitespace to separate content in adjacent tags. This is an issue with arbitrary HTML as adjacent block-level elements may not have any intermediate whitespace and simply stripping away the tags will incorrectly concatenate the textual content in the 2 elements.

For example, running strip_tags() on the following:

<div>the quick brown fox</div><div>jumped over the moon</div>

… will return:

the quick brown foxjumped over the moon

This is technically correct (we’re stripped away the <div> tags) but having no whitespace between “fox” and “jumped” means we’ve transformed the content such that we’ve lost semantic and presentational details.

The Solution

There’s 2 ways I can see to fix this behavior:

  • Pre-process the HTML content to ensure or introduce whitespace between block-level elements
  • Don’t use strip_tags() and utilize a method that better understands the need for spacing between elements

I’ll focus on the latter because that’s the avenue I went down and I didn’t consider pre-processing at the time.

Pulling together a quick-and-dirty parser, I wrote the following. It’s worth noting that still still doesn’t really consider what the tags are (e.g. whether they’re inline or block) but allows the caller to specify a string ($tagContentSeparator), typically some whitespace, that is inserted between the stripped away tags:

<?php class HTMLToPlainText { const STATE_READING_CONTENT = 1; const STATE_READING_TAG_NAME = 2; static public function convert(string $input, string $tagContentSeparator = " "): string { // the input string as UTF-32 $fixedWidthString = iconv('UTF-8', 'UTF-32', $input); // string within tags that we've found $foundContentStrings = []; // buffer for current content being read $currentContentString = ""; // flag to indicate how we should interpret what we're reading from $fixedWidthString // .. this is initially set to STATE_READING_CONTENT, as we assume we're reading content from the start, even // if we haven't encountered a tag (e.g. string that doesn't contain tags) $parserState = self::STATE_READING_CONTENT; // method to add a non-empty string to $foundContentStrings and reset $currentContentString $commitCurrentContentString = function() use (&$currentContentString, &$foundContentStrings) { if(strlen($currentContentString) > 0) { $foundContentStrings[] = trim($currentContentString); $currentContentString = ""; } }; // iterate through characters in $fixedWidthString // checking for tokens indicating if we're within a tag or within content for($i=0; $i<strlen($fixedWidthString); $i+=4) { // convert back to UTF-8 to simplify character/token checking $ch = iconv('UTF-32', 'UTF-8', substr($fixedWidthString, $i, 4)); if($ch === '<') { $parserState = self::STATE_READING_TAG_NAME; $commitCurrentContentString(); continue; } if($ch === '>') { $parserState = self::STATE_READING_CONTENT; continue; } if($parserState === self::STATE_READING_CONTENT) { $currentContentString .= $ch; continue; } } $commitCurrentContentString(); return implode($tagContentSeparator, $foundContentStrings); } }

Note that the to/from UTF-8 ↔ UTF-32 isn’t really necessary, I initially did the conversion as I was worried about splitting a multibyte character, but this isn’t possible given how the function reads the input string.

Now if we take the following HTML snippet:

<div>the quick brown fox</div><div>jumped over the moon</div>

… rendered in a browser, we get:

… with strip_tags() we get:

the quick brown foxjumped over the moon

… and with HTMLToPlainText::convert() (passing in “\n” for $tagContentSeparator), we get:

the quick brown fox jumped over the moon

The latter results in text that is semantically correct, as words in different blocks aren’t incorrectly joined. Presentationally we also get a more correct conversion but, the method isn’t really doing anything fancy here, this is due to the calling knowing a bit about the HTML snippet, how a browser would render it, and passing passing in “\n” for $tagContentSeparator.

Limitations / future work

The improvement here is that textual content is pretty preserved when doing a conversion, i.e. we don’t have to worry about textual elements being incorrectly concatenated. However, what I wrote is still lacking in 2 keys areas:

  • Generally, in terms of presentation, an arbitrary bit of HTML won’t map to what a user sees in a browser. To a certain degree this is an intractable problem, as presentation is based on browser defaults, CSS styles, etc. Also, there are things that simply don’t have a standard representation in plain-text (e.g. bold text, list items, etc.). However, there are cases where sensible defaults might make sense, e.g. stripping away <span> tags but putting newline between <p> tags.
  • Whitespace is trimmed from content within tags. This may or may not matter depending on application. In my case, I cared about the words and additional whitespace just added bloat even if it was more accurate to what was in the HTML.

EDIT: See part 2 on addressing these limitations and making the code more robust.