Improving on strip_tags (part 2)

Whitespace and tags

Previously, I looked at improving the functionality of strip_tags such that words across tags are not mashed together. The method I derived works well enough but it’s limited in that all tags are treated the same way and all whitespace separators are the same. I wanted to see if I could improve the method a bit more to address these limitations; that is, introducing whitespace based on the type of tag encountered instead of injecting whitespace after stripping away a tag.

For example, when dealing with inline tags, whitespace should be preserved:

This bit of HTML:

<span>the quick brown fox </span><span>jumped over the moon</span>

… should produce:

the quick brown fox jumped over the moon

Alternatively, when dealing with block-level tags, a newline should be injected:

This bit of HTML:

<div>the quick brown fox</div><div>jumped over the moon</div>

… should produce:

the quick brown fox jumped over the moon

Note that we’re simply talking about common/expected browser behavior from what’s thought of as inline-level or block-level tags. In reality, this categorization isn’t really part of the HTML standard anymore and layout behavior is relegated determined by CSS. From MDN:

That said, when looking at arbitrary HTML content, I still think “block” vs. “inline” is a useful distinction, at least insofar as inferring default or common behavior.

The special case

The <br> tag presents a special case. While it’s classified as an inline element, <br> represents whitespace that is generally similar to that of a block-level element (e.g. a newline). In implementation this is simple to handle but does introduce a tiny bit of additional complexity.

Looking at the high-level transformations needed, we get the following:

  • Inline-level tags → strip away (no action needed, don’t alter any existing whitespace within tag contents)
  • Block-level tags → strip away, replace with newline
  • <br> tags → strip away, replace with newline

Code

Reworking the convert() method from the previous post, we get the following:

class HTMLToPlainText { const BLOCK_LEVEL_ELEMENTS = [ "address", "article", "aside", "blockquote", "details", "dialog", "dd", "div", "dl", "dt", "fieldset", "figcaption", "figure", "footer", "form", "h1", "h2", "h3", "h4", "h5", "h6", "header", "hgroup", "hr", "li", "main", "nav", "ol", "p", "pre", "section", "table", "ul" ]; const INLINE_LEVEL_ELEMENTS_THAT_PRODUCE_NEWLINE = [ "br", ]; const STATE_READING_CONTENT = 1; const STATE_READING_TAG_NAME = 2; static public function convert(string $input, string $blockContentSeparator = "\n"): string { // the input string as UTF-32 $fixedWidthString = iconv('UTF-8', 'UTF-32', $input); // string within tags that we've found $output = ""; // buffer for current/last tag name read $currentTagName = ""; $currentTagIsClosing = null; // buffer content in the current tag being read $contentInCurrentTag = ""; // flag to indicate how we should interpret what we're reading from $fixedWidthString // .. this is initially set to STATE_READING_CONTENT, as we assume we're reading content from the start, even // if we haven't encountered a tag (e.g. string that doesn't contain tags) $parserState = self::STATE_READING_CONTENT; $flushCurrentToOutput = function() use (&$output, &$contentInCurrentTag, &$currentTagName, &$currentTagIsClosing, &$blockContentSeparator) { // handle inline tags, which produce a newline (e.g. <br>) // .. not that these can be empty (<br>) or self-closing (<br/>) if(in_array(strtolower($currentTagName), self::INLINE_LEVEL_ELEMENTS_THAT_PRODUCE_NEWLINE)) { $output .= $contentInCurrentTag . $blockContentSeparator; } else { // append $blockContentSeparator if we're at the *opening or closing* of a block-level element // (for inline element, leave content as-is) if (in_array(strtolower($currentTagName), self::BLOCK_LEVEL_ELEMENTS)) { $output .= $contentInCurrentTag . $blockContentSeparator; } else { $output .= $contentInCurrentTag; } } // reset $contentInCurrentTag = ""; $currentTagIsClosing = null; $currentTagName = ""; }; // iterate through characters in $fixedWidthString // checking for tokens indicating if we're within a tag or within content for($i=0; $i<strlen($fixedWidthString); $i+=4) { // convert back to UTF-8 to simplify character/token checking $ch = iconv('UTF-32', 'UTF-8', substr($fixedWidthString, $i, 4)); if($ch === '<') { $flushCurrentToOutput(); $parserState = self::STATE_READING_TAG_NAME; continue; } if($ch === '>') { $flushCurrentToOutput(); $parserState = self::STATE_READING_CONTENT; continue; } if($parserState == self::STATE_READING_TAG_NAME && $ch == '/') { $currentTagIsClosing = true; continue; } if($parserState == self::STATE_READING_TAG_NAME) { $currentTagName .= $ch; continue; } if($parserState === self::STATE_READING_CONTENT) { $contentInCurrentTag .= $ch; continue; } } $flushCurrentToOutput(); return trim($output, $blockContentSeparator); } }

Testing

Throwing some arbitrary bits of HTML at this function seems to indicate that the method works correctly but, a method like this, really calls for some form of automated testing. I could derive test cases from the function logic, and this is what’s typically done when testing some arbitrary method, but this approach is biased and limited here. Biased in that I’d be looking at the function and coming up with test cases based upon my experiences (what I’ve encountered and where I think there may be potential issues). Limited in that I’d likely only come up with a handful of test cases unless I invested a significant chunk of time into compiling a comprehensive set of cases; HTML has relatively few building blocks but, given the number of different ways those blocks can be combined and arranged, we end up with a fairly large number of permutations. What would really be effective here is testing with a large and varied corpus of test cases, mappings of HTML snippets to plain text representations; i.e. data-driven testing. It’s usually hard to generate or find data for such testing but the PHP repository has a number of test cases for strip_tags() that can be leveraged:

  • strip_tags_basic1.phpt has some good baseline tests (HTML tags, PHP tags, tags with attributes, HTML comments, etc.)
  • strip_tags_basic2.phpt has a good test case (different tags + mix of block and inline elements + PHP tags) but is really testing the allowed_tags_array argument to strip_tags(), which I forgot was a thing and didn’t consider in my method

Beyond the test cases in these 2 files, there are other good cases scattered in the repo, seemingly tied to specific bugs encountered (e.g. bug #53319, which involves handling of “<br />” tags) but they can be hard to locate given the organization or lack thereof of the test files. In any case, it’s great having this data to work with and there were some issues that surfaced when I began subjecting my code to some of these test (e.g. the content separator for block-level elements needing to be attended at the point of both the opening and closing tags, not just the closing tag).

Implementation-wise, testing is mainly encoding the test case in a map and assert that the actual result matches expectations:

$testCases = [ "<html>hello</html>" => "hello", "<?php echo hello ?>" => "", "<? echo hello ?>" => "", "<% echo hello %>" => "", "<script language=\"PHP\"> echo hello </script>" => " echo hello ", "<html><b>hello</b><p>world</p></html>" => "hello\nworld", "<html><!-- COMMENT --></html>" => "", "<html><p>hello</p><b>world</b><a href=\"#fragment\">Other text</a></html><?php echo hello ?>" => "hello\nworldOther text", "<p>hello</p><p>world</p>" => "hello\n\nworld", '<br /><br />USD<input type="text"/><br/>CDN<br><input type="text" />' => "USD\nCDN", ]; foreach ($testCases as $html => $expectedPlainText) { $actualPlainText = HTMLToSearchableText::convert_ex($html); echo "TEST: " . $html . "\n"; echo "EXPECTED: " . $expectedPlainText . "\n"; echo "ACTUAL: " . $actualPlainText . "\n"; echo "----\n"; assert($actualPlainText === $expectedPlainText); }

Testing is still limited here. I’ve love to simply have a large batch of test cases to throw at the function but something like that is not readily available.

Limitations / future work

The new convert() method is more robust but there’s still some key limitations when compared to the strip_tags() function:

  • PHP’s strip_tags() is actually a lot more robust when it comes to invalid/malformed HTML content, as the tests in strip_tags.phpt demonstrate
  • Preserving certain tags (as with the allowed_tags_array argument) wasn’t considered

Also, whitespace/separators produced from <br> elements at the beginning or end of any inputted HTML is stripped away. I don’t think this is correct as browsers preserve whitespace from <br> elements and don’t collapse them as with empty block-level elements.


A look at 2D vs WebGL canvas performance

I did some quick benchmarking with canvas-image-transformer, looking at the performance between directly manipulating pixels on a 2D canvas versus using a fragment shader on a WebGL canvas. For testing, I used a grayscale transformation as it can be done with a simple weighted sum (R*0.2126 + G*0.7152 + B*0.0722) and there’s a high degree of parity between the fragment shader code and the code for pixel operations on a 2D canvas.

Converting to grayscale

Pixel operations on the 2D canvas are as follows:

for(var i=0; i<pixels.data.length; i+=4) { var grayPixel = parseInt(((0.2126*(pixels.data[i]/255.0)) + (0.7152*(pixels.data[i+1]/255.0)) + (0.0722*(pixels.data[i+2]/255.0))) * 255.0); pixels.data[i] = grayPixel; pixels.data[i + 1] = grayPixel; pixels.data[i + 2] = grayPixel; }

The corresponding fragment shader for the WebGL canvas is as follows:

precision mediump float; uniform sampler2D uSampler; varying vec2 vTextureCoord; void main(void) { vec4 src = texture2D( uSampler, ( vTextureCoord ) ); float grayPx = src.r*0.2126 + src.g*0.7152 + src.b*0.0722; gl_FragColor = vec4(grayPx, grayPx, grayPx, 1); }

Performance comparisons in Chrome

Here’s the setup for comparing performance of the 2 method:

  • Input was a 3864×3864 image of the Crab Nebula, rendered onto a 2D canvas (note that time to render onto the 2D canvas is not considered in the data points below)
  • Output is the 2D canvas that the input image was render on
  • CPU was an AMD Ryzen 7 5700X
  • GPU was a RTX 2060
  • OS is Windows 10 Build 19044
  • Browser is Chrome 108.0.5359.125
  • Hard refresh on page load to bypass any browser-level caching
  • Transformation via WebGL approach for 25 iterations
  • Transformation via 2D canvas approach for 25 iterations

Visually, this is what’s being done:

canvas-image-transformer grayscale conversion

I tried to eliminate as much background noise as possible from the result; that is, eliminating anything that may have a impact on CPU or GPU usage: closing other applications that may have significant usage, not having any other tabs open in the browser, and not having DevTools open when image processing was being done. That said, I was not rigorous about this and the numbers presented are to show overall/high-level behavior and performance; they’re not necessarily representative of what peak performance would be on the machine or browser.

It’s also worth noting that canvas-image-transformer doesn’t attempt to do any sort of caching in the first iteration (i.e. textures are re-created, shaders are re-compiled, etc. on each iteration), so we shouldn’t expect large variances in performance from one iteration to the next.

Graphing the data points for each approach, for each iteration, I got the following (note that what’s presented is just the data for 1 test run; I did test multiple times and consistently saw the same behavior but, for simplicity, I just graphed the values from 1 test run):

canvas-image-transformer performance data

So, the data points for the first iteration are interesting.

  • On the 2d canvas, the transformation initially takes 371.8ms
  • On the webgl2, the transformation initially takes 506.5ms

That’s a massive gap in performance between the 2 methods, with the 2d canvas method being significantly faster. I would have expected the WebGL approach to be faster here as, generally, graphics-related things would be faster with a lower-level GPU interface, but that’s clearly not the case here.

For subsequent iterations, we can see that performance improves and normalizes for both approaches, with significantly better performance using the WebGL approach; however, why don’t we see this sort of performance during the first iteration? Profiling the code, I noticed I was consistently seeing the majority of execution time spent on texImage2D() during the first iteration:

gl.texImage2D(gl.TEXTURE_2D, 0, gl.RGBA8, gl.RGBA, gl.UNSIGNED_BYTE, srcCanvas);

Looking at the execution time of texImage2D() across iterations, we get the following:

canvas-image-transformer texImage2D execution time

We see 486.9ms spent in texImage2D() during the first iteration but then execution time drops to only ~0.1ms in subsequent iterations. It’s clear that loading data into a texture is the most costly operation on the first iteration, however it looks like there’s some sort of caching mechanism, likely in Chrome’s GPU component, that essentially eliminates this cost on subsequent iterations.

In an attempt to optimize the call in the first iteration, I briefly looked into potential optimizations to the texImage2D() call but didn’t find much. There’s no mipmap creation or doing any sort of format conversion here, so we’re just bound by how quickly we can get the pixels into VRAM.

Normal refresh (after previous page load)

There’s a bit more nuance here that’s worth touching on. Looking at just the first iteration in Chrome, after normal/soft refreshes, we see some interesting behavior:

canvas-image-transformer performance data, first iteration only
  • For the 2d canvas, the first iteration transformation times look the same as when doing a hard refresh
  • For the WebGL canvas, we’re getting the transformation times we saw after the first iteration when doing a hard refresh!

It looks like Chrome’s texture caching mechanism is in play and preserves cache entries across soft page refreshes.

What about Firefox and other browsers?

I would expect most Webkit-based browsers would have similar behavior to what’s in Chrome and some quick testing in Edge confirms this.

Firefox is a different beast. Testing in Firefox 108.0.2, we see the following transformation times:

canvas-image-transformer performance data

Performance, overall, is much more consistent than in Chrome, but not always better.

  • For the 2d canvas method, performance is simply worse; on the first iteration we see transformations take 150+ milliseconds more than in Chrome, and on subsequent iterations the performance gap is even wider.
  • For the WebGL method, our first iteration performance is significantly better than Chrome, reduced by more than 175 milliseconds. However, on subsequent iterations we don’t see the drastic performance improvement we see in Chrome.

For the 2d canvas method, it’s hard to say why it performs so differently than Chrome. However, for the WebGL method, a bit of profiling led to some interesting insights. In Firefox, the execution time of texImage2D() is consistent across iterations, hovering ~40ms; this means it performs significantly better than Chrome’s worst case (first iteration) and significantly worse than Chrome’s best case (non-first iteration where execution time is below 0.1ms), as shown below.

canvas-image-transformer performance data

The other significant aspect to Firefox’s performance is in the performance of the Canvas drawImage() call, in drawing from a WebGL canvas to a 2D canvas. At the tail end of the transformation process, canvas-image-transformer does the following:

const srcCtx = srcCanvas.getContext('2d'); srcCtx.drawImage(glCanvas, 0, 0, srcCanvas.width, srcCanvas.height);

Basically, it’s taking what’s on the WebGL canvas and writing it out to the input/source canvas, which is a 2D canvas. In Chrome this is a very fast operation, typically less that 2ms, in Firefox I see this typically going above 200ms.

canvas-image-transformer performance data

Firefox consistency

Finally, looking at transformation times across soft refreshes, we see Firefox performance is very consistent for both the 2D canvas and WebGL method:

canvas-image-transformer performance data

However, I did encounter a case where WebGL performance was more erratic. This was testing when I had a lot of tabs open and I suspect there was some contention for GPU resources.

Takeaways

There’s perhaps a number of small insights here depending on use-case and audience, but there’s 2 significant high-level takeaways for me:

  • GPUs are very fast at parallel processing but loading data to be processed and retrieving the processed data can be expensive operations
  • It’s worthwhile to measure things; I was fairly surprised by the different performance profiles between Firefox and Chrome


Versioning datasets

Contracts

An issue I’ve kept coming across when working on data systems that involve producing and consuming a number of different datasets is the lack of a contract between producers and consumers. Versioning provides a solution to this problem when dealing with software and, with a decent versioning scheme, provides a good solution for datasets as well, allowing for the creation of versioned snapshots.

Data concerns

It’s worth looking at what the problem is here and why this even matters. Imagine having some dataset, let’s say for drugs, which is periodically updated. We could reasonably say that we only care about the latest version of the drugs dataset, so every time we ingest new data, we simply overwrite the existing dataset.

For a rudimentary system, this is fine, but if we’re thinking in terms of a larger data system with this dataset being consumed by downstream processes, teams, and/or customers, there are a few concerns our system can’t elegantly deal with:

  • Corruption: the ingested data is corrupt or a bug in the ETL process results in a corrupted dataset
  • Consistent reads: not all parts (e.g. tables) of our dataset may be ready for reads by consumers at a given time (loading data to S3 is a good example here; for a non-trivial dataset with multiple objects and partitions, spread across multiple objects, the dataset as a whole can’t be written/updated atomically)
  • Breaking changes: a breaking change to downstream systems (e.g. dropping a column) may need to be rolled out
  • Reproducibility: downstream/derived datasets may need to be re-created based upon what the dataset was at some point in the past (i.e. using the latest dataset will not give the same results)
  • Traceability: we may need to validate/understand how a derived data element was generated, requiring an accurate snapshot of all input data when the derived dataset was generated

Versioning isn’t the only solution to these concerns. You could argue that frequent backups, some sort of locking mechanism, coordination between teams, and/or very granular levels of observability can address each to varying degrees, but I think versioning is (a) simple and (b) requires the least effort.

Versioning scheme

Let’s look at a versioning scheme that would address the 4 concerns I raised above. For this, I’m going to borrow from both semantic versioning and calendar versioning. Combining the 2, and adding a bit of additional metadata, we can construct a scheme like the following:

Breaking this down:

  • The semantic versioning components (major, minor, patch) can effectively tell us about the spatial nature of the dataset; the schema.
  • The calendar versioning components (YYYY0M0D) can effectively tell us about the temporal nature of the dataset (when it was ingested, generated, etc.). Note that calendar versioning is a lot more fuzzy as a standard, as there’s a lot of variance in how dates are represented, YYYY0M0D seems like a good choice as it’s easily parsable by consumers.
  • The final component (rev) is the revision number for the given date and is needed for datasets that can be generated/refreshed multiple times in a day. I think of this as an incrementing integer but a time component (hours, minutes, seconds) is another option; either can work, there’s just tradeoffs in implementation and consumer expectations.

Finding a version

Going back to our example, our data flow now looks something like this:

Note that before our consumers knew exactly where to look for the dataset (s3://bucket/drugs-data/latest), more specifically the latest version of the dataset, however, this is no longer the case. Consumers will need to figure out what version of the dataset they want. This could be trivial (e.g. consumers just want to pin to a specific version) but the more interesting and perhaps more common case, especially with automated systems, is getting the latest version. Unpacking “latest” is important here: consumers want the latest data but not if it carries with it a breaking schema change (i.e. consumers want to pin to major version component, with the others being flexible). Thinking in terms of npm-esque ranges with the caret operator, a consumer could specify a version like ^2.2.11.20221203.1 indicating they system is able to handle, and should pull in, any newer, non-breaking, updates in either schema or data.

So consumers can indicate what they want, but how does a system actually go about finding a certain version? I think the elegant solution here is having some sort of metadata for the dataset that can tell consumers what versions of the dataset are available and where to find them. Creating or updating these metadata entries can simply be another artifact of the ETL process and can be store alongside the dataset (in a manifest file, a table, etc.). Unfortunately, this does involve a small lift and a bit of additional complexity for consumers, as they’d have to read/parse the metadata record.

Dataset-level vs. Data-level versioning

In researching other ways in which versioning is done, change data capture methods usually come up. While change data capture methods are important and powerful, CDC methods are typically at the row-level, not the dataset-level, and it’s worth recognizing the distinction, especially from data systems perspective, as CDC methods come with very different architectural and implementation concerns.

For example, in this blog post from lakeFS, approach #1 references full duplication, which is dataset versioning, but then approach #2 references valid_from and valid_to fields, which is a CDC method and carries with it the requirement to write queries that respect those fields.

Avoiding full duplication

The scheme I’ve laid out somewhat implies a duplication of records for every version of a dataset. I’ve seen a number of articles bring this up as a concern, which can very well be true in a number of case, but I’m skeptical of this being a priority concern for most businesses, given the low cost of storage. In any case, I think storage-layer concerns may impact how you reference versions (more generally, how you read/write metadata), but shouldn’t necessarily dictate the versioning scheme.

From what I’ve read, most systems that try to optimize for storage do so via a git-style model. This is what’s done by cloud service providers like lakeFS and tools like git LFS, ArtiV, and DVC.

Alternatives

I haven’t come across much in terms of alternatives but, in addition to a semantic identifier, this DZone article also mentions data versions potentially containing information about the status of the data (e.g. “incomplete”) or information about what’s changed (e.g. “normalized”). These are interesting ideas but not really something I’ve seen a need for in the version identifier. That said, what I’ve presented is not intended to be some sort of silver bullet, I’m sure different engineers face different concerns and different versioning schemes would be more appropriate.

In the end, I would simply encourage engineers to consider some form of versioning around their datasets, especially in larger data systems. It’s a relatively simple tool that can elegantly address a number of important concerns.


A look at S.M.A.R.T.

What is S.M.A.R.T.?

Self-Monitoring, Analysis and Reporting Technology (S.M.A.R.T) is a monitoring system on computer hard drives and solid state drives. SMART primarily consists of a set of a attributes, with the disk monitoring current and worst values for said attributes. The attributes, associated ID number, and the range for normalized values (1 to 253) are standardized across disks. Unfortunately, there’s no standardization around which attributes are implemented, the range for raw values, what raw values actually represent, or what the threshold is for normalized value. Despite the lack of standardization around the attribute values, SMART still provides some value and offers a significant degree of observability into the state and behavior of the disk.

SMART is something I’ve been aware of for a while, but it never seemed to really matter all that much for my consumer desktop needs. There are a bunch of Windows apps to check the SMART attributes and I vaguely recall having something installed to check for out-of-range values but regularly monitoring/checking wasn’t something I took seriously; just having a backup was typically good enough and I could deal with a drive failure if/when it happened. Recently, motivated by re-using an old motherboard and CPU for a NAS server, making use of a batch of old hard disks I had accumulated, and maximizing the storage capacity of the box, I decided to take another look at SMART and its efficacy in predicting drive failures. Ideally, I could have a system where a drive could be replaced before a failure resulting in data loss.

SMART attributes correlated to hard drive failure

Google

A Google study from 2007, “Failure Trends in a Large Disk Drive Population“, lists 4 SMART attributes that are highly correlated with failure, these are:

  • SMART 5: Reallocated Sectors Count
  • SMART 187: Reported Uncorrectable Errors
  • SMART 197: Current Pending Sector Count
  • SMART 198: Uncorrectable Sector Count

It’s also worth noting that any change in SMART 187 was seen to be highly predictive of failure:

… after their first scan error (i.e. when a positive value for 187 is observed for the first time), drives are 39 times more likely to fail within 60 days than drives with no such errors.

Other attributes were looked at but results were not always consistent across models and manufacturers. The study also found using SMART parameters to predict failure was severely limited, as a large number of failed drives shown no SMART errors whatsoever:

Out of all failed drives, over 56% of them have no count in any of the four strong SMART signals, namely scan errors, reallocation count, offline reallocation, and probational count. In other words, models based only on those signals can never predict more than half of the failed drives.

… even when we add all remaining SMART parameters (except temperature) we still find that over 36% of all failed drives had zero counts on all variables.

… failure prediction models based on SMART parameters alone are likely to be severely limited in their prediction accuracy, given that a large fraction of our failed drives have shown no SMART error signals whatsoever.

This is incredibly important as correlating SMART attributes to failure means little if the correlation simply doesn’t matter for a significant percentage of drives. From skimming a few other papers, this is also something that I don’t always see being addressed/re-addressed, which is disappointing.

Backblaze

Backblaze conducted an analysis on their drives in 2016 which also showed some interesting results. In addition to the 4 SMART attributes identified in the Google study, Backblaze also found another attribute highly correlated to failure:

  • SMART 188: Command Timeout

Similar to the Google study, Backblaze also found a significant number of failed drives reporting no SMART errors for these 5 attributes but, interestingly, it was a smaller percentage that that in the Google study:

Failed drives with one or more of our five SMART stats greater than zero: 76.7%.

That means that 23.3% of failed drives showed no warning from the SMART stats we record.

Another study utilizing Backblaze’s data, “Lifespan and Failures of SSDs and HDDs: Similarities, Differences, and Prediction Models“, was also interesting as it points to another attribute highly correlated with failure:

  • SMART 240: Head Flying Hours

We examine all SMART features for HDDs, and find out that head flying hours (HFH, SMART 240) is highly related to failures even if it is not correlated with other HDD features.

A more recent post from Backblaze looks at the paper “Interpretable predictive maintenance for hard drives” which utilizes data published from Backblaze. I didn’t have a clear takeaway from the paper it did highlight the limitation of previous studies:

The analyses from Backblaze and Google were univariate and only considered correlation between failures and a single metric at a time. As such, they would not be able to detect any nonlinear interactions between metrics that affected the chance of failure. Another limitation of this analysis is that it leaves humans to choose the cutoff values that will raise alerts if exceeded.

So, for hard drives, SMART is interesting. We can say that there are maybe 6 attributes we should definitely be looking at when observing for drive failure but, unfortunately, a drive may still fail without showing any anomalous values for these attributes.

SMART attributes correlated to solid-state drive failure

While their price-per-gigabyte is still much higher than that of a hard drive, solid-state drives are increasingly commonplace. While SSDs do support SMART, I couldn’t as much research done on SSDs. The study I mentioned above, “Lifespan and Failures of SSDs and HDDs: Similarities, Differences, and Prediction Models“, didn’t use SMART attributes but instead daily performance logs in a proprietary format:

… daily performance logs for three MLC SSD models collected at a Google data center over a period of six years. All three models are manufactured by the same vendor and have a 480GB capacity … they utilize custom firmware and drivers, meaning that error reporting is done in a proprietary format rather than through standard SMART features.

Another study, “An In-Depth Study of Correlated Failures in Production SSD-Based Data Centers” didn’t find any correlation with SMART attributes:

Intra-node and intra-rack failures have limited correlations with the SMART attributes and have no significant differences of correlations with each SMART attribute. Thus, the SMART attributes are not good indicators for detecting the existence of intra-node and intra-rack failures in practice.

Finally, I looked at “SSD Failures in Datacenters: What? When? and Why?” which had a similar conclusion:

… even though tracking [SMART] symptoms is important, prognosis of whether a SSD will fail(-stop) or not, cannot be made entirely based on the symptoms. This motivates us to study other factors, beyond SMART symptoms, to better understand the characteristics of failed devices.

So it seems that for SSDs, SMART isn’t an effective tool when it comes to predicting failure.

Reading SMART attributes

On Windows, there’s a number of tools to read SMART attribute, these posts on superuser list a bunch. On Linux, smartmontools seems to be available for most distros; it’s fairly easy to install and use (from the command line, something like sudo smartctl --all /dev/sda).

As for reading the SMART data programmatically, information was more sparse. For Windows, this article provides some code and points to using DeviceIoControl() to communicate with the device driver to retrieve the attributes. For Linux, this post provides a lot of good information and code on how to read the attributes. I haven’t tried implementing either of these approaches myself, but something I might play around with for a future project.


Preventing fake signups

The problem

One annoying problem I began encountering with ScratchGraph a while ago was fake signups. Every so often I would notice a new account created but there would be no interaction on the site beyond account creation. I’d also notice some other errors in the application log, as the spam bot would attempt to fill in and submit every form on the landing page, so I’d get login failure and reset account errors as well. At first, I figured I could just ignore this; metrics would be a bit off, I’d have a bounced welcome email from time to time, and I could just purge the few fake accounts at some point the future. Unfortunately it got to the point where there were so many fake accounts being created that figuring out if an account was real took more effort, there was a lot of garbage in the database and logs and, perhaps most importantly, welcome emails being sent out would be bounced or flagged as spam, dragging down my email reputation and increasing the possibility of emails going to spam.

A solution

There’s a bunch of blog posts from email marketing services detailing this issue, e.g. Mailchip, Folderly. There’s usually a common few solutions mentioned:

  • ReCAPTCHA
  • Email confirmation for new accounts
  • Some sort of throttling
  • Honeypot fields

I opted for honeypot fields.

  • I wanted to minimize dependencies and additional integrations, so no ReCAPTCHA
  • Email confirmation seemed like too heavy of a lift and I disliked the idea of having a user jump from the application to their inbox

  • Throttling kinda makes sense (I could see if other forms were submitted around the same time, with the same email address, and flag the account), this is possible but not trivial for a PHP application where you don’t have service-level jobs running in the background
  • So, I added a honeypot field on the signup form. In practice, I wrapped a text input in a hidden div and I gave the input the name fullname. A user’s full name is also not requested/collected during sign up and I figured the spambot may try to gauge what to do with the field based on its name, so I should give it a realistic name. On the backend, if fullname is non-empty, the request still succeeds (HTTP 200), but an account is not created. Instead, email and IP address are logged in a table for spam signups (I figure I might be able to do something with this data at some point).

    Efficacy

    I was somewhat skeptical as to how well a honeypot field on the signup form would work, I was imaging spambots being incredibly sophisticated, maybe noticing the field was hidden or there was no change on the frontend. Turns out this is not the case or at least not the case for the spambots I was facing. From what I can tell, this relatively simple approach has detected and caught all fake signups from spambots.

    Fake signup caught via honeypot field

    I imagine there’s a shelf life to this solution (happy to be wrong) but, when it starts to fail, I can always integrate another approach such as throttling or ReCAPTCHA.


Improving on strip_tags

The Problem

PHP’s strip_tags() method will strip away tags but makes no attempt to introduce whitespace to separate content in adjacent tags. This is an issue with arbitrary HTML as adjacent block-level elements may not have any intermediate whitespace and simply stripping away the tags will incorrectly concatenate the textual content in the 2 elements.

For example, running strip_tags() on the following:

<div>the quick brown fox</div><div>jumped over the moon</div>

… will return:

the quick brown foxjumped over the moon

This is technically correct (we’re stripped away the <div> tags) but having no whitespace between “fox” and “jumped” means we’ve transformed the content such that we’ve lost semantic and presentational details.

The Solution

There’s 2 ways I can see to fix this behavior:

  • Pre-process the HTML content to ensure or introduce whitespace between block-level elements
  • Don’t use strip_tags() and utilize a method that better understands the need for spacing between elements

I’ll focus on the latter because that’s the avenue I went down and I didn’t consider pre-processing at the time.

Pulling together a quick-and-dirty parser, I wrote the following. It’s worth noting that still still doesn’t really consider what the tags are (e.g. whether they’re inline or block) but allows the caller to specify a string ($tagContentSeparator), typically some whitespace, that is inserted between the stripped away tags:

<?php class HTMLToPlainText { const STATE_READING_CONTENT = 1; const STATE_READING_TAG_NAME = 2; static public function convert(string $input, string $tagContentSeparator = " "): string { // the input string as UTF-32 $fixedWidthString = iconv('UTF-8', 'UTF-32', $input); // string within tags that we've found $foundContentStrings = []; // buffer for current content being read $currentContentString = ""; // flag to indicate how we should interpret what we're reading from $fixedWidthString // .. this is initially set to STATE_READING_CONTENT, as we assume we're reading content from the start, even // if we haven't encountered a tag (e.g. string that doesn't contain tags) $parserState = self::STATE_READING_CONTENT; // method to add a non-empty string to $foundContentStrings and reset $currentContentString $commitCurrentContentString = function() use (&$currentContentString, &$foundContentStrings) { if(strlen($currentContentString) > 0) { $foundContentStrings[] = trim($currentContentString); $currentContentString = ""; } }; // iterate through characters in $fixedWidthString // checking for tokens indicating if we're within a tag or within content for($i=0; $i<strlen($fixedWidthString); $i+=4) { // convert back to UTF-8 to simplify character/token checking $ch = iconv('UTF-32', 'UTF-8', substr($fixedWidthString, $i, 4)); if($ch === '<') { $parserState = self::STATE_READING_TAG_NAME; $commitCurrentContentString(); continue; } if($ch === '>') { $parserState = self::STATE_READING_CONTENT; continue; } if($parserState === self::STATE_READING_CONTENT) { $currentContentString .= $ch; continue; } } $commitCurrentContentString(); return implode($tagContentSeparator, $foundContentStrings); } }

Note that the to/from UTF-8 ↔ UTF-32 isn’t really necessary, I initially did the conversion as I was worried about splitting a multibyte character, but this isn’t possible given how the function reads the input string.

Now if we take the following HTML snippet:

<div>the quick brown fox</div><div>jumped over the moon</div>

… rendered in a browser, we get:

… with strip_tags() we get:

the quick brown foxjumped over the moon

… and with HTMLToPlainText::convert() (passing in “\n” for $tagContentSeparator), we get:

the quick brown fox jumped over the moon

The latter results in text that is semantically correct, as words in different blocks aren’t incorrectly joined. Presentationally we also get a more correct conversion but, the method isn’t really doing anything fancy here, this is due to the calling knowing a bit about the HTML snippet, how a browser would render it, and passing passing in “\n” for $tagContentSeparator.

Limitations / future work

The improvement here is that textual content is pretty preserved when doing a conversion, i.e. we don’t have to worry about textual elements being incorrectly concatenated. However, what I wrote is still lacking in 2 keys areas:

  • Generally, in terms of presentation, an arbitrary bit of HTML won’t map to what a user sees in a browser. To a certain degree this is an intractable problem, as presentation is based on browser defaults, CSS styles, etc. Also, there are things that simply don’t have a standard representation in plain-text (e.g. bold text, list items, etc.). However, there are cases where sensible defaults might make sense, e.g. stripping away <span> tags but putting newline between <p> tags.
  • Whitespace is trimmed from content within tags. This may or may not matter depending on application. In my case, I cared about the words and additional whitespace just added bloat even if it was more accurate to what was in the HTML.


Lessons from building microservices | part 1: the industry’s influence

I started learning about microservices and seeing momentum the architectural pattern around 2015. A number of developers I worked with at Grovo were enthusiastic about the idea as a solution to a number of scaling issues we were facing. The more I looked around, the more I saw developers across the industry purporting microservices as the new, modern architectural pattern. For some companies, this was natural evolution in their systems architecture, but for many companies, especially smaller startups, the push towards microservices seemed driven by the idea that, despite the cost and difficulties, this was going to be the status quo for leading tech companies and that was the direction they should be headed in. I was on the “smaller startup” side of the industry and, from my vantage point, this perspective didn’t seem to be based on any sort of real engineering analysis but more-so aligning with what was being talked about in the industry, adopting what larger companies (e.g. Netflix) were doing, and a belief that microservices were a silver bullet to scaling issues. Looking back, none of this is too surprising. Following in the footsteps of Big Tech, even when your business is operating at a different scale and your tech problems look very different, or jumping on the hype train for the latest technology (e.g. the “just-got-back-from-a-conference” effect), continues to be prevalent in the industry.

For VC-backed startups, it’s also worth looking at the state of the market around 2014-2015. Venture capital funding in tech shot up along with the size of the deals. Closing a VC round and being pushed to grow was also not out of the ordinary. I’ll hand-wave a bit here, but I think this also led to larger engineering teams, a corresponding overconfidence in what could be tackled, and an underestimation of the difficulties of working within and maintaining a topologically complex architecture.

The lesson in all this is to look at and understand industry dynamics relative to your company and your role. Software isn’t made in a vacuum and the zeitgeist of the industry is a key factor in how engineering decisions are made and what solutions manifest.

Finally, this may all sound cynical and, in terms of the pattern itself, it sort of is. Working on microservices led me to view them as an optimization and not a general pattern for application development. However, building with microservices was also a powerful forcing function to really look at and tackle inefficiencies in infrastructure, deployments, and code structure. In addition, the company-level initiative to push towards microservices highlighted the dynamics by which decisions are made and driven, from industry buzz to company leaders to individual engineers. What I learned from all of this improved my technical work in areas beyond microservices and that’s what I hope to really highlight in this and future posts.


Deployments with git tags + npm publish

Git tags and npm

Deployment workflows can vary a lot, but what I’ve tended to find ideal is to tag releases on GitHub (or whatever platform, as most have some mechanism to handle releases), the tag itself being the version number of whatever is being deployed, and having a deployment pipeline orchestrate and perform whatever steps are necessary to deploy the application, service, library, etc. This flow is well supported with git, well supported on platforms like GitHub, and is dead simple for developers to pick up and work with (in GitHub, this means filling out a form and hitting “Publish release”).

npm doesn’t play nicely with this workflow. With npm version numbers aren’t tied to git tags, or any external mechanism, but instead to the value defined in the project’s package.json file. So, trying to publish a package via tagging requires some additional steps. The typical solutions seem to be:

  • Update the version in package.json first, then create the tag
  • Use some workflow that include npm version patch to have npm handle the update to package.json and creating the git tag
  • Use an additional tool (e.g. standard-version), that tries to abstract away management of version numbers from both package.json and git tag

None of these options are great; versioning responsibility and authority is pulled away from git and, in the process, additional workflow complexity and, in the latter case, additional dependencies are introduced.

Version 0.0.0

In order to publish with npm, keep versioning authority with git, and maintain a simple workflow that doesn’t include additional steps or dependencies, the following has been working well in my projects:

  • In package.json, set the version number to “0.0.0”; this value is never changed within any git branch and, conceptually, it can be viewed as representing the “dev version” of the library. package.json only has a “non-dev version” for code published to our package repository.
  • In the deployment pipeline (triggered by tagging a release), update package.json with the version from the git tag.

    Most CI systems have some way of getting the tag being processed and working with it. For example, in CircleCI, working with tags formatted like vMAJOR.MINOR.PATCH, we can reference the tag, remove the “v” prefix, and set the version in package.json using npm version as follows:

    npm --no-git-tag-version version ${CIRCLE_TAG:1}

    Note that this update to package.json is only done within the checked-out copy of the code used in the pipeline. The change is never committed to the repo nor pushed upstream.
  • Finally, within the deployment pipeline, publish as usual via npm publish

Limitations

I haven’t run across any major limitations with this workflow. There is some loss of information captured in the git repository, as the version in package.json is fixed at 0.0.0, but I’ve yet to come across that being an issue. I could potentially see issues if you want to allow developers to do deployments locally via npm publish but, in general, I view local deployments as an anti-pattern when done for anything beyond toy projects.


Writing to the Bitfenix ICON display with Rust (part 2, writing text)

Bitmap Fonts

Picking up from being able to successfully write images to the Bitfenix ICON display, I started looking into how to render textual content. Despite their lack of versatility, using a bitmap font was a natural option: they’re easy to work with, they’re a good fit for fixed-resolution displays, and rendering is fast.

I pulled up the following bitmap font from an old project (I don’t remember the source used to create this):

A few points about this font:

  • Each character glyph is 16×16, with a total of 256 characters in the 256×256 bitmap
  • Characters are arranged/indexed left-to-right, top-to-bottom, and the index of a 16×16 block will correspond with an ASCII or UTF8 codepoint value (e.g. the character at index 33 = “!”, which also maps to UTF8 codepoint 33)
  • Despite space for 256 characters, there’s a very limited set of characters here, but there’s enough for simple US English strings
  • While each character is 16×16 pixels, this is not a monospaced font, there is data for accompanying widths for each character
  • The glyphs are simply black and white (i.e. there’s no antialiasing on the character glyphs, we can simply ignore black pixels and not worry about blending into the background)

Rendering Strings

We need to render characters from the bitmap font onto something. We could create a new image but, building upon what was done in part 1, I decided to render atop this background image. The composite of the background image + characters will be the image written to the ICON display.

To start, we’ll load the background image just as we did in part 1, but we need to make it mutable, as we’ll be writing character pixels direct to it:

// Background image needs to be 240x320 (24bpp, no alpha channel) let mut background_image = reduce_image_to_16bit_color(&load_png_image("assets/1.png"));

Next, we’ll load the font PNG in the same manner (it doesn’t need to be mutable, as we’re not modifying the character pixels):

let font_image = reduce_image_to_16bit_color(&load_png_image("assets/fonts/font1.png"));

To handle text rendering a TextRenderer struct is declared with the rendering logic encapsulated in TextRenderer.render_string(), which loops through each character in the input string, looks up the location of the character in the bitmap font, and renders the 16×16 block of pixels for the character onto the background image. The x-location of where to render a character is incremented by the width of the previous character (found via lookup into TextRenderer::font_widths vector).

pub struct TextRenderer { font_widths: Vec<u8>, } impl TextRenderer { pub fn new() -> TextRenderer { TextRenderer { font_widths: build_font_width_vec() } } pub fn render_string(&self, txt: &str, x: u64, y: u64, fontimg: &[u8], outbuf: &mut [u8]) { // x position of where character should be rendered let mut cur_x = x; for ch in txt.chars() { // From codepoint, lookup x, y of character in font and width of character from self.font_widths let ch_idx = ch as u64; let ch_width = self.font_widths[(ch as u32 % 256) as usize]; let ch_x = (ch_idx % 16) * 16; let ch_y = ((ch_idx as f32 / 16.0) as u64) * 16; // For each character, copy the 16x16 block of pixels into outbuf for fy in ch_y..ch_y+16 { for fx in ch_x..ch_x+16 { let fidx: usize = ((fx + fy*256) * 2) as usize; let fdx = fx - ch_x; let fdy = fy - ch_y; let outbuf_idx: usize = (((cur_x + fdx) + (y + fdy)*240) * 2) as usize; // If the pixel from the font bitmap is not black, write it out to outbuf if fontimg[fidx] != 0x00 && fontimg[fidx+1] != 0x00 { outbuf[outbuf_idx] = fontimg[fidx]; outbuf[outbuf_idx + 1] = fontimg[fidx + 1]; } } } cur_x = cur_x + (ch_width as u64); } } }

The TextRenderer::font_widths vector is built from the build_font_width_vec() method, which simply builds and returns a vector of hardcoded values for the character widths:

fn build_font_width_vec() -> Vec<u8> { let result = vec![ 7, 7, 7, 7, 7, 7, 7, 7, 7, 30, 0, 7, 7, 0, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 5, 3, 5, 9, 7, 18, 8, 3, 5, 5, 7, 9, 4, 8, 3, 5, 8, 4, 7, 7, 8, 7, 7, 7, 7, 7, 3, 4, 6, 9, 6, 7, 9, 8, 8, 8, 8, 8, 8, 8, 8, 5, 7, 8, 7, 9, 9, 9, 8, 9, 8, 8, 9, 8, 9, 10, 9, 9, 8, 4, 6, 4, 8, 9, 5, 7, 7, 6, 7, 7, 6, 7, 7, 3, 5, 7, 3, 9, 7, 7, 7, 7, 6, 6, 6, 7, 7, 10, 7, 7, 6, 5, 3, 5, 8, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 3, 3, 5, 5, 3, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 3, 6, 7, 7, 13, 3, 11, 10, 13, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, ]; result }

Writing to the ICON display

The final piece is simply creating a TextRenderer instance and calling render_string() with the necessary parameters. Here we’ll write out “Hello world!” to position 10, 20 on the display:

let tr = TextRenderer::new(); tr.render_string("Hello world!", 10, 20, &font_image, &mut background_image);
Bitfenix ICON image display with text

All of the code presented is up on the bitfenix-icon-sysstatus repo.

Next, I’m looking to play around with the systemstat library to print out something useful to the display.


Avoiding loops and routing failures with A*

The problem

With GraphPaper, generating path with the A* algorithm I ran into some interesting cases where the algorithm would fail to generate a path between the start and end points:

Failed path

Note that the path should go around, not directly across the big rectangle in the center.

There’s a few things being masked here, as GraphPaper will always try to produce a result, even if it’s non-optimal. So we need to disable:

  • The mechanism that makes a path directly from start to end if there’s a failure in generating a path
  • Path optimization

With these disabled, and some annotations showing the order in which the path segments were generated, we can get a better idea of what’s happening:

Failed path, showing details

Here we see that the algorithm generates a path that loops around its adjacent object (1-12), the path moves away from the object (13-14), then moves right back towards the object (15-18), then we’re out of accessible routing points, so we have a failure and we end the path by going directly to the endpoint. It’s important to note that routing points are not grid points here, they’re points designated around objects (the rectangles with the dots in the center) and anchors (the rectangles where paths start and end).

Why is this happening?

This might seem like a bug in the A* implementation but, as far as I can tell, the algorithm is performing as it should and the issue here comes down to the path cost computation done in the main loop. A* doesn’t actually define how to compute cost, but GraphPaper uses run-of-the-mill Euclidean distance as a metric.

In the A* main loop, the cost of going from our current point to a given point, n, is given by:

f(n) = g(n) + h(n)

g(n) is the total length of the path from the starting point to n (sum of the length of all pieces of the path). In practice, we keep track of the current path length, so we only need to compute and add the straight-line length from current to n:

g(n)=currentPathLength+computeLength(current,n).svg

h(n) is the straight-line length from n to the goal:

h(n)=computeLength(n,endpt).svg

The issues we’re seeing surface as a result of f(n) being based solely on distance computations, note that:

  • We may have a good point, n, it’s close to our goal but requires a large jump from our current location, the large increase in g(n) leads to another, closer, point being prioritized
  • We may have a bad point, n, it’s further from our goal but only requires a small jump from our current location, h(n) increases a bit but the relatively small increase in g(n), leads to this point being prioritized over a better one

The implication here is also that shorter jumps become preferable, even if we end up going in the opposite direction of the goal, as such jumps are safer (i.e. less likely to lead to an blocked/invalid path).

Smarter routing

To improve the routing, I wanted to adjust the cost computation such that moving to points that and in the direction of (and thus closer) to the goal are prioritized, so I introduced a new term, t(n), in the cost computation:

f(n) = g(n) + h(n) - t(n)

t(n)=(vecCurrentTo(endpt)-dot-vecCurrentTo(n))-times-computeLength(current,n).svg

To compute t(n), we first compute the dot product of 2 normalized vectors:

  • vecCurrentTo(endpt): from current to the endpoint, representing the ideal direction we should head in
  • vecCurrentTo(n): from current to n, representing the direction to n

The dot product ranges from [-1, 1]; tending to -1 when n is in the opposite direction of the ideal, and tending to 1 when n is in the same direction as the ideal.

We then scale the dot product result as we can’t just bias our overall cost, f(n), by a range this small, so we scale by the straight-line length from current to n. Using this length as a scaling factor also serves to bias towards longer paths in good/positive directions directions.

In code, the functions g(n), h(n) and t(n) components are computed like this:

// g(n) = length/cost of _startPoint to _vp + _currentRouteLength const currentToVisibleLength = (new Line(currentPoint, visiblePt)).getLength(); let gn = currentToVisibleLength + _currentRouteLength; // h(n) = length/cost of _vp to _endPoint let hn = (new Line(visiblePt, _endPoint)).getLength(); // t(n) = // a. get the relationship between the 2 vectors (dot product) // b. scale to give influence (scale by currentToVisibleLength) // .. using currentToVisibleLength as scaling factor is influence towards longer paths in good directions vs shorter paths in bad directions const vecToTargetIdeal = (new Vec2(_endPoint.getX() - currentPoint.getX(), _endPoint.getY() - currentPoint.getY())).normalize(); const vecVisibleToEndpt = (new Vec2(visiblePt.getX() - currentPoint.getX(), visiblePt.getY() - currentPoint.getY())).normalize(); const tn = vecToTargetIdeal.dot(vecVisibleToEndpt) * currentToVisibleLength;

With the updated path cost computation, we successful get a path to the endpoint and, even with path optimization off, we get a much better path overall with less unnecessary points in the path and no looping around objects:

Fixed path with smarter routing

It’s also noting how this performs as the position of the object changes (again, path optimization is disabled and the mechanism that makes a path directly from start to end on routing failure is also disabled):

Without t(n)

With t(n)

In pretty much every case, applying the t(n) factor to the path computation produces a better path with no failures in routing to the endpoint.