Versioning datasets

Contracts

An issue I’ve kept coming across when working on data systems that involve producing and consuming a number of different datasets is the lack of a contract between producers and consumers. Versioning provides a solution to this problem when dealing with software and, with a decent versioning scheme, provides a good solution for datasets as well, allowing for the creation of versioned snapshots.

Data concerns

It’s worth looking at what the problem is here and why this even matters. Imagine having some dataset, let’s say for drugs, which is periodically updated. We could reasonably say that we only care about the latest version of the drugs dataset, so every time we ingest new data, we simply overwrite the existing dataset.

For a rudimentary system, this is fine, but if we’re thinking in terms of a larger data system with this dataset being consumed by downstream processes, teams, and/or customers, there are a few concerns our system can’t elegantly deal with:

  • Corruption: the ingested data is corrupt or a bug in the ETL process results in a corrupted dataset
  • Consistent reads: not all parts (e.g. tables) of our dataset may be ready for reads by consumers at a given time (loading data to S3 is a good example here; for a non-trivial dataset with multiple objects and partitions, spread across multiple objects, the dataset as a whole can’t be written/updated atomically)
  • Breaking changes: a breaking change to downstream systems (e.g. dropping a column) may need to be rolled out
  • Reproducibility: downstream/derived datasets may need to be re-created based upon what the dataset was at some point in the past (i.e. using the latest dataset will not give the same results)
  • Traceability: we may need to validate/understand how a derived data element was generated, requiring an accurate snapshot of all input data when the derived dataset was generated

Versioning isn’t the only solution to these concerns. You could argue that frequent backups, some sort of locking mechanism, coordination between teams, and/or very granular levels of observability can address each to varying degrees, but I think versioning is (a) simple and (b) requires the least effort.

Versioning scheme

Let’s look at a versioning scheme that would address the 4 concerns I raised above. For this, I’m going to borrow from both semantic versioning and calendar versioning. Combining the 2, and adding a bit of additional metadata, we can construct a scheme like the following:

Breaking this down:

  • The semantic versioning components (major, minor, patch) can effectively tell us about the spatial nature of the dataset; the schema.
  • The calendar versioning components (YYYY0M0D) can effectively tell us about the temporal nature of the dataset (when it was ingested, generated, etc.). Note that calendar versioning is a lot more fuzzy as a standard, as there’s a lot of variance in how dates are represented, YYYY0M0D seems like a good choice as it’s easily parsable by consumers.
  • The final component (rev) is the revision number for the given date and is needed for datasets that can be generated/refreshed multiple times in a day. I think of this as an incrementing integer but a time component (hours, minutes, seconds) is another option; either can work, there’s just tradeoffs in implementation and consumer expectations.

Finding a version

Going back to our example, our data flow now looks something like this:

Note that before our consumers knew exactly where to look for the dataset (s3://bucket/drugs-data/latest), more specifically the latest version of the dataset, however, this is no longer the case. Consumers will need to figure out what version of the dataset they want. This could be trivial (e.g. consumers just want to pin to a specific version) but the more interesting and perhaps more common case, especially with automated systems, is getting the latest version. Unpacking “latest” is important here: consumers want the latest data but not if it carries with it a breaking schema change (i.e. consumers want to pin to major version component, with the others being flexible). Thinking in terms of npm-esque ranges with the caret operator, a consumer could specify a version like ^2.2.11.20221203.1 indicating they system is able to handle, and should pull in, any newer, non-breaking, updates in either schema or data.

So consumers can indicate what they want, but how does a system actually go about finding a certain version? I think the elegant solution here is having some sort of metadata for the dataset that can tell consumers what versions of the dataset are available and where to find them. Creating or updating these metadata entries can simply be another artifact of the ETL process and can be store alongside the dataset (in a manifest file, a table, etc.). Unfortunately, this does involve a small lift and a bit of additional complexity for consumers, as they’d have to read/parse the metadata record.

Dataset-level vs. Data-level versioning

In researching other ways in which versioning is done, change data capture methods usually come up. While change data capture methods are important and powerful, CDC methods are typically at the row-level, not the dataset-level, and it’s worth recognizing the distinction, especially from data systems perspective, as CDC methods come with very different architectural and implementation concerns.

For example, in this blog post from lakeFS, approach #1 references full duplication, which is dataset versioning, but then approach #2 references valid_from and valid_to fields, which is a CDC method and carries with it the requirement to write queries that respect those fields.

Avoiding full duplication

The scheme I’ve laid out somewhat implies a duplication of records for every version of a dataset. I’ve seen a number of articles bring this up as a concern, which can very well be true in a number of case, but I’m skeptical of this being a priority concern for most businesses, given the low cost of storage. In any case, I think storage-layer concerns may impact how you reference versions (more generally, how you read/write metadata), but shouldn’t necessarily dictate the versioning scheme.

From what I’ve read, most systems that try to optimize for storage do so via a git-style model. This is what’s done by cloud service providers like lakeFS and tools like git LFS, ArtiV, and DVC.

Alternatives

I haven’t come across much in terms of alternatives but, in addition to a semantic identifier, this DZone article also mentions data versions potentially containing information about the status of the data (e.g. “incomplete”) or information about what’s changed (e.g. “normalized”). These are interesting ideas but not really something I’ve seen a need for in the version identifier. That said, what I’ve presented is not intended to be some sort of silver bullet, I’m sure different engineers face different concerns and different versioning schemes would be more appropriate.

In the end, I would simply encourage engineers to consider some form of versioning around their datasets, especially in larger data systems. It’s a relatively simple tool that can elegantly address a number of important concerns.


A look at S.M.A.R.T.

What is S.M.A.R.T.?

Self-Monitoring, Analysis and Reporting Technology (S.M.A.R.T) is a monitoring system on computer hard drives and solid state drives. SMART primarily consists of a set of a attributes, with the disk monitoring current and worst values for said attributes. The attributes, associated ID number, and the range for normalized values (1 to 253) are standardized across disks. Unfortunately, there’s no standardization around which attributes are implemented, the range for raw values, what raw values actually represent, or what the threshold is for normalized value. Despite the lack of standardization around the attribute values, SMART still provides some value and offers a significant degree of observability into the state and behavior of the disk.

SMART is something I’ve been aware of for a while, but it never seemed to really matter all that much for my consumer desktop needs. There are a bunch of Windows apps to check the SMART attributes and I vaguely recall having something installed to check for out-of-range values but regularly monitoring/checking wasn’t something I took seriously; just having a backup was typically good enough and I could deal with a drive failure if/when it happened. Recently, motivated by re-using an old motherboard and CPU for a NAS server, making use of a batch of old hard disks I had accumulated, and maximizing the storage capacity of the box, I decided to take another look at SMART and its efficacy in predicting drive failures. Ideally, I could have a system where a drive could be replaced before a failure resulting in data loss.

SMART attributes correlated to hard drive failure

Google

A Google study from 2007, “Failure Trends in a Large Disk Drive Population“, lists 4 SMART attributes that are highly correlated with failure, these are:

  • SMART 5: Reallocated Sectors Count
  • SMART 187: Reported Uncorrectable Errors
  • SMART 197: Current Pending Sector Count
  • SMART 198: Uncorrectable Sector Count

It’s also worth noting that any change in SMART 187 was seen to be highly predictive of failure:

… after their first scan error (i.e. when a positive value for 187 is observed for the first time), drives are 39 times more likely to fail within 60 days than drives with no such errors.

Other attributes were looked at but results were not always consistent across models and manufacturers. The study also found using SMART parameters to predict failure was severely limited, as a large number of failed drives shown no SMART errors whatsoever:

Out of all failed drives, over 56% of them have no count in any of the four strong SMART signals, namely scan errors, reallocation count, offline reallocation, and probational count. In other words, models based only on those signals can never predict more than half of the failed drives.

… even when we add all remaining SMART parameters (except temperature) we still find that over 36% of all failed drives had zero counts on all variables.

… failure prediction models based on SMART parameters alone are likely to be severely limited in their prediction accuracy, given that a large fraction of our failed drives have shown no SMART error signals whatsoever.

This is incredibly important as correlating SMART attributes to failure means little if the correlation simply doesn’t matter for a significant percentage of drives. From skimming a few other papers, this is also something that I don’t always see being addressed/re-addressed, which is disappointing.

Backblaze

Backblaze conducted an analysis on their drives in 2016 which also showed some interesting results. In addition to the 4 SMART attributes identified in the Google study, Backblaze also found another attribute highly correlated to failure:

  • SMART 188: Command Timeout

Similar to the Google study, Backblaze also found a significant number of failed drives reporting no SMART errors for these 5 attributes but, interestingly, it was a smaller percentage that that in the Google study:

Failed drives with one or more of our five SMART stats greater than zero: 76.7%.

That means that 23.3% of failed drives showed no warning from the SMART stats we record.

Another study utilizing Backblaze’s data, “Lifespan and Failures of SSDs and HDDs: Similarities, Differences, and Prediction Models“, was also interesting as it points to another attribute highly correlated with failure:

  • SMART 240: Head Flying Hours

We examine all SMART features for HDDs, and find out that head flying hours (HFH, SMART 240) is highly related to failures even if it is not correlated with other HDD features.

A more recent post from Backblaze looks at the paper “Interpretable predictive maintenance for hard drives” which utilizes data published from Backblaze. I didn’t have a clear takeaway from the paper it did highlight the limitation of previous studies:

The analyses from Backblaze and Google were univariate and only considered correlation between failures and a single metric at a time. As such, they would not be able to detect any nonlinear interactions between metrics that affected the chance of failure. Another limitation of this analysis is that it leaves humans to choose the cutoff values that will raise alerts if exceeded.

So, for hard drives, SMART is interesting. We can say that there are maybe 6 attributes we should definitely be looking at when observing for drive failure but, unfortunately, a drive may still fail without showing any anomalous values for these attributes.

SMART attributes correlated to solid-state drive failure

While their price-per-gigabyte is still much higher than that of a hard drive, solid-state drives are increasingly commonplace. While SSDs do support SMART, I couldn’t as much research done on SSDs. The study I mentioned above, “Lifespan and Failures of SSDs and HDDs: Similarities, Differences, and Prediction Models“, didn’t use SMART attributes but instead daily performance logs in a proprietary format:

… daily performance logs for three MLC SSD models collected at a Google data center over a period of six years. All three models are manufactured by the same vendor and have a 480GB capacity … they utilize custom firmware and drivers, meaning that error reporting is done in a proprietary format rather than through standard SMART features.

Another study, “An In-Depth Study of Correlated Failures in Production SSD-Based Data Centers” didn’t find any correlation with SMART attributes:

Intra-node and intra-rack failures have limited correlations with the SMART attributes and have no significant differences of correlations with each SMART attribute. Thus, the SMART attributes are not good indicators for detecting the existence of intra-node and intra-rack failures in practice.

Finally, I looked at “SSD Failures in Datacenters: What? When? and Why?” which had a similar conclusion:

… even though tracking [SMART] symptoms is important, prognosis of whether a SSD will fail(-stop) or not, cannot be made entirely based on the symptoms. This motivates us to study other factors, beyond SMART symptoms, to better understand the characteristics of failed devices.

So it seems that for SSDs, SMART isn’t an effective tool when it comes to predicting failure.

Reading SMART attributes

On Windows, there’s a number of tools to read SMART attribute, these posts on superuser list a bunch. On Linux, smartmontools seems to be available for most distros; it’s fairly easy to install and use (from the command line, something like sudo smartctl --all /dev/sda).

As for reading the SMART data programmatically, information was more sparse. For Windows, this article provides some code and points to using DeviceIoControl() to communicate with the device driver to retrieve the attributes. For Linux, this post provides a lot of good information and code on how to read the attributes. I haven’t tried implementing either of these approaches myself, but something I might play around with for a future project.


Preventing fake signups

The problem

One annoying problem I began encountering with ScratchGraph a while ago was fake signups. Every so often I would notice a new account created but there would be no interaction on the site beyond account creation. I’d also notice some other errors in the application log, as the spam bot would attempt to fill in and submit every form on the landing page, so I’d get login failure and reset account errors as well. At first, I figured I could just ignore this; metrics would be a bit off, I’d have a bounced welcome email from time to time, and I could just purge the few fake accounts at some point the future. Unfortunately it got to the point where there were so many fake accounts being created that figuring out if an account was real took more effort, there was a lot of garbage in the database and logs and, perhaps most importantly, welcome emails being sent out would be bounced or flagged as spam, dragging down my email reputation and increasing the possibility of emails going to spam.

A solution

There’s a bunch of blog posts from email marketing services detailing this issue, e.g. Mailchip, Folderly. There’s usually a common few solutions mentioned:

  • ReCAPTCHA
  • Email confirmation for new accounts
  • Some sort of throttling
  • Honeypot fields

I opted for honeypot fields.

  • I wanted to minimize dependencies and additional integrations, so no ReCAPTCHA
  • Email confirmation seemed like too heavy of a lift and I disliked the idea of having a user jump from the application to their inbox

  • Throttling kinda makes sense (I could see if other forms were submitted around the same time, with the same email address, and flag the account), this is possible but not trivial for a PHP application where you don’t have service-level jobs running in the background
  • So, I added a honeypot field on the signup form. In practice, I wrapped a text input in a hidden div and I gave the input the name fullname. A user’s full name is also not requested/collected during sign up and I figured the spambot may try to gauge what to do with the field based on its name, so I should give it a realistic name. On the backend, if fullname is non-empty, the request still succeeds (HTTP 200), but an account is not created. Instead, email and IP address are logged in a table for spam signups (I figure I might be able to do something with this data at some point).

    Efficacy

    I was somewhat skeptical as to how well a honeypot field on the signup form would work, I was imaging spambots being incredibly sophisticated, maybe noticing the field was hidden or there was no change on the frontend. Turns out this is not the case or at least not the case for the spambots I was facing. From what I can tell, this relatively simple approach has detected and caught all fake signups from spambots.

    Fake signup caught via honeypot field

    I imagine there’s a shelf life to this solution (happy to be wrong) but, when it starts to fail, I can always integrate another approach such as throttling or ReCAPTCHA.


Improving on strip_tags

The Problem

PHP’s strip_tags() method will strip away tags but makes no attempt to introduce whitespace to separate content in adjacent tags. This is an issue with arbitrary HTML as adjacent block-level elements may not have any intermediate whitespace and simply stripping away the tags will incorrectly concatenate the textual content in the 2 elements.

For example, running strip_tags() on the following:

<div>the quick brown fox</div><div>jumped over the moon</div>

… will return:

the quick brown foxjumped over the moon

This is technically correct (we’re stripped away the <div> tags) but having no whitespace between “fox” and “jumped” means we’ve transformed the content such that we’ve lost semantic and presentational details.

The Solution

There’s 2 ways I can see to fix this behavior:

  • Pre-process the HTML content to ensure or introduce whitespace between block-level elements
  • Don’t use strip_tags() and utilize a method that better understands the need for spacing between elements

I’ll focus on the latter because that’s the avenue I went down and I didn’t consider pre-processing at the time.

Pulling together a quick-and-dirty parser, I wrote the following. It’s worth noting that still still doesn’t really consider what the tags are (e.g. whether they’re inline or block) but allows the caller to specify a string ($tagContentSeparator), typically some whitespace, that is inserted between the stripped away tags:

<?php class HTMLToPlainText { const STATE_READING_CONTENT = 1; const STATE_READING_TAG_NAME = 2; static public function convert(string $input, string $tagContentSeparator = " "): string { // the input string as UTF-32 $fixedWidthString = iconv('UTF-8', 'UTF-32', $input); // string within tags that we've found $foundContentStrings = []; // buffer for current content being read $currentContentString = ""; // flag to indicate how we should interpret what we're reading from $fixedWidthString // .. this is initially set to STATE_READING_CONTENT, as we assume we're reading content from the start, even // if we haven't encountered a tag (e.g. string that doesn't contain tags) $parserState = self::STATE_READING_CONTENT; // method to add a non-empty string to $foundContentStrings and reset $currentContentString $commitCurrentContentString = function() use (&$currentContentString, &$foundContentStrings) { if(strlen($currentContentString) > 0) { $foundContentStrings[] = trim($currentContentString); $currentContentString = ""; } }; // iterate through characters in $fixedWidthString // checking for tokens indicating if we're within a tag or within content for($i=0; $i<strlen($fixedWidthString); $i+=4) { // convert back to UTF-8 to simplify character/token checking $ch = iconv('UTF-32', 'UTF-8', substr($fixedWidthString, $i, 4)); if($ch === '<') { $parserState = self::STATE_READING_TAG_NAME; $commitCurrentContentString(); continue; } if($ch === '>') { $parserState = self::STATE_READING_CONTENT; continue; } if($parserState === self::STATE_READING_CONTENT) { $currentContentString .= $ch; continue; } } $commitCurrentContentString(); return implode($tagContentSeparator, $foundContentStrings); } }

Note that the to/from UTF-8 ↔ UTF-32 isn’t really necessary, I initially did the conversion as I was worried about splitting a multibyte character, but this isn’t possible given how the function reads the input string.

Now if we take the following HTML snippet:

<div>the quick brown fox</div><div>jumped over the moon</div>

… rendered in a browser, we get:

… with strip_tags() we get:

the quick brown foxjumped over the moon

… and with HTMLToPlainText::convert() (passing in “\n” for $tagContentSeparator), we get:

the quick brown fox jumped over the moon

The latter results in text that is semantically correct, as words in different blocks aren’t incorrectly joined. Presentationally we also get a more correct conversion but, the method isn’t really doing anything fancy here, this is due to the calling knowing a bit about the HTML snippet, how a browser would render it, and passing passing in “\n” for $tagContentSeparator.

Limitations / future work

The improvement here is that textual content is pretty preserved when doing a conversion, i.e. we don’t have to worry about textual elements being incorrectly concatenated. However, what I wrote is still lacking in 2 keys areas:

  • Generally, in terms of presentation, an arbitrary bit of HTML won’t map to what a user sees in a browser. To a certain degree this is an intractable problem, as presentation is based on browser defaults, CSS styles, etc. Also, there are things that simply don’t have a standard representation in plain-text (e.g. bold text, list items, etc.). However, there are cases where sensible defaults might make sense, e.g. stripping away <span> tags but putting newline between <p> tags.
  • Whitespace is trimmed from content within tags. This may or may not matter depending on application. In my case, I cared about the words and additional whitespace just added bloat even if it was more accurate to what was in the HTML.


Lessons from building microservices | part 1: the industry’s influence

I started learning about microservices and seeing momentum the architectural pattern around 2015. A number of developers I worked with at Grovo were enthusiastic about the idea as a solution to a number of scaling issues we were facing. The more I looked around, the more I saw developers across the industry purporting microservices as the new, modern architectural pattern. For some companies, this was natural evolution in their systems architecture, but for many companies, especially smaller startups, the push towards microservices seemed driven by the idea that, despite the cost and difficulties, this was going to be the status quo for leading tech companies and that was the direction they should be headed in. I was on the “smaller startup” side of the industry and, from my vantage point, this perspective didn’t seem to be based on any sort of real engineering analysis but more-so aligning with what was being talked about in the industry, adopting what larger companies (e.g. Netflix) were doing, and a belief that microservices were a silver bullet to scaling issues. Looking back, none of this is too surprising. Following in the footsteps of Big Tech, even when your business is operating at a different scale and your tech problems look very different, or jumping on the hype train for the latest technology (e.g. the “just-got-back-from-a-conference” effect), continues to be prevalent in the industry.

For VC-backed startups, it’s also worth looking at the state of the market around 2014-2015. Venture capital funding in tech shot up along with the size of the deals. Closing a VC round and being pushed to grow was also not out of the ordinary. I’ll hand-wave a bit here, but I think this also led to larger engineering teams, a corresponding overconfidence in what could be tackled, and an underestimation of the difficulties of working within and maintaining a topologically complex architecture.

The lesson in all this is to look at and understand industry dynamics relative to your company and your role. Software isn’t made in a vacuum and the zeitgeist of the industry is a key factor in how engineering decisions are made and what solutions manifest.

Finally, this may all sound cynical and, in terms of the pattern itself, it sort of is. Working on microservices led me to view them as an optimization and not a general pattern for application development. However, building with microservices was also a powerful forcing function to really look at and tackle inefficiencies in infrastructure, deployments, and code structure. In addition, the company-level initiative to push towards microservices highlighted the dynamics by which decisions are made and driven, from industry buzz to company leaders to individual engineers. What I learned from all of this improved my technical work in areas beyond microservices and that’s what I hope to really highlight in this and future posts.


Deployments with git tags + npm publish

Git tags and npm

Deployment workflows can vary a lot, but what I’ve tended to find ideal is to tag releases on GitHub (or whatever platform, as most have some mechanism to handle releases), the tag itself being the version number of whatever is being deployed, and having a deployment pipeline orchestrate and perform whatever steps are necessary to deploy the application, service, library, etc. This flow is well supported with git, well supported on platforms like GitHub, and is dead simple for developers to pick up and work with (in GitHub, this means filling out a form and hitting “Publish release”).

npm doesn’t play nicely with this workflow. With npm version numbers aren’t tied to git tags, or any external mechanism, but instead to the value defined in the project’s package.json file. So, trying to publish a package via tagging requires some additional steps. The typical solutions seem to be:

  • Update the version in package.json first, then create the tag
  • Use some workflow that include npm version patch to have npm handle the update to package.json and creating the git tag
  • Use an additional tool (e.g. standard-version), that tries to abstract away management of version numbers from both package.json and git tag

None of these options are great; versioning responsibility and authority is pulled away from git and, in the process, additional workflow complexity and, in the latter case, additional dependencies are introduced.

Version 0.0.0

In order to publish with npm, keep versioning authority with git, and maintain a simple workflow that doesn’t include additional steps or dependencies, the following has been working well in my projects:

  • In package.json, set the version number to “0.0.0”; this value is never changed within any git branch and, conceptually, it can be viewed as representing the “dev version” of the library. package.json only has a “non-dev version” for code published to our package repository.
  • In the deployment pipeline (triggered by tagging a release), update package.json with the version from the git tag.

    Most CI systems have some way of getting the tag being processed and working with it. For example, in CircleCI, working with tags formatted like vMAJOR.MINOR.PATCH, we can reference the tag, remove the “v” prefix, and set the version in package.json using npm version as follows:

    npm --no-git-tag-version version ${CIRCLE_TAG:1}

    Note that this update to package.json is only done within the checked-out copy of the code used in the pipeline. The change is never committed to the repo nor pushed upstream.
  • Finally, within the deployment pipeline, publish as usual via npm publish

Limitations

I haven’t run across any major limitations with this workflow. There is some loss of information captured in the git repository, as the version in package.json is fixed at 0.0.0, but I’ve yet to come across that being an issue. I could potentially see issues if you want to allow developers to do deployments locally via npm publish but, in general, I view local deployments as an anti-pattern when done for anything beyond toy projects.


Writing to the Bitfenix ICON display with Rust (part 2, writing text)

Bitmap Fonts

Picking up from being able to successfully write images to the Bitfenix ICON display, I started looking into how to render textual content. Despite their lack of versatility, using a bitmap font was a natural option: they’re easy to work with, they’re a good fit for fixed-resolution displays, and rendering is fast.

I pulled up the following bitmap font from an old project (I don’t remember the source used to create this):

A few points about this font:

  • Each character glyph is 16×16, with a total of 256 characters in the 256×256 bitmap
  • Characters are arranged/indexed left-to-right, top-to-bottom, and the index of a 16×16 block will correspond with an ASCII or UTF8 codepoint value (e.g. the character at index 33 = “!”, which also maps to UTF8 codepoint 33)
  • Despite space for 256 characters, there’s a very limited set of characters here, but there’s enough for simple US English strings
  • While each character is 16×16 pixels, this is not a monospaced font, there is data for accompanying widths for each character
  • The glyphs are simply black and white (i.e. there’s no antialiasing on the character glyphs, we can simply ignore black pixels and not worry about blending into the background)

Rendering Strings

We need to render characters from the bitmap font onto something. We could create a new image but, building upon what was done in part 1, I decided to render atop this background image. The composite of the background image + characters will be the image written to the ICON display.

To start, we’ll load the background image just as we did in part 1, but we need to make it mutable, as we’ll be writing character pixels direct to it:

// Background image needs to be 240x320 (24bpp, no alpha channel) let mut background_image = reduce_image_to_16bit_color(&load_png_image("assets/1.png"));

Next, we’ll load the font PNG in the same manner (it doesn’t need to be mutable, as we’re not modifying the character pixels):

let font_image = reduce_image_to_16bit_color(&load_png_image("assets/fonts/font1.png"));

To handle text rendering a TextRenderer struct is declared with the rendering logic encapsulated in TextRenderer.render_string(), which loops through each character in the input string, looks up the location of the character in the bitmap font, and renders the 16×16 block of pixels for the character onto the background image. The x-location of where to render a character is incremented by the width of the previous character (found via lookup into TextRenderer::font_widths vector).

pub struct TextRenderer { font_widths: Vec<u8>, } impl TextRenderer { pub fn new() -> TextRenderer { TextRenderer { font_widths: build_font_width_vec() } } pub fn render_string(&self, txt: &str, x: u64, y: u64, fontimg: &[u8], outbuf: &mut [u8]) { // x position of where character should be rendered let mut cur_x = x; for ch in txt.chars() { // From codepoint, lookup x, y of character in font and width of character from self.font_widths let ch_idx = ch as u64; let ch_width = self.font_widths[(ch as u32 % 256) as usize]; let ch_x = (ch_idx % 16) * 16; let ch_y = ((ch_idx as f32 / 16.0) as u64) * 16; // For each character, copy the 16x16 block of pixels into outbuf for fy in ch_y..ch_y+16 { for fx in ch_x..ch_x+16 { let fidx: usize = ((fx + fy*256) * 2) as usize; let fdx = fx - ch_x; let fdy = fy - ch_y; let outbuf_idx: usize = (((cur_x + fdx) + (y + fdy)*240) * 2) as usize; // If the pixel from the font bitmap is not black, write it out to outbuf if fontimg[fidx] != 0x00 && fontimg[fidx+1] != 0x00 { outbuf[outbuf_idx] = fontimg[fidx]; outbuf[outbuf_idx + 1] = fontimg[fidx + 1]; } } } cur_x = cur_x + (ch_width as u64); } } }

The TextRenderer::font_widths vector is built from the build_font_width_vec() method, which simply builds and returns a vector of hardcoded values for the character widths:

fn build_font_width_vec() -> Vec<u8> { let result = vec![ 7, 7, 7, 7, 7, 7, 7, 7, 7, 30, 0, 7, 7, 0, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 5, 3, 5, 9, 7, 18, 8, 3, 5, 5, 7, 9, 4, 8, 3, 5, 8, 4, 7, 7, 8, 7, 7, 7, 7, 7, 3, 4, 6, 9, 6, 7, 9, 8, 8, 8, 8, 8, 8, 8, 8, 5, 7, 8, 7, 9, 9, 9, 8, 9, 8, 8, 9, 8, 9, 10, 9, 9, 8, 4, 6, 4, 8, 9, 5, 7, 7, 6, 7, 7, 6, 7, 7, 3, 5, 7, 3, 9, 7, 7, 7, 7, 6, 6, 6, 7, 7, 10, 7, 7, 6, 5, 3, 5, 8, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 3, 3, 5, 5, 3, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 3, 6, 7, 7, 13, 3, 11, 10, 13, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, ]; result }

Writing to the ICON display

The final piece is simply creating a TextRenderer instance and calling render_string() with the necessary parameters. Here we’ll write out “Hello world!” to position 10, 20 on the display:

let tr = TextRenderer::new(); tr.render_string("Hello world!", 10, 20, &font_image, &mut background_image);
Bitfenix ICON image display with text

All of the code presented is up on the bitfenix-icon-sysstatus repo.

Next, I’m looking to play around with the systemstat library to print out something useful to the display.


Avoiding loops and routing failures with A*

The problem

With GraphPaper, generating path with the A* algorithm I ran into some interesting cases where the algorithm would fail to generate a path between the start and end points:

Failed path

Note that the path should go around, not directly across the big rectangle in the center.

There’s a few things being masked here, as GraphPaper will always try to produce a result, even if it’s non-optimal. So we need to disable:

  • The mechanism that makes a path directly from start to end if there’s a failure in generating a path
  • Path optimization

With these disabled, and some annotations showing the order in which the path segments were generated, we can get a better idea of what’s happening:

Failed path, showing details

Here we see that the algorithm generates a path that loops around its adjacent object (1-12), the path moves away from the object (13-14), then moves right back towards the object (15-18), then we’re out of accessible routing points, so we have a failure and we end the path by going directly to the endpoint. It’s important to note that routing points are not grid points here, they’re points designated around objects (the rectangles with the dots in the center) and anchors (the rectangles where paths start and end).

Why is this happening?

This might seem like a bug in the A* implementation but, as far as I can tell, the algorithm is performing as it should and the issue here comes down to the path cost computation done in the main loop. A* doesn’t actually define how to compute cost, but GraphPaper uses run-of-the-mill Euclidean distance as a metric.

In the A* main loop, the cost of going from our current point to a given point, n, is given by:

f(n) = g(n) + h(n)

g(n) is the total length of the path from the starting point to n (sum of the length of all pieces of the path). In practice, we keep track of the current path length, so we only need to compute and add the straight-line length from current to n:

g(n)=currentPathLength+computeLength(current,n).svg

h(n) is the straight-line length from n to the goal:

h(n)=computeLength(n,endpt).svg

The issues we’re seeing surface as a result of f(n) being based solely on distance computations, note that:

  • We may have a good point, n, it’s close to our goal but requires a large jump from our current location, the large increase in g(n) leads to another, closer, point being prioritized
  • We may have a bad point, n, it’s further from our goal but only requires a small jump from our current location, h(n) increases a bit but the relatively small increase in g(n), leads to this point being prioritized over a better one

The implication here is also that shorter jumps become preferable, even if we end up going in the opposite direction of the goal, as such jumps are safer (i.e. less likely to lead to an blocked/invalid path).

Smarter routing

To improve the routing, I wanted to adjust the cost computation such that moving to points that and in the direction of (and thus closer) to the goal are prioritized, so I introduced a new term, t(n), in the cost computation:

f(n) = g(n) + h(n) - t(n)

t(n)=(vecCurrentTo(endpt)-dot-vecCurrentTo(n))-times-computeLength(current,n).svg

To compute t(n), we first compute the dot product of 2 normalized vectors:

  • vecCurrentTo(endpt): from current to the endpoint, representing the ideal direction we should head in
  • vecCurrentTo(n): from current to n, representing the direction to n

The dot product ranges from [-1, 1]; tending to -1 when n is in the opposite direction of the ideal, and tending to 1 when n is in the same direction as the ideal.

We then scale the dot product result as we can’t just bias our overall cost, f(n), by a range this small, so we scale by the straight-line length from current to n. Using this length as a scaling factor also serves to bias towards longer paths in good/positive directions directions.

In code, the functions g(n), h(n) and t(n) components are computed like this:

// g(n) = length/cost of _startPoint to _vp + _currentRouteLength const currentToVisibleLength = (new Line(currentPoint, visiblePt)).getLength(); let gn = currentToVisibleLength + _currentRouteLength; // h(n) = length/cost of _vp to _endPoint let hn = (new Line(visiblePt, _endPoint)).getLength(); // t(n) = // a. get the relationship between the 2 vectors (dot product) // b. scale to give influence (scale by currentToVisibleLength) // .. using currentToVisibleLength as scaling factor is influence towards longer paths in good directions vs shorter paths in bad directions const vecToTargetIdeal = (new Vec2(_endPoint.getX() - currentPoint.getX(), _endPoint.getY() - currentPoint.getY())).normalize(); const vecVisibleToEndpt = (new Vec2(visiblePt.getX() - currentPoint.getX(), visiblePt.getY() - currentPoint.getY())).normalize(); const tn = vecToTargetIdeal.dot(vecVisibleToEndpt) * currentToVisibleLength;

With the updated path cost computation, we successful get a path to the endpoint and, even with path optimization off, we get a much better path overall with less unnecessary points in the path and no looping around objects:

Fixed path with smarter routing

It’s also noting how this performs as the position of the object changes (again, path optimization is disabled and the mechanism that makes a path directly from start to end on routing failure is also disabled):

Without t(n)

With t(n)

In pretty much every case, applying the t(n) factor to the path computation produces a better path with no failures in routing to the endpoint.


A* path optimization

The A* algorithm, using a straight-line distance heuristic function, is great in terms of performance, but yields a number of cases where the paths produced are not optimal.

For example, here is a path created in ScratchGraph using GraphPaper (the underlying library used for creating the connectors):

A* without path optimization

I suspect the non-optimal result is even more pronounced here given that the path is determined based on specific routing points that exist around the objects (you can spot these as the inflection points in the path), instead of a uniform grid. That said, I don’t want to use a uniform grid; while it’s likely a non-issue with A*, there is also the cost of computing which routing points are accessible vs blocked, and that cost grows quickly as the possible number of routing points grow.

An approach that works well, and doesn’t drastically increase the cost of path computation, is simplifying the generated path by checking if corresponding points in the path are visible to each other. If so, we can remove the intermediate points, and simply connect those 2 points together. Here’s a snipped from the GraphPaper codebase:

/** * * @param {Point[]} _pointsInRoute * @param {Function} _arePointsVisibleToEachOther */ optimize: function(_pointsInRoute, _arePointsVisibleToEachOther) { let start = 0; let end = _pointsInRoute.length - 1; while(true) { if((end-start) <= 1) { start++; end = _pointsInRoute.length - 1; if(start >= _pointsInRoute.length-2) { break; } } if(_arePointsVisibleToEachOther(_pointsInRoute[start], _pointsInRoute[end])) { _pointsInRoute.splice(start + 1, (end-start) - 1); end = _pointsInRoute.length - 1; } else { end--; } } }

The function works as follows:

  • Begin with the points start (first point in the path) and end (last point in the path).
  • Relative to start, check if the other points in the path are visible to it (in the code above, we iterate backwards from the endpoint). If we find that start is visible to another point, we eliminate the intermediate points from the path.
  • Once we get to end being the point after start, update start to the next point in the path and reset end to whatever the last point in the path is.
  • Repeat the latter 2 steps until we’ve checked all corresponding points in the path (start is the point directly preceding end).

Optimizing the path shown above with this method yields an optimal path:

A* with path optimization

In terms of performance, the cost is based on the number of points in the path and the cost of whatever computation the _arePointsVisibleToEachOther() function does. For my use-cases, paths have relatively few points and _arePointsVisibleToEachOther() consists of a number of fairly fast line-line intersection checks, so the method is fairly cheap and there’s no significant decrease in performance when generating paths.


Pushing computation to the front: client-side compression

Client → Server Compression

Content from a web server being automatically gzipped (via apache, nginx, etc.) and transferred to the browser isn’t anything new, but there’s really nothing in the way of compression when going in the other direction (i.e. transferring content from the client to the server). This is not too surprising, as most client payloads are small bits of textual content and/or binary content that is already well compressed (e.g. JPEG images), where there’s little gain from compression and you’re likely to just waste CPU cycles doing it. That said, when your frontend client is a space for content creation, you’re potentially going to run into cases where you’re sending a lot of uncompressed data to the server.

Use-case: ScratchGraph Export

ScratchGraph has an export feature that essentially renders the page (minus UI components) as a string of HTML. This string packaged along with some metadata and sent to the server, which sends it to a service running puppeteer, that renders the HTML string to either an image or a PDF. The overall process looks something like this:

ScratchGraph Export Flow

The HTML string being sent to the server is relatively large, a couple of MBs, due to:

  • The CSS styles (particularly due to external resources being pulled in and inlined as base64 URLs)
  • The user simply having lots of content

To be fair, it’s usually the former rather than the latter, and optimizing to avoid the inlining of resources (the intent of which was to try and do exports entirely in the browser) would have a greater impact in reducing the amount of data being transferred to the server. However, for the purposes of this blog post (and also because it leads to a more complex discussion on how the application architecture can/should evolve and what this feature looks like in the future), we’re going to sidestep that discussion and focus on what benefits data compression may offer.

Compression with pako

I was more than ready to implement a compression algorithm, but was happy to discover pako, which does zlib compression. Compressing (i.e. deflating) with pako is very simple, below I encode the HTML string to UTF8 via TextEncoder.encode() (this is because I want UTF8, this isn’t a requirement of pako), which returns a Uint8Array, then use that as the input for pako.deflate(), which also returns a Uint8Array.

const staticHtmlUtf8Arr = (new TextEncoder()).encode(html); const compressedStaticHtmlUtf8Arr = pako.deflate(staticHtmlUtf8Arr);

Here’s what that looks like in practice, exporting the diagram shown above:

ScratchGraph Export, with pako compression, results

That’s fairly significant, as the data size has been reduced by 1,237,266 bytes (42.77%)!

The final bit for the frontend is sending this to the server. I use a FormData object for the XHR call and, for the compressed data, I put append it as a Blob:

formData.append( "compressedStaticHtml", new Blob([compressedStaticHtmlUtf8Arr], {type: 'application/zlib'}), "compressedStaticHtml" );

Handling the compressed data server-side with PHP

PHP support zlib compression/decompression via the zlib module. The only additional logic needed server-side is calling gzuncompress() to decompressed the compressed data.

$staticHtml = gzuncompress(file_get_contents($compressedStaticHtmlFile->getFilePath()));

Note that $compressedStaticHtmlFile is an object representing a file pulled from the request (note that FormData will append a Blob in the same manner as a file, so server-side, you’re dealing with the data as a file). The File.getFilePath() method here is simply returning the path for the uploaded file.

Limitations

Compressing and decompressing data will cost CPU cycles and, for zlib and most algorithms, this will scale with the size of the data. So considerations around what the client-side system looks like and the size of the data need to be taken into account. In addition, compression within a browser’s main thread can lead to UI events, reflow, and repaint being blocked (i.e. the page becomes unresponsive). If the compression time is significant, performing it within a web worker instead would be a better path.