Archive for the ‘Software Engineering’ Category

Embedded in culture

Something I read a long time ago that comes to mind when I think about engineering team culture is this interview around design at Apple. Specifically, the myth around Apple having the best designers:

I think the biggest misconception is this belief that the reason Apple products turn out to be designed better, and have a better user experience, or are sexier, or whatever … is that they have the best design team in the world, or the best process in the world…
It’s actually the engineering culture, and the way the organization is structured to appreciate and support design. Everybody there is thinking about UX and design, not just the designers. And that’s what makes everything about the product so much better … much more than any individual designer or design team.

[Aside: looking back a bit, it’s worth noting that, more-so in 2014 than now, Apple’s products were viewed as being far superior in design to competitors. Many people I worked with pointed to how something looked or functioned on a Mac or iPhone as the ideal and there was a desire to replicate that aesthetic and experience. Now, in 2024, Apple is still known for good design, but I don’t think they have the same monopoly that they did a decade ago.]

What resonates here is how the concern (for a better user experience, better product, etc.) needs to be embedded within the culture of the team and it’s not something can be be strictly delegated to a certain individual, role or team (and then “thrown over the wall”).

Of course this is maybe not too surprising, within software engineering itself there’s been a need/desire/push to diffuse concerns around operations, security, testing, etc. (and what actually got me thinking about this was actually the interplay between application engineers and security engineers, where application engineers can’t simply “hand off” security concerns).

High-performing teams vs. not-invented-here syndrome

A few months ago, being particularly frustrated by yet-another-bug and yet-another-limitation of a library used in one of my team’s systems, I remembered a story about the Excel dev team and dug up In Defense of Not-Invented-Here Syndrome, which I read years ago. I didn’t think much of the essay when I first read it but now, having been in the industry for a while, I have a greater appreciation for it.

NIH syndrome is generally looked at in a negative light and for good reason; companies and teams that are too insular and reject ideas or technologies from the outside can find themselves behind the curve. However, there’s a spectrum here and, at the opposite end, heedless adoption of things from the outside can put companies and teams in an equally precarious position.

So, back to the story of the Excel development team:

“The Excel development team will never accept it,” he said. “You know their motto? ‘Find the dependencies — and eliminate them.’ They’ll never go for something with so many dependencies.”

Dealing with dependencies is a reality of software engineering, perhaps even more-so now than in the past, and for good reason, there’s a world of functionality that can simply be plugged into a project, saving significant amounts of time and energy. However, there’s a number of downsides as well:

  • Your team doesn’t control control the evolution or lifecycle of that dependency
  • Your team doesn’t control the quality of that that dependency
  • Your team doesn’t have knowledge of how that dependency does what it does

When something breaks or you hit a limitation, your team is suddenly spending a ton of time trying to debug an issue that originates from a codebase they’re not familiar with and, once there’s an understanding of the issue, coding some ugly hack to get the dependency to behave in a more reasonable way. So when a team has the resources it’s not unreasonable to target elimination of dependencies for:

  • A healthier codebase
  • A codebase that is more easily understood and can be reasoned about

These 2 points invariably lead to a higher performing team. In the case of the Excel dev team:

The Excel team’s ruggedly independent mentality also meant that they always shipped on time, their code was of uniformly high quality, and they had a compiler which, back in the 1980s, generated pcode and could therefore run unmodified on Macintosh’s 68000 chip as well as Intel PCs.

Finally, Joel’s recommendation on what shouldn’t be a dependency and be done in-house:

Pick your core business competencies and goals, and do those in house.

This makes sense and resonates with me. Though there is a subtle requirement here that I’ve seen overlooked: engineering departments and teams need to distill business competencies and goals (hopefully, these exist and are sensible) into technical competencies and goals. Without that distillation, engineering is rudderless; teams pull in dependencies for things that should be built internally, while others sink time into building things from scratch that will never get the business resources to be properly developed or maintained.

Versioning datasets

Contracts

An issue I’ve kept coming across when working on data systems that involve producing and consuming a number of different datasets is the lack of a contract between producers and consumers. Versioning provides a solution to this problem when dealing with software and, with a decent versioning scheme, provides a good solution for datasets as well, allowing for the creation of versioned snapshots.

Data concerns

It’s worth looking at what the problem is here and why this even matters. Imagine having some dataset, let’s say for drugs, which is periodically updated. We could reasonably say that we only care about the latest version of the drugs dataset, so every time we ingest new data, we simply overwrite the existing dataset.

For a rudimentary system, this is fine, but if we’re thinking in terms of a larger data system with this dataset being consumed by downstream processes, teams, and/or customers, there are a few concerns our system can’t elegantly deal with:

  • Corruption: the ingested data is corrupt or a bug in the ETL process results in a corrupted dataset
  • Consistent reads: not all parts (e.g. tables) of our dataset may be ready for reads by consumers at a given time (loading data to S3 is a good example here; for a non-trivial dataset with multiple objects and partitions, spread across multiple objects, the dataset as a whole can’t be written/updated atomically)
  • Breaking changes: a breaking change to downstream systems (e.g. dropping a column) may need to be rolled out
  • Reproducibility: downstream/derived datasets may need to be re-created based upon what the dataset was at some point in the past (i.e. using the latest dataset will not give the same results)
  • Traceability: we may need to validate/understand how a derived data element was generated, requiring an accurate snapshot of all input data when the derived dataset was generated

Versioning isn’t the only solution to these concerns. You could argue that frequent backups, some sort of locking mechanism, coordination between teams, and/or very granular levels of observability can address each to varying degrees, but I think versioning is (a) simple and (b) requires the least effort.

Versioning scheme

Let’s look at a versioning scheme that would address the 4 concerns I raised above. For this, I’m going to borrow from both semantic versioning and calendar versioning. Combining the 2, and adding a bit of additional metadata, we can construct a scheme like the following:

Breaking this down:

  • The semantic versioning components (major, minor, patch) can effectively tell us about the spatial nature of the dataset; the schema.
  • The calendar versioning components (YYYY0M0D) can effectively tell us about the temporal nature of the dataset (when it was ingested, generated, etc.). Note that calendar versioning is a lot more fuzzy as a standard, as there’s a lot of variance in how dates are represented, YYYY0M0D seems like a good choice as it’s easily parsable by consumers.
  • The final component (rev) is the revision number for the given date and is needed for datasets that can be generated/refreshed multiple times in a day. I think of this as an incrementing integer but a time component (hours, minutes, seconds) is another option; either can work, there’s just tradeoffs in implementation and consumer expectations.

Finding a version

Going back to our example, our data flow now looks something like this:

Note that before our consumers knew exactly where to look for the dataset (s3://bucket/drugs-data/latest), more specifically the latest version of the dataset, however, this is no longer the case. Consumers will need to figure out what version of the dataset they want. This could be trivial (e.g. consumers just want to pin to a specific version) but the more interesting and perhaps more common case, especially with automated systems, is getting the latest version. Unpacking “latest” is important here: consumers want the latest data but not if it carries with it a breaking schema change (i.e. consumers want to pin to major version component, with the others being flexible). Thinking in terms of npm-esque ranges with the caret operator, a consumer could specify a version like ^2.2.11.20221203.1 indicating they system is able to handle, and should pull in, any newer, non-breaking, updates in either schema or data.

So consumers can indicate what they want, but how does a system actually go about finding a certain version? I think the elegant solution here is having some sort of metadata for the dataset that can tell consumers what versions of the dataset are available and where to find them. Creating or updating these metadata entries can simply be another artifact of the ETL process and can be store alongside the dataset (in a manifest file, a table, etc.). Unfortunately, this does involve a small lift and a bit of additional complexity for consumers, as they’d have to read/parse the metadata record.

Dataset-level vs. Data-level versioning

In researching other ways in which versioning is done, change data capture methods usually come up. While change data capture methods are important and powerful, CDC methods are typically at the row-level, not the dataset-level, and it’s worth recognizing the distinction, especially from data systems perspective, as CDC methods come with very different architectural and implementation concerns.

For example, in this blog post from lakeFS, approach #1 references full duplication, which is dataset versioning, but then approach #2 references valid_from and valid_to fields, which is a CDC method and carries with it the requirement to write queries that respect those fields.

Avoiding full duplication

The scheme I’ve laid out somewhat implies a duplication of records for every version of a dataset. I’ve seen a number of articles bring this up as a concern, which can very well be true in a number of case, but I’m skeptical of this being a priority concern for most businesses, given the low cost of storage. In any case, I think storage-layer concerns may impact how you reference versions (more generally, how you read/write metadata), but shouldn’t necessarily dictate the versioning scheme.

From what I’ve read, most systems that try to optimize for storage do so via a git-style model. This is what’s done by cloud service providers like lakeFS and tools like git LFS, ArtiV, and DVC.

Alternatives

I haven’t come across much in terms of alternatives but, in addition to a semantic identifier, this DZone article also mentions data versions potentially containing information about the status of the data (e.g. “incomplete”) or information about what’s changed (e.g. “normalized”). These are interesting ideas but not really something I’ve seen a need for in the version identifier. That said, what I’ve presented is not intended to be some sort of silver bullet, I’m sure different engineers face different concerns and different versioning schemes would be more appropriate.

In the end, I would simply encourage engineers to consider some form of versioning around their datasets, especially in larger data systems. It’s a relatively simple tool that can elegantly address a number of important concerns.

Publishing packages with npm and CircleCI

A common workflow

In recent years, I’ve pushed more and more for common, automated, deployment processes. In practice, this has usually meant:

  • Code is managed with Git, and tags are used for releases
  • Tags (and hence releases) are created via GitHub
  • Creating a tag executes everything in the CI pipeline + a few more tasks for the deployments

The result is that all deployments go through the same process (no deploy scripts run on personal machines), in the same environment (the CI container). It eliminates discrepancies in how things are deployed, avoids workflow differences and failures due to environment variance, and flattens the learning curve (developers only need to learn about Git tags).

Here I’ll present how I’ve been approaching this when it comes to publishing npm packages, with deployment tasks handled via CircleCI. The flow will look something like this:

Setting up CircleCI

First things first, we need the CircleCI pipeline to trigger when a tag is created. At the bottom of your circle.yml file, add filter for “deployment.”

version: 2 jobs: build: docker: - image: circleci/node:10.0.0 working_directory: ~/repo steps: - checkout # # Other stuff (run npm install, execute tests, etc.) # ... deployment: trigger_tag: tag: /.*/

Authenticating with the npm registry

Create an npm token and expose it as an environment variable in CircleCI (in this case, I’ve named it NPM_TOKEN). Then, add a step to authenticate with the npm registry in your circle.yml:

version: 2 jobs: build: docker: - image: circleci/node:10.0.0 working_directory: ~/repo steps: - checkout # # Other stuff (run npm install, execute tests, etc.) # ... - run: name: Authenticate with registry command: echo "//registry.npmjs.org/:_authToken=$NPM_TOKEN" > ~/repo/.npmrc deployment: trigger_tag: tag: /.*/

Versioning

Things get a little weird when it comes to versioning. npm expects a version declared in the project’s package.json file. However, this goes against managing releases (and thus versioning) with Git tags. I see two potential solutions here:

  • Manage versions with both Git and npm, with the npm package version mirroring the tag. This would mean updating the version in package.json first, then creating the Git tag.
  • Only update/set the version in package.json within the pipeline, and set it to the version indicated by the Git tag.

I like the latter solution, as forgetting to update the version number in package.json is an annoyance that pops up frequently for me. Also, dealing with version numbers in 2 places, across 2 systems, is an unnecessary bit of complexity and cognitive load. There is one oddity however, you still need a version number in package.json when developing and using the npm tool, as npm requires it and will complain if it’s not there or in an invalid format. I tend to set it to “0.0.0”, indicating a development version; e.g.

{ "name": "paper-plane", "version": "0.0.0", // ... }

In the pipeline, we’ll reference the CIRCLE_TAG environment variable to get the Git tag and use to correctly set the version in package.json. Based on semantic versioning conventions, we expect the tag to have the format “vX.Y.Z”, so we’ll need to strip away the “v” and then we’ll use “X.Y.Z” for the version in package.json. We can use npm version to set the version number:

npm --no-git-tag-version version ${CIRCLE_TAG:1}

Note the –no-git-tag-version flag. This is necessary as the default behavior of npm version is to commit the tag to the git repo.

Publishing

Publishing is simply done via npm publish. Pulling together the CIRCLE_TAG check, applying the version, and publishing into a deploy step, we get something like this:

version: 2 jobs: build: docker: - image: circleci/node:10.0.0 working_directory: ~/repo steps: - checkout # # Other stuff (run npm install, execute tests, etc.) # ... - run: name: Authenticate with registry command: echo "//registry.npmjs.org/:_authToken=$NPM_TOKEN" > ~/repo/.npmrc - deploy: name: Updating version num and publishing command: | if [[ "${CIRCLE_TAG}" =~ v[0-9]+(\.[0-9]+)* ]]; then npm --no-git-tag-version version ${CIRCLE_TAG:1} npm publish fi deployment: trigger_tag: tag: /.*/

… and we’re done 🚀!

For further reference, this circle.yml uses the steps presented above.

Null

One of my favorite videos is Null Island from Minute Earth. I frequently link to it when I get into a discussion about whether null is an acceptable value for a certain use case.

What I really like is the definition around null and the focus of null having a concrete definition, that is: “we don’t know”. When used in this way, we have a clear understanding of what null is and the context in which it’s used (whether some field in a relational table, JSON object, value object, etc.) inherits this definition (i.e. “it’s either this value or we don’t know”), yielding something that’s fairly easy to reason about.

When nulls are ill-defined or have multiple definitions, complexity and confusion grow. Null is not:

  • Zero
  • Empty set
  • Empty string
  • Invalid value
  • A flag value for an error

Equating null to any of the above means that if you come across a null, you need to dig deeper into your code or database to figure out what that null actually means.

The flip side of this is avoiding nulls altogether, and there are really 2 cases here:

  • There is no need for null (i.e. we do know what the value is, in every use case)
  • Architect the system such that a null isn’t surfaced

In the first case, null doesn’t fit the use case, so there’s no need for it. When possible, this is ideal, and you avoid the necessity for null checks.

For the second case, architecting this way always seems to involve adding more complexity, to the point where it’s questionable if there’s a net benefit.

The road never built

On reliability, Robert Glass notes the following in Frequently Forgotten Fundamental Facts about Software Engineering:

Roughly 35 percent of software defects emerge from missing logic paths, and another 40 percent are from the execution of a unique combination of logic paths. They will not be caught by 100-percent coverage (100-percent coverage can, therefore, potentially detect only about 25 percent of the errors!).

John Cook dubs these “sins of omission”:

If these figures are correct, three out of four software bugs are sins of omission, errors due to things left undone. These are bugs due to contingencies the developers did not think to handle.

I can’t validate the statistics, but they do ring true to my experiences as well; simply forgetting to handle an edge case, or even a not-so-edge case, is something I’ve done more often than I’d like. I’ve introduced my fair share of “true bugs”, code paths that fails to do what’s intended, but with far less frequency.

The statistics also hint at the limitations of unit testing, as you can can’t test a logic path that doesn’t exist. While you can push for (and maybe even attain) 100% code coverage, it’s a fuzzy metric and by no means guarantees error-free functionality.

Code Coverage