Posts Tagged ‘semantic-versioning’

Versioning datasets

Contracts

An issue I’ve kept coming across when working on data systems that involve producing and consuming a number of different datasets is the lack of a contract between producers and consumers. Versioning provides a solution to this problem when dealing with software and, with a decent versioning scheme, provides a good solution for datasets as well, allowing for the creation of versioned snapshots.

Data concerns

It’s worth looking at what the problem is here and why this even matters. Imagine having some dataset, let’s say for drugs, which is periodically updated. We could reasonably say that we only care about the latest version of the drugs dataset, so every time we ingest new data, we simply overwrite the existing dataset.

For a rudimentary system, this is fine, but if we’re thinking in terms of a larger data system with this dataset being consumed by downstream processes, teams, and/or customers, there are a few concerns our system can’t elegantly deal with:

  • Corruption: the ingested data is corrupt or a bug in the ETL process results in a corrupted dataset
  • Consistent reads: not all parts (e.g. tables) of our dataset may be ready for reads by consumers at a given time (loading data to S3 is a good example here; for a non-trivial dataset with multiple objects and partitions, spread across multiple objects, the dataset as a whole can’t be written/updated atomically)
  • Breaking changes: a breaking change to downstream systems (e.g. dropping a column) may need to be rolled out
  • Reproducibility: downstream/derived datasets may need to be re-created based upon what the dataset was at some point in the past (i.e. using the latest dataset will not give the same results)
  • Traceability: we may need to validate/understand how a derived data element was generated, requiring an accurate snapshot of all input data when the derived dataset was generated

Versioning isn’t the only solution to these concerns. You could argue that frequent backups, some sort of locking mechanism, coordination between teams, and/or very granular levels of observability can address each to varying degrees, but I think versioning is (a) simple and (b) requires the least effort.

Versioning scheme

Let’s look at a versioning scheme that would address the 4 concerns I raised above. For this, I’m going to borrow from both semantic versioning and calendar versioning. Combining the 2, and adding a bit of additional metadata, we can construct a scheme like the following:

Breaking this down:

  • The semantic versioning components (major, minor, patch) can effectively tell us about the spatial nature of the dataset; the schema.
  • The calendar versioning components (YYYY0M0D) can effectively tell us about the temporal nature of the dataset (when it was ingested, generated, etc.). Note that calendar versioning is a lot more fuzzy as a standard, as there’s a lot of variance in how dates are represented, YYYY0M0D seems like a good choice as it’s easily parsable by consumers.
  • The final component (rev) is the revision number for the given date and is needed for datasets that can be generated/refreshed multiple times in a day. I think of this as an incrementing integer but a time component (hours, minutes, seconds) is another option; either can work, there’s just tradeoffs in implementation and consumer expectations.

Finding a version

Going back to our example, our data flow now looks something like this:

Note that before our consumers knew exactly where to look for the dataset (s3://bucket/drugs-data/latest), more specifically the latest version of the dataset, however, this is no longer the case. Consumers will need to figure out what version of the dataset they want. This could be trivial (e.g. consumers just want to pin to a specific version) but the more interesting and perhaps more common case, especially with automated systems, is getting the latest version. Unpacking “latest” is important here: consumers want the latest data but not if it carries with it a breaking schema change (i.e. consumers want to pin to major version component, with the others being flexible). Thinking in terms of npm-esque ranges with the caret operator, a consumer could specify a version like ^2.2.11.20221203.1 indicating they system is able to handle, and should pull in, any newer, non-breaking, updates in either schema or data.

So consumers can indicate what they want, but how does a system actually go about finding a certain version? I think the elegant solution here is having some sort of metadata for the dataset that can tell consumers what versions of the dataset are available and where to find them. Creating or updating these metadata entries can simply be another artifact of the ETL process and can be store alongside the dataset (in a manifest file, a table, etc.). Unfortunately, this does involve a small lift and a bit of additional complexity for consumers, as they’d have to read/parse the metadata record.

Dataset-level vs. Data-level versioning

In researching other ways in which versioning is done, change data capture methods usually come up. While change data capture methods are important and powerful, CDC methods are typically at the row-level, not the dataset-level, and it’s worth recognizing the distinction, especially from data systems perspective, as CDC methods come with very different architectural and implementation concerns.

For example, in this blog post from lakeFS, approach #1 references full duplication, which is dataset versioning, but then approach #2 references valid_from and valid_to fields, which is a CDC method and carries with it the requirement to write queries that respect those fields.

Avoiding full duplication

The scheme I’ve laid out somewhat implies a duplication of records for every version of a dataset. I’ve seen a number of articles bring this up as a concern, which can very well be true in a number of case, but I’m skeptical of this being a priority concern for most businesses, given the low cost of storage. In any case, I think storage-layer concerns may impact how you reference versions (more generally, how you read/write metadata), but shouldn’t necessarily dictate the versioning scheme.

From what I’ve read, most systems that try to optimize for storage do so via a git-style model. This is what’s done by cloud service providers like lakeFS and tools like git LFS, ArtiV, and DVC.

Alternatives

I haven’t come across much in terms of alternatives but, in addition to a semantic identifier, this DZone article also mentions data versions potentially containing information about the status of the data (e.g. “incomplete”) or information about what’s changed (e.g. “normalized”). These are interesting ideas but not really something I’ve seen a need for in the version identifier. That said, what I’ve presented is not intended to be some sort of silver bullet, I’m sure different engineers face different concerns and different versioning schemes would be more appropriate.

In the end, I would simply encourage engineers to consider some form of versioning around their datasets, especially in larger data systems. It’s a relatively simple tool that can elegantly address a number of important concerns.

Deployments with git tags + npm publish

Git tags and npm

Deployment workflows can vary a lot, but what I’ve tended to find ideal is to tag releases on GitHub (or whatever platform, as most have some mechanism to handle releases), the tag itself being the version number of whatever is being deployed, and having a deployment pipeline orchestrate and perform whatever steps are necessary to deploy the application, service, library, etc. This flow is well supported with git, well supported on platforms like GitHub, and is dead simple for developers to pick up and work with (in GitHub, this means filling out a form and hitting “Publish release”).

npm doesn’t play nicely with this workflow. With npm version numbers aren’t tied to git tags, or any external mechanism, but instead to the value defined in the project’s package.json file. So, trying to publish a package via tagging requires some additional steps. The typical solutions seem to be:

  • Update the version in package.json first, then create the tag
  • Use some workflow that include npm version patch to have npm handle the update to package.json and creating the git tag
  • Use an additional tool (e.g. standard-version), that tries to abstract away management of version numbers from both package.json and git tag

None of these options are great; versioning responsibility and authority is pulled away from git and, in the process, additional workflow complexity and, in the latter case, additional dependencies are introduced.

Version 0.0.0

In order to publish with npm, keep versioning authority with git, and maintain a simple workflow that doesn’t include additional steps or dependencies, the following has been working well in my projects:

  • In package.json, set the version number to “0.0.0”; this value is never changed within any git branch and, conceptually, it can be viewed as representing the “dev version” of the library. package.json only has a “non-dev version” for code published to our package repository.
  • In the deployment pipeline (triggered by tagging a release), update package.json with the version from the git tag.

    Most CI systems have some way of getting the tag being processed and working with it. For example, in CircleCI, working with tags formatted like vMAJOR.MINOR.PATCH, we can reference the tag, remove the “v” prefix, and set the version in package.json using npm version as follows:

    npm --no-git-tag-version version ${CIRCLE_TAG:1}

    Note that this update to package.json is only done within the checked-out copy of the code used in the pipeline. The change is never committed to the repo nor pushed upstream.
  • Finally, within the deployment pipeline, publish as usual via npm publish

Limitations

I haven’t run across any major limitations with this workflow. There is some loss of information captured in the git repository, as the version in package.json is fixed at 0.0.0, but I’ve yet to come across that being an issue. I could potentially see issues if you want to allow developers to do deployments locally via npm publish but, in general, I view local deployments as an anti-pattern when done for anything beyond toy projects.

Publishing packages with npm and CircleCI

A common workflow

In recent years, I’ve pushed more and more for common, automated, deployment processes. In practice, this has usually meant:

  • Code is managed with Git, and tags are used for releases
  • Tags (and hence releases) are created via GitHub
  • Creating a tag executes everything in the CI pipeline + a few more tasks for the deployments

The result is that all deployments go through the same process (no deploy scripts run on personal machines), in the same environment (the CI container). It eliminates discrepancies in how things are deployed, avoids workflow differences and failures due to environment variance, and flattens the learning curve (developers only need to learn about Git tags).

Here I’ll present how I’ve been approaching this when it comes to publishing npm packages, with deployment tasks handled via CircleCI. The flow will look something like this:

Setting up CircleCI

First things first, we need the CircleCI pipeline to trigger when a tag is created. At the bottom of your circle.yml file, add filter for “deployment.”

version: 2 jobs: build: docker: - image: circleci/node:10.0.0 working_directory: ~/repo steps: - checkout # # Other stuff (run npm install, execute tests, etc.) # ... deployment: trigger_tag: tag: /.*/

Authenticating with the npm registry

Create an npm token and expose it as an environment variable in CircleCI (in this case, I’ve named it NPM_TOKEN). Then, add a step to authenticate with the npm registry in your circle.yml:

version: 2 jobs: build: docker: - image: circleci/node:10.0.0 working_directory: ~/repo steps: - checkout # # Other stuff (run npm install, execute tests, etc.) # ... - run: name: Authenticate with registry command: echo "//registry.npmjs.org/:_authToken=$NPM_TOKEN" > ~/repo/.npmrc deployment: trigger_tag: tag: /.*/

Versioning

Things get a little weird when it comes to versioning. npm expects a version declared in the project’s package.json file. However, this goes against managing releases (and thus versioning) with Git tags. I see two potential solutions here:

  • Manage versions with both Git and npm, with the npm package version mirroring the tag. This would mean updating the version in package.json first, then creating the Git tag.
  • Only update/set the version in package.json within the pipeline, and set it to the version indicated by the Git tag.

I like the latter solution, as forgetting to update the version number in package.json is an annoyance that pops up frequently for me. Also, dealing with version numbers in 2 places, across 2 systems, is an unnecessary bit of complexity and cognitive load. There is one oddity however, you still need a version number in package.json when developing and using the npm tool, as npm requires it and will complain if it’s not there or in an invalid format. I tend to set it to “0.0.0”, indicating a development version; e.g.

{ "name": "paper-plane", "version": "0.0.0", // ... }

In the pipeline, we’ll reference the CIRCLE_TAG environment variable to get the Git tag and use to correctly set the version in package.json. Based on semantic versioning conventions, we expect the tag to have the format “vX.Y.Z”, so we’ll need to strip away the “v” and then we’ll use “X.Y.Z” for the version in package.json. We can use npm version to set the version number:

npm --no-git-tag-version version ${CIRCLE_TAG:1}

Note the –no-git-tag-version flag. This is necessary as the default behavior of npm version is to commit the tag to the git repo.

Publishing

Publishing is simply done via npm publish. Pulling together the CIRCLE_TAG check, applying the version, and publishing into a deploy step, we get something like this:

version: 2 jobs: build: docker: - image: circleci/node:10.0.0 working_directory: ~/repo steps: - checkout # # Other stuff (run npm install, execute tests, etc.) # ... - run: name: Authenticate with registry command: echo "//registry.npmjs.org/:_authToken=$NPM_TOKEN" > ~/repo/.npmrc - deploy: name: Updating version num and publishing command: | if [[ "${CIRCLE_TAG}" =~ v[0-9]+(\.[0-9]+)* ]]; then npm --no-git-tag-version version ${CIRCLE_TAG:1} npm publish fi deployment: trigger_tag: tag: /.*/

… and we’re done 🚀!

For further reference, this circle.yml uses the steps presented above.