Archive for September, 2022

A look at S.M.A.R.T.

What is S.M.A.R.T.?

Self-Monitoring, Analysis and Reporting Technology (S.M.A.R.T) is a monitoring system on computer hard drives and solid state drives. SMART primarily consists of a set of a attributes, with the disk monitoring current and worst values for said attributes. The attributes, associated ID number, and the range for normalized values (1 to 253) are standardized across disks. Unfortunately, there’s no standardization around which attributes are implemented, the range for raw values, what raw values actually represent, or what the threshold is for normalized value. Despite the lack of standardization around the attribute values, SMART still provides some value and offers a significant degree of observability into the state and behavior of the disk.

SMART is something I’ve been aware of for a while, but it never seemed to really matter all that much for my consumer desktop needs. There are a bunch of Windows apps to check the SMART attributes and I vaguely recall having something installed to check for out-of-range values but regularly monitoring/checking wasn’t something I took seriously; just having a backup was typically good enough and I could deal with a drive failure if/when it happened. Recently, motivated by re-using an old motherboard and CPU for a NAS server, making use of a batch of old hard disks I had accumulated, and maximizing the storage capacity of the box, I decided to take another look at SMART and its efficacy in predicting drive failures. Ideally, I could have a system where a drive could be replaced before a failure resulting in data loss.

SMART attributes correlated to hard drive failure

Google

A Google study from 2007, “Failure Trends in a Large Disk Drive Population“, lists 4 SMART attributes that are highly correlated with failure, these are:

  • SMART 5: Reallocated Sectors Count
  • SMART 187: Reported Uncorrectable Errors
  • SMART 197: Current Pending Sector Count
  • SMART 198: Uncorrectable Sector Count

It’s also worth noting that any change in SMART 187 was seen to be highly predictive of failure:

… after their first scan error (i.e. when a positive value for 187 is observed for the first time), drives are 39 times more likely to fail within 60 days than drives with no such errors.

Other attributes were looked at but results were not always consistent across models and manufacturers. The study also found using SMART parameters to predict failure was severely limited, as a large number of failed drives shown no SMART errors whatsoever:

Out of all failed drives, over 56% of them have no count in any of the four strong SMART signals, namely scan errors, reallocation count, offline reallocation, and probational count. In other words, models based only on those signals can never predict more than half of the failed drives.

… even when we add all remaining SMART parameters (except temperature) we still find that over 36% of all failed drives had zero counts on all variables.

… failure prediction models based on SMART parameters alone are likely to be severely limited in their prediction accuracy, given that a large fraction of our failed drives have shown no SMART error signals whatsoever.

This is incredibly important as correlating SMART attributes to failure means little if the correlation simply doesn’t matter for a significant percentage of drives. From skimming a few other papers, this is also something that I don’t always see being addressed/re-addressed, which is disappointing.

Backblaze

Backblaze conducted an analysis on their drives in 2016 which also showed some interesting results. In addition to the 4 SMART attributes identified in the Google study, Backblaze also found another attribute highly correlated to failure:

  • SMART 188: Command Timeout

Similar to the Google study, Backblaze also found a significant number of failed drives reporting no SMART errors for these 5 attributes but, interestingly, it was a smaller percentage that that in the Google study:

Failed drives with one or more of our five SMART stats greater than zero: 76.7%.

That means that 23.3% of failed drives showed no warning from the SMART stats we record.

Another study utilizing Backblaze’s data, “Lifespan and Failures of SSDs and HDDs: Similarities, Differences, and Prediction Models“, was also interesting as it points to another attribute highly correlated with failure:

  • SMART 240: Head Flying Hours

We examine all SMART features for HDDs, and find out that head flying hours (HFH, SMART 240) is highly related to failures even if it is not correlated with other HDD features.

A more recent post from Backblaze looks at the paper “Interpretable predictive maintenance for hard drives” which utilizes data published from Backblaze. I didn’t have a clear takeaway from the paper it did highlight the limitation of previous studies:

The analyses from Backblaze and Google were univariate and only considered correlation between failures and a single metric at a time. As such, they would not be able to detect any nonlinear interactions between metrics that affected the chance of failure. Another limitation of this analysis is that it leaves humans to choose the cutoff values that will raise alerts if exceeded.

So, for hard drives, SMART is interesting. We can say that there are maybe 6 attributes we should definitely be looking at when observing for drive failure but, unfortunately, a drive may still fail without showing any anomalous values for these attributes.

SMART attributes correlated to solid-state drive failure

While their price-per-gigabyte is still much higher than that of a hard drive, solid-state drives are increasingly commonplace. While SSDs do support SMART, I couldn’t as much research done on SSDs. The study I mentioned above, “Lifespan and Failures of SSDs and HDDs: Similarities, Differences, and Prediction Models“, didn’t use SMART attributes but instead daily performance logs in a proprietary format:

… daily performance logs for three MLC SSD models collected at a Google data center over a period of six years. All three models are manufactured by the same vendor and have a 480GB capacity … they utilize custom firmware and drivers, meaning that error reporting is done in a proprietary format rather than through standard SMART features.

Another study, “An In-Depth Study of Correlated Failures in Production SSD-Based Data Centers” didn’t find any correlation with SMART attributes:

Intra-node and intra-rack failures have limited correlations with the SMART attributes and have no significant differences of correlations with each SMART attribute. Thus, the SMART attributes are not good indicators for detecting the existence of intra-node and intra-rack failures in practice.

Finally, I looked at “SSD Failures in Datacenters: What? When? and Why?” which had a similar conclusion:

… even though tracking [SMART] symptoms is important, prognosis of whether a SSD will fail(-stop) or not, cannot be made entirely based on the symptoms. This motivates us to study other factors, beyond SMART symptoms, to better understand the characteristics of failed devices.

So it seems that for SSDs, SMART isn’t an effective tool when it comes to predicting failure.

Reading SMART attributes

On Windows, there’s a number of tools to read SMART attribute, these posts on superuser list a bunch. On Linux, smartmontools seems to be available for most distros; it’s fairly easy to install and use (from the command line, something like sudo smartctl --all /dev/sda).

As for reading the SMART data programmatically, information was more sparse. For Windows, this article provides some code and points to using DeviceIoControl() to communicate with the device driver to retrieve the attributes. For Linux, this post provides a lot of good information and code on how to read the attributes. I haven’t tried implementing either of these approaches myself, but something I might play around with for a future project.

Preventing fake signups

The problem

One annoying problem I began encountering with ScratchGraph a while ago was fake signups. Every so often I would notice a new account created but there would be no interaction on the site beyond account creation. I’d also notice some other errors in the application log, as the spam bot would attempt to fill in and submit every form on the landing page, so I’d get login failure and reset account errors as well. At first, I figured I could just ignore this; metrics would be a bit off, I’d have a bounced welcome email from time to time, and I could just purge the few fake accounts at some point the future. Unfortunately it got to the point where there were so many fake accounts being created that figuring out if an account was real took more effort, there was a lot of garbage in the database and logs and, perhaps most importantly, welcome emails being sent out would be bounced or flagged as spam, dragging down my email reputation and increasing the possibility of emails going to spam.

A solution

There’s a bunch of blog posts from email marketing services detailing this issue, e.g. Mailchip, Folderly. There’s usually a common few solutions mentioned:

  • ReCAPTCHA
  • Email confirmation for new accounts
  • Some sort of throttling
  • Honeypot fields

I opted for honeypot fields.

  • I wanted to minimize dependencies and additional integrations, so no ReCAPTCHA
  • Email confirmation seemed like too heavy of a lift and I disliked the idea of having a user jump from the application to their inbox

  • Throttling kinda makes sense (I could see if other forms were submitted around the same time, with the same email address, and flag the account), this is possible but not trivial for a PHP application where you don’t have service-level jobs running in the background
  • So, I added a honeypot field on the signup form. In practice, I wrapped a text input in a hidden div and I gave the input the name fullname. A user’s full name is also not requested/collected during sign up and I figured the spambot may try to gauge what to do with the field based on its name, so I should give it a realistic name. On the backend, if fullname is non-empty, the request still succeeds (HTTP 200), but an account is not created. Instead, email and IP address are logged in a table for spam signups (I figure I might be able to do something with this data at some point).

    Efficacy

    I was somewhat skeptical as to how well a honeypot field on the signup form would work, I was imaging spambots being incredibly sophisticated, maybe noticing the field was hidden or there was no change on the frontend. Turns out this is not the case or at least not the case for the spambots I was facing. From what I can tell, this relatively simple approach has detected and caught all fake signups from spambots.

    Fake signup caught via honeypot field

    I imagine there’s a shelf life to this solution (happy to be wrong) but, when it starts to fail, I can always integrate another approach such as throttling or ReCAPTCHA.