I'm curious what this data would look like collated by drive birth date rather than (or in 3D addition to) age. I wouldn't use that as the "primary" way to look at things, but it could pop some interesting bits. Maybe one of the manufacturers had a shipload of subpar grease? Slightly shittier magnets? Poor quality silicon? There's all kinds of things that could cause a few months of hard drive manufacture to be slightly less reliable…
(Also: "Accumulated power on time, hours:minutes 37451*:12, Manufactured in week 27 of year 2014" — I might want to replace these :D — * pretty sure that overflowed at 16 bit, they were powered on almost continuously & adding 65536 makes it 11.7 years.)
Over the past couple of years, I've been side hustling a project that requires buying ingredients from multiple vendors. The quantities never work out 1:1, so some ingredients from the first order get used with some from a new order from a different vendor. Each item has its own batch number which when used together for the final product yields a batch number on my end. I logged my batch number with the batch number for each of the ingredients in my product. As a solo person, it is a mountain of work, but nerdy me goes to that effort.
I'd assume that a drive manufacture does similar knowing which batch from which vendor the magnets, grease, or silicon all comes from. You hope you never need to use these records to do any kind of forensic research, but the one time you do need it makes a huge difference. So many people doing similar products that I do look at me with a tilted head while their eyes go wide and glaze over as if I'm speaking an alien language discussing lineage tracking.
Are you using a merkle tree for batch ids?:
…where f = hash for a merkle tree with fixed size (but huge!) batch numbers, and f = repr for increasingly large but technically decipherable pie IDs.crust = f({flour, butter}) filling = f({fruit, sugar}) pie = f({crust, filling})
No. That sounds like someone that's had some sort of theory before. I just started up an uneducated man's database in the form of a spreadsheet. My batch 1001 sheet has all of the data necessary. My batch 1002 sheet has all of its data. I'm just a simpleton when it comes to this stuff.
> it is a mountain of work, but nerdy me goes to that effort.
Presumably required for compliance, if you're selling your products..
Are there decent softwares for tracking this? Or do you use custom spreadsheets or something?
I just listed everything in spreadsheets. I have no insights into what real companies do. It's a major case of fake it till you make it. I know enough to that tracking this info is a good idea, so I brute forced it. I now at least have the data to use in a real system if that ever needs to be. Unlikely. I don't even have a company formed or do any kind of sales. It's all been a very involved hobby, with products given away. It's just been a fun way of being able to push back from the keyboard while not sitting on the couch in front of yet another screen.
What I'm aware of in the industry is that the large ones use SAP. Dunno if SAP provides that from scratch or if it's custom-built though, but I'd take a guess that it comes with their standard inventory/materials tracking/invoice handling package.
Edit: looked it up, yep, part of SAP HANA LO-BM [1].
[1] https://help.sap.com/docs/SAP_S4HANA_ON-PREMISE/4eb099dbc8a6...
Every decent ERP has it, even Microsoft ones, because anyone trading in any kind of food (batch tracking) or anything tech (serial nos) need it.
I didn't even know Microsoft had an ERP solution. Wtf
Why wouldn't they? They bought a number of established and emerging solutions and integrated them into their business licensing and sales, upgrading them as their platforms evolved. Some date back 40 years, starting as a standalone DOS program, now a SaaS.
I think it's helpful to put on our statistics hats when looking at data like this... We have some observed values and a number of available covariates, which, perhaps, help explain the observed variability. Some legitimate sources of variation (eg, proximity to cooling in the NFS box, whether the hard drive was dropped as a child, stray cosmic rays) will remain obscured to us - we cannot fully explain all the variation. But when we average over more instances, those unexplainable sources of variation are captured as a residual to the explanations we can make, given the avialable covariates. The averaging acts a kind of low-pass filter over the data, which helps reveal meaningful trends.
Meanwhile, if we slice the data up three ways to hell and back, /all/ we see is unexplainable variation - every point is unique.
This is where PCA is helpful - given our set of covariates, what combination of variables best explain the variation, and how much of the residual remains? If there's a lot of residual, we should look for other covariates. If it's a tiny residual, we don't care, and can work on optimizing the known major axes.
Exactly. I used to pore over the Backblaze data but so much of it is in the form of “we got 1,200 drives four months ago and so far none have failed”. That is a relatively small number over a small amount of time.
On top of that it seems like by the time there is a clear winner for reliability, the manufacturer no longer makes that particular model and the newer models are just not a part of the dataset yet. Basically, you can’t just go “Hitachi good, Seagate bad”. You have to look at specific models and there are what? Hundreds? Thousands?
"Actually HGST was better on average than WD"is probably about the only kind of conclusion you can make. As you have noted, looking at specific models doesn't get you anything useful because by the time you have enough data the model is already replaced by a different one - but you can make out trends for manufacturers.
> On top of that it seems like by the time there is a clear winner for reliability, the manufacturer no longer makes that particular model and the newer models are just not a part of the dataset yet.
That's how things work in general. Even if it is the same model, likely parts have changed anyway. For data storage, you can expect all devices to fail, so redundancy and backup plans are key, and once you have that set, reliability is mostly just a input into your cost calculations. (Ideally you do something to mitigate correlated failures from bad manufacturing or bad firmware)
> if we slice the data up three ways to hell and back, /all/ we see is unexplainable variation
It's certainly true that you can go too far, but this is a case where we can know a priori that the mfg date could be causing bias in the numbers they're showing, because the estimated failure rates at 5 years cannot contain data from any drives newer than 2020, whereas failure rates at 1 year can. At a minimum you might want to exclude newer drives from the analysis, e.g. exclude anything after 2020 if you want to draw conclusions about how the failure rate changes up to the 5-year mark.
I find it more straight forward to just model the failure rate with the variables directly, and look metrics like AUC for out of sample data.
I personally am looking forward to BackBlaze inventing error bars and statistical tests.
Well said, and made me want to go review my stats text.
(with a tinfoil hat on) I'm convinced that Backblaze is intentionally withholding and ambiguating data to prevent producing too-easily understood visualization that Seagate is consistently the worst of the last 3 remaining drive manufacturers.
Their online notoriety only started after a flooding in Thailand that contaminated all manufacturing clean room for spindle motors in existence, causing bunch of post-flood ST3000DM001 to fail quickly, which probably incentivized enough people for the Backblaze stat tracking to gain recognition and to continue to this date.
But even if one puts aside such models affected by the same problem, Seagate drives always exhibited shorter real world MTBF. Since it's not in interest of Backblaze or anyone to smear their brand, they must be tweaking data processing to leave out some of those obvious figures.
To Seagate's credit though, their warranty service is excellent. I've had the occasional exos drive die (in very large zfs raids) and they do just ship you one overnight if you email an unhappy smart over. Also their nerd tooling, seachest, is freely downloadable and mostly open source. That's worth quite a lot to me...
(And if anyone is curious about their tools – https://github.com/Seagate/openSeaChest is the link. Lots of low level interesting toys!)
I don't think so, their posts still have all the details and the Seagates stick out like a very sore thumb in their tables:
https://backblazeprod.wpenginepowered.com/wp-content/uploads...
and graphs:
https://backblazeprod.wpenginepowered.com/wp-content/uploads...
Since it's not in interest of Backblaze or anyone to smear their brand
It is if they want to negotiate pricing; and even in the past, Seagates were usually priced lower than HGST or WD drives. To me, it looks like they just aren't as consistent, as they have some very low failure rate models but also some very high ones; and naturally everyone will be concerned about the latter.
OTOH, Seagate never sold customers SMR drives mislabeled for NAS use.
Not 100% sure about SMR situation, but granted, Seagate was never not technological front runner nor untrustworthy nor unfaithful company; their 5k4 drives were always more cost effective than anybody and they're the first to ship HAMR drives right now as well. It's __JUST__ that the MTBF was always statistically shorter.
Which is a significant “just”, to be sure! But in my experience, if an Iron Wolf survives a RAID rebuild, it’s probably going to work for many more years. I’ve had 3 WD Reds claim to keep working, and still pass their SMART short and long tests, but tank in performance. I’d see the RAID usage graphs and all drives would be at like 5% IO utilization while the Red was pegged at 100% 24/7. The whole volume would be slow as it waited for the dying-but-lying Red to commit its writes.
In each case, I yanked the Red and saw volume wait times drop back down to the baseline, then swapped in an Iron Wolf. Fool me thrice, shame on all of us. I won’t be fooled a 4th time.
I’m not a Seagate fanboy. There’s an HGST drive in my home NAS that’s been rocking along for several years. There are a number of brands I’d use before settling for WD again. However, I’d make sure WD hadn’t bought them out first.
Ugh, source on that? In the market for a new NAS/Homeserver soonish (realized my drives are almost at 10 years of power on time) and would like to have spinning rust behind ssd for larger storage.
It was a whole thing a while back. This was maybe the original article, but once it landed this was the headline of all tech news for a couple of days. https://blocksandfiles.com/2020/04/14/wd-red-nas-drives-shin...
SMR drives aren’t inherently bad, but you must not use them in a NAS. The may work well, up until they don’t, and then they really don’t. WD snuck these into their Red line, the one marketed at NAS users. The end result after a huge reputational hit was to promise to keep the Red Pro line on HMR, but the plain Red line is still a coin flip, AFAIK.
I will not use WD drives in a NAS. It’s all about trust, and they violated it to as astonishing degree.
Didn't the ST3000DM001 fail because of a design flaw on the read head landing ramp?
According to Wikipedia: https://en.wikipedia.org/wiki/ST3000DM001
Somewhat tangent: Imagine my dismay after googling why two of my drives in my NAS failed within a couple of days of one another, and I came across a Wikipedia page dedicated to the drive's notoriety. I think this is one of the few drives that were so bad that it had it's own dedicated Wikipedia page.
Same thing happened to me with this drive: Lost the data on a RAID because two drives failed at the same time.
Agreed, these type of analyses benefit from grouping by cohort years. Standard practice in analytics.
Reminds me of the capacitor blight of the late aughts. Knowing when the device was manufactured helped troubleshoot likely suspect components during repair.
That drive's probably earned a quiet retirement at this point
Right. Does the trouble at year 8 reflect bad manufacturing 8 years ago?
Honestly, at 8 years, I'd be leaning towards dirty power on the user's end. For a company like BackBlaze, I'd assume a data center would have conditioned power. For someone at home running a NAS with the same drive connected straight to mains, they may not receive the same life span for a drive from the same batch. Undervolting when the power dips is gnarly on equipment. It's amazing to me how the use of a UPS is not as ubiquitous at home.
> it’s amazing to me how the use of a UPS is not as ubiquitous at home
I live in the UK. I’ve had one power cut in the last… 14 years? Brown outs aren’t a thing. I’ve had more issues with my dog pulling out cables because she got stuck on them (once) than I have with any issues to my supply
- [deleted]
Drives run off the regulated 12V supply, not the raw power line. "Dirty power" should not be a problem.
It would depend on how well done the regulation was in the power supply, wouldn't it?
why people continue to misunderstand this befuddles me. If you bought a budget PSU, then who knows what the voltages really are coming down the +3/+5v lines. You hope they are only +3/+5, but what happens when the power dips. Is the circuitry in the bargain priced PSU going to keep the voltages within tolerance, or do they even have the necessary caps in place to handle the slightest change in mains? we've seen way too meany tear downs to show that's not a reliable thing to bank your gear on.
- [deleted]
> why people continue to misunderstand this befuddles me.
You might want to check whether your befuddlement is due to your own misunderstanding of the topic. How many switching regulators have you built? We aren't living in fixed AC transformer days anymore, even the shittiest PSU won't behave like you're making it out. The legally required PFC will already prevent it just by itself, before the main 400V DC/DC step-down even gets its hands on the power. And why are you even mentioning 3V/5V? Those rails only exist for compatibility, modern systems run almost entirely off the 12V rails; even SATA power connectors got their 3.3V (it's not 3V btw) pins spec'd away to reserved by now.
PSUs don't really rely on caps to maintain voltage, there are negative feedbacks on top of negative feedbacks.
Living in Sweden, realized last year that I hadn't replaced my homeserver/NAS in a long time, still haven't had time to replace it and the 2 drives (WD RED) are now approaching 10 years of power on time without any smart problems so far.
I work there. Can't go into much detail, but we have absolutely had various adventures with power and cooling that were entirely out of our control. There was even an "unmooring" event that nearly gave us a collective heart attack, which I'll leave you to guess at :)
> It's amazing to me how the use of a UPS is not as ubiquitous at home.
Most users don't see enough failures that they can attribute to bad power to justify the cost in their mind. Furthermore, USPes are extremely expensive per unit of energy storage, so the more obviously useful use case (of not having your gaming session interrupted by a power outage) simply isn't there.
UPSes are a PITA. I have frequent enough outages that I use them on all of my desktops, and they need a new battery every couple years, and now I'm reaching the point where the whole thing needs replacement.
When they fail, they turn short dips, which a power supply might have been able to ride through into an instant failure, and they make terrible beeping at the same time. At least the models I have do their test with the protected load, so if you test regularly, it fails by having an unscheduled shutdown, so that's not great either. And there's not many vendors and my vendor is starting to push dumb cloud shit. Ugh.
Sounds like you have some APS model. I had those issues, and switched to Cyberpower. The alarm can be muted and the battery lasts for many years.
A UPS is a must for me. When I lived in the midwest, a lightening strike near me fried all my equipment, including the phones. I now live in Florida and summer outages and dips (brownouts) are frequent.
I've got Cyberpowers actually. The alarm can be muted, but it doesn't stay muted. Especially when the battery (or ups circuitry) is worn out so a power dip turns into infinite beeping. But also if the computer is turned off.
Many years ago I had the same thing happen - actually came in the phone line, fried my modem and everything connected to the motherboard. More recently I had lightning strike a security camera - took out everything connected to the same network switch, plus everything connected to the two network switches one hop away. Also lit up my office with a shower of sparks. Lightning is no joke.
[flagged]
- [deleted]
Yes this is fairly standard in manufacturing environments. builds of material and lot or down to serial # level are tracked for production of complex goods.