Hard disk reliability study - 2005-2020

Discussion in 'hardware' started by Mrkvonic, Feb 19, 2020.

  1. Mrkvonic

    Mrkvonic Linux Systems Expert

    Joined:
    May 9, 2005
    Posts:
    9,767
    I think you will love this. I've published this ultra-long and insightful reliability study of mechanical hard disks conducted over 15 years (2005-2020) in a home environment, covering desktops, laptops and external devices, including environmental conditions, temperatures, usage patterns, Mean Time To Fail (MTTF), failure probability, other factors and findings, and more. Enjoy most profoundly.

    https://www.dedoimedo.com/computers/hard-disk-reliability-study-2005-2020.html


    Cheers,
    Mrk
     
  2. itman

    itman Registered Member

    Joined:
    Jun 22, 2010
    Posts:
    7,913
    Location:
    U.S.A.
    I have a WD Blue; i.e. WD2500KS, circa 2006 that I reinstalled last Sept in my Win 10 build when my much newer Seagate Barracuda drive used for backups failed. WD2500KS Smart only shows 22 ECC errors which haven't changed since installation. HD Tune states this could be a sata cable problem but I am not worried unless those counts start increasing.
     
  3. Bill_Bright

    Bill_Bright Registered Member

    Joined:
    Jun 29, 2007
    Posts:
    3,611
    Location:
    Nebraska, USA
    I commend you for taking the time and effort to consolidate and aggregate all this data from 15 years of use. I find the results interesting but I am sorry to say, its not very useful or valuable for us normal consumers. That is NOT a criticism of you, all the effort you put into it, or the report. Let me explain.

    I have never seen such a report, at least not one like this that centers on long term use in a home environment. And that is exactly what most consumers need. We might see RMA reports from the likes of Newegg or Amazon, but those reports tend to be highly skewed as they include products returned because they were damaged during shipping, were the wrong size or wrong color, missing parts out of the box, or they were returned for other reasons not associated with failure under normal use. They also include DOAs which really don't reflect the reliability of that particular model. And RMA reports never address longevity.

    So a report like yours that covers many years is a fantastic idea and I truly applaud your efforts to bring one to us.

    The problem is, there just aren't enough samples. A sample size of one is anecdotal at best. Even the finest products from the best manufacturer will occasionally have a sample that fails prematurely. That does not mean that is an unreliable model or brand. To be of real value, there needs lots of samples of the same model number. The more samples the greater the accuracy and value. That's just simple Statistics 101.

    What makes Backblaze informative and useful is, in most cases, they evaluate 100s or even 1000s of the same model number over time. But sadly, that is not done in a home environment.

    I appreciate you acknowledged the exact model numbers are not listed. But unfortunately that information is critical to be of value to the consumer. For example, your report shows that 2 identical WD Blue 1000 drive were "OK" but another WD Blue 1000 failed. Were they the same model numbers from the same series? We don't know, so we don't know what to buy or what to avoid. There was a WD My Book 500 that failed, but another was OK. Were they the same model numbers? We don't know.

    You are 100% correct that revision numbers change too - but that is often to correct firmware bugs or to replace failing components on the controller board with reliable components from a different source - fixes that may make later model numbers very reliable - and thus desirable to us consumers.

    "Laptop disk" is listed 5 times. What does that even mean? Are they the same brand? Different brands? We don't know. I note the difference between a "laptop disk" and "PC disk" is often based solely on the type computer it is put in. While you don't normally see 3.5" disks in laptops, it is not uncommon to see the exact same model number 2.5" disks used in notebooks also used in PCs (or in enclosures) Beyond that, nothing makes a disk a laptop disk different from a PC disk. So without a model number, those entries, again, are of no value to us. :(

    I also appreciate you acknowledge your bias towards WD. But Seagate is the world's largest HD manufacturer with 40% of the market share to WD at 36%. One of the more common questions I get when people are seeking advice is, "Seagate or WD?" But your report is void of Seagate drives.

    So again, I find your report very interesting and I do appreciate the effort you put into it. But do I love it? Sorry, but no. It really serves no "value" for us.

    I understand pride of authorship but please don't take this critique personally. I meant nothing personal.
     
  4. itman

    itman Registered Member

    Joined:
    Jun 22, 2010
    Posts:
    7,913
    Location:
    U.S.A.
    @Mrkvonic here's one for you.

    For same above noted Win 10 build, recently replaced boot drive and installed Win 10 1909 fresh. Drive is Seagate Constellation 1TB purchased in 2013 but never used. I did run HDD diagnostics on the drive at time of drive purchase and everything checked out. I have been getting an I/O error per Win log entry every time I boot. No more errors thereafter. Using MS standard SATA driver in AHCI mode. Any ideas on this?

     
  5. zapjb

    zapjb Registered Member

    Joined:
    Nov 15, 2005
    Posts:
    4,876
    Location:
    USA still the best. But getting worse!
    I don't want to know. LOL. I have a hard enough time not being in the present moment.
     
  6. Mrkvonic

    Mrkvonic Linux Systems Expert

    Joined:
    May 9, 2005
    Posts:
    9,767
    Bill, statistically, 30 devices is enough to have a meaningful report. Of course, like you said - and I wrote - there are lots of other things to take into account. But it is indeed impossible to cover them by "one" household. I tried my best to account for what's feasible and reasonable. But indeed, I haven't tried too many different manufacturers. Even so I guess this is pretty indicative, even on the conservative side, of what you might get if you go for whatever mix of hard disks you need for your home.

    itman - firmware error, driver error, controller error, cable error, disks error, can't say more. My guess, if the disk works fine - probably driver error.

    Thanks,
    Mrk
     
  7. Bill_Bright

    Bill_Bright Registered Member

    Joined:
    Jun 29, 2007
    Posts:
    3,611
    Location:
    Nebraska, USA
    It would be if they were 30 identical drives. Or even if 5 samples of 6 different model numbers. Even 3 samples of 10 model numbers might be meaningful. But here, most are a sample size of one. And they don't specify the model number. So the report does not reveal any trends, good or bad. It does not reveal brands to avoid, models to avoid, series to avoid - or buy.

    A single sample that fails could simply be an exception. And exceptions don't make the rule. And the reverse is true too. A single sample that passes all tests could be an exception too.

    If I did a study of power supplies that included a single Corsair 500W supply and reported that one Corsair 500W failed, would that suggest all Corsair 500W supplies are of poor quality and unreliable? Not at all because (1) a sample of one does not show a trend and (2) we know Corsair has several series or tiers of PSUs and Corsair's top tier are quality, reliable power supplies. So would that be meaningful information about Corsair 500W supplies? If I did a study of power supplies and didn't include even one EVGA or SeaSonic, would it be meaningful?

    So, no. It really isn't "meaningful". Interesting? Absolutely! :)

    What would have been meaningful is if 15 years ago, you got all your friends and colleagues to keep track as you did. Then you consolidated all that data (but, again with model numbers and all the market leaders). That would have been meaningful and of value to us home consumers. So why didn't you do that? ;)
    And that is clearly obvious and appreciated.
     
  8. Mrkvonic

    Mrkvonic Linux Systems Expert

    Joined:
    May 9, 2005
    Posts:
    9,767
    The focus is not specifically around brands - it's about failures and data backups - that's the most important thing, I believe.
    One, you can't expect disks to warn you when they are going to fail. Two, my percentages/time are thus - hence the data redundancy plan.
    Getting people to do this ... well, I might as well run a penal colony :)

    Cheers,
    Mrk
     
  9. Bill_Bright

    Bill_Bright Registered Member

    Joined:
    Jun 29, 2007
    Posts:
    3,611
    Location:
    Nebraska, USA
    I have to admit, I am really confused now. Absolutely, backups are critical. And I absolutely mean backups - as in more than one copy - a point you also stress. :) And it is sad when everyone knows they should have a robust backup plan, and use it, but most don't. We always hear one of these excuses and not until they lose the only copy of their data that they then might do something about it. :(

    I guess where I'm confused is if model numbers don't matter and brands don't matter, then why break it down at all? Of 30 drives, you had 5 failures. That's a 17% failure rate. Not good! In fact, that's horrible!

    The true fact is this; all drives will fail - eventually (unless retired before that eventuality).

    IMO, since brand and model numbers don't matter, then it also does not matter if those failures occurred in the 1st year, 5th year or 10th year. The fact 17% of all drives failed during while in use is proof enough to conclude users need to have a robust, multi-copy backup plan, and use it. Breaking it down by brand (in some cases), years in use, type drive, etc. just obfuscate the true conclusion you are trying to make.

    JMVHO
     
  10. Mrkvonic

    Mrkvonic Linux Systems Expert

    Joined:
    May 9, 2005
    Posts:
    9,767
    That is 17% over 7 years average disk life, so that is about 2.5% annual in the worst case, hence two or three data copies are needed (at least).
    Mrk
     
  11. Bill_Bright

    Bill_Bright Registered Member

    Joined:
    Jun 29, 2007
    Posts:
    3,611
    Location:
    Nebraska, USA
    Two or three backup copies are needed. That is, 3 copies minimum are needed for a "robust" backup plan; the original and 2 backups, preferably with one of those backup copies being stored off-site.

    That said, if the focus is about the importance of data backups, it is important to note data loss is NOT just about drive "failure". Data "corruption" can be just as devastating and can occur due to unexpected power loss or malware. Data loss can occur totally through user error by accidentally deleting files, wiping or formatting drives. Data loss can occur due to natural disasters (fire, flood, tornado or hurricane). Or a bad guy can break into the home and steal the computer (and the external backup drive too that folks typically have sitting next to their monitors). Natural disasters and badguys illustrate the need for an "off-site" backup copy.

    If the focus was to be about data backups ("the most important thing"), then all that information about those drives, including their size, time in service, etc. makes that focus a bit fuzzy. I mean the title of your article is "Hard disk reliability study - 2005-2020", not "Why you need to have a robust backup plan".
     
  12. reasonablePrivacy

    reasonablePrivacy Registered Member

    Joined:
    Oct 7, 2017
    Posts:
    1,084
    Location:
    Member state of European Union
    I have read your article "My backup strategy" and I think it has one assumption, that is not true in many cases.
    Assumption is: one would not buy new drive immediately after failure of previous drive and continue to use only remaining drives.
    If you buy new drive immediately after failure then whole math is a lot more complicated.
    What means immediately? There may be a scenario user doesn't know about failure for a few days or weeks. And even if he know then buying may take few days to ship to home. So there is a time window. You would need to calculate risk of all drives failure in that short time.

    As @Bill_Bright noted backups don't only protect from drive failures. Backups protect from malware, file corruption due to power loss or software errors, bad guy taking your data drives from you, or mistakes user may do: deleting or modifying data that makes data less useful.
     
  13. Mrkvonic

    Mrkvonic Linux Systems Expert

    Joined:
    May 9, 2005
    Posts:
    9,767
    Assumption is - data is present simultaneously on the X devices (copies).
    Mrk
     
  14. Linux Build

    Linux Build Registered Member

    Joined:
    Feb 21, 2020
    Posts:
    1
    Location:
    Seattle
  15. reasonablePrivacy

    reasonablePrivacy Registered Member

    Joined:
    Oct 7, 2017
    Posts:
    1,084
    Location:
    Member state of European Union
    What is X? X is a starting number of drives? When starting number of drives increases then probability of at least one drive failure increases too. In that case probability of not having all X drives in operable state should increase, not decrease. The larger X is then the less chance you have to have X drives after time Y (caeteris paribus) compared to the same time Y but smaller drive count W < X.
     
  16. Mrkvonic

    Mrkvonic Linux Systems Expert

    Joined:
    May 9, 2005
    Posts:
    9,767
    If one drive has a 10% probability of failure, if you have two such drives with the same data (identical copies), then the probability of loss of data due to both disks failing at the same time is 10% x 10%, so 1%.
    Mrk
     
  17. reasonablePrivacy

    reasonablePrivacy Registered Member

    Joined:
    Oct 7, 2017
    Posts:
    1,084
    Location:
    Member state of European Union
    It is hard to discuss with you, because you often reply with just one sentence.

    10% probability of failure over 3 years I assume.
    It is not 1% probability of data loss if you you discover drive failure within minutes, buy new drive (or use other spare drive in operable state) and place backup here within few hours. It is non-zero probability, but considerably less than 1%.
     
  18. Mrkvonic

    Mrkvonic Linux Systems Expert

    Joined:
    May 9, 2005
    Posts:
    9,767
    Well, I explained that in detail in both articles, I hope, hence the one sentence wonders.

    Yes, there is the window where you have downtime due to a failure and increased temporal risk in the time window where you need to introduce the new component. So you can then go with triple or quadruple redundancy, have spares ready, so the time to resume normal state goes down to maybe minutes.

    The time frame that defines the probability for failure affects cost, not your redundancy. The probability determines your acceptable redundancy level.

    Mrk
     
  19. 142395

    142395 Guest

    Haven't read the all arguments, replying only for some mathematical part. Sure, if a disk fails just after 3 years (I'm not sure of my English, I mean if you buy a disk at Jan. 1, 2010 then it fails exactly at Jan. 1, 2013 but probabilistically, in this case 10%; it's also okay to assume a disk fails within 3 years with the probability but the owner don't buy new one), if you buy 2 disks and have the same data on both, the risk you lose both at the day is 1% while the probability both disks survive is 81% and only either one survive is 9%. Ofc it's not realistic, it'll be better to assume the failure as a function of time (ignore temperature etc. for simplicity). I don't know what function is good for this (if we have enough data, we can estimate the function), but one candidate may be:
    Pr(failure) = exp(-(-ln(time))^γ)
    where γ is a positive parameter determining the shape of the function. In this case time must be within 0 to 1 so let's assume time = 1 means 30 years - a scenario that after 30 years all disks will fail w/ 100% certainty. If γ = 1 (linear), the probability a disk will fail after 3y is 10% as the previous discussion. But now you can calculate the probability for more realistic cases such as what @reasonablePrivacy talked. but I think γ = 1 is not likely, probably γ > 1 in reality.

    For the original discussion, it seems two disks failed after 4y, one after 5y, one after 8y, and the last after 10y. Simply summing up this by 16.7% per 10y and 1.67% per year is wrong because they failed in different timings. Even assuming a disk fails w/ fixed 1.67% probability every after a year INDEPENDENTLY (throwing a dice every after year), after the first year the surviving probability is 98.33%, after two years it's 96.69%, and ten years still 84.5%, not 100 - 16.7 = 83.3%.
     
    Last edited by a moderator: Feb 23, 2020
  20. stapp

    stapp Global Moderator

    Joined:
    Jan 12, 2006
    Posts:
    13,808
    Location:
    UK
  21. reasonablePrivacy

    reasonablePrivacy Registered Member

    Joined:
    Oct 7, 2017
    Posts:
    1,084
    Location:
    Member state of European Union
    I think that bathtub curve is usually a candidate. The idea of bathtub curve is that clearly defective machines will have very early failure. Machines that are not clearly defective will survive time of very high failure rate and fail at a constant failure rate later. It does not apply to all machines, but it is a classic example.
     
  22. Mrkvonic

    Mrkvonic Linux Systems Expert

    Joined:
    May 9, 2005
    Posts:
    9,767
    Yuki, I agree with you on the curve - it's several independent curves, but for the sake of simplicity, I went with the simplest one.
    If we had the exact equation, we could predict disk failures accurately, which is not the case in the industry. Far from it.

    I analyzed the cumulative failure by doing:

    Sum of failures/divided by #years, so 1/30 disk after 4 years gives it roughly 0.75%, but say 3/30 disks after 5 years, gives it 2%.

    Besides, because of the lack of exact math in this case, I try not to focus on the equation. I just look at end states (worst case) scenario. And then, derive from that the expected failure rates, and then decide how important the data needs to be so that I won't lose it if disks go bust.

    Mrk
    Mrk
     
  23. reasonablePrivacy

    reasonablePrivacy Registered Member

    Joined:
    Oct 7, 2017
    Posts:
    1,084
    Location:
    Member state of European Union
    I also don't focus on exact math, but I don't agree to ignore human action. Two cases:
    1. Somebody buys 2 HDDs. One of them fails after some time, person discovers that, but does nothing (human inaction).
    2. Somebody buys 2 HDDs. One of them fails after some time, person discovers that, buys another drive and places backup there as soon as possible (human action by reaction to an event).
    Even without close look and mathematical calculations common sense tells us that probability of data loss is different in these scenarios.
     
  24. 142395

    142395 Guest

    Well, seeing is believing. Let's assume the failure rate is linear and after 40 years all disks will fail as expectation (Google search suggested 40 - 50y). For simplicity, I use discrete time whose unit is a month.

    Case 1: a user always use at most 2 disks (tho personally I'd like to recommend at least 2 simultaneous backups i.e. 3+ disks, let's assume more risky scenario). If one fails, he'll replace it in a month. It's possible both of his disks fail before he replaces.

    Case 2: another user bought 5 disks at first and never buy another. All other conditions are the same.

    I simulated both cases w/ 50000 users for each. Every month, each disk can fail w/ 0.20833% probability.

    Results:

    Case 1: 105/50000 users had failure of both disks before 40y. The distribution of time to full-failure for those 105 users is shown below (the vertical axis is the number of users, the horizontal is time range in months). Users bought 1.986 new disks (other than initial 2) on average of 50000 users.
    https://i.imgur.com/XipVXQ8.png

    Case 2: 5001/50000 users had failure of all disks before 40y. The distribution of time to full-failure for those 5001 users is shown below.
    https://i.imgur.com/n1DRITu.png
     
    Last edited by a moderator: Feb 23, 2020
  25. Bill_Bright

    Bill_Bright Registered Member

    Joined:
    Jun 29, 2007
    Posts:
    3,611
    Location:
    Nebraska, USA
    It is not the case in the industry because developing an exact equation to accurately predict when a specific disk will fail is simply impossible.

    That would be great if possible, however. Companies, governments, institutions and individual consumers could "strategically" plan and budget for those future expenditures because they would know exactly when those resources would be needed.

    IT managers could schedule downtime to periodically replace the drives before their "expiration dates". Now that would be really nice!

    And while there certainly are other factors dictating the need for data backups, unexpected drive failures would no longer be one of them.

    Maybe one day. But not today.
     
Loading...
  1. This site uses cookies to help personalise content, tailor your experience and to keep you logged in if you register.
    By continuing to use this site, you are consenting to our use of cookies.