Failure Trends in a Large Disk Drive Population

Discussion in 'other software & services' started by Pinga, Feb 16, 2007.

Thread Status:
Not open for further replies.
  1. Pinga

    Pinga Registered Member

    Joined:
    Aug 31, 2006
    Posts:
    1,420
    Location:
    Europe
    Google paper:

    '...SMART parameters alone are unlikely to be useful for predicting individual drive failures. Surprisingly, we found that temperature and activity levels were much less correlated with drive failures than previously reported.'

    http://labs.google.com/papers/disk_failures.pdf
     
  2. HandsOff

    HandsOff Registered Member

    Joined:
    Sep 16, 2003
    Posts:
    1,946
    Location:
    Bay Area, California


    Actually, all it means is that the parameters that were set by the manufacturers were either wrong, or outright lies, the latter being the more likely, IMHO. Push the temperature mark up high enough and it will be correct 100 % of the time, the same with activity.



    -HandsOff!
     
  3. Ice_Czar

    Ice_Czar Registered Member

    Joined:
    May 21, 2002
    Posts:
    696
    Location:
    Boulder Colorado
    a broader quote sheds a bit more light ;)

    factor in that SMART has always been of limited utility value (it does what it does well, but is blind to catastrophic failure) and that distribution and integration is the leading cause of damage and the paper isnt all that surprising.

    a more random pattern of failure is actually a sign that manufacturing and normal operating environment\wear is well addressed, but that factors largely beyond the control of manufacturer's now account for the lions share in the cause and effect of failure


    and the actual study addresses where those surprising results are ;)
    this is surprising in that to date almost all predictive analysis is based on the Arrhenius Equation (for both mechanical as well as semiconductor reliability studies) and that in a controlled experiment temperature is likely the primary mechanism for failure, however in an uncontrolled real world environment, those "other factors" are displacing it throughout the low to middle temperature range.

    a "typical" reliability study\tutorial
    http://www.odyseus.nildram.co.uk/Systems_And_Devices_Files/Component Reliability Tutorial.pdf

    (the Arrhenius Equation is additionally modified for other variables, for instance current density in semiconductors in what is known as Black's Equation which is why we only get a basic rule of thumb for most temperature related life span predictions, namely the famous drop 10C (18F) increase life expectancy 100%, rise 10C decrease life span by half)

    nice find Pinga ;)
     
    Last edited: Feb 17, 2007
  4. Ice_Czar

    Ice_Czar Registered Member

    Joined:
    May 21, 2002
    Posts:
    696
    Location:
    Boulder Colorado
    what the hell :p
    a little "reliability" article Ive been working on
    about why to overclock and how it can kill
    (a work in progress...for a long time, note the further from the beginning the less verified the info especially atomic level and their analogies)

    Magic Smoke
    Part One: Resistance Is Futile



    What exactly is Magic Smoke and more importantly how do you safeguard the magic smoke you have? Whether your a n00b that has just heard the term or a veteran that is all too familiar with it (from first hand experience), we are going to take an in-depth look at exactly what magic smoke is.

    While the mythology of the origins of magic smoke are often entertaining nothing is more dreaded than the pronouncement that one has lost their magic smoke. It is in fact one of the few things worse than the Ravenous Bugblatter Beast of Traal, getting pantsed in the cafeteria or a Blue Screen of Death. It is generally at this point when geeks will gather round to commiserate the victim and suggest some depraved diversion to make them feel better. Unfortunately there are relatively few females admitting to being geeks, specifically becuase of this phenomena. But one thing is invariably true, once the magic smoke is gone, all your left with is a rather expensive key fob or wall decoration.




    The Golden Chip
    A common misconception is that if you assemble processor A with mobo B add RAM C and video card D you'll be able to benchmark or overclock to some widespread average you have seen. Worse some imagine not an average but some extraordinary but well publicized overclock will be obtainable. Why isn't this always so? Of course the simple answer is people enjoy deluding themselves.

    The technical answer is that when Integrated Circuits or as we will now refer to them IC Chips are made, they are tested, some because of a fortuitous alignment of variables are better than others, most are average and a few are below average. It is a classic bell curve with the average commanding the lion's share.

    http://img.techpowerup.org/060403/granularity.gif

    Manufacturers do a limited amount of testing on any given IC Chip, what that means is that they do not explore the full potential of a given chip but roughly group them into classes, first for basic functionality with a burn-in and then roughly for speed, known as speed binning. Thus some processors or memory chips can gain a reputation as good overclockers largely because the speed they end up rated for is conservative in the first place. Others are speed binned to a finer granularity and their potential is more limited. Of course these tests are based on the "stock" voltage the chip class is destined for as well as an average temperature of operation.

    If you have ever heard the term UTT in memory chips what that refers to is UnTesTed. They are not speed tested and there is no rating for them before they are dumped on the wholesale market. Memory companies have recently gambled on such chips and the acronym has come down to us. They subsequently perform their own sorting and save money by skipping the high-end chip testing the manufacturer normally does. How much testing these chips get varies with who is integrating them into modules.

    When a fab etches a chip, it lays down thin-film metallic traces to conduct electricity from one device to another on the chip, these traces are called interconnects. The state of these interconnects and the state of the semiconductor devices and passive components themselves determine the potential speed an IC chip is capable of. The devices are miniturized transistors, diodes, resistors, Inductors and capacitors, and there can be literally millions of them depending on the chip.

    Thus the state of these devices and the circuit are in fact Magic Smoke, while it may be less colorful than the captured perfume of a hundred and one dancing girls entranced in a WOW battle royale, distilled by pure unobtainum and injected into a chip by mad Electrical Engineering Druid Mystics deep in underground vaults protected by mutant cyborgs, it is rather more expensive to fabricate an IC Chip than to round up the Girls, Mutants and Mystics, (EE Mystics are a dime a dozen).

    When you buy a processor, a video card with a GPU and assorted SDRAM chips on it, memory modules or a mobo with a whole host of IC chips on it. Those chips are typically rated and tested to perform to their stated speed, but they haven't been fully explored and more importantly the testing they have undergone, may not have revealed latent defects that can cut short their projected lifespan.

    Overclocking is one way for the end-user to explore their full performance potential, but it is also how you accelerate the "wear" on the chip and may uncover a latent defect further degrading or killing the chip. You find out if you have a Golden Chip or a Lemon. If you are suddenly moved to make a run to your local Fish and Chips, you should especially pay heed to the warning at the beginning of this article.

    The Bus to West Hell
    If you have ever had a device die and it wasn't for instance a mechanical device that has suffered an egregious insult (like a swan dive off the workbench), it likely failed because of electromigration. Bent pins, flexed mobos and smacked HDDs are of course excepted. Electromigration is the mechanism that first degrades and then kills IC chips and actually any conductive trace even much larger ones given the wrong conditions.

    Static Discharges and power surges from or through the power supply also fall into this group, as does overheating, so electromigration is the means of destruction seperate from the root cause, which more often than not ultimately lies inbetween one's ears. But we are working on that. :p

    As the scale of IC chip interconnects and the devices on them has decreased, electromigration has become an ever greater problem. Simply put electromigration is the displacement of material within the metallized interconnects, which first form voids and eventually breaks the interconnect as a void spread all the way across a trace. As the scale has decreased so has the amount of material that can be displaced across the width of an interconnect. There are simply fewer molecules that can be displaced and those that are have a greater impact than they would in a larger scale.

    http://img.techpowerup.org/060330/elektromigration.jpg
    Failure of a copper conductive strip due to electromigration, viewed with a scanning electron microscope.
    Creative Commons ShareAlike 2.0 By Paddy/Bilder Benutzer


    Imagine a hose that is pumping electrons first it is twisted narrowing the diameter, then kinked all but stopping the flow and finally cut and you have a pretty good idea of how electromigration would impact powering those devices. Of course as a derivative of Murphy's Law this will occur right in the middle of a waterfight just as your about to soak twelve coed's in flimsy Tshirts.

    As interconnects degrade a portion of the circuit or branches of a circuit can become intermittent. Leading to random errors, lockups, data corruption and BSOD. Thay are also rumored to steal socks, borrow your car, and rackup long distance phone calls, wait that was my last roomate.

    So what specifically causes electromigration? The short answer is both heat and current density. If you think of those cooper molecules as a bus full of little children, heat would be like doping them up on candy and Jolt then driving them past Disneyland. They are going to get excited and distracted, in fact they are going to be bouncing off the walls, now we are going to ask them to pass oranges from the back of the bus to the front, these are electrons, the more oranges the more are going to get dropped, used to pelt passing pedestrians and of course applied to Suzie's hair.

    Electromigration is inevitable, it is always occuring and canot be completely stopped if the circuit is being employed. But the rate of electromigration is dependent on the current density and temperature. Put another way stopping electromigration is futile because of resistance.

    By lowering the temperature of a substance it becomes more orderly, which is why holding cells are kept at a low ambient temperature at a hooskow near you. The extreme example of this is a superconductor, where very very little resistance occurs. All the little children are transformed into Stepford Kids, automatons incapable of destraction, dutifly passing bushels of oranges as fast as their little hands can, to keep from freezing to death. Almost no oranges go astray.

    Most analogies are flawed in one respect or another, and ours in no exception, but I figured it would be rather crude to have the little blighters eating and excreting the oranges. You might be familiar with the classic atomic model that identifies its component parts.

    http://img.techpowerup.org/060424/acdc_copper_atom.gif

    Copper because of its single electron loosely attached in its outer shell is able to gain and loose them easily which makes it such a great conductor. But the shell model has been itself superceeded by a more accurate but in many ways more mystic model provided by quantum mechanics. Rather than the term shell we employ the term energy level. The shell model implys a fixed position for electrons when in fact each circle illustrated above can have a variable number of electrons depending on the atoms total energy level.

    When a conductor looses or gains electrons that is known as ionization, which is an exchange of energy from one atom to another, and the individual atoms total energy state is altered. Atoms have in addition to electrons, protrons, protrons have a positive charge, while electrons have a negative charge these are in its elemental state in balance with no charge one way or the other. However if an atom gives up more than its normal number of electrons its is positively charged and is called a positive Ion, the reverse is true if it has gained more electrons that it would normally have and its considered negative charged and a negative Ion. This seems backwards since we are adding and subtracting electrons, till you recall electrons act like negative numbers in addition. Thus you can have positive Ions of Copper or positive Ions of other elements since its a description of an atom's electron state or electrical potential.

    http://img.techpowerup.org/060424/acdc_inside_wire.gif

    But electrical potential is just part of a atom's energy state, heat is another
    http://environmentalchemistry.com/yogi/periodic/atom_anatomy.html#EnergyLevels
    >insert diagram copper molecular latice<
    >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
    In a perfect molecular structure electrons moving through the copper lattice would not collide and there would be no electromigration or resistance either, but perfect structures and ideallized flows are theroetical constructs. In real life you cannot avoid resistance, at least this side of superconductivity which would require massive cooling capacity.

    Resistance occurs from a scattering of electrons from defects in the latice. When an potential difference or voltage is applied you start moving electrons off one end and that allows adjecent electrons to move one atom over and on down the line. But not all of them end up where you want them and the longer the cunductor the more miss. In plain English the longer the wire the greater the resistance because the more electrons scatter. The result is that they collide with atoms that cannot accept the electrons because they lack the potential, they do however transfer kinetic energy, this is known at Jould Heating or ohmic heating.

    The energy state of an atom is not solely determined by its electrical potential, atoms store heat as kinetic energy or more simply motion, a hunk of hot metal may not appear to be moving on any scale we can see, but each individual atom is vibrating. This is more accurately described as Internal Energy, a random molecular motion. And because its random...

    probably more than enough for today :p
     
  5. HandsOff

    HandsOff Registered Member

    Joined:
    Sep 16, 2003
    Posts:
    1,946
    Location:
    Bay Area, California
    While I'm normally a fan of normal distribution, my love of cynicism his to take precedence here. I will indeed have to stand corrected if the failure data was consistent at the high temperature range of the spectrum, where I was guessing the manufacturers were possibly voiding warranties at unrealistically low temps.

    On the other hand, if they aren't then isn't this the next most obvious conclusion. What would you expect to fail first a mechanical device, or a device with no moving parts? Will iPod Mini's have the longevity of iPod Shuffles (with no mechanical drive)? I am glad the look at these things, but is this not to be expected?

    -HandsOff
     
  6. Ice_Czar

    Ice_Czar Registered Member

    Joined:
    May 21, 2002
    Posts:
    696
    Location:
    Boulder Colorado
    in a mobile device we suddenly have a whole new set of variables as well as technologies. impact damage to a HDD is largely a derivative of the flex in the armature and the platter, as the size of these decrease the stiffness proportionately increases and the gforce to cause a head slap increases, a four foot drop that would devastate a 3" internal HDD, is survivable by a 1" micodrive. But a flash memory card left to bake in a car window......

    lots of possible variables ;)

    what I find so interesting in Pinga's discovery is that they are now looking for the unaccounted for variables that have displaced the midrange temperature life expectancy predictions and are basically unaccounted for. We can of course deduce like the storagereview FAQ they are integration\distribution and environmental factors other than temperature which are the root cause of the (largely random) failures.
     
  7. Mrkvonic

    Mrkvonic Linux Systems Expert

    Joined:
    May 9, 2005
    Posts:
    8,698
    Hello,

    Ice, I could add quite a few lines on the issue of QM and solid state in general. BTW, it's protons and not protrons ... a small typo ... :)

    As to the collision with lattices, that's not due to defects, that's due to the physical structure of the material. Even the purest materials will have a fixed max. conductivity, based on the mean free path of the electrons, which depend on temperature and / or other influences, like EM fields. Impurities in a material add to reduced conduction yield, which is critical for thin wafers of silicon used in the assembly of chips.

    Another effect is the Casimir effect, which occurs at nano distances between particles - and reflects the theory of the zero-point energy, the energy of vacuum. Although fancy in name, this effect must be taken into consideration with nano EM devices.

    Mrk out
     
  8. Ice_Czar

    Ice_Czar Registered Member

    Joined:
    May 21, 2002
    Posts:
    696
    Location:
    Boulder Colorado
    want a job as my editor?
    seriously I need someone to make sure my "interpretation" is correct and accurate and thats only a part of the article.

    PS reading up on the Casimir effect right now ;)
    Thanx for both observations :thumb:

    maybe editor is too strong a word, just pointing out the flaws and letting me sort them but for the whole article.
    From the physical description of electromigration and its relationship to current density and temperature its swings towards the power delivery chain (utility, conditioning, SMPS, VRM\VRD, chips\Vcore\timings) and then covers ESD\precautions and possibly into EMI\Shielding in relation to the inverse square law and data reliability\signaling

    whoa ["Casimir effect" electromigration] turns up a hell of alot to read :p
     
    Last edited: Feb 18, 2007
  9. Ice_Czar

    Ice_Czar Registered Member

    Joined:
    May 21, 2002
    Posts:
    696
    Location:
    Boulder Colorado
  10. lucas1985

    lucas1985 Retired Moderator

    Joined:
    Nov 9, 2006
    Posts:
    4,047
    Location:
    France, May 1968
    Very good, thanks Ice :)
     
Loading...
Thread Status:
Not open for further replies.