Editing RAID (section)

== Weaknesses ==
=== Correlated failures ===
In practice, the drives are often the same age (with similar wear) and subject to the same environment. Since many drive failures are due to mechanical issues (which are more likely on older drives), this violates the assumptions of independent, identical rate of failure amongst drives; failures are in fact statistically correlated.<ref name="Patterson_1994" /> In practice, the chances for a second failure before the first has been recovered (causing data loss) are higher than the chances for random failures. In a study of about 100,000 drives, the probability of two drives in the same cluster failing within one hour was four times larger than predicted by the [[exponential distribution|exponential statistical distribution]]—which characterizes processes in which events occur continuously and independently at a constant average rate. The probability of two failures in the same 10-hour period was twice as large as predicted by an exponential distribution.<ref name="schroeder">[http://www.usenix.org/events/fast07/tech/schroeder.html Disk Failures in the Real World: What Does an MTTF of 1,000,000 Hours Mean to You?] [[Bianca Schroeder]] and [[Garth A. Gibson]]</ref>

=== <span class="anchor" id="URE"></span><span class="anchor" id="UBE"></span><span class="anchor" id="LSE"></span>Unrecoverable read errors during rebuild ===
Unrecoverable read errors (URE) present as sector read failures, also known as latent sector errors (LSE). The associated media assessment measure, unrecoverable bit error (UBE) rate, is typically guaranteed to be less than one bit in 10<sup>15</sup>{{Disputed inline|Talk|date=October 2020}} for enterprise-class drives ([[SCSI]], [[Fibre Channel|FC]], [[Serial Attached SCSI|SAS]] or SATA), and less than one bit in 10<sup>14</sup>{{Disputed inline|Talk|date=October 2020}} for desktop-class drives (IDE/ATA/PATA or SATA). Increasing drive capacities and large RAID&nbsp;5 instances have led to the maximum error rates being insufficient to guarantee a successful recovery, due to the high likelihood of such an error occurring on one or more remaining drives during a RAID set rebuild.<ref name="Patterson_1994" />{{Obsolete source|reason=This source is 26 years old|date=October 2020}}<ref name="mojo2010">{{cite web|title=Does RAID 6 stop working in 2019?|url=http://storagemojo.com/2010/02/27/does-raid-6-stops-working-in-2019/|first=Robin|last=Harris|publisher=TechnoQWAN|work=StorageMojo.com|date=2010-02-27|access-date=2013-12-17}}</ref> When rebuilding, parity-based schemes such as RAID&nbsp;5 are particularly prone to the effects of UREs as they affect not only the sector where they occur, but also reconstructed blocks using that sector for parity computation.<ref>J.L. Hafner, V. Dheenadhayalan, K. Rao, and J.A. Tomlin. [https://www.usenix.org/legacy/event/fast05/tech/full_papers/hafner_matrix/hafner_matrix_html/matrix_hybrid_fast05.html "Matrix methods for lost data reconstruction in erasure codes. USENIX Conference on File and Storage Technologies], Dec. 13–16, 2005.</ref>

Double-protection parity-based schemes, such as RAID&nbsp;6, attempt to address this issue by providing redundancy that allows double-drive failures; as a downside, such schemes suffer from elevated write penalty—the number of times the storage medium must be accessed during a single write operation.<ref>{{Cite web|url=http://www.storagecraft.com/blog/raid-performance/|title=Understanding RAID Performance at Various Levels|last=Miller|first=Scott Alan|date=2016-01-05|website=Recovery Zone|publisher=StorageCraft|access-date=2016-07-22}}</ref> Schemes that duplicate (mirror) data in a drive-to-drive manner, such as RAID&nbsp;1 and RAID&nbsp;10, have a lower risk from UREs than those using parity computation or mirroring between striped sets.<ref name="UREs" /><ref>{{cite web
 |url = http://www.miracleas.com/BAARF/RAID5_versus_RAID10.txt
 |title = RAID&nbsp;5 versus RAID&nbsp;10 (or even RAID&nbsp;3, or RAID&nbsp;4)
 |date = March 2, 2011
 |access-date = October 30, 2014
 |first = Art S. |last=Kagel
 |website = miracleas.com
 |url-status = dead
 |archive-url = https://web.archive.org/web/20141103162704/http://www.miracleas.com/BAARF/RAID5_versus_RAID10.txt
 |archive-date = November 3, 2014
}}</ref> [[#SCRUBBING|Data scrubbing]], as a background process, can be used to detect and recover from UREs, effectively reducing the risk of them happening during RAID rebuilds and causing double-drive failures. The recovery of UREs involves remapping of affected underlying disk sectors, utilizing the drive's sector remapping pool; in case of UREs detected during background scrubbing, data redundancy provided by a fully operational RAID set allows the missing data to be reconstructed and rewritten to a remapped sector.<ref>{{cite book |first1=M. |last1=Baker |first2=M. |last2=Shah |first3=D.S.H. |last3=Rosenthal |first4=M. |last4=Roussopoulos |first5=P. |last5=Maniatis |first6=T. |last6=Giuli |first7=P |last7=Bungale  |title=Proceedings of the 1st ACM SIGOPS/EuroSys European Conference on Computer Systems 2006 |chapter=A fresh look at the reliability of long-term digital storage |date=April 2006|pages=221–234 |doi=10.1145/1217935.1217957 |isbn=1595933220 |s2cid=7655425 }}</ref><ref>{{Cite book |chapter-url=http://research.cs.wisc.edu/adsl/Publications/latent-sigmetrics07.pdf |first1=L.N. |last1=Bairavasundaram |first2=G.R. |last2=Goodson |first3=S. |last3=Pasupathy |first4=J. |last4=Schindler |title=Proceedings of the 2007 ACM SIGMETRICS international conference on Measurement and modeling of computer systems |chapter=An analysis of latent sector errors in disk drives |date=June 12–16, 2007|pages=289–300 |doi=10.1145/1254882.1254917 |isbn=9781595936394 |s2cid=14164251 }}</ref>

=== Increasing rebuild time and failure probability ===
Drive capacity has grown at a much faster rate than transfer speed, and error rates have only fallen a little in comparison. Therefore, larger-capacity drives may take hours if not days to rebuild, during which time other drives may fail or yet undetected read errors may surface. The rebuild time is also limited if the entire array is still in operation at reduced capacity.<ref>Patterson, D., Hennessy, J. (2009). ''Computer Organization and Design''. New York: Morgan Kaufmann Publishers. pp 604–605.</ref> Given an array with only one redundant drive (which applies to RAID levels 3, 4 and 5, and to "classic" two-drive RAID&nbsp;1), a second drive failure would cause complete failure of the array. Even though individual drives' [[mean time between failure]] (MTBF) have increased over time, this increase has not kept pace with the increased storage capacity of the drives. The time to rebuild the array after a single drive failure, as well as the chance of a second failure during a rebuild, have increased over time.<ref name="StorageForum">{{cite web |url=http://www.enterprisestorageforum.com/technology/features/article.php/3839636 |title=RAID's Days May Be Numbered |last=Newman |first=Henry |date=2009-09-17 |access-date=2010-09-07 |work=EnterpriseStorageForum}}</ref>

Some commentators have declared that RAID&nbsp;6 is only a "band aid" in this respect, because it only kicks the problem a little further down the road.<ref name="StorageForum" /> However, according to the 2006 [[NetApp]] study of Berriman et al., the chance of failure decreases by a factor of about 3,800 (relative to RAID&nbsp;5) for a proper implementation of RAID&nbsp;6, even when using commodity drives.<ref name="ACMQ" />{{cnf}} Nevertheless, if the currently observed technology trends remain unchanged, in 2019 a RAID&nbsp;6 array will have the same chance of failure as its RAID&nbsp;5 counterpart had in 2010.<ref name="ACMQ" />{{Unreliable source?|date=October 2020}}

Mirroring schemes such as RAID&nbsp;10 have a bounded recovery time as they require the copy of a single failed drive, compared with parity schemes such as RAID&nbsp;6, which require the copy of all blocks of the drives in an array set. Triple parity schemes, or triple mirroring, have been suggested as one approach to improve resilience to an additional drive failure during this large rebuild time.<ref name="ACMQ">{{cite web |title=Triple-Parity RAID and Beyond. ACM Queue, Association for Computing Machinery |url=https://queue.acm.org/detail.cfm?id=1670144 |first=Adam |last=Leventhal |date=2009-12-01 |access-date=2012-11-30}}</ref>{{Unreliable source?|date=October 2020}}

=== Atomicity<span class="anchor" id="WRITE-HOLE"></span> ===
<!-- [[RAID 5 write hole]] redirects here. -->
A system crash or other interruption of a write operation can result in states where the parity is inconsistent with the data due to non-atomicity of the write process, such that the parity cannot be used for recovery in the case of a disk failure. This is commonly termed the ''write hole'' which is a known data corruption issue in older and low-end RAIDs, caused by interrupted destaging of writes to disk.<ref name="RRG">{{cite web|title="Write Hole" in RAID5, RAID6, RAID1, and Other Arrays|url=http://www.raid-recovery-guide.com/raid5-write-hole.aspx|publisher=ZAR team|access-date=15 February 2012}}</ref> The write hole can be addressed in a few ways:

* [[Write-ahead logging]].
** Hardware RAID systems use an onboard nonvolatile cache for this purpose.<ref name=Danti>{{cite web |last=Danti |first=Gionatan|title=write hole: which RAID levels are affected? |url=https://serverfault.com/a/1002509 |website=Server Fault |language=en}}</ref>
** mdadm can use a dedicated journaling device (to avoid performance penalty, typically, [[SSD]]s and [[Non-volatile memory|NVMs]] are preferred) for this purpose.<ref>{{cite web|url=https://lwn.net/Articles/673953/|title=ANNOUNCE: mdadm 3.4 - A tool for managing md Soft RAID under Linux [LWN.net]|website=lwn.net }}</ref><ref>{{cite web|url=https://lwn.net/Articles/665299/|title=A journal for MD/RAID5 [LWN.net]|website=lwn.net }}</ref>
* Write [[intent log]]ging. [[mdadm]] uses a "write-intent-bitmap". If it finds any location marked as incompletely written at startup, it resyncs them. It closes the write hole but does not protect against loss of in-transit data, unlike a full WAL.<ref name=Danti/><ref>{{man|4|md|Linux}}</ref>
* Partial parity. [[mdadm]] can save a "partial parity" that, when combined with modified chunks, recovers the original parity. This closes the write hole, but again does not protect against loss of in-transit data.<ref>{{cite web |title=Partial Parity Log |url=https://www.kernel.org/doc/html/latest/driver-api/md/raid5-ppl.html |website=The Linux Kernel documentation}}</ref>
* Dynamic stripe size. [[RAID-Z]] ensures that each block is its own stripe, so every block is complete. Copy-on-write ([[Copy-on-write|COW]]) transactional semantics guard metadata associated with stripes.<ref name="RAID-Z">{{cite web |url= https://blogs.oracle.com/bonwick/en_US/entry/raid_z |title= RAID-Z |website= Jeff Bonwick's Blog |publisher= [[Oracle Corporation|Oracle]] Blogs |date= 2005-11-17 |access-date= 2015-02-01 |first= Jeff |last= Bonwick |url-status= dead |archive-url= https://web.archive.org/web/20141216015058/https://blogs.oracle.com/bonwick/en_US/entry/raid_z |archive-date= 2014-12-16 }}</ref> The downside is IO fragmentation.<ref name="b.PoO"/>
* Avoiding overwriting used stripes. [[bcachefs]], which uses a copying garbage collector, chooses this option. COW again protect references to striped data.<ref name="b.PoO">{{cite web |last1=Overstreet |first1=Kent |title=bcachefs: Principles of Operation |url=https://bcachefs.org/bcachefs-principles-of-operation.pdf |access-date=10 May 2023 |date=18 Dec 2021}}</ref> 

Write hole is a little understood and rarely mentioned failure mode for redundant storage systems that do not utilize transactional features. Database researcher [[Jim Gray (computer scientist)|Jim Gray]] wrote "Update in Place is a Poison Apple" during the early days of relational database commercialization.<ref>{{cite web |last1=Gray |first1=Jim |title=The Transaction Concept: Virtues and Limitations (Invited Paper) |url=http://www.informatik.uni-trier.de/~ley/db/conf/vldb/Gray81.html |publisher=VLDB [Very Large Data Bases] 1981 |archive-url=https://web.archive.org/web/20080611230227/http://www.informatik.uni-trier.de/~ley/db/conf/vldb/Gray81.html |archive-date=2008-06-11 |pages=144–154 |date=2008-06-11 |url-status=dead}}</ref>

=== Write-cache reliability ===
There are concerns about write-cache reliability, specifically regarding devices equipped with a [[write-back cache]], which is a caching system that reports the data as written as soon as it is written to cache, as opposed to when it is written to the non-volatile medium. If the system experiences a power loss or other major failure, the data may be irrevocably lost from the cache before reaching the non-volatile storage. For this reason good write-back cache implementations include mechanisms, such as redundant battery power, to preserve cache contents across system failures (including power failures) and to flush the cache at system restart time.<ref>{{Cite web|url=https://www.snia.org/education/online-dictionary/w|title=Definition of write-back cache at SNIA dictionary|website=www.snia.org}}</ref>