Open main menu
Home
Random
Recent changes
Special pages
Community portal
Preferences
About Wikipedia
Disclaimers
Incubator escapee wiki
Search
User menu
Talk
Dark mode
Contributions
Create account
Log in
Editing
IT disaster recovery
Warning:
You are not logged in. Your IP address will be publicly visible if you make any edits. If you
log in
or
create an account
, your edits will be attributed to your username, along with other benefits.
Anti-spam check. Do
not
fill this in!
{{short description|Maintaining or reestablishing vital information technology infrastructure}} {{About|a sub-practice of [[business continuity planning]] (BCP)|societal disaster recovery|emergency management}} '''IT disaster recovery''' (also, simply '''disaster recovery''' ('''DR''')) is the process of maintaining or reestablishing vital [[IT infrastructure|infrastructure]] and [[information systems|systems]] following a [[natural disaster|natural]] or [[man-made hazards|human-induced]] [[disaster]], such as a storm or battle. DR employs policies, tools, and procedures with a focus on [[IT systems]] supporting critical business functions.<ref>{{cite web |url=http://continuity.georgetown.edu/dr/ |archive-url=https://web.archive.org/web/20120226115053/http://continuity.georgetown.edu/dr/ |title='Systems and Operations Continuity: Disaster Recovery |publisher=Georgetown University - University Information Services |archive-date=26 Feb 2012 |access-date=20 July 2024}}</ref> This involves keeping all essential aspects of a business functioning despite significant disruptive events; it can therefore be considered a subset of business continuity (BC).<ref>{{cite web|title=Disaster Recovery and Business Continuity |url=http://www-304.ibm.com/partnerworld/gsd/solutiondetails.do?solution=44832 |archive-url=https://web.archive.org/web/20130111203921/http://www-304.ibm.com/partnerworld/gsd/solutiondetails.do?solution=44832&expand=true&lc=en |archive-date=January 11, 2013 |publisher=[[IBM]] |access-date=20 July 2024}}</ref><ref>{{cite web |url=https://drii.org/what-is-business-continuity-management |title=What is Business Continuity Management? |publisher=Disaster Recovery Institute International |access-date=20 July 2024}}</ref> DR assumes that the primary site is not immediately recoverable and restores data and services to a secondary site. ==IT service continuity== {{see also|Business continuity and disaster recovery auditing}} '''IT service continuity (ITSC)''' is a subset of BCP,<ref>{{cite web |website=ForbesMiddleEast.com |title=Defending The Data Strata |url=https://www.forbesmiddleeast.com/en/defending-the-data-strata |date=December 24, 2013 }}{{Dead link|date=August 2024 |bot=InternetArchiveBot |fix-attempted=yes }}</ref> which relies on the metrics (frequently used as [[key risk indicators]]) of recovery point/time objectives. It encompasses '''IT disaster recovery planning''' and the wider '''IT resilience planning'''. It also incorporates IT infrastructure and [[IT service|services]] related to [[telecommunications|communications]], such as [[telephony]] and [[data communications]].<ref>{{cite web |website=[[Association for Computing Machinery|ACM]].com (ACM Digital Library) |title=Information systems continuity process |author1=M. Niemimaa |author2=Steven Buchanan |date=March 2017 |url=https://dl.acm.org/citation.cfm?id=3062955}}</ref><ref>{{cite magazine |magazine=Disaster Recovery Journal |url=https://www.drj.com/images/journal/fall-2017-volume30-issue3/2017_ITServiceDir.pdf |title=2017 IT Service Continuity Directory |access-date=2018-11-30 |archive-date=2018-11-30 |archive-url=https://web.archive.org/web/20181130084451/https://www.drj.com/images/journal/fall-2017-volume30-issue3/2017_ITServiceDir.pdf |url-status=dead }}</ref> ===Principles of backup sites=== {{Main|Backup site}} Planning includes arranging for backup sites, whether they are "hot" (operating prior to a disaster), "warm" (ready to begin operating), or "cold" (requires substantial work to begin operating), and standby sites with hardware as needed for continuity. In 2008, the [[British Standards Institution]] launched a specific standard supporting Business Continuity Standard [[BS 25999]], titled BS25777, specifically to align computer continuity with business continuity. This was withdrawn following the publication in March 2011 of ISO/IEC 27301, "Security techniques β Guidelines for information and communication technology readiness for business continuity."<ref>{{Cite web|website=Business Continuity Forum|date=2012-05-03|title=ISO 22301 to be published Mid May - BS 25999-2 to be withdrawn|url=https://www.continuityforum.org/content/news/165318/iso-business-continuity-standard-22301-replace-bs-25999-2|access-date=2021-11-20|language=en}}</ref> [[ITIL]] has defined some of these terms.<ref>{{Cite web|url=https://www.axelos.com/resource-hub/glossary/|title=Browse the Resource Hub for all the latest content | Axelos|website=www.axelos.com}}</ref> ===Recovery Time Objective=== The '''Recovery Time Objective (RTO)'''<ref name=Forb.En>{{cite magazine |magazine=[[Forbes]] |date=April 30, 2015 |url=https://www.forbes.com/sites/sungardas/2015/04/30/like-the-nfl-draft-is-the-clock-the-enemy-of-your-recovery-time-objective |title=Like The NFL Draft, Is The Clock The Enemy Of Your Recovery Time}}</ref><ref name=Forb.R3>{{cite magazine |magazine=[[Forbes]] |date=October 10, 2013 |url=https://www.forbes.com/sites/sungardas/2013/10/29/three-reasons-you-cant-meet-your-disaster-recovery-time-objectives |title=Three Reasons You Can't Meet Your Disaster Recovery Time}}</ref> is the targeted duration of time and a service level within which a [[business process]] must be restored after a disruption in order to avoid a break in business continuity.<ref name=druva>{{cite web |url=http://www.druva.com/blog/2008/03/22/understanding-rpo-and-rto |title=Understanding RPO and RTO |publisher=DRUVA |date=2008 |access-date=February 13, 2013}}</ref> According to business continuity planning methodology, the RTO is established during the [[business impact analysis]] (BIA) by the owner(s) of the process, including identifying time frames for alternate or manual workarounds. [[File:RPO RTO example converted.png|thumb|500px|<small>Example showing longer 'actual' times that do NOT meet either RPO or RTOs ('objectives'). Diagram provides schematic representation of the terms [[Recovery Point Objective|RPO]] and RTO.</small>]] RTO is a complement of RPO. The limits of acceptable or "tolerable" [[IT service continuity|ITSC]] performance are measured by RTO and RPO in terms of time lost from normal business process functioning and data lost or not backed up during that period.<ref name="druva" /><ref name="TTdiff">{{Cite web |url=https://searchstorage.techtarget.com/feature/What-is-the-difference-between-RPO-and-RTO-from-a-backup-perspective |title=How to fit RPO and RTO into your backup and recovery plans |website=SearchStorage |access-date=2019-05-20}}</ref> ====Recovery Time Actual==== '''Recovery Time Actual (RTA)''' is the critical metric for business continuity and disaster recovery.<ref name=Forb.En/> The business continuity group conducts timed rehearsals (or actuals), during which RTA gets determined and refined as needed.<ref name=Forb.En/> === Recovery Point Objective === A '''Recovery Point Objective (RPO)''' is the maximum acceptable interval during which [[transactional data]] is lost from an IT service.<ref name=druva/> For example, if RPO is measured in minutes, then in practice, off-site mirrored backups must be [[Continuous data protection|continuously maintained]] as a daily off-site backup will not suffice.<ref>{{cite web |author=Richard May |title=Finding RPO and RTO |url=http://www.virtualdcs.co.uk/blog/business-continuity-planning-rpo-and-rto.html|url-status=dead |archive-url=https://web.archive.org/web/20160303224604/http://www.virtualdcs.co.uk/blog/business-continuity-planning-rpo-and-rto.html|archive-date=2016-03-03}}</ref> ====Relationship to RTO==== A recovery that is not instantaneous restores transactional data over some interval without incurring significant risks or losses.<ref name=druva/> RPO measures the maximum time in which recent data might have been permanently lost and not a direct measure of loss quantity. For instance, if the BC plan is to restore up to the last available backup, then the RPO is the interval between such backups. RPO is not determined by the existing backup regime. Instead BIA determines RPO for each service. When off-site data is required, the period during which data might be lost may start when backups are prepared, not when the backups are secured off-site.<ref name=TTdiff/> ===Mean times=== The recovery metrics can be converted to/used alongside [[failure]] metrics. Common measurements include [[mean time between failures]] (MTBF), [[mean time to first failure]] (MTFF), [[mean time to repair]] (MTTR), and [[mean down time]] (MDT). ===Data synchronization points=== A data synchronization point<ref>{{cite web |date=May 14, 2013 |title=Data transfer and synchronization between mobile systems |url=http://www.freepatentsonline.com/8442943.html}}</ref> is a backup is completed. It halts update processing while a disk-to-disk copy is completed. The backup<ref>{{cite web |title=Amendment #5 to S-1 |website=SEC.gov |url=https://www.sec.gov/Archives/edgar/data/1519917/000119312512125661/d179347ds1a.htm |quote=real-time ... provide redundancy and back-up to ...}}</ref> copy reflects the earlier version of the copy operation; not when the data is copied to tape or transmitted elsewhere. ===System design=== RTO and the RPO must be balanced, taking business risk into account, along with other system design criteria.<ref>{{cite book|chapter=Setting the Maximum Tolerable Downtime -- setting recovery objectives|pages=19β22 |title=IT Disaster Recovery Planning For Dummies |author=Peter H. Gregory |publisher=Wiley |isbn=978-1118050637 |chapter-url=https://books.google.com/books?id=YC49DXW-_60C&pg=PA20|date=2011-03-03 }}</ref> RPO is tied to the times backups are secured offsite. Sending synchronous copies to an offsite mirror allows for most unforeseen events. The use of physical transportation for tapes (or other transportable media) is common. Recovery can be activated at a predetermined site. Shared offsite space and hardware complete the package.<ref>{{cite book|title=Information Security for Managers |page=177|url=https://books.google.com/books?isbn=1349101370 |isbn=1349101370|author1=William Caelli |author2=Denis Longley |year=1989|publisher=Springer }}</ref> For high volumes of high-value transaction data, hardware can be split across multiple sites. ==History== Planning for disaster recovery and information technology (IT) developed in the mid to late 1970s as computer center managers began to recognize the dependence of their organizations on their computer systems. At that time, most systems were batch-oriented [[mainframe]]s. An offsite mainframe could be loaded from backup tapes pending recovery of the primary site; [[downtime]] was relatively less critical. The disaster recovery industry<ref>{{cite news |newspaper=[[The New York Times]] |title=Catastrophe? It Can't Possibly Happen Here |url=https://www.nytimes.com/1995/01/29/business/catastrophe-it-can-t-possibly-happen-here.html |quote=.. patient records |date=January 29, 1995}}</ref><ref>{{cite web |website=[[The New York Times]] |url=https://www.nytimes.com/1994/10/09/realestate/commercial-property-disaster-recovery-business-whose-clients-hope-never-use-it.html |title=Commercial Property/Disaster Recovery |quote=...the disaster-recovery industry has grown to |date=October 9, 1994}}</ref> developed to provide backup computer centers. Sungard Availability Services was one of the earliest such centers, located in Sri Lanka (1978).<ref>{{cite news |newspaper=The Irish Times |url=https://www.irishtimes.com/business/technology/us-tech-firm-sungard-announces-50-jobs-for-dublin-1.2267857 |title=US tech firm Sungard announces 50 jobs for Dublin |quote=Sungard .. founded 1978 |author=Charlie Taylor |date=June 30, 2015}}</ref><ref>{{cite web |url=http://www.ft.lk/it-telecom-tech/sungard-to-be-a-vital-presence-in-the-banking-industry/50-7581 |title=SunGard to be a vital presence in the banking industry |publisher=Wijeya Newspapers Ltd. |quote=SunGard ... Sri Lanka's future. |date=November 12, 2010 |author=Cassandra Mascarenhas}}</ref> During the 1980s and 90s, computing grew exponentially, including internal corporate timesharing, online data entry and [[Real-time computing|real-time processing]]. [[Availability]] of IT systems became more important. Regulatory agencies became involved; availability objectives of 2, 3, 4 or 5 nines (99.999%) were often mandated, and [[high-availability]] solutions for [[hot-site]] facilities were sought.{{Citation needed|date=November 2018}} IT service continuity became essential as part of Business Continuity Management (BCM) and Information Security Management (ICM) as specified in ISO/IEC 27001 and ISO 22301 respectively. The rise of [[cloud computing]] since 2010 created new opportunities for system resiliency. Service providers absorbed the responsibility for maintaining high service levels, including availability and reliability. They offered highly resilient network designs. [[Recovery as a Service]] (RaaS) is widely available and promoted by the [[Cloud Security Alliance]].<ref>[https://cloudsecurityalliance.org/download/secaas-category-9-bcdr-implementation-guidance/ ''SecaaS Category 9 // BCDR Implementation Guidance''] CSA, retrieved 14 July 2014.</ref> ==Classification == Disasters can be the result of three broad categories of threats and hazards. * Natural hazards include acts of nature such as floods, hurricanes, tornadoes, earthquakes, and epidemics. * Technological hazards include accidents or the failures of systems and structures such as pipeline explosions, transportation accidents, utility disruptions, dam failures, and accidental hazardous material releases. * Human-caused threats that include intentional acts such as active assailant attacks, chemical or biological attacks, cyber attacks against data or infrastructure, sabotage, and war. Preparedness measures for all categories and types of disasters fall into the five mission areas of prevention, protection, mitigation, response, and recovery.<ref>{{Cite web|url=https://www.fema.gov/media-library-data/1527613746699-fa31d9ade55988da1293192f1b18f4e3/CPG201Final20180525_508c.pdf|title=Threat and Hazard Identification and Risk Assessment (THIRA) and Stakeholder Preparedness Review (SPR): Guide Comprehensive Preparedness Guide (CPG) 201, 3rd Edition|date=May 2018|publisher=US Department of Homeland Security}}</ref> ==Planning== Research supports the idea that implementing a more holistic pre-disaster planning approach is more cost-effective. Every $1 spent on hazard mitigation (such as a [[disaster recovery plan]]) saves society $4 in response and recovery costs.<ref>{{cite web |title=Post-Disaster Recovery Planning Forum: How-To Guide, Prepared by Partnership for Disaster Resilience |publisher=University of Oregon's Community Service Center, (C) 2007, www.OregonShowcase.org |url=http://1.usa.gov/1IBkvv0 |access-date=October 29, 2018 }}{{Dead link|date=February 2024 |bot=InternetArchiveBot |fix-attempted=yes }}</ref> 2015 disaster recovery statistics suggest that downtime lasting for one hour can cost<ref>{{Cite web|url=http://www.techadvisory.org/2016/01/the-importance-of-disaster-recovery/|title=The Importance of Disaster Recovery|access-date=October 29, 2018}}</ref>{{Failed verification|date=January 2025}} * small companies $8,000, * mid-size organizations $74,000, and * large enterprises $700,000 or more. As [[IT systems]] have become increasingly critical to the smooth operation of a company, and arguably the economy as a whole, the importance of ensuring the continued operation of those systems, and their rapid recovery, has increased.<ref>{{cite web|url=http://www.ready.gov/business/implementation/IT|title=IT Disaster Recovery Plan|date=25 October 2012|publisher=FEMA|access-date=11 May 2013}}</ref> ==Control measures== Control measures are steps or mechanisms that can reduce or eliminate threats. The choice of mechanisms is reflected in a disaster recovery plan (DRP). Control measures can be classified as controls aimed at preventing an event from occurring, controls aimed at detecting or discovering unwanted events, and controls aimed at correcting or restoring the system after a disaster or an event. These controls are documented and exercised regularly using so-called "DR tests". == Strategies == The disaster recovery strategy derives from the business continuity plan.<ref name="DRI International 2021">{{cite web | title=Use of the Professional Practices framework to develop, implement, maintain a business continuity program can reduce the likelihood of significant gaps | website=DRI International | date=2021-08-16 | url=https://drii.org/resources/professionalpractices/EN | access-date=2021-09-02}}</ref> Metrics for business processes are then mapped to systems and infrastructure.<ref>Gregory, Peter. CISA Certified Information Systems Auditor All-in-One Exam Guide, 2009. {{ISBN|978-0-07-148755-9}}. Page 480.</ref> A [[cost-benefit analysis]] highlights which disaster recovery measures are appropriate. Different strategies make sense based on the cost of downtime compared to the cost of implementing a particular strategy. Common strategies include: * backups to tape and sent off-site * backups to disk on-site (copied to off-site disk) or off-site * replication off-site, such that once the systems are restored or synchronized, possibly via [[storage area network]] technology * private cloud solutions that replicate metadata (VMs, templates and disks) into the private cloud. Metadata are configured as an [[XML]] representation called Open Virtualization Format, and can be easily restored * hybrid cloud solutions that replicate both on-site and to off-site data centers. This provides instant fail-over to on-site hardware or to cloud data centers. * high availability systems which keep both the data and system replicated off-site, enabling continuous access to systems and data, even after a disaster (often associated with [[cloud storage]]).<ref>{{cite magazine|url=http://www.inc.com/guides/201106/how-to-use-the-cloud-as-a-disaster-recovery-strategy.html|title=How to Use the Cloud as a Disaster Recovery Strategy|last=Brandon|first=John|date=23 June 2011|magazine=Inc. |access-date=11 May 2013}}</ref> Precautionary strategies may include: * local mirrors of systems and/or data and use of disk protection technology such as [[RAID]] * surge protectors β to minimize the effect of power surges on delicate electronic equipment * use of an [[uninterruptible power supply]] (UPS) and/or backup generator to keep systems going in the event of a power failure * fire prevention/mitigation systems such as alarms and fire extinguishers * anti-virus software and other security measures. == Disaster recovery as a service == {{main|Recovery as a service}}[[File:Edge Night 02.jpg|thumb|A modular data center connected to the power grid at a utility substation]] [[Disaster recovery as a service]] (DRaaS) is an arrangement with a third party vendor to perform some or all DR functions for scenarios such as power outages, equipment failures, cyber attacks, and natural disasters.<ref>{{Cite web|url=https://www.techtarget.com/searchdisasterrecovery/definition/disaster-recovery-as-a-service-DRaaS|title=What Is Disaster Recovery as a Service (DRaaS)? | Definition from TechTarget|website=Disaster Recovery}}</ref> == Disaster recovery for cloud systems == Following best practices can enhance disaster recovery strategy for [[Cloud computing|cloud-hosted]] systems: <ref>{{Cite book |title=Engineering Resilient Systems on AWS |date=11 October 2024 |publisher=O'Reilly Media |isbn=9781098162399}}</ref><ref>{{Cite book |title=Cloud Application Architectures Building Applications and Infrastructure in the Cloud |date=April 2009 |publisher=O'Reilly Media |isbn=9780596555481}}</ref><ref>{{Cite book |title=Site Reliability Engineering How Google Runs Production Systems |date=23 March 2016 |publisher=O'Reilly Media |isbn=9781491951170}}</ref> # '''Flexibility:''' The disaster recovery strategy should be adaptable to support both partial failures (such as recovering specific files) and full environment failures. # '''Regular testing''': Regular testing of the disaster recovery plan can verify its effectiveness and identify any weaknesses or gaps. # '''Clear roles and permissions''': It should be clearly defined who is authorized to execute the disaster recovery plan, with separate access and permissions for these individuals. Implementing a clear [[Privilege separation|separation of permissions]] between those who can execute the recovery and those who have access to [[Backup|backup data]] helps minimize the risk of unauthorized actions. # '''[[Software documentation|Documentation]]''': The plan should be well-documented and easy-to-follow to ensure that operators can effectively follow it during stressful situations. ==See also== {{columns-list|colwidth=18em| * [[Backup site]] * [[BS 25999]] * [[Business continuity planning]] * [[Business continuity]] * [[Continuous data protection]] * [[Disaster recovery plan]] * [[Disaster response]] * [[Emergency management]] * [[High availability]] * [[Information System Contingency Plan]] * [[Real-time recovery]] * [[Recovery Consistency Objective]] * [[Remote backup service]] * [[Virtual tape library]] }} ==References== {{reflist}} ==Further reading== *{{cite book |last=Barnes |first=James |title=A guide to business continuity planning |publisher=John Wiley |publication-place=Chichester, NY |year=2001 |isbn=9780470845431 |oclc=50321216 |ref=none}} *{{cite book |last=Bell |first=Judy Kay |title=Disaster survival planning : a practical guide for businesses |publisher=Disaster Survival Planning |publication-place=Port Hueneme, CA, US |year=2000 |isbn=9780963058027 |oclc=45755917 |ref=none}} *{{cite book |last=Fulmer |first=Kenneth |title=Business Continuity Planning : a Step-by-Step Guide With Planning Forms |publisher=Rothstein Associates, Inc |publication-place=Brookfield, CT |year=2015 |isbn=9781931332804 |id={{OCLC|712628907|905750518|1127407034}} |ref=none}} *{{cite journal |last=DiMattia |first=Susan S |title=Planning for Continuity |journal=Library Journal |volume=126 |issue=19 |year=2001 |issn=0363-0277 |oclc=425551440 |pages=32β34 |ref=none}} *{{cite web |last=Harney |first=John |title=Business Continuity and Disaster Recovery: Back Up Or Shut Down |website=AIIM E-DOC Magazine |date=JulyβAugust 2004 |url=http://www.edocmagazine.com/archives_articles.asp?ID=29114 |archive-url=https://web.archive.org/web/20080204225856/http://www.edocmagazine.com/archives_articles.asp?ID=29114 |archive-date=2008-02-04 |url-status=dead |ref=none |issn=1544-3647 |oclc=1058059544}} *{{cite web |title=ISO 22301:2019(en), Security and resilience β Business continuity management systems β Requirements |url=https://www.iso.org/obp/ui/#iso:std:iso:22301:ed-2:v1:en |publisher=ISO}} *{{cite web |title=ISO/IEC 27001:2013(en) Information technology β Security techniques β Information security management systems β Requirements |url=https://www.iso.org/obp/ui/#iso:std:iso-iec:27001:ed-2:v1:en |publisher=ISO}} *{{cite web |title=ISO/IEC 27002:2013(en) Information technology β Security techniques β Code of practice for information security controls |url=https://www.iso.org/obp/ui/#iso:std:iso-iec:27002:ed-2:v1:en |publisher=ISO}} ==External links== *{{cite web |title=Glossary of terms for Business Continuity, Disaster Recovery and related data mirroring & z/OS storage technology solutions |website=recoveryspecialties.com |url=https://recoveryspecialties.com/glossary.html |archive-url=https://web.archive.org/web/20201114001623/https://recoveryspecialties.com/glossary.html |archive-date=2020-11-14 |url-status=dead |access-date=2021-09-02}} *{{cite web |title=IT Disaster Recovery Plan |website=Ready.gov |url=https://www.ready.gov/it-disaster-recovery-plan |ref=none |access-date=2021-09-02}} *{{cite web |title=RPO (Recovery Point Objective) Explained |website=IBM |date=2019-08-08 |url=https://www.ibm.com/services/business-continuity/rpo |access-date=2021-09-02}} {{Authority control}} [[Category:Disaster recovery| ]] [[Category:Backup]] [[Category:Business continuity]] [[Category:Data management]] [[Category:IT risk management]]
Edit summary
(Briefly describe your changes)
By publishing changes, you agree to the
Terms of Use
, and you irrevocably agree to release your contribution under the
CC BY-SA 4.0 License
and the
GFDL
. You agree that a hyperlink or URL is sufficient attribution under the Creative Commons license.
Cancel
Editing help
(opens in new window)
Pages transcluded onto the current version of this page
(
help
)
:
Template:About
(
edit
)
Template:Authority control
(
edit
)
Template:Citation needed
(
edit
)
Template:Cite book
(
edit
)
Template:Cite journal
(
edit
)
Template:Cite magazine
(
edit
)
Template:Cite news
(
edit
)
Template:Cite web
(
edit
)
Template:Columns-list
(
edit
)
Template:Dead link
(
edit
)
Template:Failed verification
(
edit
)
Template:ISBN
(
edit
)
Template:Main
(
edit
)
Template:Reflist
(
edit
)
Template:See also
(
edit
)
Template:Short description
(
edit
)