Strategies for strengthening against the crisis
Organizations need to use different mechanisms simultaneously to protect data, and make it more accessible in the context of information technology.
In this article, we first describe the basic concepts of error resistance in the context of information technology and then explain the steps to achieve the goal.
RPO and RTO factors
In the field of data error resistance, two factors are very important
RPO (Recovery point Objective): Measured in minutes and indicates that in the event of a data crisis, the organization wants to return to a few minutes before the time of the accident; Example: In case of data failure, if the data Do you not have the last 60 minutes? Is it acceptable? The lower the RPO, the more complex the data protection methods, and the higher the cost of the organization. Example: In a bank you may not be able to lose even a minute of data Accept.
RTO (Recovery Time Objective (: Specifies how many minutes of service failure you can handle since the accident. Example: If your data recovery process from the Backup device takes three hours, it means that if you need to recover from Backup, you may experience 3 hours of service interruption, this factor is measured in minutes.
To reduce RPO time you need to use different mechanisms in the data. Mechanisms such as Replication, Backup, Snapshot help reduce RPO. In addition to RPO factors, RTO should increase redundancy in different layers of the data center. Example: In data center physics, both in the network, and in the processing layer, various mechanisms must be implemented to reduce RTO.
Uptime and SLA
Uptime or SLA (Service Level Agreement) factors are discussed in data centers
Uptime: The quality of data center services over a period of time (usually one year). Example: A data center that claims to have 99.99% uptime means that less than 52:35 (52 minutes and 35 seconds) per year is disrupted.
Table – Disruption of acceptable service in different SLA conditions
Service Level Agreement (SLA) is the uptime that runs under a contract between two or more parties. This means that the server is committed to have less than 5:15 (5 minutes and 15 seconds) per year.
Each data center lists its area of responsibility as well as items inside or outside the SLA agreement in the SLA document. Example: In a contract, the SAL is 99.9% agreed, and on the other hand, the occurrence of physical events such as floods and earthquakes is not in the service provider’s obligations. In this case, the server will not have to serve in two data centers that are physically more than 100 km apart.
Some data center SLA contracts refer only to the disconnection of the relevant service, and others, in addition to disconnection, refer to certain qualitative parameters such as Response Time or the amount of communication bandwidth at different layers.
If you want to design a data center with a specific SLA or Uptime, you must specify the SLA-related domains. Because adding some items to SLA obligations can increase the cost. Example: As soon as natural disasters are among the items to be covered, they may require you to have two or more data centers and thus increase costs.
Steps to create a crisis-resistant facility
Business Continuity and Disaster Recovery are closely related. Business Continuity is a set of programs, processes, policies, and implementations and monitoring that are used to sustain business, whether in crisis or non-crisis, and in fact Disaster Recovery is a subset of Business Continuity Which keeps the business healthy in the event of a crisis. The overall steps to implement the crisis resistance platform are related to Business Continuity in the field of information technology.
Step 1: Recognize and prepare the Service Catalog
In the first step, the current situation (business platform and systems) must be identified. Systems that are directly related to the organization’s core business should be considered in separate priority and importance. In fact, at the end, you will have a list of important electronic services of the organization and general information about each of them, such as the person in charge of the system, as well as a series of general technical information about the services. In the collection, the ITIL method is also called Service Catalog, and the specifications of this document are mentioned in ITIL.
Step 2: Analysis and preparation of BIA
Important scenarios that cause losses in the business (in the field of information technology) and the extent of its possible consequences are identified and analyzed to determine the extent to which data and electronic services are worth the investment for the organization. have. Example: In a bank, if you lose even a few minutes of Core Banking system data, a lot of losses will be inflicted on the organization, so the investment to lower the RPO and RTO is more than a normal organization. Some systems in the same bank may have little effect on the organization. It allows the data center services to be divided into different priority levels in terms of the importance of data protection and service. One of the most important activities at this stage is to identify the inputs and dependencies of each service. A system may require several ancillary services to serve. By preparing a Service Dependency Model, in fact, a vertical connection is established between each system and the basic services of the data center (different layers of the data center) and the connections and dependencies between different systems are determined horizontally.
In the ITIL method, there are recommendations for developing a Service Dependency Model. The following areas are important for developing an IT BIA:
Important business activities that use the IT platform. Example: For an insurance company, all the activities that are done for business turnover are usually done by a set of Core Insurance systems and enterprise portals. Therefore, all relevant systems should be analyzed in the BIA and the amount of damage to the organization in the event of an accident should be determined for each of the affiliated systems. The amount of RPO and RTO can be determined from the amount of damage.
List of business systems with the effect and importance of crisis on each: List of all systems related to the organization’s business with the extent of their impact on the organization’s revenue and losses in the event of a system failure.
List of public systems with the severity of the impact and the importance of the crisis on each: List of all systems that provide ancillary services in the organization but do not have a direct impact on the organization’s revenue, along with the amount of damage to the organization directly or indirectly In case of any incident for any of the systems. Example: Usually, the organization’s e-mail platform system does not have a direct effect on the organization’s business, but in the event that something like the workflow of some organizations is done via email, it may affect the quality of work of employees or representatives of the organization. Depending on the effect, different RPO / RTO may be considered for these tools and software.
Important organizational processes: Another view is to examine important organizational processes and related systems.
Benefits of each system and scope of potential crisis: For each of the above, the range of service users must be clear. Also, what are the benefits of the disruption in the relevant service, and the extent of its direct and indirect effects and possible losses are also determined.
Also, categorizing electronic services in large and medium data centers can be a good way to optimize quality-price. All related systems and data can be divided into several different levels (levels A, B, C, D) and for each level, a different amount of RPO-RTO can be developed and the technical solution designed for each level can be different. This creates a balance between quality and cost.
Step 3: Design
Based on the knowledge made, and the specified policies, the required technical design of the substrate is done. To provide the required RPO-RTO at each of the available levels (A to D) in each of the different layers of the data center, there are various mechanisms and methods that can be used. These include the following:
Use of more than one data center: If the organization’s policy is to be resistant to natural disasters or even events such as power outages, it should have more than one data center and make it possible for any crisis Data centers, in the second or third data center, the service can be continued quickly.
Use of higher level standards in the design of physical data center infrastructure: According to TIA-942, four levels are defined as quality levels of physical data center conditions. Each of these 4 levels should be used in the design of the data center using a series of features and a number of strategies to increase resistance to events. Level 4 uses more features and mechanisms more error-resistant than Level 1. Higher levels of standardization should be used if high levels of availability are required. It also became more sensitive to data center management processes and maintenance of physical data center infrastructure.
Use of additional network connections in the user access network: Each data center must have a network connection to its external world. Identified the different types of network communications both inside and outside the data center and, depending on the need, observed redundancy in each.
Use of additional equipment in different layers: In the new category, the active data center infrastructure is divided into three different layers (storage, processing and network). In each of the layers, redundancy in the field of equipment should be used.
Use the right software Fault Recovery technology: The use of additional equipment alone does not protect the system from events. Rather, it must have appropriate software intelligent mechanisms to make optimal use of this platform and capacity. In each of the three layers named in the data center, there are many technologies and software mechanisms for error resistance. Example: In the processing layer, in a data center that uses VMware’s virtualization solution, mechanisms such as Vsphere Clustering and SRM can be used as software solutions as a complement to hardware hardware redundancy. Or if the Oracle DB database is used, the High Availability mechanisms offered by the company can be used.
Use the right processes, policies and manpower: Many events may be predictable before they occur. The efficiency of precise and complete processes, comprehensive policies and expert and precise manpower can, in addition to preventing many events, in times of crisis help the situation to return to normal quickly.
Up-to-date documentation: In medium and large data centers, a lot of equipment is used. Data center documentation makes it possible to act quickly in times of crisis. Updating documents is an important issue.
Step 4: Monitoring and testing
Use of smart monitoring: One of the most important issues in times of crisis is to inform experts and managers about the crisis. Therefore, it is effective to use tools to inform the crisis. Optimal use of the monitoring bed can predict and prevent many crises. Example: In data storage equipment, many disk failures are reported as predictable and say that the useful life of the disk or disks is coming to an end. If there is a good monitoring platform, and careful procedures are followed to follow such cases, the disk can be replaced before failure.
Execution of crisis resolution maneuvers: In times of crisis, manpower must be mentally prepared to face calmly, and to avoid human error, and the impact of stress caused by the disorder, maneuvering is important.
There must be a separate plan for Planned Downtime and Unplanned Downtime. A data center in its lifetime sometimes requires Planned Downtime at different layers. Periodic Maintenance operations may require that some components of the data center be temporarily shut down. Example: Most operating systems require a reboot after the update and are out of reach for a few minutes.
For important crisis scenarios, be sure to develop a written and tested plan to return to normal. If a designed and tested solution is available, the situation can be normalized in less time and the possibility of human error in the recovery process can be reduced.
Carrying out regular crisis maneuvers: For each of the levels of importance (A, B, C, and D) according to the written plan, crisis maneuvers should be performed and experiences should be documented.