This analysis is focused on mission critical IT systems and building an infrastructure environment with an increased amount of system availability and recovery. Mission critical applications such as customer facing web applications are vital to a smooth operation of most business operations. System downtime translates to financial losses to the organization.
Customer facing web applications have improved business capability; however it is driving the need for decreasing recovery time objectives and more stringent service levels from a service availability perspective. We cannot treat all IT data center systems the same, some systems are more critical to the operation of your company than others. Business requirements have changed over the past few years for system applications that drive business revenue, expectations are that systems are available 24/7, like online retail systems.
The proposed solution recommends creating an isolated infrastructure cell to host your mission critical systems that support your vital business functions. This environment is going to be small in scale, designed to host mission critical applications. Not only does this give you the capability to improve reliability from an infrastructure component level; but it will enable you to create process improvements from a service management perspective.
We will not be focusing on a complete disaster recovery solutions in this evaluation; this assessment will strictly center on IT system availability.
Critical Success Factors
The solution must be able to meet the following requirements
- Vital business functions need to be highly available and operate with minimum disruption.
- Highly resilient infrastructure design to withstand failures.
- Decrease the number of system outages to mission critical systems.
- Operate within scheduled maintenance and deployment release dates.
Key Functional Requirements
IT system resiliency is determined by redundant system components including servers, networking, storage, and system recoverability. The components must be highly available to meet mission critical system needs and minimize downtime. A failure in any link in the infrastructure chain could result in the loss of IT system availability to the business. As a result, redundancy must be applied to all infrastructure components to ensure high availability.
Recovery Point Objective (RPO)
The system data on our critical systems dictates the amount of data that can be lost as the result of a failure. Generally, mission critical systems cannot sustain any data loss and require a very low recovery point objective. Systems that are not mission critical IT systems often can sustain some amount of data loss or lost transactions resulting from a system failure.
Recovery Time Object (RTO)
Recovery time objectives (RTOs) spell out the maximum allowable time to restore IT services. RTOs are typically associated with recoverability, whereas Quality of Service (QoS) needs are associated with availability. Most organizations use RTOs to express disaster recovery requirements. For our evaluation, we are going to focus on availability solutions for protecting our IT systems from downtime caused by individual system outages, component outages, and maintenance activity.
High Availability Solution - Part 1
High availability solutions enable systems to recover quickly from failure because of redundant components and software technologies that improve business continuity. Should a server node in a high availability cluster be turned off for maintenance or fail as a result of an unplanned outage, the services are automatically moved to or restarted on another host.
Server Form Factor
VMware vSphere allows organizations to spread the virtual machines (servers) across multiple physical hosts, with the ability to consolidate workloads into each server. Essentially, a scale up design uses a small number of large powerful servers, as opposed to a scale out solution design that revolves around smaller servers. Both aim to achieve the computing power that is required to run our systems, but the way in which they scale is different and has a different impact to support.
Scale up advantages:
- Better resource management: Larger servers can take better advantage of the hypervisor’s resource optimization capabilities. Scaling out doesn’t make as efficient use of the resources because they are more limited on an individual node.
- Cost: Scaling up is cheaper.
- Fewer Hypervisors: With fewer servers loaded with the hypervisor, it is easier to maintain hypervisor upgrades, hypervisor patching, BIOS and firmware upgrades, and a smaller footprint for system monitoring.
- Larger VMs possible: Scale up is more flexible with large VMs because of resource scaling.
- Power and cooling: In general scaling up requires less power and cooling because it is a smaller amount of host nodes.
Scale out advantages:
- Less impact during a host failure: Having fewer VMs per server reduces the risk if a physical host failure should occur. By scaling out to small servers, fewer VMs are affected at once.
- Less expensive host redundancy: It is significantly cheaper to maintain an N+2 host policy.
Although scaling up hosts saves money on OPEX and infrastructure costs, the recommendation for mission critical applications is to scale out so that the VM impact is minimized in the event of a system failure. vSphere High Availability (HA) uses a restart of the virtual machine as the mechanism for addressing host failures. This means there is a period of downtime when the host fails and the VM(s) completes reboot on a different host(s).
Host Resource Capacity
vSphere clustering has the capability of admission control to ensure that capacity is available for maintenance and host failure. Failover capacity is calculated by determining how many hosts can fail and still leave enough capacity to satisfy the requirements of all powered-on virtual machines. An N+2 solution, where N is the number of physical servers in the environment plus two additional physical servers to host the VMs provides the advantage of allowing for an unexpected system failure while one host is out of the cluster for maintenance. This cluster design can sustain an impact of two hosts without disrupting mission critical systems.
This ensures that we are not over-committed in host resource allocation which can lead to poor performance on the VMs should there be a multi-host failure.
Host High Availability
vSphere High Availability is a clustering solution to detect failed physical hosts and recover virtual machines. If vSphere HA discovers that a host node is down, it quickly restarts the host’s virtual machines on other servers in the cluster. This enables us to protect virtual machines and their workloads.
vSphere vMotion provides the ability to perform live migrations of a virtual machine from one ESXi host to another ESXi host without service interruption. This is a no-downtime operation; network connections are not dropped and applications continue running uninterrupted.
This makes vMotion an effective tool for load balancing VMs across host nodes within a cluster. Additionally, if a host node needs to be powered off for hardware maintenance, we use vMotion to migrate all the active virtual machines from the host going offline to another host to ensure there is no business disruption.
In my next post, we will cover storage, virtual machines, and infrastructure deployment and maintenance.