This analysis is focused on mission critical IT systems and building an infrastructure environment with an increased amount of system availability and recovery. Mission critical applications such as customer facing web applications are vital to a smooth operation of most business operations. System downtime translates to financial losses to the organization.
Customer
facing web applications have improved business
capability; however it is driving the need for decreasing recovery time
objectives and more stringent service levels from a service availability
perspective. We cannot treat all IT data center systems the same, some systems
are more critical to the operation of your company than others. Business
requirements have changed over the past few years for system applications that
drive business revenue, expectations are that systems are available 24/7, like
online retail systems.
The proposed
solution recommends creating an isolated infrastructure cell to host your
mission critical systems that support your vital business functions. This
environment is going to be small in scale, designed to host mission critical applications. Not only does this give
you the capability to improve reliability from an infrastructure component
level; but it will enable you to create process improvements from a service
management perspective.
We will not
be focusing on a complete disaster recovery solutions in this evaluation; this
assessment will strictly center on IT system availability.
Critical Success Factors
The
solution must be able to meet the following requirements
- Vital business functions need to be highly available and operate with minimum disruption.
- Highly resilient infrastructure design to withstand failures.
- Decrease the number of system outages to mission critical systems.
- Operate within scheduled maintenance and deployment release dates.
Key Functional Requirements
Infrastructure
Resiliency
IT system resiliency is determined by redundant system
components including servers, networking, storage, and system recoverability.
The components must be highly available to meet mission critical system needs
and minimize downtime. A failure in any link in the infrastructure chain could result
in the loss of IT system availability to the business. As a result, redundancy
must be applied to all infrastructure components to ensure high availability.
Recovery
Point Objective (RPO)
The system data on our critical systems dictates the amount
of data that can be lost as the result of a failure. Generally, mission critical
systems cannot sustain any data loss and require a very low recovery point
objective. Systems that are not mission critical IT systems often can sustain
some amount of data loss or lost transactions resulting from a system failure.
Recovery
Time Object (RTO)
Recovery time objectives (RTOs) spell out the maximum
allowable time to restore IT services. RTOs are typically associated with
recoverability, whereas Quality of Service (QoS) needs are associated with
availability. Most organizations use RTOs to express disaster recovery
requirements. For our evaluation, we are going to focus on availability
solutions for protecting our IT systems from downtime caused by individual
system outages, component outages, and maintenance activity.
High Availability Solution - Part 1
High availability solutions enable systems to recover
quickly from failure because of redundant components and software technologies
that improve business continuity. Should a server node in a high availability
cluster be turned off for maintenance or fail as a result of an unplanned
outage, the services are automatically moved to or restarted on another host.
Host Systems
Server
Form Factor
VMware vSphere allows organizations to spread the virtual
machines (servers) across multiple physical hosts, with the ability to
consolidate workloads into each server. Essentially, a scale up design uses a
small number of large powerful servers, as opposed to a scale out solution
design that revolves around smaller servers. Both aim to achieve the computing
power that is required to run our systems, but the way in which they scale is
different and has a different impact to support.
Scale up advantages:
- Better resource management: Larger servers can take better advantage of the hypervisor’s resource optimization capabilities. Scaling out doesn’t make as efficient use of the resources because they are more limited on an individual node.
- Cost: Scaling up is cheaper.
- Fewer Hypervisors: With fewer servers loaded with the hypervisor, it is easier to maintain hypervisor upgrades, hypervisor patching, BIOS and firmware upgrades, and a smaller footprint for system monitoring.
- Larger VMs possible: Scale up is more flexible with large VMs because of resource scaling.
- Power and cooling: In general scaling up requires less power and cooling because it is a smaller amount of host nodes.
Scale out advantages:
- Less impact during a host failure: Having fewer VMs per server reduces the risk if a physical host failure should occur. By scaling out to small servers, fewer VMs are affected at once.
- Less expensive host redundancy: It is significantly cheaper to maintain an N+2 host policy.
Although scaling up hosts saves money on OPEX and infrastructure costs, the recommendation for mission critical applications is to scale out so that the VM impact is minimized in the event of a system failure. vSphere High Availability (HA) uses a restart of the virtual machine as the mechanism for addressing host failures. This means there is a period of downtime when the host fails and the VM(s) completes reboot on a different host(s).
Host
Resource Capacity
vSphere clustering has the capability of admission control to
ensure that capacity is available for maintenance and host failure. Failover
capacity is calculated by determining how many hosts can fail and still leave
enough capacity to satisfy the requirements of all powered-on virtual machines.
An N+2 solution, where N is the number of physical servers in the environment
plus two additional physical servers to host the VMs provides the advantage of
allowing for an unexpected system failure while one host is out of the cluster
for maintenance. This cluster design can sustain an impact of two hosts without
disrupting mission critical systems.
This ensures that we are not over-committed in host resource
allocation which can lead to poor performance on the VMs should there be a
multi-host failure.
Host High
Availability
vSphere High Availability
is a clustering solution to detect failed physical hosts and recover
virtual machines. If vSphere HA discovers that a host node is down, it quickly
restarts the host’s virtual machines on other servers in the cluster. This
enables us to protect virtual machines and their workloads.
Live
Migration
vSphere vMotion provides the ability to perform live
migrations of a virtual machine from one ESXi host to another ESXi host without
service interruption. This is a no-downtime operation; network connections are
not dropped and applications continue running uninterrupted.
This makes vMotion an effective tool for load balancing VMs
across host nodes within a cluster. Additionally, if a host node needs to be
powered off for hardware maintenance, we use vMotion to migrate all the active
virtual machines from the host going offline to another host to ensure there is
no business disruption.
In my next post, we will cover storage, virtual machines, and infrastructure deployment and maintenance.