Press "Enter" to skip to content

Setting up high availability cluster (part 1)

When setting up robust production systems, minimizing downtime and service interruptions is often a high priority. Regardless of how reliable your systems and software are, problems can occur that can bring down your applications or your servers.

Implementing high availability for your infrastructure is a useful strategy to reduce the impact of these types of events. Highly available systems can recover from server or component failure automatically.

In this post and the following one we are going to set up from scratch a high availability cluster (Active-Standby) with 2 CentOS servers. The topology used in this tutorial is shown below:

First of all we must make sure that both hosts see each other and are able to resolve their hostnames.

 

1. Configure SSH between 2 nodes

It is needed that both nodes can connect each other through SSH using their public keys (without password). For each node we must create a new key pair and copy the public key to the authorized keys of the other node.

Node1:

Node2:

 

2. Install the cluster software “Pacemaker”

We must install Pacemaker on both nodes:

pacemaker: It’s responsible for all cluster-related activities, such as monitoring cluster membership, managing the services and resources, and fencing cluster members. The RPM contains 2 important components:

  • Cluster Information Base (CIB)
  • Cluster Resource Management Daemon (CRMd)

corosync: This is the framework used by Pacemaker for handling communication between the cluster nodes.

pcs: Provides a command-line interface to create, configure, and control every aspect of a Pacemaker/corosync cluster.

We must open TCP ports 2224, 3121, 21064 and UDP 5405. For that we can use the following commands in CentOS:

Before the cluster can be configured, the pcs daemon must be started and enabled on both nodes. This daemon works with the pcs command-line interface to manage synchronizing the corosync configuration across all nodes in the cluster.

The installed packages will create a “hacluster” user with a disabled password. The account needs a login password in order to perform such tasks as syncing the corosync configuration, or starting and stopping the cluster on other nodes.

So now we will set a password for the hacluster user, using the same password on both nodes:

 

3. Create the cluster with “pcs” command line

On either node, use “pcs cluster auth” to authenticate pcs to pcsd daemon on nodes specified as the hacluster user:

Next, use “pcs cluster setup” on the same node to generate and synchronize the corosync configuration. The pcs command will configure corosync to use UDP unicast messages between both nodes:

Authentication and encryption of the connection between cluster nodes and nodes running pacemaker is achieved using with TLS-PSK encryption/authentication over TCP (port 3121 by default). This means that both the cluster node and remote node must share the same private key. By default, this key is placed at “/etc/pacemaker/authkey” on each node.

The final corosync configuration is stored on “/etc/corosync.conf configuration” on each node.

Now that corosync is configured, it is time to start the cluster. The command below will start corosync and pacemaker service on both nodes in the cluster.

Now is recommended to enable both services on each node to start after boot. In each node:

Finally verify the cluster status:

 

4. Add a resource to be used by the cluster

Before starting to modify the cluster configuration, we will use this command to verify the config xml from the CIB (Cluster Information Base)

Errors state that STONITH is enabled and not configured .We will disable STONITH feature for now and configure it later.

Now, we will add our first resource that will be the virtual IP 192.168.125.10 shared by both nodes:

We can check with “pcs status” that the resource “ClusterIP” is working and currently running on node1. So all requests made against IP 192.168.125.10 will be managed by node1.

The properties that you define for a resource tell the cluster which script to use for the resource, where to find that script and what standards it conforms to.

Standard: The standard the script conforms to. Allowed values: ocf, service, upstart, systemd, lsb, stonith
Provider: The OCF spec allows multiple vendors to supply the same resource agent. Most of the agents shipped by Red Hat use “heartbeat” as the provider.

 

5. Simulate a failover

Shut down Pacemaker and Corosync services on node1 to simulate a failover.

Verify that pacemaker and corosync are no longer running on node1:

Go to the other node, and check the cluster status.

So if we ping IP 192.168.125.10 it is still available because now is running on node2.

Now, simulate node recovery by restarting the cluster stack on pcmk-1:

 

5. Configure a resource HTTP

Now that we have a basic but functional active/passive two-node cluster, we’re ready to add some real services. We’re going to start with Apache HTTP Server because it is a feature of many clusters and relatively simple to configure.

NOTE: Do not enable the httpd service. Services that are intended to be managed via the cluster software should never be managed by the OS

We need to create a page for Apache to serve. On CentOS 7.1, the default Apache document root is /var/www/html, so we’ll create an index file there.  For the moment, we will simplify things by serving a static site and manually synchronizing the data between the two nodes, so run this command on both nodes:

In order to monitor the health of your Apache instance, and recover it if it fails, the resource agent used by Pacemaker assumes the server-status URL is available.

On both nodes, we must enable the Apache status URL adding the following lines to the /etc/httpd/conf/httpd.conf

At this point, Apache is ready to go, and all that needs to be done is to add it to the cluster. Let’s call the resource WebSite. We need to use an OCF resource script called apache in the heartbeat namespace.

Wait a moment, the WebSite resource isn’t running on the same host as our IP address!

To reduce the load on any one machine, Pacemaker will generally try to spread the configured resources across the cluster nodes. However, we can tell the cluster that two resources are related and need to run on the same host

We need to instruct the cluster that WebSite can only run on the host that ClusterIP is active on. To achieve this, we use a colocation constraint that indicates it is mandatory for WebSite to run on the same node as ClusterIP. The “mandatory” part of the colocation constraint is indicated by using a score of INFINITY. The INFINITY score also means that if ClusterIP is not active anywhere, WebSite will not be permitted to run.

Colocation constraints are “directional”, in that they imply certain things about the order in which the two resources will have a location chosen. In this case, we’re saying that WebSite needs to be placed on the same machine as ClusterIP, which implies that the cluster must know the location of ClusterIP before choosing a location for WebSite.

Finally, we can use contraints to prefer one node over another.

In the location constraint below, we are saying the WebSite resource prefers the node pcmk-1 with a score of 50. Here, the score indicates how badly we’d like the resource to run at this location.

To see the current placement scores, you can use a tool called crm_simulate.

There are times when an administrator needs to override the cluster and force resources to move to a specific location.  In this case, we could force the WebSite to move to node1 by updating our previous location constraint with a score of INFINITY.

Once we’ve finished whatever activity required us to move the resources to pcmk-1 (in our case nothing), we can then allow the cluster to resume normal operation by removing the new constraint.

irst, use the –full option to get the constraint’s ID:

Then remove the desired contraint using its ID:

 

So, with the above steps we have already an HTTP functional active/passive two-node cluster. However, if the content of the website is modified on the active node (for example a dynamic website) the passive node will not be modified and when a failover occurs, the website shown by node2 will be outdated.

In order to have files replication in both nodes of the cluster automatically we will use DRBD. This will be explained on the next post.

4 Comments

  1. Hache
    Hache 19 March, 2018

    very good

    when the next DRBD post?

  2. Hache
    Hache 19 March, 2018

    Thanks
    I already seen the DRBD post, you should like the post well at the end of this post

  3. shuja
    shuja 28 April, 2018

    Hi Sir,

    Thanks for the article. I configured it accordingly as described above but i when gave demo to my manger about this cluster configuration. while testing he disconnected network interface from Node1 and ran the command pcs resource show the output was
    like below
    ‘ClusterIP (ocf::heartbeat:IPaddr2): Started node1
    WebSite (ocf::heartbeat:apache): Started node1″
    and when i ran the same command on node2 the output was like below

    pcs resource show

    ClusterIP (ocf::heartbeat:IPaddr2): Started 2
    WebSite (ocf::heartbeat:apache): Started node2

    although failover occurred successfully but services were still showing activated on node1 hence he rejected my configuration and said fix this issue.
    when i checked the cornosync.log
    i found below logs
    ____________________________________________
    [8621] node1 corosyncnotice [TOTEM ] A processor failed, forming new configuration.
    [8621] node1 corosyncnotice [TOTEM ] The network interface is down.
    [8621] node1 corosyncnotice [TOTEM ] adding new UDPU member {192.168.40.173}
    [8621] node1 corosyncnotice [TOTEM ] adding new UDPU member {192.168.40.172}
    [8621] node1 corosyncwarning [MAIN ] Totem is unable to form a cluster because of an operating system or network fault. The most common cause of this message is that the local firewall is configured improperly.
    ________________________________________
    kindly help how can i fix this issue. FYI. HA service is enabled in firewall on both nodes.
    my concern is this that why resources are still showing active on node1

    • Carlos
      Carlos 4 May, 2018

      Hi Shuja,

      Sorry for the late reply… Since the topology I used on this post is only for demo purposes, you must take into account that if HA interface is the interface that you turn off, both nodes on the cluster stop seeing each other and both nodes will start to try to get the services running so it is likely node1 stops seeing node2 and try to keep services running.

      Have you tried to use a HA interface for cluster communications and use other interface to external access to the web servers?

      Regards,
      Carlos

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.