Clustering OpenSIPS for High Availability – Part 2

In this, part 2, we investigate the implications of using more than one IP address on an OpenSIPS server and how this impacts on far-end NAT traversal. We will also see how the use of a virtual IP address can overcome these problems when clustering two OpenSIPS servers.

Part 1 reviewed why we might want to cluster two OpenSIPS servers and options for sharing common resources including the mediaproxy and MySQL database.

An overview of far-end NAT traversal

The term “far-end NAT traversal” refers to any mechanism that overcomes the problems of communicating with a remote device installed behind a NAT firewall/router. This is the usual network topology for devices connected to a domestic broadband circuit, except where the device is integrated with the router. For OpenSIPS, the challenge is to be able to send SIP requests to an IP phone that is behind a NAT firewall.

Diagram of far-end NAT traversal

Its ability to perform far-end NAT traversal is one of OpenSIPS key strengths when used as a Registrar and location server. This involves a number of elements, including:

  1. Detecting that a remote device is behind NAT
  2. Modifying certain headers (such as Via and Contact) so they reflect the external IP and port of the device instead of the internal LAN address
  3. Flagging and remembering that a device or end-point was behind NAT
  4. Using SIP keep-alives (or “NAT Pings”) to keep the SIP signalling port open on the remote firewall
  5. Providing a media proxying service so RTP connections only need to be established out through the remote firewall, not in

It would not be practical to go into further detail here (although it would make a great subject for a later article), but suffice to say that, from the above list, the item that causes most difficulty for a clustered OpenSIPS solution is the need to keep the SIP signalling port open on the remote firewall.

Firewall behaviour

Firewalls generally allow outbound connections – from the LAN to the Internet – but block inbound ones, unless explicitly allowed by special user-configurable rules.

Outbound connections allowed by firewall
Inbound connections blocked at the firewall

So, for example, an IP phone would have no problem sending a registration request to an OpenSIPS registrar server on the Internet, but the OpenSIPS server may find requests it sends back to the phone are blocked. However, firewalls with stateful packet inspection (which most have) will generally allow incoming messages to get through provided they look like responses to an earlier outbound message. How does the firewall decide if an incoming message is a response that should be allowed through? Typically, the response must comply with most, if not all, of the following rules:

  1. Arrive within a pre-defined time after the outbound message, typically 30 to 60 seconds
  2. Arrive at the same external port on the firewall that the outbound message was sent from
  3. The source IP address of the incoming message must match the one the outbound message was sent to
  4. The source port number used by the sending server may need to match the one the outbound message was sent to (not all firewalls require this)

If all the required criteria are matched, then the firewall allows the incoming message through. Furthermore, it forwards the message to the same device on the LAN that sent the outgoing message.

How does OpenSIPS get through the remote firewall?

The best solution is to configure a port forwarding rule on the user’s firewall. However, that will not always be possible because VoIP service providers cannot be sure about the conditions at the user’s end of the connection – the VoIP service provider generally needs to be able to support users irrespective of the make/model of network equipment, the level of experience of the user and even the restricted access some user’s may be allowed to their firewall.

Instead, OpenSIPS attempts to keep the hole through the firewall open by sending a “neutral” SIP message to the phone every 30 seconds (or whatever period has been configured in the parameter settings in opensips.cfg). This keep-alive message is referred to as a “NAT Ping” in some sections of the OpenSIPS documentation. When the phone receives the message it sends back a response. Typically, the NAT Ping request is OPTIONS or NOTIFY and the response is almost immaterial as long as some response is sent. The fact that the phone sent something to OpenSIPS is enough to reset the timer on most firewalls thereby meeting the criteria for rule number 1 in the list above.

The other three rules in the above list will almost certainly be met when the service provider has just one OpenSIPS server. However, this article is all about using two OpenSIPS servers in a cluster. As Shakespear would say “Ay, there’s the rub”.

Far-end NAT traversal on multi-homed or clustered servers

So what happens when there are two OpenSIPS servers or even one server with two different IP addresses? The user’s IP phone will have registered using just one server IP address and the firewall will not accept inbound communication to the phone if they come from any other IP address because that would be breaking rule number 3.

only one opensips server in the cluster has 2-way communication

In this scenario, the second OpenSIPS server in the cluster cannot penetrate the remote NAT firewall, even though it knows the correct remote IP address and port number which it read from the shared location table. If the backup server cannot initiate calls to UA devices, it largely defeats the purpose of clustering two opensips servers!

So what other options are available to us? I believe the list of options is as follows:

  1. Configure port forwarding on every remote firewall
  2. Configure every user device to register twice – once with SIP server 1 and again with SIP server 2
  3. Assuming every user device has DNS SRV capability, create an SRV record for each server
  4. Make SIP server 2 take the IP address of SIP server 1 as part of the failover process

Options 1 and 2 involve configuration of user devices or network equipment which may not be accessible. Furthermore, registering twice creates complications when you send a call to the phone. Option 3 depends on the capabilities of the user device and it can take up to an hour for a phone to re-register with the lower priority SRV address following failure of the server at the higher prioity address. That is hardly likely to be satisfactory for most users. Which just leaves option 4 – read on.

Active/Standby failover and the use of a Virtual IP address

The network interface on a server can generally be configured with more than one network address – this is a trick that is available on most PC’s and servers including those running Windows or Linux. It is quite common for high availability servers to exploit this trick as a way of passing one IP address from a primary to a standby unit, thereby providing an almost seamless transition for any client devices that need to communicate with the server. Each server must have its own unique IP address, but in addition there can be a floating address that may be assigned to any one server at any given time. This floating address is sometimes referred to as a Virtual IP address. Linux HA (High Availability) packages can be installed that provide all the functionality for an active/standby failover between multiple servers, ready to go. [2017 Update: Corosync, Pacemaker and the pcs cluster management suit provide a complete solution for Virtual IP switching. If you want to try this solution, I recommend this article – http://jensd.be/156/linux/building-a-high-available-failover-cluster-with-pacemaker-corosync-pcs]

I investigated Linux HA some years ago [approx. 2008], but more recently was involved in a project to provide failover for Asterisk that obliged me to research how a failover solution can be written using bash scripts. The greater flexibility available through bash scripts is attractive and I was able to build on a previously documented solution posted on a web forum which unfortunately no longer exists. This meant it was not too difficult to refine and improve the published code and come up with something suitable for OpenSIPS. Ultimately, that is the solution we opted for to build our high availability OpenSIPS cluster and the results were generally very satisfactory.

In part three of this article, I will look at how a Virtual IP address can be used with two OpenSIPS servers. For part 3 of this article, click here.

2 thoughts on “Clustering OpenSIPS for High Availability – Part 2”

Comments are closed.