What’s the disaster we are trying to avoid?
The assumed scenario is this: Some kind of centralised VoIP service is being offered to a number of users; the service operates on servers located at a data centre or office and the users each have a SIP client device, such as an IP phone, that connects to the centralised service over the Internet or over the company network. That is the typical setup for an Internet Telephony Service Provider (ITSP) with thousands of users. However, the principles are also very similar for a relatively humble setup with an Asterisk PBX in a small office and a handful of IP phones used as PBX extensions.
The disaster that you are trying to avoid is simply this – if a server or other component fails, we don’t want it to cause prolonged disruption to the service for a large number of users. Ideally, we want some kind of backup system to take over – preferably without the need for human intervention because the fault might happen in the middle of the night when the support team are all asleep!
Designing systems for resilience
Elimination of “single points of failure” is a tricky design problem and it is important to keep a balanced judgement about the best approach to it. An ITSP might decide to have multiple sites with a complete working infrastructure at each site, but this is beyond the scope of a smaller service provider or business.
If you consider each site (or data centre) in isolation then you may want to have redundancy and failover for all or any of the following:
- Telephony carriers – your feeds to and from the PSTN
- Internet connectivity – the ISP and the physical connection
- Network components – firewalls, routers, switches etc.
- Mains power – a UPS can help with this, but other approaches may also be relevant
- The primary servers – your SIP Proxies, Asterisks, IVR’s, gateways etc.
- Supporting service servers – DNS, Email, Database, Application servers etc.
- Components within a server – dual redundant power supplies, RAID or mirrored disks etc.
When designing for resilience, first sketch out a diagram of your proposed solution and then ask yourself a series of “what if” questions – What if the ISP has a fault? What if the firewall goes up in smoke? What if the hard disk on the database server fails?
You will soon see that a totally robust solution would be complicated and expensive – not only do you need duplication of every key component but you also need some way of connecting them together that will gracefully switch across from the dead unit to the backup with minimal disruption. In addition, background tasks are required to synchronise files and data between primary and backup servers, even if those data are changing frequently as the service is used – stored voicemail messages, registrations, call records, credit and billing information are all examples. Perhaps it can be done, but what you really need to do is decide in advance what are the parameters for an acceptable solution – which failure scenarios are most likely, which risks do you have to protect yourself against and which risks are you prepared to live with? How rapidly must the failover operate, will it be fully automated or is manual intervention acceptable? Which data are at risk in the event of a failure and what are the business critical elements within that data? Remember also that it is not just the importance of the data, but also the complexity of the data structures and the time it would take to repair your data following an incident.
Mechanisms available for resilience and high availability
Client-side vs. Server-side mechanisms
Solutions generally fall into one or other category – client or server. Client-side solutions allow the SIP client device to dynamically select the server it is going to register and communicate with. Not only must the client devices have the required capabilities, but also you have to make sure they are all configured correctly to use it – that may not be so easy if you have a large user base, your users are allowed a free choice in their equipment selection and they are self-installing and configuring the kit.
Server-side solutions have the advantage of being more within the service providers control, but they are generally more complex and expensive. Another advantage is that the failover can operate very quickly – if you are relying on mechanisms in the client devices then it is likely that failover will be quite slow.
The slowness of client-side solutions is linked to the registration refresh interval. The typical refresh interval for client registrations is between 30 minutes and 1 hour. Even on IP phones that can be configured to send so-called “keep-alive” pings to the server, these are intended primarily for NAT traversal so failure to get a response will not necessarily trigger instant re-registration.
The primary mechanism for resilience within SIP clients is the use if DNS-SRV server location records. SRV records are a special type of DNS record, similar to MX records for email, that allows a single SIP domain name to be associated with multiple SIP servers. Basically, a DNS lookup using the SIP domain name can return multiple SRV records and each record can identify a different server IP address – each record also has a weight and priority setting to allow the provider to control distribution among the server addresses. DNS-SRV is only viable as an automated failover option if the service provider operates multiple servers on different static IP addresses and those servers are all equally capable of handling requests from the SIP clients.
SIP clients should be able to support DNS-SRV for service location in addition to the vanilla options of specifying a host name or IP address as the location of the SIP Proxy. Snom IP phones will always use DNS-SRV if it is available whereas most other makes of IP phone provide it as an option that can be switched on or off.
This falls within the client-side solutions category, but it does not require any special technology or settings in the SIP client devices. Failover DNS is only offered by a few providers, but it has great potential as a low cost solution for SIP failover. To the client device, it appears to be just a plain vanilla DNS Host record, but behind the scenes, the IP address associated with the host name is allocated dynamically by a mechanism that pings your servers at frequent intervals to see if they are still alive. When the primary server stops responding to the pings, the DNS record is updated with the IP address of the backup server. The TTL setting on the DNS record is deliberately set to a low value so it should not take very long before the SIP client refreshes its address cache and starts to use the backup server.
If you are interested in using failover DNS, please contact Smartvox Limited.
Some makes of IP phone can be configured for server failover
Aastra IP phones have extra fields for each line where you can set an IP address or host name for “Backup Proxy Server” and “Backup Registrar Server”. Multi-line Snom IP phones, such as the 320, refer to each line as an “Identity”. Within the settings for one Identity there is a field called “Failover Identity” where you can select any one of the other lines. The twelve different Identities available on the Snom 300-series phones should be adequate for a combination of multi-line/multi-account functionality combined with failover on each different account. Yealink phones such as the T21P allow you to enter details for two different servers on each user account.
Now we are looking at server-side solutions. The primary purpose of load balancing systems is to allow high volumes of traffic to be handled. A typical example would be the use of an OpenSIPS Proxy Server to distribute incoming SIP calls to a group (or farm) of several IVR’s or gateways. Load balancing devices may permit different distribution algorithms to be selected – “round robin” distribution is the simplest and is quite suitable for many situations.
Most load balancing solutions provide a degree of resilience because, if one of the IVR’s fails or is taken out of service, then the others will continue to answer the incoming calls. Depending on how the load balancing device works and what distribution algorithm has been selected, you may find that a percentage of calls are not answered or experience a delay before being answered or you may find that calls are redistributed to the working servers in a completely seamless manner.
Remember also that load balancing and high availability are not the same – your load balancer will help in situations where one of the gateways or IVR’s has failed, but you must also consider the consequences if the load balancer itself were to fail. The solution to this problem is very likely to be found by ensuring you have more than one failover mechanism in place. For example, if you had a primary load balancing server and a backup on different static IP addresses, then you could use a client-side solution to provide failover to the the backup load balancer. Another option would be to use Virtual IP addresses as discussed below.
Solutions using Virtual IP addresses
Linux servers are able to support dynamic IP address allocation on their network interface cards. This allows us to do some useful tricks for automatic failover – the actual mechanism can be based on home-produced bash scripts running in the background or you may prefer to use the Linux HA system which can be downloaded and installed on most, if not all, Linux distributions (see my article Pacemaker and OpenSIPS). What it means is that two servers can be running – one primary and one backup – but at any given moment only one of them is assigned the “Virtual IP address”. The assignment of the VIP is controlled by background tasks that are constantly monitoring the status of the servers. This mechanism essentially allows you to organise your SIP Proxy or Asterisk servers as a cluster.
If you are interested in implementing any of the server-side solutions mentioned above, please contact Smartvox Limited and we will be happy to discuss your requirements.
Failover solutions for OpenSIPS/OpenSER/Kamailio
Two servers with a shared Virtual IP address
One OpenSIPS server is able to handle very large numbers of SIP transactions and registrations. It can be configured in a load balancing role, passing SIP requests to other servers, including Asterisk servers, that act as IVR’s or gateways. This is generally an excellent arrangement for an Internet Telephony Service Provider, except that the OpenSIPS server is absolutely critical and if it fails, then the entire service is broken.
The obvious solution in this case would be to have two identical OpenSIPS machines that share one Virtual IP address. One machine would be the primary and the other would be ready as a hot standby. If the primary fails, then the VIP is re-assigned to the backup machine. This could be controlled by Linux HA or it would be possible to write some simple bash scripts that run on both machines and are responsible for assigning and revoking the VIP.
Find out more about this type of solution by reading the new article on Linux HA (Pacemaker and Corosync):
There is also a 3-part article published in 2010:
The problem of synchronising Location Server registrations
The only difficulty in the dual OpenSIPS solution is the requirement for synchronisation of registrations. Registrations are more of a problem than most other stored data because (a) OpenSIPS works best when it stores registrations in memory and (b) registrations are constantly changing – they expire and are renewed, as well as new ones joining and old ones leaving.
To make the failover solution function seamlessly, it is necessary to synchronise the in-memory registrations between the primary and the backup server. This can be done by forking a copy of the registration request to the backup server, but there are some practical problems in doing this, especially when the OpenSIPS server is configured for far-end NAT traversal and some of the client devices are behind restricted-cone NAT firewalls. It may be possible to use replication functionality provided by the database application or simply to use a shared location table, but this is not compatible with caching of registrations in memory and requires the use of the slower db_mode 3 in the USRLOC module. Version 2 of OpenSIPS introduced a new feature, the Clusterer module, which provides a framework for clustering multiple OpenSIPS server and particularly for replication of registration contact details. However, even this new mechanism has problems when used with the nathelper module in a clustered high-availability solution where some of the client devices are behind restricted-cone NAT, because both the active and the standby server will attempt to ping the client device to keep the NAT path open.
If you are looking for a high availability OpenSIPS solution, please contact Smartvox Limited. We have a lot of experience in this field and would be happy to discuss your requirements.
Resilience in the network
What can you do to ensure that your Asterisk based IP-PBX is always able to reach – and be reached from – the Internet?
Dual WAN – Provision of a backup Internet Connection
Internet Service Providers are able to invest in sophisticated solutions using BGP to ensure that the same range of IP addresses are re-allocated to the backup equipment in the event of an equipment failure or connectivity failure. This provides as near seamless failover as is possible, but it comes with a high price tag and so is beyond the reach of many medium-sized businesses and probably all but the most specialised small business.
Instead, the best that most system designers can hope for is to provide a so-called “Dual WAN” connection to the Internet. Dual WAN is now an option on many firewall/router devices aimed at the small and medium business sector. You will need to sign up with two different service providers and it strongly advised that you make sure they are using different cables (perhaps even different technologies) to deliver Internet connectivity to your building. If they share the same underground cable in the street outside then the proverbial “JCB digging up your cable” is going to take both your connections out at the same time!
Typically, the firewall/router will use one WAN connection as the main one and keep the other in reserve in case the main one breaks. Some devices will allow load sharing between the two WAN connections, based on a simple algorithm. However, this will only apply to outbound traffic and it could introduce complications if it is not handled well by the router.
Dual WAN solution and their interaction with an Asterisk PBX system
While dual WAN seems like an excellent idea to make sure your Asterisk or other IP-PBX can stay connected to the Internet, there are complications. The most significant issue is that the “public” (or static) IP addresses for access to your IP-PBX from the Internet will be different depending which Internet connection is being used. Provider A will allocate you one range of IP addresses and provider B allocates you a completely different range. Which one should be used by your remote IP phones? Which one will be the termination for a SIP trunk?
The only way to resolve these problems is to combine dual WAN with one of the other solutions mentioned earlier – either DNS-SRV or failover DNS. Unfortunately, most VoIP service providers will not be able to connect to your IP-PBX using a DNS host name so the solution will be very much in their hands. Some will be able to support the use of a failover route for inbound calls, but it may only be via a conventional PSTN landline connection rather than an alternative IP address. Some may be able to offer a backup trunk at extra cost.
Ha, you may be thinking, it will be alright because my Asterisk server registers with the VoIP service provider so the service provider will be informed of the new IP address as part of the new registration. Well, yes this might work, but if your Asterisk server is behind NAT then you have to tell it what the public IP address is using the “externip” parameter in SIP.CONF. One way you could automate this change of address would be to use the “externhost” parameter instead of “externip”, and set the host name to something that is using failover DNS. Another option that might be viable (in conjunction with the “externhost” parameter) is Dynamic DNS. This is normally used to assign a resolvable host name to an Internet connection that does not have a static IP address. However, support for Dynamic DNS is normally only provided in routers designed for the domestic, rather than the business, market …and those may not all have dual WAN ports.
Don’t be fooled – none of this failover DNS or Dynamic DNS is going to happen instantly. On top of the latency resulting from DNS caching, you must also remember that Asterisk (like most other SIP UAC’s) does not re-register with the host server very often – once every 30 minutes is quite common as the default setting. So even if you have dual WAN connections and you don’t have Asterisk behind NAT (or you are using “externhost” in combination with failover DNS), there will usually be a significant delay before Asterisk re-registers with your VoIP service provider and informs them of the new IP address that your PBX is now to be reached on. Try reducing the expire time of your registration to keep the delays as short as possible.
Contingency planning for Disaster Recovery – Dual Site
Large financial institutions, such as international banks or stock exchanges, have elaborate disaster recovery plans that will allow them to be back up and running within a few hours even if an entire office building has to be evacuated or if a data centre has to be shut down (e.g. because of a fire). They do this by having a duplicate, albeit scaled-down, setup in a different geographic location. If necessary, they can bus their staff to the backup location where pre-tested and configured IT and phone systems are in place waiting to be activated.
Obviously this level of contingency planning is too expensive and complex for most businesses to consider. However, it does demonstrate the principles that you need to consider when planning for the worst. If you are providing an essential service, perhaps one where lives would be put at risk if the telephones stop working, then complete duplication of infrastructure in a different geographic location may be part of the solution. Even if system failure would just lose you money and annoy your customers, the risk of a prolonged outage may be enough to justify the cost of a dual site solution. In the ITSP world, this usually means having two Data Centres. With the availability of relatively low cost virtual servers in the cloud, including some solutions where you can spin-up extra server capacity on demand, this approach is becoming increasingly viable for VoIP telephony solutions because they can work entirely over IP networks and the Internet, requiring no special hardware as was the case with legacy TDM-based services.