Openstack Neutron - Router failover time & Conntrack replication

While doing some work with virtual routers in Neutron I didecded to dig into the failover process and the time it takes for a router to 'move' from one server to another.

I noiced when I would kill the router on one of the hosts Neutron L3 Agent that Neutron would pick up this failure pretty much immeidatley and keepalived would do it's thing by moving all of the IP addresses to the new nominated primary and all would be good.. But No!!

The issue i noticed was that all active TCP connections would get dropped during this process, if you had SSH open to a VM with a floating IP during this failover your session would be dropped. Different applications responded differently to this failover but all in all the customer experience was bad, this isnt a routine that I can execute during the day without notifiying the customer. How entirely non 'cloudy'

I set about deep diving into the issue and very quickly it became apparent that the linux conntrack tables arent being replicated from one host to another, so the tracking of all active sessions is lost when the router 'moves'

This can be demonstrated by switching into the routers namespace with ip netns exec qrouter-800b9927-e5a5-4dd8-b7ac-16a23f0e0b71 bash then running conntrack -L or conntrack -C to display just the count of the connections

Here you can see the backup server has very little in conntrack


But the host with the active router has a few entries(Keep in mind this is my test lab, there isnt much traffic, in production you could have 200,000+ entries in conntrack)


I did need to do an apt install conntrack conntrackd inside my Neutron L3 Agent docker containers to make all of this work

Using this basic conntrackd.conf file I was able to get conntrackd to replicate the conntrack table between the hosts

 1Sync {
 2    Mode FTFW {
 3        DisableExternalCache Off
 4        CommitTimeout 1800
 5        PurgeTimeout 5
 6    }
 8    UDP {
 9        IPv4_address
10        IPv4_Destination_Address
11        Interface ha-60d901db-7d
12	Port 3780
13        SndSocketBuffer 1249280
14        RcvSocketBuffer 1249280
15        Checksum on
16    }
19General {
20    Nice -20
21    HashSize 32768
22    HashLimit 131072
23    LogFile on
24    Syslog on
25    LockFile /var/lock/conntrack.lock
26    UNIX {
27        Path /var/run/conntrackd.ctl
28        Backlog 20
29    }
30    NetlinkBufferSize 2097152
31    NetlinkBufferSizeMaxGrowth 8388608
32    Filter From Userspace {
33        Protocol Accept {
34            TCP
35            UDP
36            ICMP # This requires a Linux kernel >= 2.6.31
37        }
39    }

start conntrack by running conntrackd -C conntrackd.conf And confirm it's working by running conntrackd -C conntrackd.conf -s to show the status

Grab the script from here Github

On your primary execute bash primary and on the secondary execute bash backup

Confirm that the numbers look right by running conntrackd -C conntrackd.conf -s on each host The backup server should look like this


And the primary server should look like this


To force kill a router i simply killed the relevant keepalived processes from within the docker container like this


ps ax | grep 800b9927-e5a5-4dd8-b7ac-16a23f0e0b71 | grep keepalived

Then kill the keepalived processes with a kill 726 727

Once keepalived is dead, on the NEW primary(The previous backup) you'll need to have conntrackd take all of the tracking entries from cache and load them into conntrack, you can do this manually by running bash primary

If you do all of this quick enough, there will be effectivly no downtime. Of course this can be somewhat automated by adding the notification scripts into the keepalived.conf but given this is built by Openstack I havent yet worked out how to do that. Here is an example keepalived.conf file that you might find handy Github


Conntrack tools Github

Conntrack tools man page

Great article about HA firewalls