Setting up HSRP (or how not to..)

OK need time out from reading about wireless networks, its all a bit of a repeat to be honest and I’m getting brain ache. I have update my ANKI flashcard pack though with some of it.

But I thought a few words on HSRP would help clear my mind.

HSRP (hot spare routing protocol), its a wonderful idea of CISCO’s. Two or more Routers on a subnet, sharing one IP address. You assign you client PC’s this IP as there default gate way, and in the event of one of the routers failing another takes over and keeps connectivity for you clients! And so simply to set up (both cisco routers and switches with the advanced IP feature set have this).

On the Primary router enter the config mode followed by the interface you wish to configure this on, and enter the following commands

(config-if)#ip address 192.168.14.200 255.255.255.0
(config-if)#standby 1 preempt
(config-if)#standby 1 ip 192.168.14.254
(config-if)#standby 1 priority 100

Then on the secondary router enter the following.

(config-if)#ip address 192.168.14.100 255.255.255.0
(config-if)#standby 1 preempt
(config-if)#standby 1 ip 192.168.14.254
(config-if)#standby 1 priority 95

And there we have it, now the first router will respond to any ARP requests for the 192.168.14.254 address which can be used as the DFGW for you clients. What is even better is that the routers will share the same MAC address for this IP. So in the event of the primary router failing with in 3 seconds (default timers) the secondry router takes over and all currently active clients will be able to carry on where they left off.

As always that is far from all you can do with HSRP, one of the main limitation you may notice is only one gateway is active at any one time, and although you can play with HSRP to achieve load balancing (See here), there is a much better way by using GLBP (Gateway Load Balancing Protocol). You can also have HSRP track interfaces and IP SLA counters to increase and decrease a routers priority to insure the router in the best position is running as the active, this cisco document covers the settings in far more detail than there is space for here.

Now for the how not to do it part 😉

By default the timers on HSRP are set to send a hello every 1 second and the standby router becomes active if it fails to hear a hello from the active route for more than 3 seconds. you don’t have to enter this of couse but the command would like like this to set it up

(config-if)#standby 1 timers 1 3

But 3 seconds ????!!!!!!!!!!!! three second network outage I cried! Hitting the question mark after typing (config-if)#standby 1 timers ?… what’s this I see msec. Yay I cried and after checking the documentation so see this really did reduce timers to the msec range, I proceeded to configure a hello timer of 50msec and a hold timer of 150msec ( you can actual configure it as low as 10msec). A quick test and yes almost instance fail over, not even a packet dropped, and I went home a happy lad.

However I configured that in the evening with little traffic on the network, next day just before lunch however…. Oh this is not so good, no one can get out of site? Things start to move then crash to a stop again. Well better log on to the core switches I suppose and see what the logs are saying….. Umm nope they wont let me on just hanging. Finally after switching off the secondary switch the primary one magically let me log in again and after checking its logs I could see what was happening. With such short hello timers packets where getting dropped, the switches started flapping between active and standby and in doing so just made the issue worse. And they could not settle on who was in charge.

From this I learnt two important things, First the don’t go below 200msec hello timers and 700msec hold timers (come on still less than a second fail over), and only do this is the routers/switches are directly attached. Secondly add in a preempt delay statement

(config-if)#standby 1 preempt delay 10

This will stop the flapping between active and standby. Once a device has change state away from being the active router, with the configuration above it must wait at least 10 seconds before it can take over again.

And finally just because you can do some thing does not mean you should or need to. The time out in the TCP stack in XP (and most other systems) is at least 9 seconds. In the case of VoIP and Video a few seconds delay may make a call hicup, but it will normally stay up. And people will not mine or take much notice of a slight hicup as long as it only happens once ever 6 months.

There are cases when you need better fail times, in which case you need the correct equipment. HSRP is a great technology but as I found back a few years ago when I did this. You can push a good thing to far.

For those of you with out CISCO devices, the industry standard version is VRRP (Virtual Router Redundancy Protocol), and some information on that can be found in this document.

Well I hope some of you will learn from my mistake, thankfully because I had played around with HSRP a lot on a test network, I was in a good position to trouble shoot and had it back up and working quickly. But still it is one of the times few times I have had to hold up my hands to management. Thank fully these times are rare and so far non-critical and short lived…

Well back to work tomorrow. Night all have a good one.