The Peculiar case of the missing bandwidth.

Where I work we have a slightly strange network set up, as an agency of the government we run under what is knows as the GSI (government secure internet). What this means in practice is that our main site + the 16 or so regional sites have there WAN routers managed by a central government IT centre, and all traffic to the outside world has to pass through there systems. This in its self causes no end of issues in terms of restrictions such as no VPN access and no FTP allowed. But leaving that aside it does mean we sit behind a very secure gateway. All you really need to understand is that we have  “10mbs” full duplex fibre as our primary link of the main site, through which both internet and WAN traffic is routed. Oh and of course we have no access to the WAN router to see what is going on.

Well last Friday, the network grinds to a screeching halt..  What was a 20msec latency link to the regional sites has now become 4000msec (yep that’s right 4 seconds!!). As I say no access to the WAN router but from out 4506 that connects to it I can see the link to it is looking fine. So nothing for it but to call the service provider, after a short chat they agree that traffic has dropped and latency has shot up and start looking in to it for me.

A few hours pass (well 3 days to be more correct during which time we have moved over to the 4 mbs backup link) and they finally come back saying that the link seems to have dropped and the most data they can push through it is 1.6mbs, and they think it is a routing issue on our sites subnet as latency to the outside address of the router seems fine.

Now at this point my mine is saying 1.6mbs??? hmmm why does that number sound familiar, may be if they measured it a bit more accurately they would find it was actually 1.54mbs which of course is a T1 link speed.  Which suggests to  me either some one added a bandwidth policy along the link or the route had changed to pass across a T1 link. But no “defiantly not!!”, I am told with absolutely certainty that no changes have been made to the configuration and some one will attend site to test it out.

Following day the service provider has an engineer on site, after hours of testing the local loop section on the fibre can’t find anything wrong signal strength is perfect and router on site has low latency to next hop. After hours on the phone and a few more suggestions from me that 1.6mbs suggests a T1 link some where along the line. I am told again there have been no changed to the configure or routes, but he say he will call head office and have them check the configs. He come of the phone and says he will try one last test… And what do you know the Link is suddenly back working, latency’s dropped back to the 20msec region and pushing about 9mb of data across the link.

So what did they change? “Nothing”, all they did was set a 10mbsec bandwidth policy on one of the interfaces along the router… So why did it drop in the first place “no idea, some times these things happen”. Hold on so they are telling me they changed nothing, the link just stopped working on its own, and where as it had worked fine for the last 4 years with out the policy configured, it now just happens that adding it has solved the issue??

Forgive me for feeling that someone made a cock up, and had to fix it in a hurry, and I have not been told the full story.

So great after 4 days all back up and working. Or is it? For a long time now I have been suggesting that we don’t have the 10mbs full duplex link we have been paying for. In tests I have never been able to get more than 9mbs total throughput. As I push the outgoing traffic if pulls the incoming down. (Of course as I said I don’t have access to the routers so all I can do is push traffic from our devices at either end). But one of the engineers mentioned in passing that our link was 2 X 4.5mbs??? Which  is exactly 9mbs which is what my test show… So not only did they muck up the link for 4 days but for the last 4 years they have not been providing the service we pay for!!

Not really impressed with them over the last week (not that I have been overly impressed with them before, although a few members of there staff I have to say have been very helpful to me over the years), but maybe some thing good will come out of it and I will have the full 10mbs full duplex link promised.

It is also quite nice in the sense that I informed management and the service providers of my consern’s about the link speed, about 2 years ago when I first really had reason to look at it. All of who dismissed me, and told me it was a 10mbs full duplex and that I was only seeing 9mbs due to the type and volume of traffic. So I would be laying if I said I didn’t slip the “as I told you 3 years ago” in to my report to management this time round. 🙂

I still can’t believe that no one can hold there hands up though and tell us what really happened last Friday. This is where network device management accounting comes in handy, can’t even log on to my devices, let alone update config with out it getting logged. It’s not just I like to spy on people, but if all changes are logged on the syslog server, then if some one does make a change, and the next day when they are off it all falls apart. I can view the last 24 hours, 3 days, etc, of changes at a glance and see what has happened. No need for them to remember to document every change they make, that’s all done for them.

Well I wait to see what come of this episode. But after this I not sure I will ever trust a service provider again.

laters all