On (2012-11-09 01:22 +0200), Kasper Adel wrote:
We've been hearing about ISSU for so many years and i didnt hear that any vendor was able to achieve it yet.
What is the technical reason behind that?
I'd say generally code quality in routers is really really bad, I'm not sure why this is. I think one problem is, that we start on premise that code will be written correctly. When we start on that premise, we can do silly things like write run-to-completion operating systems like IOS and JunOS (rpd). Which means single guy making one bad judgement call, and whole OS is bad. Of course run-to-completion is most optimum way to execute code, if your code is flawless, but that ship has sailed. Possibly when IOS started CPU time was premium and it was cheaper to through code review money at the problem. But today it clearly is cheaper to add power to control plane and have levels of abstraction in control-plane which saves the system from bad code, i.e. design your control-plane assuming code you deliver isn't good. Take a page from erlang team on design principles. I think Arista is walking the right path. They have (hopefully) stable and simplistic state-storage process, from which separate processes can download their states when they crash, which can make crashing virtually transparent to operator. However I think Arista is still running single BGPd etc, I think you should at least rung iBGP and eBGP or maybe even peer gruops in different daemons, so when you get bad UPDATE, it'll crash your eBGPs or one peer-group, instead of all neighbours. Or of course if you keep TCP state and various bgp RIBs in separate location, you won't need to tear down the TCP just because you crash. Someone might argue the overhead is too large, but is it though? MX routers ship with 4 cores RP, out of which you're using 1 core. The overhead isn't that high. Some people write positive things about ISSU in reply, only box where I've seen it work reliably is CAT4500 switches. I've not seen it working in routers. On MX960 my personal hit miss ratio is like 4/5 ISSU work, 1/5 have failed catastrophically, like suddenly PFE is dropping packets as if FW filter was applied, while none is. So we've stopped using ISSU. Point of ISSU is, you're not doing change management notices to your customers, so then it positively has to work, or you're in breach of contract. -- ++ytti