On Tue, 6 Jun 2006, Nick Burke wrote:
First, a little background.. My CTO made my stomach curdle today when he announced that he wanted to do away with all our cisco [routers] and instead use Linux/zebra boxen. We are a small company, so naturally penny pinching is the primary motivation. That, and the sheer joy of watching me squirm. He has informed me that he has found "many people" who do this for their "core devices". I'm not so certain about this whole situation, so I humbly ask:
How many of you have actually use(d) Zebra/Linux as a routing device (core and/or regional, I'd be interested in both) in a production (read: 99.999% required, hsrp, bgp, dot1q, other goodies) environment?
And, if you care to spend this much time, what pitfalls/benefits did you find out about after implementation? Having done exactly that previously, I wouldn't recommend it.
While it will work, most of the time, reaching 99.999% will be a challenge. Amount of engineering time you will spend in order to reach that point (and to maintain your setup) will dwarf the cost of leasing proper equipment. Issues encountered: *) Performance under ddos: Linux routing stack is route-cache-based. That means, performance is a function of flows per second, and even small random src/dst ddos will kill you. Even when this is fixed, performance will be limited by pps - and the "worst case" performance of PC router is not as impressive as "omg i can route 1gbit with p3/1ghz". In the end, "worst case" performance is what really matters, and it isn't all that awesome. *) Management: It takes certain amount of sysadmin time to manage each PC router (tools/etc). *) Integration: As it is not designed as a "complete system", you will have little wierdnesses, such as, quagga not seeing kernel-installed routes, or netlink not being able to keep up with route updates, etc. All of those are fairly small things, but there are more than enough of them. *) Troubleshooting/continuity of operations: It takes two orders of magnitude more clue to troubleshoot zebra network - there are simply *lots* more things that can possibly go wrong - you don't worry just about your links breaking, you have to worry about your software being buggy. While any CCIE will most likely be able to troubleshoot and run a cisco-based network, pool of engineers sufficiently clued in a myriad of things that relate to troubleshooting of a PC router (ie. both network engineer, system admin, protocol engineer, kernel hacker, and at times, zebra-source-code-hacker) is far smaller. *) Maturity: While it has been improving, things like Quagga have still have stability issues and "wierd issues that are resolved by killing ospfd". Because of a greater state of flux in such environment, you are likely to encounter things like "oh, this bug is fixed in latest release" - and then having to retest the new release which has completely different bugs. Yes, I know, you get that with proprietary vendors - but at least you get a benefit of *them* doing at least some amount of testing prior to release. *) Redundancy: Adding more redundancy to such a system is not likely to increase availability - in fact, it is likely to decrease availability because of added complexity and "more things to break". Your problems are not likely to be the PC losing power (complete failure). Your problem will be things like zebra's idea of routing table being different from kernel's idea, zebra being unhappy after a transit flaps sucking up CPU time, leading to other things timing out, etc. Redundancy will excarcerbate these issues, making troubleshooting *harder*. So, in conclusion, if you have a large number of clued linux hackers who have nothing better to do, it may be a good idea. Otherwise, you'll realize you are spending far more on sysadmin time than you are saving on equipment cost. -- Alex Pilosov | DSL, Colocation, Hosting Services President | alex@pilosoft.com 877-PILOSOFT x601 Pilosoft, Inc. | http://www.pilosoft.com