Lucent GBE (4 x VC4) clues needed
(oops technical question in nanog, wearing my asbestos suit) Consider this topology GSR - 3750 --(GE over 4xVC4) - NSE100 - NSE100 --(GE over 4xVC4) -- 3550 - GSR All other fibres are dark fibres, except marked. When we ping either NSE100 <-> GSR leg, when there is no background traffic there is no packet loss. If there is even few Mbps, lets say 10Mbps of background traffic we get 1-5% packet loss on 1500 bytes, and bit less packet loss on small packets. As background traffic increases packet loss quickly increases. We tried to replace (GSR-3750) with 7600, but same issue persisted. We've measured both Lucent GBE legs with having loop in other end and pushing tests from EXFO and Smartbits gear through the loop, no errors can be detected in RFC tests. There isn't very much that can be configured in the Lucent, and we've tried pretty much every setting. We've tried to set autonego on and off in every gear in the path, without any changes to observed behaviour. We've also tried to use use 1xVC4, without any changes to the behaviour. All VC4's in given leg are using same path. Even though we test the packet loss pinging from router link to router link, same packet loss is experienced for transit traffic also. We've tried to turn PXF off in NSE100. Packets between NSE100 <-> NSE100 over dark fibre are not lost. We're pretty much utterly without clues. All I can think off is some obscure IFG issue, that is, NSE100 would have less than perfect timing for IFG which would confuse Lucent regarding what is part of which frame. Does stuff like this really happen? NSE100 drops bad IP packets in PXF and there is only shared counter, so I can't tell if I get CRC for IP, I just loose the packets. But IS-IS is not handled in PXF, and I get %CLNS-4-LSPCKSUM and %CLNS-3-BADPACKET messages over both Lucent legs, but not between the NSE100's. So I assume the packets are not dropped, but broken. I swear next time I'll complain about some political issue, thanks, -- ++ytti
-----Original Message----- From: owner-nanog@merit.edu [mailto:owner-nanog@merit.edu] On Behalf Of Saku Ytti Sent: Thursday, September 21, 2006 9:12 AM To: nanog@merit.edu Subject: Lucent GBE (4 x VC4) clues needed
(oops technical question in nanog, wearing my asbestos suit)
Consider this topology
GSR - 3750 --(GE over 4xVC4) - NSE100 - NSE100 --(GE over 4xVC4) -- 3550 - GSR
All other fibres are dark fibres, except marked.
When we ping either NSE100 <-> GSR leg, when there is no background traffic there is no packet loss. If there is even few Mbps, lets say 10Mbps of background traffic we get 1-5% packet loss on 1500 bytes, and bit less packet loss on small packets. As background traffic increases packet loss quickly increases.
We tried to replace (GSR-3750) with 7600, but same issue persisted.
We've measured both Lucent GBE legs with having loop in other end and pushing tests from EXFO and Smartbits gear through the loop, no errors can be detected in RFC tests.
There isn't very much that can be configured in the Lucent, and we've tried pretty much every setting. We've tried to set autonego on and off in every gear in the path, without any changes to observed behaviour. We've also tried to use use 1xVC4, without any changes to the behaviour. All VC4's in given leg are using same path. Even though we test the packet loss pinging from router link to router link, same packet loss is experienced for transit traffic also. We've tried to turn PXF off in NSE100. Packets between NSE100 <-> NSE100 over dark fibre are not lost.
We're pretty much utterly without clues. All I can think off is some obscure IFG issue, that is, NSE100 would have less than perfect timing for IFG which would confuse Lucent regarding what is part of which frame. Does stuff like this really happen?
NSE100 drops bad IP packets in PXF and there is only shared counter, so I can't tell if I get CRC for IP, I just loose the packets. But IS-IS is not handled in PXF, and I get %CLNS-4-LSPCKSUM and %CLNS-3-BADPACKET messages over both Lucent legs, but not between the NSE100's. So I assume the packets are not dropped, but broken.
I swear next time I'll complain about some political issue, thanks, -- ++ytti
Silly question (considering that you stated that IS-IS is borked also, which is not handled by PXF - but did you try disabling PXF? There's a reason why Cisco discontinued every product that "features" it. It's broken.
On (2006-09-21 06:32 -0700), David Temkin wrote:
traffic also. We've tried to turn PXF off in NSE100. Packets
Silly question (considering that you stated that IS-IS is borked also, which is not handled by PXF - but did you try disabling PXF?
Not silly question at all, it was just longer mail that many people care to read (including me).
There's a reason why Cisco discontinued every product that "features" it. It's broken.
It's not broken, it's just ciscos name for NPU, two PXF's doesn't mean they have anything in common, apart being NPU. In essence, CRS-1 uses NPU's afaik, of course cisco doesn't call them PXF, due to bad publicity. Cooler word for NPU style design is probably cell processor, makes me feel warm already about my NSE100's. Yes, you can design broken NPU, NSE-1 was good example of that :). Thanks, -- ++ytti
Saku Ytti wrote:
(oops technical question in nanog, wearing my asbestos suit)
Consider this topology
GSR - 3750 --(GE over 4xVC4) - NSE100 - NSE100 --(GE over 4xVC4) -- 3550 - GSR
All other fibres are dark fibres, except marked.
When we ping either NSE100 <-> GSR leg, when there is no background traffic there is no packet loss. If there is even few Mbps, lets say 10Mbps of background traffic we get 1-5% packet loss on 1500 bytes, and bit less packet loss on small packets. As background traffic increases packet loss quickly increases.
[SNIP]
There isn't very much that can be configured in the Lucent, and we've tried pretty much every setting. We've tried to set autonego on and off in every gear in the path, without any changes to observed behaviour.
Did you try power cycling the Lucents after changing the auto-neg settings? I've seen some broken autoneg implementations in the past on managed media converters that didn't change settings immediately. It's worth a shot as you seem to be all out of other ideas ;) Sam
On (2006-09-21 18:49 +0100), Sam Stickland wrote:
Did you try power cycling the Lucents after changing the auto-neg settings? I've seen some broken autoneg implementations in the past on managed media converters that didn't change settings immediately. It's worth a shot as you seem to be all out of other ideas ;)
I brought the adjacent ports in IP gear down and up. We could verify from management interface to the lucent that autonegotiation wasn't performed after down/up, while we could observe before down/up that autonegotiation was marked being done even though we had configure cisoc interfaces as 'force-up'. So clearly it needed to see link down/up. We didn't powercycle lucent, as it would mean bringing down tens of 10G waves. But taking the GBE module out/in would have been option (three countries are involved, so bit inconvenient, but possible). Country A - Country B is one lucent leg. Country B - Country C is another lucent leg. Anyhow thanks for the thoughts, any help I can get is much appreciated :). Of course we have full support agreement to both vendors, which we probably have to try sooner or later, but it'll be long battle on who's problem it really is. -- ++ytti
On (2007-09-21 16:12 +0300), Saku Ytti wrote:
(oops technical question in nanog, wearing my asbestos suit)
Consider this topology
GSR - 3750 --(GE over 4xVC4) - NSE100 - NSE100 --(GE over 4xVC4) -- 3550 - GSR
This should have been Nortel GBE, not Lucent my bad. Anyhow, just wanted for sake of archive report that it's the Nortel 4xVC4 that corrupts packets, it mostly seems to corrupt source MAC and always same bits, that is, any L2 will learn mostly same MAC with few different vendor codes, we can also see this in wireshark on fibresplitter. (It's not limited strictly to source MAC, but it's not random by any means) It's not broken hardware (unless by design), as it can be seen in both of the production legs and we've recreated the same problem in lab. Most likely software issue in Nortel. -- ++ytti
Consider this topology
GSR - 3750 --(GE over 4xVC4) - NSE100 - NSE100 --(GE over 4xVC4) -- 3550 - GSR
This should have been Nortel GBE, not Lucent my bad.
My first best guess was right, it was lucent system after all. We've now solved the issue, problem is in GBE card in Lucent in hardware revision S1:7, which is broken by design. S1:3, S1:6 work and we should be able to test S1:8 soon, but we expect it to work also. Symptoms were that it flipped bits (but not randomly, just couldn't figure out why certain places saw bit flips) and calculated new, correct CRC to the ethernet frame, after it had flipped bit. -- ++ytti
participants (3)
-
David Temkin
-
Saku Ytti
-
Sam Stickland