On Sat, Dec 06, 2014 at 11:51:56AM +0200, Saku Ytti wrote:
a) one particular optic had slow i2c, vendor polled it more aggressively than it could respond. Vendor polling code didn't handle errors reading from i2c, but instead crashed whole linecard control-plane. Vendor claimed it's not bug, because it didn't happen on their optic. I tried to explain to them, they cannot guarantee that I2C reads won't fail on their own optics, and it's serious problem, but was unable to convince them to fix it. Now I am in possession of good bunch of SFP I can stick to your routers in colo, have them crash, and you won't have any clue why they crashed.
b) particular vendor had bug in their SFP microcontroller where after 2**31 1/100 of a seconds had passed, it started to write its uptime to a location where DDM temperature measurements are read. This was obvious from graphs, because it went linearily from -127 ... 127, then jumped back to -127. These optics when seated on Vendor1 caused no problems, when seated on Vendor2 they caused link flapping, even two boxes away! (A-B-C, A having problematic optic, B-C might flap). Coincidentally Vendor2 is same as in case a), they didn't consider this was bug in their code. This was particularly funny, if you rebooted 100 boxes in a maintenance window, then the bug would trigger at same moment after 2**31 1/100th of a second, causing potentially major outage.
Who is Vendor2?