[NANOG] Re: JunOS and MX trojan and malware

16 Mar 2025

      Oh. And this is not getting better, this is getting worse.

In juniper you can do flow -> logical -> physical -> npu level
admission control. LPTS is NPU. So collateral damage is very
expensive.

There was 'lpts punt excessive-flow-trap' which was retired, and we
couldn't get Cisco to understand why replacement is needed.

E.g. interface1 customer has L2 loop, and offers us excessive amount
of ARP. Other interfaces in same NPU are dead too, you used to be able
to address this in excessive-flow-trap.

Further it is impossible to expect customers to understand LPTS, when
Cisco does not. We had PE-CE BGP flaps in 690279616, where TAC was
focused on fixing our MQC config, despite LPTS not being subject to
MQC at all. It took escalation to Xander, who initially thought
ingress ACL can be used to discriminate here, until I reminded him how
LPTS works and he luckily didn't try to gas light like TAC, but
immediately agreed that LPTS is not subject to ingress ACL either
(apparently it at some time was, which is why Xander was confused for
a while).

So when LPTS does have gaps or collateral damage, you can't even add
ACL or ingress MQC to tactically address the offending interface.

So lot more complexity would be needed, to make LPTS functional, but
already the complexity is higher than what vendor can support. And
complexity is being reduced (removal of flow-trap) without
understanding why it was actually needed.

On Sun, 16 Mar 2025 at 10:11, Saku Ytti <saku@ytti.fi> wrote:
...
LPTS is not really competitive with Juniper offering. But because
Juniper needs configuration and LPTS does not, in practice LPTS ends
up having better outcome. Granted the outcome is terrible and easy to
bypass, but it is still better than typical Juniper outcome.
I could explain many gaps in it, absolute gaps and relative gaps to
Juniper. But one particular thing is that dimensioning is all wrong,
the device has no idea if it can handle what LPTS admits. For example,
we regularly had 1/8th of our BGP peers go down, because some xipc
worker was congested, because LPTS admitted too many packets to it,
and ended up doing software drops. It does a poor job in deciding what
should and what should not be admitted, and the rate at which they
should be admitted, or that rate of session 1 does not overpower
session 2.
The above problem is particularly hilarious, because the CPU
performance was used by BGP, which meant XIPC had less CPU cycles to
handle what LPTS admitted. Now because XIPC doesn't have higher
priority over BGP, this of course meant that XIPC couldn't give the
packet to BGP, causing more pressure and CPU cycle demand on BGP. If
XIPC had had priority over BGP, then BGP processing would have been
slowed down, but XIPC could have offered it the work it was going to
need to do, reducing overall CPU time.
eXR works better, but that's mostly out of luck, not out of design.
Cisco marketed cXR as real time OS, and stressed that the point of
real time was crucial for mission critical system. Yet cisco ran
everything in flat priority, Cisco did try to introduce priorities in
cXR internally, but it just made things worse, due to having
incomplete understanding on what customers are doing and how. The
losing 1/8th of BGP sessions regularly was known problem to cisco, and
cisco explicitly decided not to try to address it, other than 'imaybe
it'll work better on eXR'.
On Sun, 16 Mar 2025 at 09:01, Jakob Heitz (jheitz) via NANOG
<nanog@lists.nanog.org> wrote:
...
Hi Saku,
Search the Internet for “IOS-XR LPTS” for one way to protect the control plane.
Regards,
Jakob.
...
most others don't even have a way to protect control-plane.

NANOG mailing list
https://lists.nanog.org/archives/list/nanog@lists.nanog.org/message/TFPR5TJH...
--
  ++ytti
-- 
  ++ytti