We suffered a series of crashes that led to JTAC
recommending disabling RPKI. We had a core dump which
matches PR1332626 which is confidential, so I have no idea
what it is about. Apparently what happened was the server
running the RPKI validation server rebooted and the service
was not configured to automatically restart. Also we did not
have it redundant nor did we monitor the service. So we had
no working RPKI validation server and that apparently caused
the MX204 to become unstable in various ways. It might run
for a day but it would do all sorts of things like packet
loss, delays and generally be "strange". The first crash
caused BGP, ssh and subscriber management to be down, but
LDP, OSPF, SNMP to be up. It became a black hole we could
not login to. The worst possible kind of crash for a
router. We had to go onsite and pull the power.
The router appears to run fine after disabling RPKI. I
suppose starting the validation service may also fix the
issue. But I am not going to go there until I know what is
in that PR and also I feel the RPKI funktion needs to be
failsafe before we can use it. I know we are at fault for
not deploying the validation service in a redundant setup
and for failing at monitoring the service. But we did so
because we thought it not to be too important, because a
failed validation service should simply lead to no
validation, not a crashed router.
This is on JUNOS 20.1R1.11.