Dear NANOG members, We operate a campus network reaching more than 100 buildings on 5 campuses. We also operate a regional backbone and the interconnexion to our NREN. The current architecture is made of a L2 backbone and a few routers. Most of the buildings are connected with a 1 Gb/s link using our own optical fiber (only a few building are connected at 10 Gb/s). In a smaller number of buildings (a few dozens), we also operate the internal network, made of ethernet switches (in a multi-vendor environment). In each building, we provide at least an edge switch, marking the boundary between us and the customer, where we deliver the different services on ethernet ports. The services we currently offer: - L2 interconnections (400 vlans are present in 2 buildings or more; only a few VLANs are present in more than 30 buildings) - IPv4 et IPv6 routing (hundreds of subnets) and Internet access, - specific interconnections (ex: terminating a VPN to the customer, say a national private infrastructure delivered by the NREN through MPLS L2/L3VPN and stitched to the customer network using a specific VLAN) - routing isolation using routing instances (~ VRF Lite) : only 5 instances, but we could have more, - routing and filtering using open source firewalls running on servers in our DCs (less than 15 platforms, as most customer operate their own firewall), - user authentication, - shared VPN platform allowing direct access for an identified user into the customer network (based on radius attribute) - this platform uses VLANs to interconnect to the rest of the network, - wireless LAN, also allowing direct access for an identified user into the customer network ; the platform is a centralized controller, and it uses VLAN to interconnect to the rest of the network. (those last two services could use just a VLAN or a dedicated subnet delivered on a port of the edge switch which is then connected to the customer firewall) We are not satisfied with the current backbone design ; we had our share of problems in the past: - high CPU load on the core switches due to multiple instances of spanning tree slowly converging when a topology change happens (somehow fixed with a few instances of MSTP) - spanning tree interoperability problems and spurious port blocking (fixed by BPDU filtering) - loops at the edge and broadcast/multicast storms (fixed with traffic limits and port blocking based on threshhold) - some small switches at the edge are overloaded with large numbers of MAC addresses (fixed with reducing broadcast domain size and subnetting) This architecture doesn't feel very solid. Even if the service provisionning seems easy from an operational point of view (create a VLAN and it is immediately available at any point of the L2 backbone), we feel the configuration is not always consistent. We have to rely on scripts pushing configuration elements and human discipline (and lots of duct-tape, especially for QoS and VRFs). We are re-designing our network architecture. We have enough fiber to imagine many ways to link the core network devices. We find MPLS has its merit as a platform, to bring all the network services we currently provide (L2, L3 VPN, VPLS, and soon EVPN) However, we also want to upgrade the infrastructure to allow future growth of the traffic. Some labs, especially in physics, could need more than 10 Gb/s in the coming years. Our cycles of evolution are long (we keep a backbone technology for 8 years). MPLS is definitely not cheap considering the price of a 10G or 100G router interface. Compared to MPLS, a L2 solution with 100 Gb/s interfaces between core switches and a 10G connection for each buildings looks so much cheaper. But we worry about future trouble using Trill, SPB, or other technologies, not only the "open" ones, but specifically the proprietary ones based on central controller and lots of magic (some colleagues feel the debug nightmare are garanteed). If you had to make such a choice recently, did you choose an MPLS design even at lower speed ? How would you convince your management that MPLS is the best solution for your campus network ? How would you justify the cost or speed difference ? Thanks for your insights!