Back2Basic: Fail Closed, Fail Open. Layer 3 vs Layer 2 failures behavior
Some time ago I heard from a very good instructor that Layer 3 issues fail closed, and Layer 2 fail open. This is an interesting approach to define how a network failure can be identified and related to an specific area. Today, I was doing some research regarding certain protocol and get to a very good explanation of exactly how to define Fail Closed and Fail Open situations, so I think it is worth to be share. Here we go:
In Layer 3 (Routing World), when there is a power outage, the lift just stops. In the Layer 2 (Spanning Tree World), the lift falls with a hissing sound and crashes in a cloud of smoke. Not a friendly world. Where does this difference come from?
Fail Closed World
On a heavily summarized form of "how a router works", when a router receives a packet, it looks in its forwarding table to determine where to forward it. If there is no prefix in the forwarding table, the router does not forward (i.e. it drops the packet). That way the failure can be isolated quickly and back to normal is now a matter of resolve the root cause related to the missing path to destination.
In the Layer 3 world, link state protocols like IS-IS or OSPF commonly introduce transient loops during their convergence. It does not matter. There’s no flooding at Layer 3, so a packet looping would only have local effect on the links part of the cycle and the TTL in the data plane would get rid of it eventually. Multidestination traffic is strictly constrained by a powerful reverse path forwarding check (RPFC) in the data plane again. Even if the CPU of a router was affected, adjacencies would drop, routes would be removed from forwarding tables and in the end, packets would stop being forwarded. The system would fall back to a stable state… The lift stops it FAIL CLOSED.
Fail Open: Release The Kraken!!!
On the other hand, When a bridge receives a frame, it looks into its filtering database where NOT to send this frame. If there is no entry in this filtering database, also known as mac address table, the frame is not filtered (i.e it is flooded.) Yes, your high end switch behaves by default as a glorified hub and this has nothing to do with STP.
The consequence is well known. If for any reason a loop is introduced in the network, a single frame could be forwarded forever because there is no TTL at Layer 2. Worse, the powerful ASICs of your glorified hubs will flood the frame on all their ports, instantly saturating them. Then, just because you’re lucky, the frame is carrying a Layer 3 broadcast that is hitting directly and killing immediately the CPU of all the hosts. Hissing sound.
Wait, it’s not over! You might have some low end switches at the access, with poor control plane protection. Their CPU might also be impacted by this traffic, and they might not be able to run their STP process any more. Interestingly enough, most people don’t realize that those edge switches, the further away from the root, are responsible for blocking ports in a bridged network. If they can’t run STP properly, they’re going to open even more loops… Cloud of smoke. Not only the servers are affected, but the condition of the failure can be maintained indefinitely. A local issue can have global, permanent impact. In resume it FAIL OPEN!!!!
Final Thoughs
This situations depicts one of the reason why on the critical areas of the network is preferred to use Layer 3 (routed) instead of Layer 2 (bridged, trunked, etc) links. Some networks requires deterministic behavior and Routing Protocols offer advantages over STP to control what happen in case of failure. Always prefer to have a scenario that fail closed!!
Anyway the final words on what to choose will be determined by the goal the network has to achieve.
Keep it in mind that any situation referring to network design and best practices will be answer very often with the number one option in the Networking Best Practices Book: IT DEPENDS!!!
Freddy Bello M.