A couple of guys ponder about how to design IP fabrics. Ethan Banks wonders at the two-tier fabric done by Mellanox and the Brad Hedlund wonders whether 10G or 40G links are to be used in making the fabric. Both are interesting reads. Some key things to note:
- The more the fan-out at the leaf/spine, the larger will be the fabric that has uniform characteristics of latency, performance etc. (Unlike a network, a fabric has this property – and hence the design constraint)
- With a couple of tiers, the scale out will be much more – that is rather than using TOR as leaf switches, we can have TOR switches connecting to leaf switches. Intra-rack communication is always more performant than inter-rack communication. However, with two-tiers, non-uniform communication patterns emerge.
- Basic advantage of 10G over 40G is that there is more fan-out from leaf to spine, large allowed cable lengths for flexible placement of leaf spine fabrics. Caveats include large ECMP count needing hardware support. Also, 40G QSFP cable lengths are limited so, we cannot stretch the network too far.
- Basic advantage of 40G over 10G is the ability to amortize the cost of hashing skewness caused by L3 ECMP hashing mechanisms. If more flows hash on to a path, a 40G can handle the load better than a 10G link. Along this line of argument, a 100G link is better than 40G, but I wonder how the fan-out will be in that case.
- Reading Brad’s article gave some formulas – numberOfLeafNodes = numberOfPortsinSpine, numberOfSpineNodes = numberOfUplinkPortsInLeaf. numberOfServerFacingPorts = numberOfLeafNodes * numberOfServerPortsPerLeafNode.
Ivan Pepelnjak highlights a fundamental difference between full mesh and CLOS architectures: a full mesh does not give maximum possible ECMP bandwidth between all the edge nodes as does a CLOS network. Period.