Posted in software


Some nice points in this Introduction to BGP:

  • [BGP] routing information is usually exchanged between competing business entities — Internet Service Providers (ISPs) — in an open, hostile environment (public Internet). BGP is thus very security-focused (for example, all adjacent routers have to be configured manually),
  • All other routing protocols are concerned solely with finding the optimal path toward all known destinations. BGP cannot take this simplistic approach because the peering agreements between ISPs almost always result in complex routing policies.
  • Local preference — the “internal cost” of a destination, used to ensure AS-wide consistency.
  • Multi-exit discriminator — this attribute gives adjacent ISPs the ability to prefer one peering point over another.
  • Communities — a set of generic tags that can be used to signal various administrative policies between BGP routers. – I guess this is not used just for this purpose anymore.
Posted in software

Unequal Cost Load Balancing

Why not do UCMP? Why always ECMP? There may be two reasons:

  • If we use UCMP, the routing node in the network can cause variance in the packet delay at the end host.
  • Using UCMP may cause routing loops as this article points out.

Apparently, only EIGRP is capable of determining whether the alternate unequal cost paths are loop free or not. And BGP.

Posted in software

On Design of IP Fabrics

A couple of guys ponder about how to design IP fabrics. Ethan Banks wonders at the two-tier fabric done by Mellanox and the Brad Hedlund wonders whether 10G or 40G links are to be used in making the fabric. Both are interesting reads. Some key things to note:

  • The more the fan-out at the leaf/spine, the larger will be the fabric that has uniform characteristics of latency, performance etc. (Unlike a network, a fabric has this property – and hence the design constraint)
  • With a couple of tiers, the scale out will be much more – that is rather than using TOR as leaf switches, we can have TOR switches connecting to leaf switches. Intra-rack communication is always more performant than inter-rack communication. However, with two-tiers, non-uniform communication patterns emerge.
  • Basic advantage of 10G over 40G is that there is more fan-out from leaf to spine, large allowed cable lengths for flexible placement of leaf spine fabrics. Caveats include large ECMP count needing hardware support. Also, 40G QSFP cable lengths are limited so, we cannot stretch the network too far.
  • Basic advantage of 40G over 10G is the ability to amortize the cost of hashing skewness caused by L3 ECMP hashing mechanisms. If more flows hash on to a path, a 40G can handle the load better than a 10G link. Along this line of argument, a 100G link is better than 40G, but I wonder how the fan-out will be in that case.
  • Reading Brad’s article gave some formulas – numberOfLeafNodes = numberOfPortsinSpine, numberOfSpineNodes = numberOfUplinkPortsInLeaf. numberOfServerFacingPorts = numberOfLeafNodes * numberOfServerPortsPerLeafNode.

Ivan Pepelnjak highlights a fundamental difference between full mesh and CLOS architectures: a full mesh does not give maximum possible ECMP bandwidth between all the edge nodes as does a CLOS network. Period.