If you had to decide to adopt the VXLAN as an overlay layer on a Data Center deployment, we could dare to say:
The VXLAN segment is ‘a VLAN with a X in the middle’
…and, if not enough:
The multicast is the quieter brother of the noisier broadcast one!
Ok, jokes aside, it doesn’t matter if you are using ACI or the CLI (or DCNM) to implement VXLAN, in any case you have to associate one multicast group at any VXLAN segment (VNI ID) in order to have the support of BUM traffic forwarding… (if you don’t decide to go for the head-end replication).
On ACI (and actually also using DCNM) you have to declare just the address pool for the BDs multicast address GiPo; however, it is still necessary defining “one multicast group per each L2VNI ID”… mmmhh, are you sure?
The range reserved for multicast groups, in general is 220.127.116.11 – 18.104.22.168 that would mean a few million of IP addresses available; for some reason however (HW/SW factors limit to a few hundred the the number of multicast addresses in most of the real deployments) could be not possible to have one to one mapping of a L2VNI with a multicast group address. On ACI for example, you can select in the range 22.214.171.124/15 – 126.96.36.199/15 the pool for the BD multicast addresses, i.e. 128K GiPo addresses, and on DCNM also the mask accepted is in the range /16 – /30.
Are you not satisfied with these numbers? You have to know that on ACI rel. 5.0(1) you can configure for example, at most 21K BDs on a pure L2 fabric (no GW pervasive addresses configured on BDs and not IP routing enabled on them at all) and 15K BDs on a L3 fabric (1.980 BDs actually per leaf as upper limit), so, 128K actually would be more than enough.
You could be complaining that VXLAN with its 16 million combinations, at this point, was just a mantra dictated by the marketing campaign, if you consider the previous numbers (to be honest we have to keep in mind also the L3VNI in the whole budget)… and how could I blame you!
Basically, still on ACI with the Per-port Local VLAN significance, you could re-use the same VLAN ID on each single port, so theoretically and potentially, for a 48 ports Leaf, it would mean 48 * 4096 = 196.608 different broadcast domains, i.e. 196.608 different VXLAN ID per leaf, so 196.608 GiPo per leaf; the max number of 1.980 BDs per leaf however, imposes an upper limit at GiPo that you can configure on each leaf.
ACI doesn’t use any PIM internally to manage the BUM traffic because the forwarding is based on Root role of Spines shared among them based on fTag assigned to each multicast group; on classic VXLAN implementation instead PIM-ASM or PIM-BiDir can be used where Spine act as Rendezvous Point (phantom or anycast solution), but again, we are still talking about of multicast group addresses assigned to L2VNIs.
We could be wondering at this point: what happens if multicast groups are not uniquely assigned per each L2VNI or even worse, if we use just one multicast group for all the L2VNI? No panic!
Let’s take a part for a moment the ACI implementation that we have already seen, has a reduced range of multicast groups available; on a classic VXLAN deployment (no SDN!) depending on PIM-ASM or PIM-BiDir adoption, 2 kind of mroute entries will be build up, (*,G) and (S,G) in the first scenario and just the (*,G) entry for the shared tree that has the path to the RP (leaves will never build the Shortest Path Tree towards the source for the PIM-BiDir case).
PIM-ASM always uses the Shortest Path Tree between sources and receiver after SPT transition while PIM-BiDir always uses the shared tree crossing the RP, so sub-optimal path; however, on the Spine-Leaf architecture, we have to be honest, there is not so much to choose among the available paths, in the end leaf – spine – leaf is the way …except the case the source and the receiver are behind the same leaf.
The PIM-BiDir is the best choice for many to many application traffic patterns and sender and receiver can send and receive multicast traffic at the same time; if we consider also the story of (S,G) against (*,G) entries that have to be maintained, with PIM-BiDir we could save some resources on leaves and spines.
All that to say, having less G entries (with PIM-BiDir adoption and sharing multicast addresses among different L2VNIs), could be a good choice on a Charles Clos architecture, but what about the spreading of BUM traffic across the fabric?
What would change if we had to adopt only one multicast group shared by all the L2VNIs (the drastic scenario)? Breath, slowly, take your time…
It would mean that the BUM traffic would be spread everywhere by spines towards all the leaves where at least one BD (whatever BD!) and its L2VNI are associated to the same multicast address, having the VTEP expressed, by join message, interest to join the same multicast tree for the same multicast group; a simple ARP request, leaving from a BD associated to a specific L2VNI, actually would involve all the leaves even though with different BDs and different L2VNI associated to themselves… or in other words an ARP request would be not restricted anymore to one broadcast domain.
The BUM traffic however, once landed on the destination leaf, would be dropped on 99% of the cases, because of the mismatch of the L2VNI ID transported on the VXLAN packet with the one associated to the current BD configured on that leaf (so the endpoints behind it, would be safe because would not have to process that ARP request); do you think it would be safe to have that kind of behaviour in your DC? BUM traffic democratically flooding everywhere on the fabric uplinks?
Mmmhhh… in my opinion NO!
It could be acceptable, eventually, only if “all the VLANs” have to be spread on all the leaves (it could be, who knows?)…
In that case, who cares? We would have a chatty fabric in any case, as it happens in the small villages where all people talk with anyone, it doesn’t matter if they are from the same family, relatives, friends or perfect strangers… 🙂