Tales from the Trenches: VMware Cloud Foundation (VCF) 4.0 Deployment Gotcha

July 28, 2020

Tweet This:
Share on LinkedIn:

By Steve Kaplan, Kovarus, SDDC & Cloud Management

I recently carved some time out to rebuild our VMware Cloud Foundation (VCF) infrastructure stack on the latest and greatest release, 4.0. VCF 3.9.1, the initial release we deployed this particular stack with, was the first release to use Application Virtual Networks (AVN), or logical switches provisioned from VMware NSX that provide vRealize Suite components network VVD-consistent network services in both the local site (Region A) and for cross-site availability (Region X).

After making a number of tweaks prior to bringing up the account for the changes in the platform, including some new DNS entries and accounting for networking changes introduced as the underlying platform shifts from NSX-V to NSX-T as the provider of Software-Defined Networking (SDN), bringing up of VCF 4.0 was a success. BGP was peering, connectivity to my AVN networks seemed to be fine, and I was ready to take the plunge and start provisioning the vRealize Suite components by standing up vRealize Suite Lifecycle Manager (vRSLCM) via SDDC Manager.

The Problem

Everything seemed to be moving along and going until I hit an error relating to SDDC Manager trying to update the SSL certificate on the appliance with a certificate minted by the VCF platform:

In hopes of having search engines pick this up, here’s a text translation of the error message:

Description: Request and Configure vRealize Suite Lifecycle Manager SSL Certificate
Progress Messages: Replacing vRSLCM certificates failed
Error:
Message: Replacing vRSLCM certificates failed
Remediation Message:
Reference Token: 8UCLGQ
Cause:

Type: com.vmware.evo.sddc.common.certificateutil.GenericCertException
Message: Error while uploading certificate to remote path /opt/vmware/vlcm/cert/server.crt
Type: com.jcraft.jsch.SftpException
Message: java.io.IOException: inputstream is closed

The Early Adopter Dilemma

Looking through all of the usual places, both in the VMTN forums and using various search engines to search for this specific error yielded little in the way of help. Such is the life when you choose to adopt technology early, as we chose to with VCF 4. I was at a loss, even after discussing internally what could be causing this sort of a problem. The behavior seemingly didn’t make sense, as I was able to remotely connect to and transfer content from my own laptop to the LCM appliance over the VPN, so it wasn’t a basic connectivity problem.

I was resigned to opening a support ticket with VMware and extended troubleshooting sessions. Then a lightbulb went off …

The Fix Is In!

Fortunately, I’m a member of VMware’s vExpert program, and have access to the Slack team where the community congregates. I noticed I wasn’t in the VCF-focused channel, so I hopped in there and noticed another individual was having the same issue. After some back and forth, some of the VMware folks who monitor in there suggested checking MTU settings. Ah-ha!

In the course of revamping the switch configurations on the top-of-rack switches supporting this infrastructure, I had validated that the MTU values on all of the switch ports was set to 9000, but the gateway interfaces on the switches did not have their MTUs set to 9000. After I made the changes for the interfaces on both switches for the VLANs supporting both the ESXi and NSX edge tunnel endpoint (TEP) interfaces, retrying from the existing task succeeded and the rest of the deployment completed as I would have expected it to.

What’s The Deal?

Let’s delve briefly into how we got here! Because VCF 3.x still utilizes NSX-V for the management workload domain, whereas 4.0 and on will only use NSX-T for all workload domains, while NSX-T and NSX-V both provide overlay networking capabilities, the role that the edge device plays in both platforms is very different and plays a significant role in why this happened. Let me stipulate that I am oversimplifying things, as the purpose here was to serve as a helpful tip to solve a tactical problem, as opposed to an in-depth look at the differences between how NSX-T and NSX-V architecturally support overlay networks.

In an NSX-V based deployment, the Edge Services Gateway (ESG or edge) is a virtual appliance (or HA pair) that gets provisioned and serves as the perimeter device for all North-South (out of and into the data center) communication from the AVN networks. In an NSX-T deployment, however, the edge acts more as an underlay for the logical T0 and T1 routers. By decoupling the router functions from the devices providing the relevant performance and throughput for N-S traffic, NSX-T is able to use either VMs or physical hosts to provide edge services in an NSX-T architecture. This change in behavior means that hosts interact with the TEP interfaces of edges for N-S traffic flows, unlike in NSX-V.

Because the SDDC Manager is on the “Management Network,” which is VLAN-backed, and the LCM appliance is on an overlay network, what we were seeing, whether we realized it or not, was packet fragmenting due to the MTU size on the gateway interfaces that provide the path for traffic to route between host and edge TEP interfaces.

In Closing

While the fix was easy to implement and wouldn’t have been user impacting (if I had users to worry about), it certainly wasn’t obvious or presented in a manner that me, a person who doesn’t come from a network engineering background, would have identified. Because this is all relatively new, and there isn’t an upgrade path, I don’t imagine that there’s necessarily too many out there who’ll hit it, I still felt like it was worth getting out there for the early adopters who may not be network-focused and don’t have an army of CCIEs behind them to pester with questions like I’m fortunate to have.


Looking to learn more about modernizing and automating IT? We created the Kovarus Proven Solutions Center (KPSC) to let you see what’s possible and learn how we can help you succeed. To learn more about the KPSC go to the KPSC page.

Also, follow Kovarus on LinkedIn for technology updates from our experts along with updates on Kovarus news and events.