VMM bare metal build fails due to no matching logical network

If you deploy a new VMM bare metal build environment you may face an issue where the deployment fails with Error (21219).

The error description doesn’t appear to make a lot of sense either stating:

IP or subnet doesn’t match to specified logical network.


Recommended Action
Specify matching logical network and IP address or subnet.

If you face this, then you’ve likely checked and double checked all of your VMM networking and everything looks fine.

What’s happening?

This issue is caused by the build process looking up the IP address that you have allocated to the new server, and comparing this to the logical network. When it does this it’s finding that the IP address doesn’t match the subnet configured on the logical network .

But you’ve checked this already and it DOES. You checked it again now just in case and it still does, so this can’t be your problem right!?!?!

Well no.

You see VMM appears to not just be looking at the logical network for a match, but more specifically the first network site that was created on this logical network. Now in most cases you created the management network first so no big deal. But if you didn’t create it first, or you deleted it for some reason and recreated it then it will no longer be the first created site.

You’ve got to be kidding me. But it’s easy to fix?

You would think so but notice that there are no re-order buttons on the subnets?

This means that the only way to “reorder” them is to delete all of the network sites and recreate them. And if you have created any VM networks or bound them to any other configuration object then you’ll be even happier to know you will need to undo all of this configuration too.

Hopefully you’re deploying a new cluster and not deciding to deploy bare metal build to an existing one.

In case you think you must have missed this somewhere, it isn’t stated in the documentation. So is it a bug or a feature?

Either way just remember to create the host management subnet first in future.

Unable to delete Hyper-V Host in VMM due to SQL statement failure

I had an odd failure when removing a Hyper-V server from VMM 2016. The job failed with a very generic Error 20413.VMM-Host-Deletion-Failure-20413
So the next step was to check the log file which gave me an unexpected error.


—————————————————-
——————- Error Report ——————-
—————————————————-
Error report created 4/26/2018 7:29:26 AM
CLR is not terminating

—————————————————-
————— Bucketing Parameters —————
—————————————————-
EventType=VMM20
P1(appName)=vmmservice.exe
P2(appVersion)=4.0.2244.0
P3(assemblyName)=Utils.dll
P4(assemblyVer)=4.0.2244.0
P5(methodName)=Microsoft.VirtualManager.DB.SqlRetryCommand.ExecuteNonQuery
P6(exceptionType)=Microsoft.VirtualManager.DB.CarmineSqlException
P7(callstackHash)=9e56

SCVMM Version=4.0.2244.0
SCVMM flavor=C-buddy-RTL-AMD64
Default Assembly Version=4.0.2244.0
Executable Name=vmmservice.exe
Executable Version=4.0.2244.0
Base Exception Target Site=140717336435616
Base Exception Assembly name=System.Data.dll
Base Exception Method Name=System.Data.SqlClient.SqlConnection.OnError
Exception Message=Unable to connect to the VMM database because of a general database failure.
Ensure that the SQL Server is running and configured correctly, then try the operation again.
EIP=0x00007ffb8a573c58
Build bit-size=64


Great!! The service can’t talk to SQL I thought but this message was also a little deceiving and the next section was actually more important.


—————————————————-
———— exceptionObject.ToString() ————
—————————————————-
Microsoft.VirtualManager.DB.CarmineSqlException: Unable to connect to the VMM database because of a general database failure.
Ensure that the SQL Server is running and configured correctly, then try the operation again. —> System.Data.SqlClient.SqlException: The DELETE statement conflicted with the SAME TABLE REFERENCE constraint “FK_tbl_WLC_VHD_VHD”. The conflict occurred
The statement has been terminated.


Again they bury the lead. The first part again goes on about not being able to talk to SQL but then they give you the actual issue. “The DELETE statement conflicted with the SAME TABLE REFERENCE constraint “FK_tbl_WLC_VHD_VHD”. The conflict occurred
The statement has been terminated”

When VMM is trying to delete the server it’s hitting an issue due to references in the “FK_tbl_WLC_VHD_VHD” table. This is blocking the deletion of the server object.

I found that there were some mentioned that this may be due to the server belonging to a cluster, which it was, and that VMM may take some time to clean up the reference. Well this server had been removed from the cluster almost 12 hours ago so I doubted that just waiting longer would do it and decided to clean up the table.

This appeared to be caused by some orphaned objects that were still in the database as being present on the host even though they were long gone. These existed in the  tbl_WLC_PhysicalObject table.

VMM uses GUIDs to refer to objects in the database so I first needed to get the GUID for the server which could then be used to target these entries. This was simple with powershell.


(Get-SCVMHost Hyper-V-Server-Name).ID


We then pick up the GUID and insert it into the following SQL query after a quick DB backup.


DELETE FROM [tbl_WLC_PhysicalObject] WHERE [HostId]=’VM-Host-GUID’


Finally back to VMM Powershell and delete the Hyper-V server again. My Hyper-V server was already off the network so I used a -force to just remove the database references.


remove-vmhost Hyper-V-Server-Name -Force


This time the job succeeded.

VMM Hates SAN Groups Or How To Kill Your Cluster

A really nice feature of VMM is that you can integrate it with any SAN with an SMIS interface and then perform storage tasks, such as adding disks or even deploying VMs based on SAN snapshots. In fact if you set up an SMIS SAN many standard tasks will be updated to include SAN activities. This is where things start to go off the rails.

You see most SANs will use groups to manage access to LUNs. This way as you add a LUN you only have to add it to a single group and then all servers can see it.

Well VMM doesn’t work this way. It thinks in terms of servers. You’ll see this if you add a new LUN from VMM. It will map each server to the LUN rather than adding any obvious group. That’s fine you might think but things get nasty when you try to remove a server’s access.

You see VMM may not add servers to groups but it absolutely knows enough to do some serious damage. If you remove a server from a cluster then part of the job is to remove the cluster disk access. This will not only remove any direct access published to the server but also remove any groups that the server is also a member of. This has the side effect of removing all disk access to any other server also a member of the same SAN group. Effectively removing all SAN disks from all cluster nodes.

I first saw this with a SAN that I had never used before and just thought that it might be a bug in this vendor’s SMIS implementation but have recently seen the same behaviour with a totally different vendor.

So in short, groups make a heap of sense from the SAN point of view, but if you are going to use SMIS with VMM then ONLY assign servers to the LUNs.

VMM Bare Metal Builds and why you should use a Native vLAN

VMM Bare Metal Builds are an amazing way to ensure that your Hyper-V servers start out consistent. It’s a bit magical but part of that process just works better when you use a native VLAN. But why is that the case?

First let’s look at the VMM Bare Metal Build process.

  1. The VMM Server connects to the hardware management interface and instructs the server to reset. This is immediate and if you specified the wrong hardware management address, well congratulation you just rebooted a server.
  2. The new server being rebuilt goes through it’s boot process. Hopefully you have it configured to PXE boot. This will get a DHCP address and then request a PXE server to respond
  3. The WDS server receives the PXE boot request and checks with the VMM server to see whether this request is authorised. If it is then it responds to the request and send the WinPE image
  4. The new server loads the WinPE operating system and connects to the network. This network connection is a brand new network connection and is in no way connected to the PXE boot. You’ve just booted into an OS after all
  5. The new server runs the VMM scripts to discover the hardware inventory and then send this to the VMM server
  6. Once the admin inputs the required information (New server name and possibly network information) the new server begins the build process by cleaning the specified disk and downloading the VHDX image.
  7. The new server then reboots. This time the server is not authorised to PXE boot so proceeds to boot off the new VHDX boot image.
  8. The new server then customises the sysprepped operating system including any static IP address you provided and performs any additional customisation required by the VMM build process (ie. Adding the Hyper-V and MPIO role and installing the VMM agent).
  9. You should now be left with a server on the network using the configured network settings.

There are a few things to note here. Each time that the server uses either PXE or boots into WinPE it’s reliant on finding a DHCP server. If you’re using port-channel network connections, and very few people are not now, then how is this request going to work? It needs to know what vLAN to tag the request with.

Now you can configure most servers in the BIOS to PXE boot with vLAN tagging and that’s great. Now you have your WinPE image. How does WinPE know about the port-channel. This will be dependent on the NIC driver for your server. Is it even possible to modify it so that, when the driver is loaded, it automatically uses vLAN tagging with the correct vLAN ID. It’s possible but something else that needs to be managed. If VMM updates the WinPE image then you need to reconfigure it again.

Next when you boot off the VHDX this also needs be configured with the correct vLAN ID. Now I have to admit I have never got to this stage since the NIC driver in WinPE has always been a blocker for me but is VMM able to set the correct VLAN ID? You absolutely need to tell VMM what network switch to use and what logical network but does this mean that it will set the VLAN ID correctly. If it doesn’t then this is again another blocker.

So as you can see it may be possible to use vLAN tagging throughout the VMM Bare Metal Build process but sometimes you need to look at whether it’s worth the additional overhead. From managing the server BIOS, to the WinPE drivers and configuration, and the OS customisation. There’s a lot going on with this process and everything needs to work perfectly to result in a fully built server. Is it worth the additional overhead just to avoid setting a network as the native vLAN.

Update the VMM Bare Metal WinPE Image

The VMM Bare Metal build process is one of those processes that just seems magical when you first see it but there’s a lot going on to make this work. One of the common issues is that the server will boot using PXE but then will either not be able to continue to talk to the VMM server or will not see any local disks. These are all related to the drivers contained in the WinPE boot image.

This image is managed by VMM but you will find a current version on the WDS server in the RemoteInstall\DCMgr\Boot\Windows\Images directory which is called boot.wim.

If you want to manually update this with new drivers then you can use the script below. You need to run this from the VMM server and it requires that the boot.wim file be located in c:\temp with all drivers extracted into a folder called c:\temp\Drivers. You also need a c:\temp\mount directory for the WinPE image to be mounted to.


$mount = “c:\temp\mount”
$winpeimage = “c:\temp\boot.wim”
$winpetemp = $winpeimage + “.tmp”
$drivers = “C:\temp\Drivers”

copy $winpeimage $winpetemp

dism /mount-wim /wimfile:$winpetemp /index:1 /mountdir:$mount
dism /image:$mount /add-driver /driver:$drivers /recurse
Dism /Unmount-Wim /MountDir:$mount /Commit

publish-scwindowspe -path $winpetemp
del $winpetemp


Once the WinPE image has been updated with the new drivers it will distribute the new image to all WDS servers in the environment.

It is also possible to install all drivers located in the VMM Library using but I try to stay away from this to minimise the size of the WinPE image. Let VMM install any non-critical drivers as part of it’s own process.

VMM Duplicate VMs

VMM may discover VMs that already exist in the environment and add then as a new VM. You will end up with two different VMs listed with the same name.

To confirm that this is the case run the following command in VMM Powershell.


get-vm “Duplicate-VM-Name” | FL NAME,ID,biosguid,location


If you have duplicate machines then everything except the ID, which is VMM assigned, will match as shown below


Name : Duplicate-VM-Name
ID : 42635679-94fb-4149-ad26-66041a8c96eb
BiosGuid : 5cf412b5-3398-4c5a-951f-3e22c7f97d1a
Location : C:\ClusterStorage\volume1

Name : Duplicate-VM-Name
ID : 8c675f1e-6626-4805-b365-f9b6be3d6c7f
BiosGuid : 5cf412b5-3398-4c5a-951f-3e22c7f97d1a
Location : C:\ClusterStorage\volume1


Both of these VMs refer to the same REAL VM so if you delete one then the second VM will go into a missing state.

If you use powershell using the -force parameter the behaviour changes. This will remove the VM from the VMM database but will not touch the VM. You can use the following powershell command to do this.


get-vm “Duplicate-VM-Name” | WHERE ID -eq “8c675f1e-6626-4805-b365-f9b6be3d6c7f” | remove-vm -force


You will now just have a single VM again.

 

VMM 2016 Cluster Upgrades and Resource Groups

In order to upgrade VMM from 2012 R2 to 2016 you need to deploy new management servers and basically use a lift and shift upgrade process. This is due to VMM 2012 R2 supporting up to Windows Server 2012 R2 while VMM 2016 ONLY supports Windows Server 2016.

If you installed VMM as a failover cluster then you also need to think about how you are going to handle the cluster as part of this upgrade. With Windows 2016 you can add new nodes to an existing Windows 2012 R2 cluster but there may be reasons to create a brand new cluster. You need to think carefully about the process that you are going to follow either way.

If you are going to configure a new cluster then you need to decide whether you will use the same VMM Cluster Service name or a new name. If you use a new name then you will need to reassociate all agents once you have completed the installation. Think about any parts of the environment which may also rely on the old VMM server name.

If on the other hand you plan to reuse the old name then there are a couple of things to watch out for. Ironically the first and most important aspect is actually the removal of the old VMM nodes. Even if you stop the old VMM cluster service it still appears that it will remove the cluster service computer name from the database when you uninstall the last node. This will result in the new VMM service crashing and being unable to restart. Looking at the VMM log located in c:\programdata\VMMLogs\SCVMM.{GUID} you will see the following error:


Base Exception Method Name=Microsoft.VirtualManager.DB.SqlRetryCommand.ValidateReturnValue

Exception Message=Computer cluster-service-name is not associated with this VMM management server.

Check the computer name, and then try the operation again.


If you face this issue the quickest way to fix this is to just uninstall the VMM service, delete the VMM cluster role and reinstall it using the same database and user settings. There may be a way to fix it in the back end database but it’s most likely not worth the effort at this point.

In order to avoid this you will need to uninstall VMM on the old cluster nodes first before doing the upgrade. Just make sure that you always select to retain the database. You should have a backup of the database already though right?

The other issue you will need to deal with is cluster permissions. Remember that the VMM cluster service is a virtual server along with the cluster service. The cluster service needs to have access to do things with the VMM cluster service.

When you run the first node installation it may fail after quite some time with the following error:


“Creation of the VMM resource group VMM failed.Ensure that the group name is valid, and cluster resource or group with the same name does not exist, and the group name is not used in the network.”


This is due to the cluster computer account not having access to modify the AD account of the VMM cluster service virtual server. Grant the new cluster computer account full control of the existing cluster service computer account and re-run setup.

While you’re at it make sure that you also grant access to the DNS entries in case these also need to change.