Blog

This week in the cloud – 18th June 2018

There’s so much happening in the cloud space at the moment that I thought it would be good for my own reference as much as anyone else’s to produce a summary of some of the big changes that have happened this week. This week has been particularly busy with the Microsoft Build Conference.

The compute decision tree

The first resource that I found wasn’t new this week but will be quite useful. There’s so many different types of compute services but choosing the wrong one can be catastrophic when migrating on-premises resources to the Azure cloud.

This and additional information is located at https://docs.microsoft.com/en-us/azure/architecture/guide/technology-choices/compute-decision-tree.

The new DEV lab

Next up is a look at the DEVTest Lab function in Azure. If you haven’t heard of this it’s a great way to spin up a new environment to do some testing without having that old hardware around or having to bother with building all the boring stuff.

https://azure.microsoft.com/en-us/services/devtest-lab/

With this you can deploy templates with multiple machines which can include different components. This allows you to do things like deploy an SCCM environment , even though these include multiple servers and services. Deploy multiple VMs with domain controllers (including standing up a new forest) SQL services and the SCCM services all using a automated process.

Then you can minimise costs by automating the shutdown of the environment so that an idle DEV machine isn’t costing money for idling.

A great resource about this is found here. https://execmgr.net/2018/04/13/building-a-configmgr-lab-in-azure/

PSTN Services in Teams getting closer

Here in NZ we will be holding our breaths for a while longer before PSTN services are available in Office 365 but things are looking a little easier with the introduction in preview of Direct Routing. This is only available in teams *sigh* but will allow a on-premises telephony gateway to directly integrate with Teams. No more Skype for Business on-premises environment or multi-VM cloud connector. Just install a supported physical or even virtual telephony gateway and away you go.

https://techcommunity.microsoft.com/t5/Microsoft-Teams-Blog/Direct-Routing-NOW-in-Public-Preview/ba-p/193915

Let’s just hope that Teams can improve to the point that we will all accept them taking Skype for Business off us in the future.

Linux Everywhere even in your Azure AD

Microsoft now loves Linux. Really loves it. Loves it so much that they have now released a Linux distro in the form of Azure Sphere. This is a new IoT operating system which Microsoft will support for 10 years, While it has built-in integration with Azure there appears nothing to stop it from connecting to another cloud service or even a on-premises environment.

Next up is a boring old Linux VM running in Azure. Fairly boring now but now you can integrate it with Azure AD as the Identity Provider. This will enable you to log on to a Linux machine in Azure using on-premises credentials.

https://docs.microsoft.com/en-us/azure/virtual-machines/linux/login-using-aad

Another Azure AD Service

Just because there are not enough ways to use the words Azure Active and Directory in a product name there is now also Azure Active Directory Domain Services. This isn’t a really new service but I have to admit I totally missed it and must have thought it was just one of the other Azure Active Directory services.

This time though it’s a full on Active Directory service without the VM. This Azure service uses the Azure AD directory service to stand up a full Active Directory service in Azure complete with the features that Azure AD doesn’t include such as Group Policy, Organisational Units and NTLM/Kerberos Authentication.

To be clear this is still NOT your on-premises domain but another domain with the same users, passwords and groups.

Details can be found here. https://azure.microsoft.com/en-us/services/active-directory-ds/.

This is just a taster of some of the changes that have been introduced recently. Microsoft announced at Build that they had introduced 170 different new functions in Azure in the last year. Keeping up with these changes is going to get very difficult without even including AWS.

 

 

LastPass on Firefox – The missing copy password function

Ever since Firefox removed the legacy plugin functionality I’ve been annoyed that the new LastPass plugin didn’t have the copy username and copy password option.

LastPass-No-Password-Copy

It’s fine if you are just using the browser to log in but what about when you need to log in using actual applications?

This required you to find the entry and select to edit it, then unhide the password and manually copy it to the clipboard. 😩

I finally went looking for a solution and it appears that the firefox addon now needs both the addon and a native messaging component. I guess this wasn’t available when the addon was first releases and LastPass doesn’t tell you to go get it.

So how do you know if you need it? Well other than the copy username and copy password not being available you can go to the lastpass addon and select more options and then about LastPass

About-Lastpass

If you have the native messaging components then you should see this

LastPass with Native Components

If you don’t have them installed then there will be a button to go to the LastPass web site to install them. This just seems to me to be re-running the LastPass installer but I could be wrong about that.

Once done all the functions will be available again. Ya!!!

LastPass-With-Password-Copy

Office 365 Hybrid Send-As Functionality – Not quite there yet.

Recently Microsoft announced that mailbox delegation would be available between Cloud and On-Premises accounts. This would allow for a cloud mailbox user to send-as an on-premises mailbox.

Looking at the documentation (https://technet.microsoft.com/en-us/library/jj200581(v=exchg.150).aspx) it appears that this should now be working in early May 2018. In particular.

NoteNote:
As of February 2018 the feature to support Full Access, Send on Behalf and folder rights cross forest is being rolled out and expected to be complete by April 2018.

This feature requires the latest Exchange 2010 RU, Exchange 2013 CU10, Exchange 2016 or above but otherwise should just work.

Unfortunately when a user tried to use this feature it didn’t work.


This message could not be sent. Try sending the message again later, or contact your network administrator. You do not have the permission to send the message on behalf of the specified user. Error is [0x80070005-0x0004dc-0x000524].


Notice that this mentions the send on behalf rights. Well in this case the user didn’t have those but instead had the more powerful Send-As rights.

Well it looks like Microsoft are running a bit late on the rollout with this other article (https://technet.microsoft.com/en-us/library/jj906433%28v=exchg.150%29.aspx?f=255&MSPPError=-2147217396) now shifting the rollout completion to Q2 2018.


NoteNote:
As of February 2018 the feature to support Full Access and Send on Behalf Of is being rolled out and expected to be complete by the second quarter of 2018.

Either way it’s not much longer but in the interim you may need to keep assigning send on behalf rights prior to migrating mailboxes. This will save you having to use powershell to do this post-migration since the on-premises ECP interface doesn’t support granting these rights to cloud mailboxes.

 

Cisco UCS – Cisco Server Computing takes virtualisation a step further

I recently implemented a new Cisco UCS environment using Windows Server 2016 Hyper-V, managed by a MS VMM 2016 and SCCM Current Branch management environment. This was my first introduction to the Cisco UCS platform.

The Hardware

On first look it appeared to be just another blade enclosure environment.

ucs-5108-blade-server-chassis

In addition to the standard blade chassis, the Cisco UCS environment also requires external management controllers called Fabric Interconnects. This is where all the intelligent for the environment sites and can manage multiple chassis.

UCS-6248up-48-port-fabric-interconnect

While the fabric interconnects can be installed as a single unit I can’t see why anyone would ever want to do this and so you can cluster multiple units. These are also not just the management controller for the environment but also the conduit for all external communications.

These are active/passive management clusters so just be aware that a management outage occurs when the active role changes. Blade traffic will continue to route as this uses both the active and passive nodes at all time. If a fabric interconnect goes offline then it will just mean that some of the paths are no longer available. As long as you have paths for all services via all fabric interconnects and the servers are configured correctly you won’t experience any issues.

There’s a few caveats there but unfortunately it is possible to install these units badly. This is not a unit to plug in quickly without any planning.

Connectivity

Cisco have produced validated designs which give step by step documentation to install an environment with specified hardware. The Windows 2016 Hyper-V with VMM validated design uses the following hardware:

  • UCS Blade Chassis
  • UCS Standalone Servers
  • UCS Fabric Interconnects
  • Nexus Switches
  • MDS Fibre Channel Switches
  • NetApp SAN

Put together this gives the following physical design

Cisco UCS Networking Design

It is absolutely possible to drop the MDS switches in this design and use the Nexus switches to provide the FibreChannel connectivity. Also worth noting is that in this design the NetApps are used for both FC and iSCSI/SMB storage thus requiring the connection to the Nexus switches.

Each blade chassis is connected via multiple connections to both fabric interconnects. This will provide all external connectivity including network and storage access as well as the management, which we will go into later.

Each port on the fabric interconnects will then be configured as either a server, network or FC port. Server ports will be used to discover chassis and standalone alone UCS Servers. Network ports will be configured using network templates for external connectivity.

FC ports can not be directly specified but are instead limited to a number of ports which are located in a location which differs depending on the fabric interconnects that you are using. The UCS 6248s that I used required the FC ports to be located at the top end of the ports on each fabric interconnect. If you wanted to have 2 FC ports per Fabric Interconnect then these ports would be assigned to port 31 and 32 on each unit.

The Virtualisation magic

This is reasonably standard so far so why did I say that it takes virtualisation a step further?

Well each server does not get directly configured. In fact Cisco would rather you forget that you even had servers and rather just think about resources.

Before you do anything you need to configure the external network configuration and external FC configuration as well as discover your servers.

Then everything is based on templates and service profiles. While it is possible to create a server from scratch without any templates this is not encouraged and would likely result in a giant mess. Instead you need to go through and create templates for everything.

You need to start with the addresses you will be using. This includes MAC addresses, FC Addresses, UUIDs. Next you need to create policies for the boot order, BIOS settings, power settings, and network configuration.

Then you need to configure all vLANs, vSANs which can then be assigned to vNICs and vHBAs which also have adapter configuration.

Then you need to create pools of servers which will be used to assign the configuration.

Next you create the service templates which takes all of the above information and creates a configuration template. You then assign this template to a server pool.

Finally you can configure your servers by deploying the service templates to your server pools. This will give the server a base name as well as a starting number which it will increment.

You would think that this would result in blade 1 in chassis 1 being assigned the first template but cisco really don’t want you to think that much about it. It will assign each service profile where-ever it sees fit. If you really need to know where the server is located physically then you can look it up but it’s definitely not front and centre.

Each blade will end up with what appears to be a physical NIC which is in fact the vNICs defined in the template as well as FCoE adapters to match the vHBA configuration.

Sounds like a lot of effort. Why bother?

It is a lot of effort up front, but once you’ve got your service templates expanding the environment is quite amazing. This is particularly the case if you also use SAN boot rather than local disk. Have a hardware failure? Just reassign the service profile to another blade in the environment. The server will reboot and be operational with ALL hardware configuration being identical.

Most other blade environments will allow you to switch out a blade, with the new blade having the same FC and MAC address, but this goes so much further. It also saves a trip to the data centre as you can move the configuration to a new slot rather than having to replace the server in the same slot.

Need to install a new chassis? Connect 4 cables and power, discover the chassis, potentially upgrade firmware and then add the new servers to the existing pools. Deploy 8 new servers with the existing service templates. Total time of stuff all.

Throw in the IPMI integration with VMM and you can deploy a new bare metal Hyper-V environment in no time at all.

Need to install a new network card? Sure that’s virtual. Change the service template and trigger a service profile update and all associated servers will now have the new vNIC.

What are the limitations

As so many facebook relationship status’ say. It’s complicated. Particularly when setting it up the first time you are almost guaranteed to be  left scratching your head asking why a template just refuses to deploy. Unfortunately the error messages can be a little vague too with “not enough compute error” and “not enough vNIC/vHBA error” plaguing me during my deployment.

This is definitely not the unit that you quickly install and have operational in an morning with the physical installation being just the start of the deployment process.

The Cisco environment really wants you to let go of where servers are located which can really be intuitive. If you are a bit too obsessive compulsive for this chaos then you can manually deploy each server to a service template and manually assign a name but you just know that someone at Cisco is shedding a tear.

You also have to understand just how much control you are handing over to the Cisco management environment. If you are deploying Hyper-V then you should be looking for how to configure jumbo frames on the physical network adapters. The problem is you just won’t find it on the physical adapters. This is because it’s configured on the vNIC template in the UCS management interface.

There are also still some rough edges in the environment. While vSphere 6.5 supports UEFI boot using secure boot this just wouldn’t work for me and ultimately had to be disabled. This was documented as a bug for the current release at the time.

Is it worth the effort?

As always it depends. If you do just want a quick build for a static environment then this may not be for you. It’s fancy but if the steep learning curve delays the deployment and then it’s never used again it’s a bit of a waste.

I actually really like this hardware for environments that are experiencing change or growth. Everything can be standardised while still allowing for huge growth. No longer will you have 5 different chassis configuration depending on the engineer assigned to the build.

 

 

Upgrading Bamboo results in HTTPS configuration disappearing

I recently upgraded bamboo within the 5.x version. The actual upgrade went well but when bamboo was restarted the site was only available on the default HTTP port. In this case it was a simple fix and it was a good thing that I had copied both the application files and the bamboo home.

Even though the Bamboo installer says that it’s going to “upgrade” what it’s really saying is that it will dump the new files in the old directory. This includes overwriting any files like the server.xml file.

Unfortunately this removed the HTTPS section of this file. Luckily this was a simple copy and paste from the old configuration file.

Yes this is in the upgrade guide but this guide also doesn’t say that an “upgrade” is possible so I figured this was a new function. Oh well.

Now that the server was available on HTTPS I still had another problem. The site wasn’t available via our F5 load balancer. This was a little harder to spot, but was again a simple solution.

When you connect to the root site (https://bamboo.mydomain.com) it sends a 302 redirect to the web service address. Now the F5 load balancer is looking for a HTTP 200 message to say that the site is healthy so you can’t point it at the root so instead we used the location that it sends you to, which in the case of the old version was /userlogin!default.action?os_destination=%2Fstart.action.

Well of course as part of the upgrade this path was changed just a little bit to /userlogin!doDefault.action?os_destination=%2Fstart.action. If you went to the old path well no HTTP 200 for you and so the site was marked as being down.

Once the health monitor was updated to the new URL the site was available again.

Upgrading BitBucket using HTTPS on Windows from 4.x to 5.x

Just for a change I had to upgrade a Bitbucket 4.x server to 5.9. This is a major upgrade and Atlassian provided clear warning that the configuration file needed to be changed manually as part of this process.

The upgrade went well and I decided to start the service to see what state it would be in prior to the configuration file changes. This wasn’t so good with the service stopping almost immediately.

Checking the Bitbucket web server log I found the following error.


Caused by: java.lang.IllegalStateException: Failed to load property source from location ‘file:/D:/Atlassian/ApplicationData/Bitbucket/shared/bitbucket.properties’


This seemed strange as the service had been running fine but after checked the file permissions sure enough the Bitbucket service account had no access to the file. Simple fix so I started the service again.

This time the service started but was using straight HTTP. That’s fine it showed that the service was all fine and talking to the database so now on to the configuration changes.

The configuration that needed to be replicated was below.


<Connector port=”443″
keystoreFile=”D:\Atlassian\keystore\bitbucket.jks”
maxHttpHeaderSize=”8192″
SSLEnabled=”true”
maxThreads=”150″
minSpareThreads=”25″
maxSpareThreads=”75″
enableLookups=”false”
disableUploadTimeout=”true”
useBodyEncodingForURI=”true”
acceptCount=”100″
scheme=”https”
secure=”true”
clientAuth=”false”
sslProtocol=”TLS” />


Looking at the new bitbucket.properties file the format was a little different and not what was expected based on the upgrade documentation. It seemed on first look to use semi-colons as separators rather than being line separated.


sql-server:1433;databaseName=bitbucket;
#>
#>*******************************************************
jdbc.driver=com.microsoft.sqlserver.jdbc.SQLServerDriver
jdbc.url=jdbc:sqlserver://sqlserver:1433;databaseName=bitbucket;
jdbc.user=bitbucket_usr
jdbc.password=***************


To replicate this the new configuration was set up using semi-colons. When the service was started it wasn’t too good.


Faulting application name: bserv64.exe, version: 1.0.15.0, time stamp: 0x51543b9d
Faulting module name: jvm.dll, version: 25.71.0.1, time stamp: 0x56ac212a
Exception code: 0xc0000005
Fault offset: 0x0000000000214f38
Faulting process id: 0xa70
Faulting application start time: 0x01d3dd132c763723
Faulting application path: C:\Atlassian\Bitbucket\5.9.1\bin\bserv64.exe
Faulting module path: c:\atlassian\bitbucket\4.6.0\jre\bin\server\jvm.dll
Report Id: 552e5d9f-4907-11e8-80db-001dd8b71d05
Faulting package full name:
Faulting package-relative application ID:


Well that didn’t work. After looking again I saw my mistake and removed the semi-colons and used line breaks instead. This didn’t work too well either with the Bitbucket log again recording an error.


Caused by: org.springframework.boot.context.embedded.tomcat.ConnectorStartFailedException: Connector configured to listen on port 443 failed to start


After more investigation I found that you need to specify the key-store and key password in this version even if the default password has been used.

Still the service wouldn’t start.

The other configuration items looked pretty clear so I looked at the key-store location parameter. In the old version this was keystoreFile=”D:\Atlassian\keystore\bitbucket.jks”

All of the examples provided were for linux which used the typical /dir/file format. This surely wouldn’t work for Windows but I didn’t find any examples of what to do.

Ultimately I removed the speech marks and converted the back slashes to forward slashes.

So the final working configuration for 5.x is below.


server.port=443
server.ssl.enabled=true
server.ssl.key-store=D:/Atlassian/keystore/bitbucket.jks
server.ssl.key-store-password=changeit
server.ssl.key-password=changeit


The service now started using HTTPS and we were back in service.

Unable to delete Hyper-V Host in VMM due to SQL statement failure

I had an odd failure when removing a Hyper-V server from VMM 2016. The job failed with a very generic Error 20413.VMM-Host-Deletion-Failure-20413
So the next step was to check the log file which gave me an unexpected error.


—————————————————-
——————- Error Report ——————-
—————————————————-
Error report created 4/26/2018 7:29:26 AM
CLR is not terminating

—————————————————-
————— Bucketing Parameters —————
—————————————————-
EventType=VMM20
P1(appName)=vmmservice.exe
P2(appVersion)=4.0.2244.0
P3(assemblyName)=Utils.dll
P4(assemblyVer)=4.0.2244.0
P5(methodName)=Microsoft.VirtualManager.DB.SqlRetryCommand.ExecuteNonQuery
P6(exceptionType)=Microsoft.VirtualManager.DB.CarmineSqlException
P7(callstackHash)=9e56

SCVMM Version=4.0.2244.0
SCVMM flavor=C-buddy-RTL-AMD64
Default Assembly Version=4.0.2244.0
Executable Name=vmmservice.exe
Executable Version=4.0.2244.0
Base Exception Target Site=140717336435616
Base Exception Assembly name=System.Data.dll
Base Exception Method Name=System.Data.SqlClient.SqlConnection.OnError
Exception Message=Unable to connect to the VMM database because of a general database failure.
Ensure that the SQL Server is running and configured correctly, then try the operation again.
EIP=0x00007ffb8a573c58
Build bit-size=64


Great!! The service can’t talk to SQL I thought but this message was also a little deceiving and the next section was actually more important.


—————————————————-
———— exceptionObject.ToString() ————
—————————————————-
Microsoft.VirtualManager.DB.CarmineSqlException: Unable to connect to the VMM database because of a general database failure.
Ensure that the SQL Server is running and configured correctly, then try the operation again. —> System.Data.SqlClient.SqlException: The DELETE statement conflicted with the SAME TABLE REFERENCE constraint “FK_tbl_WLC_VHD_VHD”. The conflict occurred
The statement has been terminated.


Again they bury the lead. The first part again goes on about not being able to talk to SQL but then they give you the actual issue. “The DELETE statement conflicted with the SAME TABLE REFERENCE constraint “FK_tbl_WLC_VHD_VHD”. The conflict occurred
The statement has been terminated”

When VMM is trying to delete the server it’s hitting an issue due to references in the “FK_tbl_WLC_VHD_VHD” table. This is blocking the deletion of the server object.

I found that there were some mentioned that this may be due to the server belonging to a cluster, which it was, and that VMM may take some time to clean up the reference. Well this server had been removed from the cluster almost 12 hours ago so I doubted that just waiting longer would do it and decided to clean up the table.

This appeared to be caused by some orphaned objects that were still in the database as being present on the host even though they were long gone. These existed in the  tbl_WLC_PhysicalObject table.

VMM uses GUIDs to refer to objects in the database so I first needed to get the GUID for the server which could then be used to target these entries. This was simple with powershell.


(Get-SCVMHost Hyper-V-Server-Name).ID


We then pick up the GUID and insert it into the following SQL query after a quick DB backup.


DELETE FROM [tbl_WLC_PhysicalObject] WHERE [HostId]=’VM-Host-GUID’


Finally back to VMM Powershell and delete the Hyper-V server again. My Hyper-V server was already off the network so I used a -force to just remove the database references.


remove-vmhost Hyper-V-Server-Name -Force


This time the job succeeded.

VMM Hates SAN Groups Or How To Kill Your Cluster

A really nice feature of VMM is that you can integrate it with any SAN with an SMIS interface and then perform storage tasks, such as adding disks or even deploying VMs based on SAN snapshots. In fact if you set up an SMIS SAN many standard tasks will be updated to include SAN activities. This is where things start to go off the rails.

You see most SANs will use groups to manage access to LUNs. This way as you add a LUN you only have to add it to a single group and then all servers can see it.

Well VMM doesn’t work this way. It thinks in terms of servers. You’ll see this if you add a new LUN from VMM. It will map each server to the LUN rather than adding any obvious group. That’s fine you might think but things get nasty when you try to remove a server’s access.

You see VMM may not add servers to groups but it absolutely knows enough to do some serious damage. If you remove a server from a cluster then part of the job is to remove the cluster disk access. This will not only remove any direct access published to the server but also remove any groups that the server is also a member of. This has the side effect of removing all disk access to any other server also a member of the same SAN group. Effectively removing all SAN disks from all cluster nodes.

I first saw this with a SAN that I had never used before and just thought that it might be a bug in this vendor’s SMIS implementation but have recently seen the same behaviour with a totally different vendor.

So in short, groups make a heap of sense from the SAN point of view, but if you are going to use SMIS with VMM then ONLY assign servers to the LUNs.

VMM Bare Metal Builds and why you should use a Native vLAN

VMM Bare Metal Builds are an amazing way to ensure that your Hyper-V servers start out consistent. It’s a bit magical but part of that process just works better when you use a native VLAN. But why is that the case?

First let’s look at the VMM Bare Metal Build process.

  1. The VMM Server connects to the hardware management interface and instructs the server to reset. This is immediate and if you specified the wrong hardware management address, well congratulation you just rebooted a server.
  2. The new server being rebuilt goes through it’s boot process. Hopefully you have it configured to PXE boot. This will get a DHCP address and then request a PXE server to respond
  3. The WDS server receives the PXE boot request and checks with the VMM server to see whether this request is authorised. If it is then it responds to the request and send the WinPE image
  4. The new server loads the WinPE operating system and connects to the network. This network connection is a brand new network connection and is in no way connected to the PXE boot. You’ve just booted into an OS after all
  5. The new server runs the VMM scripts to discover the hardware inventory and then send this to the VMM server
  6. Once the admin inputs the required information (New server name and possibly network information) the new server begins the build process by cleaning the specified disk and downloading the VHDX image.
  7. The new server then reboots. This time the server is not authorised to PXE boot so proceeds to boot off the new VHDX boot image.
  8. The new server then customises the sysprepped operating system including any static IP address you provided and performs any additional customisation required by the VMM build process (ie. Adding the Hyper-V and MPIO role and installing the VMM agent).
  9. You should now be left with a server on the network using the configured network settings.

There are a few things to note here. Each time that the server uses either PXE or boots into WinPE it’s reliant on finding a DHCP server. If you’re using port-channel network connections, and very few people are not now, then how is this request going to work? It needs to know what vLAN to tag the request with.

Now you can configure most servers in the BIOS to PXE boot with vLAN tagging and that’s great. Now you have your WinPE image. How does WinPE know about the port-channel. This will be dependent on the NIC driver for your server. Is it even possible to modify it so that, when the driver is loaded, it automatically uses vLAN tagging with the correct vLAN ID. It’s possible but something else that needs to be managed. If VMM updates the WinPE image then you need to reconfigure it again.

Next when you boot off the VHDX this also needs be configured with the correct vLAN ID. Now I have to admit I have never got to this stage since the NIC driver in WinPE has always been a blocker for me but is VMM able to set the correct VLAN ID? You absolutely need to tell VMM what network switch to use and what logical network but does this mean that it will set the VLAN ID correctly. If it doesn’t then this is again another blocker.

So as you can see it may be possible to use vLAN tagging throughout the VMM Bare Metal Build process but sometimes you need to look at whether it’s worth the additional overhead. From managing the server BIOS, to the WinPE drivers and configuration, and the OS customisation. There’s a lot going on with this process and everything needs to work perfectly to result in a fully built server. Is it worth the additional overhead just to avoid setting a network as the native vLAN.

Skype for Business Admin and Powershell Unresponsive

I had an interesting issue where a Skype for Business admin site would sit at the spinning wheel at 100%. This environment had two Enterprise pools so I checked the other site to find the same thing. At this stage I was fairly convinced that it was bigger than just a bad server.

I then opened up powershell which connected fine. Great!!

Next I ran a command after much thought or more to the point after typing get-cs<couple of tabs><enter> which happened to end up on Get-CSADDomain.

So this returned LC_DOMAINSETTINGS_STATE_FAILED. Urgh!

That looks pretty average for what, at this point, is an operational environment.

So next I ran get-CSUser, and we waited. Yeah there are a few users in the environment so that’s the be expected but after a couple of minutes I knew that this wasn’t going to end.

I checked the event log and found the following error in the Lync Server Log


Source: LS Remote PowerShell

Level: Error

Event ID: 35009

Remote PowerShell cannot create InitialSessionState.

Remote PowerShell cannot create InitialSessionState for user: S-1-5-21-XXXXXXXXX-XXXXXXXXX-XXXXXXXXX-XXXXX. Cause of failure: Thread was being aborted.. Stacktrace: System.Threading.ThreadAbortException: Thread was being aborted.

at System.Threading.WaitHandle.WaitOneNative(SafeHandle waitableSafeHandle, UInt32 millisecondsTimeout, Boolean hasThreadAffinity, Boolean exitContext)

at System.Threading.WaitHandle.InternalWaitOne(SafeHandle waitableSafeHandle, Int64 millisecondsTimeout, Boolean hasThreadAffinity, Boolean exitContext)

at System.Threading.WaitHandle.WaitOne(Int32 millisecondsTimeout, Boolean exitContext)

at Microsoft.Rtc.Management.Store.Sql.ClientDBAccess.OnBeforeSprocExecution(SprocContext sprocContext)

at Microsoft.Rtc.Common.Data.DBCore.ExecuteSprocContext(SprocContext sprocContext)

at Microsoft.Rtc.Management.Store.Sql.XdsSqlConnection.ReadDocItems(ICollection`1 key)

at Microsoft.Rtc.Management.ScopeFramework.AnchoredXmlReader.Read(ICollection`1 key)

at Microsoft.Rtc.Management.ServiceConsumer.CachedAnchoredXmlReader.Read(ICollection`1 key)

at Microsoft.Rtc.Management.ServiceConsumer.TypedXmlReader.Read(SchemaId schemaId, IList`1 scopeContextList, Boolean useDefaultIfNoneExists)

at Microsoft.Rtc.Management.ServiceConsumer.ServiceConsumer.ReadT

at Microsoft.Rtc.Management.RBAC.ServiceConsumerRoleStoreAccessor.GetRolesFromStore()

at Microsoft.Rtc.Management.Authorization.OcsRunspaceConfiguration.ConstructCmdletsAndScopesMap(List`1 tokenSIDs)

at Microsoft.Rtc.Management.Authorization.OcsRunspaceConfiguration..ctor(IIdentity logonIdentity, IRoleStoreAccessor roleAccessor, List`1 tokenGroups)

at Microsoft.Rtc.Management.Authorization.OcsAuthorizationPlugin.CreateInitialSessionState(IIdentity identity, Boolean insertFormats, Boolean insertTypes, Boolean addServiceCmdlets)

Cause: Remote PowerShell can fail to create InitialSessionState for varied number of reasons. Please look for other events that can give some specific information.

Resolution:

Follow the resolution on the corresponding failure events.


Well that doesn’t look so good. Reading this it looked like it might be a database issue. This would make sense since the CMS database is in a single location with all servers accessing it. Even if an object is in AD, Skype for Business will get information about it from a single place, the CMS.

If you have multiple pools including fail-over pools then there is still just one CMS service.

The database server was busier than expected but nothing was standing out as really bad. (60% average CPU for the SQL process and a few deadlocked processes reported in the SQL log) but it did seem responsive.

It was at this point that other services using the same SQL server also were reported as being down and the SQL admin made the call to restart the SQL Service.

Once restarted everything became responsive again.

Unfortunately I never got to the bottom of what was wrong in the SQL server, but I think it’s still good to remember the heavy reliance on the database service in Skype for Business. Yes there is a SQL service on each Skype for Business Server, but this isn’t used for all processes.