Introduction:
We have deployed ADFS farm with two nodes and two WAP servers with load balancers in front of both ADFS farm and WAP servers for high availability requirement couple of years back for Workday Single Sign On. It was working fine till last month. This month last week, the token sign-in certificate got expired. We have renewed it by using this method. After this renewal, we have noticed that if the request hit the secondary node, then users where getting below error
“Service Unavailable HTTP Error 503. The service is unavailable”
We started digging into the issue and found few events. Tried to research about it but every article is talking about adding permission on the certificate to the service account. We did not changed the SSL certificate so it will not applicable to this situation.
Event IDs:
Log Name: AD FS/Admin
Source: AD FS
Date: 2/18/2017 1:38:33 AM
Event ID: 102
Task Category: None
Level: Error
Keywords: AD FS
User: <service account>
Computer: <ADFS FQDN>
Description:
There was an error in enabling endpoints of Federation Service. Fix configuration errors using PowerShell cmdlets and restart the Federation Service.
Additional Data
Exception details:
System.ArgumentNullException: Value cannot be null.
Parameter name: certificate
at System.IdentityModel.Tokens.X509SecurityToken..ctor(X509Certificate2 certificate, String id, Boolean clone, Boolean disposable)
at System.IdentityModel.Tokens.X509SecurityToken..ctor(X509Certificate2 certificate, String id, Boolean clone)
at Microsoft.IdentityServer.Service.Configuration.MSISSecurityTokenServiceConfiguration.Create(Boolean forSaml, Boolean forPassive)
at Microsoft.IdentityServer.Service.Policy.PolicyServer.Service.ProxyPolicyServiceHost.ConfigureWIF()
at Microsoft.IdentityServer.Service.SecurityTokenService.MSISConfigurableServiceHost.Configure()
at Microsoft.IdentityServer.Service.Policy.PolicyServer.Service.ProxyPolicyServiceHost.Create()
at Microsoft.IdentityServer.ServiceHost.STSService.StartProxyPolicyStoreService(ServiceHostManager serviceHostManager)
at Microsoft.IdentityServer.ServiceHost.STSService.OnStartInternal(Boolean requestAdditionalTime)
Log Name: AD FS/Admin
Source: AD FS
Date: 2/18/2017 1:38:33 AM
Event ID: 381
Task Category: None
Level: Error
Keywords: AD FS
User: <service account>
Computer: <ADFS FQDN>
Description:
An error occurred during an attempt to build the certificate chain for configuration certificate identified by thumbprint ‘EC00E3DFB10000270C0FB7F5AC58423402AD7F00’. Possible causes are that the certificate has been revoked or certificate is not within its validity period.
The following errors occurred while building the certificate chain:
MSIS2013: A required certificate is not within its validity period when verifying against the current system clock.
User Action:
Ensure that the certificate is valid and has not been revoked or expired.
Cause:
To isolate the issue instead of using the NLB url, tried specific server urls in the browser and found that if we browsed the second server url we are getting that error. Also we noticed that from third week of Jan (last month) the synchronization between primary and secondary node WIDs stopped working. We can find the last sync date using Get-adfssyncproperties in the secondary node.
Resolution:
None of the troubleshooting helped us to fix the issue. Thought of moving the ADFS from primary to secondary but felt afraid whether it will make the issue more worsen that the current one since the WID in the secondary might be corrupted. To get the immediate relief we pointed the DNS record of the SSO url to the primary node. This will help us to drive all traffic to the primary node only. After that we logged a ticket with Microsoft. They analyzed the WID and confirmed that there are no corruption in it. Both primary and secondary are in same VLAN. They finally suggested to reinstall the ADFS component in the secondary node. After little bit of hesitation since they were not able to troubleshoot this issue instead just asked to reinstall (after a week of troubleshooting), reinstalled the ADFS component and overwritten the WID in the node. This fixed the issue!!
Let us know if you have any other idea to fix this issue instead of reinstalling the component through comment 🙂 Happy learning!!
Leave a Reply