Always On Availability Groups More Resilient to Transient Network / Lease Timeout Issues


Introduction

SQL Server Always On Availability Groups provides resilience for high availability and disaster recovery solution in a multi-node architecture. In the article, Session timeouts in SQL Server Always On Availability Groups, we explained the session timeout configurations to safeguard AG from soft errors.

Suppose you have an availability group configured for automatic failover. In this synchronized availability group, we can have the following types of failover.

  • Automatic
  • Manual
  • You can do forced failover with possible data loss

In the case of automatic failover, the availability group fails over to a synchronized secondary replica without any data loss. However, an automatic failover requires the satisfied conditions defined in the flexible failover policy.

The flexible failover policy depends on the Health-check timeout thresholdFailure-Condition Level and Cluster timeouts.

It is an important aspect to determine the auto-failover causes in a production database environment. The logs are useful factors for monitoring and investigating the failures.

  • SQL Server error logs
  • Windows cluster logs
  • Cluster event logs
  • SQL Server failover diagnostics (sp_server_diagnostics) logs
  • AlwaysOn_Health extended events output
  • System_health extended events
  • Application and system logs

 “Common” unexpected failure events

Most of the unexpected failover events which we had recently investigated could have been avoided by small changes to the cluster/AG role properties.

In the below example, the downtime was just two seconds, but it closed all application connections and led the running SQL agent jobs to fail (extract from the SQL errorlog file):

  • 2023-05-28 05:48:53.91 Server      The state of the local availability replica in availability group 'AG' has changed from 'PRIMARY_NORMAL' to 'RESOLVING_NORMAL'. The replica state changed because of either a startup, a failover, a communication issue, or a cluster error.
  • 2023-05-28 05:48:53.91 spid53      The state of the local availability replica in availability group 'AG' has changed from 'RESOLVING_NORMAL' to 'PRIMARY_PENDING'. The replica state changed because of either a startup, a failover, a communication issue, or a cluster error.
  • 2023-05-28 05:48:53.91 Server      The state of the local availability replica in availability group 'AG' has changed from 'PRIMARY_PENDING' to 'PRIMARY_NORMAL'. The replica state changed because of either a startup, a failover, a communication issue, or a cluster error.

Here, we see the classic lease timeout event (extract from the errorlog file):

2023-05-28 05:48:53.91 Server      Error: 19419, Severity: 16, State: 1. Windows Server Failover Cluster did not receive a process event signal from SQL Server hosting availability group 'AG' within the lease timeout period.

2023-05-28 05:48:53.91 Server      Error: 19407, Severity: 16, State: 1. The lease between availability group 'AG' and the Windows Server Failover Cluster has expired. A connectivity issue occurred between the instance of SQL Server and the Windows Server Failover Cluster. To determine whether the availability group is failing over correctly, check the corresponding availability group resource in the Windows Server Failover Cluster.

And in the failover cluster logs, you may see event sequences such as those below (you can extract the cluster logs using Get-ClusterLog powershell command):

DBS   [NETFTEVM] FTI NetFT event handler got event: LocalEndpoint 10.0.1.10:~3343~ has missed two-fifth consecutive heartbeats from 10.0.1.11:~3343~

DBS   [NETFTEVM] FTI NetFT event handler got event: LocalEndpoint 192.168.1.10:~3343~ has missed two-fifth consecutive heartbeats from 192.168.1.11:~3343~

or

ERR [RES] SQL Server Availability Group: [hadrag] Failure detected, diagnostics heartbeat is lost

ERR [RES] SQL Server Availability Group <AG>: [hadrag] Availability Group is not healthy with given HealthCheckTimeout and FailureConditionLevel WARN  [RHS] Resource AG IsAlive has indicated failure.

INFO  [RCM] rcm::RcmGroup::UpdateStateIfChanged: (AG, Online --> Pending)

 Always On health detection relies on a few mechanisms:

  • Resource DLL (RHS), which determines the IsAlive value at the cluster heartbeat interval, and is controlled by CrossSubnetDelay and SameSubnetDelay cluster properties
  • sp_server_diagnostics, which reports the component health on an interval controlled by the HealthCheckTimeout property
  • Lease mechanism, which is used as a Looks-Alive between the cluster resource host and the SQL processes
  • Session Timeout, which detected the “soft” errors / small timeouts or insufficient resources 
    • This also affects the automatic seeding timeout

The mechanisms above are controlled by properties that can be adjusted at the cluster level and/or the AG level: 


 





Getting cluster properties values

To view the current cluster values, open an elevated Powershell terminal and run the following:

#display current cluster properties

Get-Cluster | fl CrossSubnetDelay, CrossSubnetThreshold, SameSubnetDelay , SameSubnetThreshold

 #display current AG role properties

Get-ClusterResource <yourAG>| Get-ClusterParameter HealthCheckTimeout, LeaseTimeout

 



 


To view the current AG role properties, execute the below SQL statement in SSMS or any SQL IDE:

select ag.name, arcn.replica_server_name, arcn.node_name, ars.role, ars.role_desc, ars.connected_state_desc, ars.synchronization_health_desc, ar.availability_mode_desc, ag.failure_condition_level,ar.failover_mode_desc, ar.session_timeout

from sys.availability_replicas ar with (nolock)

inner join sys.dm_hadr_availability_replica_states ars with (nolock) on ars.replica_id=ar.replica_id and ars.group_id=ar.group_id

inner join sys.availability_groups ag with (nolock) on ag.group_id=ar.group_id

inner join sys.dm_hadr_availability_replica_cluster_nodes arcn with (nolock) on arcn.group_name=ag.name and arcn.replica_server_name=ar.replica_server_name 



 


Making the cluster / AG more resilient

The default/maximum network downtime that AG can absorb without being impacted is 10 seconds (1/2 of LeaseTimeout and 1/3  of HealthCheckTimeout). 

In order to make the cluster/AG roles more resilient and able to absorb more network downtime without a failover, you can increase the cluster and AG properties, taking into account the relationship between the properties’ values: 

For example, to increase the supported network downtime from 10 to 20 seconds:

  • LeaseTimeout – change from 20000 to 40000 (40 seconds)
  • HealthCheckTimeout  – change from 30000 to 60000 (60 seconds)
  • SameSubnetThreshold  – change from 10 to 20
  • Session Timeout – change from 10 to 20 seconds

 You may use the below Powershell script to change the cluster level properties:

$crossSubnetDelayOptimal = 1000;

$crossSubnetThresholdOptimal = 20

$sameSubnetDelayOptimal = 1000

$sameSubnetThresholdOptimal = 20

 

$healthCheckTimeoutOptimal = 60000

$leaseTimeoutOptimal = 40000

 

Write-Host "Check cluster heartbeat timeouts"

$cluster = get-cluster

 

$crossSubnetDelay = $cluster.CrossSubnetDelay

$crossSubnetThreshold = $cluster.CrossSubnetThreshold

$sameSubnetDelay = $cluster.SameSubnetDelay

$sameSubnetThreshold = $cluster.SameSubnetThreshold

 

if($crossSubnetDelay -ne $crossSubnetDelayOptimal)

{

    Write-Host "Cluster option CrossSubnetDelay changed from $crossSubnetDelay to $crossSubnetDelayOptimal"

    $cluster.CrossSubnetDelay = $crossSubnetDelayOptimal

}

 

if($crossSubnetThreshold -ne $crossSubnetThresholdOptimal)

{

    Write-Host "Cluster option CrossSubnetThreshold changed from $crossSubnetThreshold to $crossSubnetThresholdOptimal"

    $cluster.CrossSubnetThreshold = $crossSubnetThresholdOptimal

}

 

if($sameSubnetDelay -ne $sameSubnetDelayOptimal)

{

    Write-Host "Cluster option SameSubnetDelay changed from $sameSubnetDelay to $sameSubnetDelayOptimal"

    $cluster.SameSubnetDelay = $sameSubnetDelayOptimal

}

 

if($sameSubnetThreshold -ne $sameSubnetThresholdOptimal)

{

    Write-Host "Cluster option SameSubnetThreshold changed from $sameSubnetThreshold to $sameSubnetThresholdOptimal"

    $cluster.SameSubnetThreshold = $sameSubnetThresholdOptimal

}

 

Write-Host "Check cluster resource properties"

 

$resources = Get-ClusterResource

ForEach($resource in $resources)

{

    if($resource.ResourceType -eq "SQL Server Availability Group")

    {

        $name = $resource.Name

        $healthCheckTimeout = (Get-ClusterParameter -Name HealthCheckTimeout -InputObject $resource).Value

        $leaseTimeout = (Get-ClusterParameter -Name LeaseTimeout -InputObject $resource).Value

 

        if($healthCheckTimeout -ne $healthCheckTimeoutOptimal)

        {

            Write-Host "$name - HealthCheckTimeout - Changed from $healthCheckTimeout to $healthCheckTimeoutOptimal"

            Set-ClusterParameter -Name HealthCheckTimeout -InputObject $resource -Value $healthCheckTimeoutOptimal

        }

 

        if($leaseTimeout -ne $leaseTimeoutOptimal)

        {

            Write-Host "$name - LeaseTimeout - Changed from $leaseTimeout to $leaseTimeoutOptimal"      

            Set-ClusterParameter -Name LeaseTimeout -InputObject $resource -Value $leaseTimeoutOptimal

        }

    }

}

 

 To change the Session Timeout value, connect to the SQL instance and execute the below statement on all the AG replicas:

ALTER AVAILABILITY GROUP <AG name> MODIFY REPLICA ON '<replica name>' WITH (SESSION_TIMEOUT = 20);

 

Microsoft guidelines

Microsoft compiled guidelines for timeout values, their root causes and outcomes. 



 

 

 

 

 

 

 




Special note on Resource Monitor (RHS)



 


 



















If lease timeout occurs ( >20 seconds), the rhs.exe reports an error in the Windows cluster, and it starts taking preventive action.

You can get the following error messages for a least timeout issues:

select message_id,text from sys.messages

where message_id in(19407,19419,19421,19422) and

language_id=1033 


 

 


 

SQL Server Always On Availability Groups Health Check timeout

 



 








SQL Server Always On Availability Group performs a health check of the primary replica using the sp_server_diagnostics stored procedure. The sp_server_diagnostics executes every 10 seconds. The health check timeout is 30 seconds.

Therefore, the stored procedure returns the result on 1/3 * health checks current threshold. If the SP does not return any results, AG refers to the previous state for determining the instance health until the health-check timeout threshold. The next Is-Alive determines that the primary replica is unresponsive, and it initiates the automatic failover.

Note:

Recap of LeaseTimeout, Session-Timeout and Health check timeouts

 Lease timeout:

  • Default value: 20000 milliseconds, i.e. 20 seconds
  • It is required to prevent a split-brain scenario in the Windows failover cluster
  • It can trigger an AG failover or offline-online
  • It is used in both Is-Alive and the Looks-alive mechanism

Session Timeout:

  • It is used to safeguard against soft errors between AG replicas
  • default value: 10000 milliseconds or 10 seconds
  • It is not part of the Is-Alive or the Looks-alive mechanism
  • Secondary replica becomes DISCONNECTED status due to session timeout
  • You can configure the session timeouts in the availability group properties from the primary replica instance

Health Check timeout

  • Default value: 30000 milliseconds, i.e. 30 seconds
  • It is the timeout if the sp_server_diagnostics does not return any data or reports errors in the health check
  • It is used in both Is-Alive and the Looks-alive mechanism
  • It also depends on the failover condition levels defined from 1-5, as explained earlier. The default configuration is level 3

Server Level Health Detection:

 







SELECT wait_time, *

FROM sys.dm_exec_requests WHERE last_wait_type = 'SP_SEREVR_DIAGNOSTICS_SLEEP'

and command = 'execute'

GO

DBCC INPUTBUFFER()

EXEC sp_server_diagnostics 10

Create the following [AGSessionTimeOut] table on primary replica in availability group database [MyNewDB]. Next, we use a while loop to insert the records into the [AGSessionTimeOut] table and wait for 1 second (using the WAITFOR DELAY command) before inserting the next record.

Create table MyNewDB.dbo.AGSessionTimeOut

(

    ID int IDENTITY(1,1),

    RecordTimestamp datetime

)

WHILE 1=1

BEGIN

    INSERT INTO MyNewDB.dbo.AGSessionTimeOut(RecordTimestamp)

        VALUES (GETDATE())

    WAITFOR DELAY '00:00:01'

END

We have a synchronous AG secondary replica, so it waits for transaction acknowledgement and commits records on the primary replica. After some time, stop the secondary replica SQL instance. It causes connection timeouts for the secondary replica. Now, stop insert the statement on the primary replica after 20-30 seconds.

To troubleshoot the connection timeout, check the SQL Server error log on the primary replica. In the below image, we see that it recorded a connection timeout error at 2020=11-24 11:34:13 AM.

exec xp_readerrorlog 0,1,N'Connection timeout'

 



 

 


Conclusion

In this article, we figured SQL Server Always On Availability Groups flexible failover policy using the lease, health check timeout and Failover Condition Levels. You should review the logs to determine the scenarios where the automatic failover occurred or failed.


Special Thanks to Dan-Andrei Stefan who made very informatic blog on this....

No comments:

Post a Comment

Azure SQL Elastic Pools: a way for saving costs

  Azure SQL Elastic Pools: a way for saving costs  Elastic pools enable you to purchase resources for a pool shared by multiple databases to...