Always On Availability Groups More Resilient to Transient Network / Lease Timeout Issues


Introduction

SQL Server Always On Availability Groups provides resilience for high availability and disaster recovery solution in a multi-node architecture. In the article, Session timeouts in SQL Server Always On Availability Groups, we explained the session timeout configurations to safeguard AG from soft errors.

Suppose you have an availability group configured for automatic failover. In this synchronized availability group, we can have the following types of failover.

  • Automatic
  • Manual
  • You can do forced failover with possible data loss

In the case of automatic failover, the availability group fails over to a synchronized secondary replica without any data loss. However, an automatic failover requires the satisfied conditions defined in the flexible failover policy.

The flexible failover policy depends on the Health-check timeout thresholdFailure-Condition Level and Cluster timeouts.

It is an important aspect to determine the auto-failover causes in a production database environment. The logs are useful factors for monitoring and investigating the failures.

  • SQL Server error logs
  • Windows cluster logs
  • Cluster event logs
  • SQL Server failover diagnostics (sp_server_diagnostics) logs
  • AlwaysOn_Health extended events output
  • System_health extended events
  • Application and system logs

 “Common” unexpected failure events

Most of the unexpected failover events which we had recently investigated could have been avoided by small changes to the cluster/AG role properties.

In the below example, the downtime was just two seconds, but it closed all application connections and led the running SQL agent jobs to fail (extract from the SQL errorlog file):

  • 2023-05-28 05:48:53.91 Server      The state of the local availability replica in availability group 'AG' has changed from 'PRIMARY_NORMAL' to 'RESOLVING_NORMAL'. The replica state changed because of either a startup, a failover, a communication issue, or a cluster error.
  • 2023-05-28 05:48:53.91 spid53      The state of the local availability replica in availability group 'AG' has changed from 'RESOLVING_NORMAL' to 'PRIMARY_PENDING'. The replica state changed because of either a startup, a failover, a communication issue, or a cluster error.
  • 2023-05-28 05:48:53.91 Server      The state of the local availability replica in availability group 'AG' has changed from 'PRIMARY_PENDING' to 'PRIMARY_NORMAL'. The replica state changed because of either a startup, a failover, a communication issue, or a cluster error.

Here, we see the classic lease timeout event (extract from the errorlog file):

2023-05-28 05:48:53.91 Server      Error: 19419, Severity: 16, State: 1. Windows Server Failover Cluster did not receive a process event signal from SQL Server hosting availability group 'AG' within the lease timeout period.

2023-05-28 05:48:53.91 Server      Error: 19407, Severity: 16, State: 1. The lease between availability group 'AG' and the Windows Server Failover Cluster has expired. A connectivity issue occurred between the instance of SQL Server and the Windows Server Failover Cluster. To determine whether the availability group is failing over correctly, check the corresponding availability group resource in the Windows Server Failover Cluster.

And in the failover cluster logs, you may see event sequences such as those below (you can extract the cluster logs using Get-ClusterLog powershell command):

DBS   [NETFTEVM] FTI NetFT event handler got event: LocalEndpoint 10.0.1.10:~3343~ has missed two-fifth consecutive heartbeats from 10.0.1.11:~3343~

DBS   [NETFTEVM] FTI NetFT event handler got event: LocalEndpoint 192.168.1.10:~3343~ has missed two-fifth consecutive heartbeats from 192.168.1.11:~3343~

or

ERR [RES] SQL Server Availability Group: [hadrag] Failure detected, diagnostics heartbeat is lost

ERR [RES] SQL Server Availability Group <AG>: [hadrag] Availability Group is not healthy with given HealthCheckTimeout and FailureConditionLevel WARN  [RHS] Resource AG IsAlive has indicated failure.

INFO  [RCM] rcm::RcmGroup::UpdateStateIfChanged: (AG, Online --> Pending)

 Always On health detection relies on a few mechanisms:

  • Resource DLL (RHS), which determines the IsAlive value at the cluster heartbeat interval, and is controlled by CrossSubnetDelay and SameSubnetDelay cluster properties
  • sp_server_diagnostics, which reports the component health on an interval controlled by the HealthCheckTimeout property
  • Lease mechanism, which is used as a Looks-Alive between the cluster resource host and the SQL processes
  • Session Timeout, which detected the “soft” errors / small timeouts or insufficient resources 
    • This also affects the automatic seeding timeout

The mechanisms above are controlled by properties that can be adjusted at the cluster level and/or the AG level: 


 





Getting cluster properties values

To view the current cluster values, open an elevated Powershell terminal and run the following:

#display current cluster properties

Get-Cluster | fl CrossSubnetDelay, CrossSubnetThreshold, SameSubnetDelay , SameSubnetThreshold

 #display current AG role properties

Get-ClusterResource <yourAG>| Get-ClusterParameter HealthCheckTimeout, LeaseTimeout

 



 


To view the current AG role properties, execute the below SQL statement in SSMS or any SQL IDE:

select ag.name, arcn.replica_server_name, arcn.node_name, ars.role, ars.role_desc, ars.connected_state_desc, ars.synchronization_health_desc, ar.availability_mode_desc, ag.failure_condition_level,ar.failover_mode_desc, ar.session_timeout

from sys.availability_replicas ar with (nolock)

inner join sys.dm_hadr_availability_replica_states ars with (nolock) on ars.replica_id=ar.replica_id and ars.group_id=ar.group_id

inner join sys.availability_groups ag with (nolock) on ag.group_id=ar.group_id

inner join sys.dm_hadr_availability_replica_cluster_nodes arcn with (nolock) on arcn.group_name=ag.name and arcn.replica_server_name=ar.replica_server_name 



 


Making the cluster / AG more resilient

The default/maximum network downtime that AG can absorb without being impacted is 10 seconds (1/2 of LeaseTimeout and 1/3  of HealthCheckTimeout). 

In order to make the cluster/AG roles more resilient and able to absorb more network downtime without a failover, you can increase the cluster and AG properties, taking into account the relationship between the properties’ values: 

For example, to increase the supported network downtime from 10 to 20 seconds:

  • LeaseTimeout – change from 20000 to 40000 (40 seconds)
  • HealthCheckTimeout  – change from 30000 to 60000 (60 seconds)
  • SameSubnetThreshold  – change from 10 to 20
  • Session Timeout – change from 10 to 20 seconds

 You may use the below Powershell script to change the cluster level properties:

$crossSubnetDelayOptimal = 1000;

$crossSubnetThresholdOptimal = 20

$sameSubnetDelayOptimal = 1000

$sameSubnetThresholdOptimal = 20

 

$healthCheckTimeoutOptimal = 60000

$leaseTimeoutOptimal = 40000

 

Write-Host "Check cluster heartbeat timeouts"

$cluster = get-cluster

 

$crossSubnetDelay = $cluster.CrossSubnetDelay

$crossSubnetThreshold = $cluster.CrossSubnetThreshold

$sameSubnetDelay = $cluster.SameSubnetDelay

$sameSubnetThreshold = $cluster.SameSubnetThreshold

 

if($crossSubnetDelay -ne $crossSubnetDelayOptimal)

{

    Write-Host "Cluster option CrossSubnetDelay changed from $crossSubnetDelay to $crossSubnetDelayOptimal"

    $cluster.CrossSubnetDelay = $crossSubnetDelayOptimal

}

 

if($crossSubnetThreshold -ne $crossSubnetThresholdOptimal)

{

    Write-Host "Cluster option CrossSubnetThreshold changed from $crossSubnetThreshold to $crossSubnetThresholdOptimal"

    $cluster.CrossSubnetThreshold = $crossSubnetThresholdOptimal

}

 

if($sameSubnetDelay -ne $sameSubnetDelayOptimal)

{

    Write-Host "Cluster option SameSubnetDelay changed from $sameSubnetDelay to $sameSubnetDelayOptimal"

    $cluster.SameSubnetDelay = $sameSubnetDelayOptimal

}

 

if($sameSubnetThreshold -ne $sameSubnetThresholdOptimal)

{

    Write-Host "Cluster option SameSubnetThreshold changed from $sameSubnetThreshold to $sameSubnetThresholdOptimal"

    $cluster.SameSubnetThreshold = $sameSubnetThresholdOptimal

}

 

Write-Host "Check cluster resource properties"

 

$resources = Get-ClusterResource

ForEach($resource in $resources)

{

    if($resource.ResourceType -eq "SQL Server Availability Group")

    {

        $name = $resource.Name

        $healthCheckTimeout = (Get-ClusterParameter -Name HealthCheckTimeout -InputObject $resource).Value

        $leaseTimeout = (Get-ClusterParameter -Name LeaseTimeout -InputObject $resource).Value

 

        if($healthCheckTimeout -ne $healthCheckTimeoutOptimal)

        {

            Write-Host "$name - HealthCheckTimeout - Changed from $healthCheckTimeout to $healthCheckTimeoutOptimal"

            Set-ClusterParameter -Name HealthCheckTimeout -InputObject $resource -Value $healthCheckTimeoutOptimal

        }

 

        if($leaseTimeout -ne $leaseTimeoutOptimal)

        {

            Write-Host "$name - LeaseTimeout - Changed from $leaseTimeout to $leaseTimeoutOptimal"      

            Set-ClusterParameter -Name LeaseTimeout -InputObject $resource -Value $leaseTimeoutOptimal

        }

    }

}

 

 To change the Session Timeout value, connect to the SQL instance and execute the below statement on all the AG replicas:

ALTER AVAILABILITY GROUP <AG name> MODIFY REPLICA ON '<replica name>' WITH (SESSION_TIMEOUT = 20);

 

Microsoft guidelines

Microsoft compiled guidelines for timeout values, their root causes and outcomes. 



 

 

 

 

 

 

 




Special note on Resource Monitor (RHS)



 


 



















If lease timeout occurs ( >20 seconds), the rhs.exe reports an error in the Windows cluster, and it starts taking preventive action.

You can get the following error messages for a least timeout issues:

select message_id,text from sys.messages

where message_id in(19407,19419,19421,19422) and

language_id=1033 


 

 


 

SQL Server Always On Availability Groups Health Check timeout

 



 








SQL Server Always On Availability Group performs a health check of the primary replica using the sp_server_diagnostics stored procedure. The sp_server_diagnostics executes every 10 seconds. The health check timeout is 30 seconds.

Therefore, the stored procedure returns the result on 1/3 * health checks current threshold. If the SP does not return any results, AG refers to the previous state for determining the instance health until the health-check timeout threshold. The next Is-Alive determines that the primary replica is unresponsive, and it initiates the automatic failover.

Note:

Recap of LeaseTimeout, Session-Timeout and Health check timeouts

 Lease timeout:

  • Default value: 20000 milliseconds, i.e. 20 seconds
  • It is required to prevent a split-brain scenario in the Windows failover cluster
  • It can trigger an AG failover or offline-online
  • It is used in both Is-Alive and the Looks-alive mechanism

Session Timeout:

  • It is used to safeguard against soft errors between AG replicas
  • default value: 10000 milliseconds or 10 seconds
  • It is not part of the Is-Alive or the Looks-alive mechanism
  • Secondary replica becomes DISCONNECTED status due to session timeout
  • You can configure the session timeouts in the availability group properties from the primary replica instance

Health Check timeout

  • Default value: 30000 milliseconds, i.e. 30 seconds
  • It is the timeout if the sp_server_diagnostics does not return any data or reports errors in the health check
  • It is used in both Is-Alive and the Looks-alive mechanism
  • It also depends on the failover condition levels defined from 1-5, as explained earlier. The default configuration is level 3

Server Level Health Detection:

 







SELECT wait_time, *

FROM sys.dm_exec_requests WHERE last_wait_type = 'SP_SEREVR_DIAGNOSTICS_SLEEP'

and command = 'execute'

GO

DBCC INPUTBUFFER()

EXEC sp_server_diagnostics 10

Create the following [AGSessionTimeOut] table on primary replica in availability group database [MyNewDB]. Next, we use a while loop to insert the records into the [AGSessionTimeOut] table and wait for 1 second (using the WAITFOR DELAY command) before inserting the next record.

Create table MyNewDB.dbo.AGSessionTimeOut

(

    ID int IDENTITY(1,1),

    RecordTimestamp datetime

)

WHILE 1=1

BEGIN

    INSERT INTO MyNewDB.dbo.AGSessionTimeOut(RecordTimestamp)

        VALUES (GETDATE())

    WAITFOR DELAY '00:00:01'

END

We have a synchronous AG secondary replica, so it waits for transaction acknowledgement and commits records on the primary replica. After some time, stop the secondary replica SQL instance. It causes connection timeouts for the secondary replica. Now, stop insert the statement on the primary replica after 20-30 seconds.

To troubleshoot the connection timeout, check the SQL Server error log on the primary replica. In the below image, we see that it recorded a connection timeout error at 2020=11-24 11:34:13 AM.

exec xp_readerrorlog 0,1,N'Connection timeout'

 



 

 


Conclusion

In this article, we figured SQL Server Always On Availability Groups flexible failover policy using the lease, health check timeout and Failover Condition Levels. You should review the logs to determine the scenarios where the automatic failover occurred or failed.


Special Thanks to Dan-Andrei Stefan who made very informatic blog on this....

Flow Control in Availability Groups and SQL Server 2022

 

Flow Control in Availability Groups and SQL Server 2022

What is Flow Control?

Flow control in Always On Availability Groups refers to mechanisms that control the rate at which data changes are sent from the primary replica to the secondary replicas. The goal is to ensure that the secondary replicas are kept up-to-date with the primary replica while avoiding overwhelming the network or the secondary replicas themselves.

Data synchronization process

To estimate the time to full synchronization and to identify the bottleneck, you need to understand the synchronization process. Performance bottleneck can be anywhere in the process, and locating the bottleneck can help you dig deeper into the underlying issues. The following figure and table illustrate the data synchronization process:

Availability group data synchronization






Two Levels of Flow Control

In SQL Server Always On Availability Groups, the transaction log messages from the primary replica to the secondary replicas are managed through a mechanism that involves two levels of flow control:

  1. Replica-Level Flow Control:

    • Primary-to-Replica Flow Control: Each secondary replica has a threshold for the number of log messages that can be sent from the primary replica before flow control is engaged. If the number of unacknowledged log messages exceeds this threshold, the primary replica will pause sending additional log messages to that particular secondary replica until acknowledgments are received for the previously sent messages. This prevents the secondary replica from being overwhelmed with too many log messages at once.
  2. Database-Level Flow Control:

    • Database-Specific Thresholds: In addition to controlling the flow of log messages at the replica level, SQL Server also implements flow control at the database level. Each database within an Availability Group can have its own thresholds for log message delivery. This means that if the threshold for a specific database is reached, log messages for that database will be paused until acknowledgments are received, while other databases in the same Availability Group can continue to receive log messages as long as their thresholds are not exceeded.

How It Works

  1. Message Thresholds: SQL Server tracks the number of log messages sent to each secondary replica. When the number of unsent log messages reaches a predefined threshold, flow control is triggered. This prevents the secondary replica from being flooded with log data, which can be crucial for maintaining system stability and performance.

  2. Acknowledgment Messages: Once the secondary replica processes the log messages and sends an acknowledgment back to the primary replica, the primary replica resumes sending log messages to that replica. The primary replica adjusts the flow based on the current acknowledgment status to ensure that the number of outstanding messages remains within acceptable limits.

  3. Dynamic Adjustment: SQL Server dynamically adjusts the flow of log messages based on the acknowledgments received and the current load on the secondary replicas. This helps maintain a balance between the rate of log message delivery and the capacity of the secondary replicas to process and apply those logs.

Importance

This flow control mechanism is essential for several reasons:

  • Performance and Stability: It prevents secondary replicas from being overwhelmed by excessive log data, which could degrade performance or lead to errors.
  • Efficient Data Replication: Ensures that log data is replicated efficiently and consistently across all secondary replicas, maintaining data integrity and synchronization.
  • Network Management: Helps manage network bandwidth and reduces the risk of network congestion by controlling the rate at which log messages are sent.

Once the message threshold of either gate is reached, log messages are no longer sent to a specific replica or for a specific database














Two useful performance counters, SQL Server:Availability Replica > Flow control/sec and SQL Server:Availability Replica > Flow Control Time (ms/sec), show you, within the last second, how many times flow control was activated and how much time was spent waiting on flow control. Higher wait time on the flow control translate to higher RPO.

For more details please ref:
https://learn.microsoft.com/en-us/sql/database-engine/availability-groups/windows/monitor-performance-for-always-on-availability-groups?view=sql-server-ver16&tabs=new-limits









Azure SQL Elastic Pools: a way for saving costs

  Azure SQL Elastic Pools: a way for saving costs  Elastic pools enable you to purchase resources for a pool shared by multiple databases to...