Availability Groups – Making SQL Agent Jobs only run on Primary Replica

September 3, 2018

I recently encountered some strange behaviour with SQLAgent Jobs in an Always On Availability Group.

There’s a major difference between SQL Server Failover Cluster Instances (FCI) and SQL Server Availability Groups that makes it a little bit more complicated with SQLAgent Jobs.

Put simply, with a Failover Cluster, there is only one “SQL Server” involved. It just moves around Servers within the Cluster. The SQL Server Services are installed on each Node in the Failover Cluster. At any one time, only one of the Servers will be running the SQL Server and SQL Server Agent Services, using a shared set of Data.
A failure of that Node will cause another Node to start up, become the Primary SQL Server and take control of the shared set of Data. This involves a short break in Service (which comes as a surprise to many non-DBAs) which needs to be handled by the Application, and following the failover, data is available through the Cluster SQL Server name pointing to the Node that’s just become the Primary.

Always On Availability Groups are different. While they utilise Windows Clustering, SQL Server is constantly running on each of the Replicas and data is kept up-to-date on each of the Replicas either Syncronously or Asynchronously. Once again, a “Server Agnostic” Connection address is used to connect to the Always On Availability Group – this time called the “Availability Group Listener”. In the event of one of the Replicas failing, one of the other Replicas will take over (again after a short break in service).
One big difference, however, is that as the Availability Group is made up of two or more independent SQL Servers, all with their own Server Settings, Security and SQLAgent Jobs.

Cue my problem – I need to ensure that SQLAgent Jobs continue to run when my Availability Group has failed over to the secondary Replica. If I schedule them to run on all Replicas, that could spell trouble. I considered using a Startup Job to enable them on a Replica once it became Primary – trouble is, it’s always running.

I thought I’d check to see how Microsoft do it. There’s an automated job that should only run on the Primary Replica called “SSIS Failover Monitor Job”… only it doesn’t work.

SQL Server cannot parse the script, therefore it fails each time it tries to run.

I now use the following SQL Script as the first Step in any Scheduled Task that needs to run on the Primary Replica.
The script uses the DMV [dm_hadr_availability_replica_states] to check the Role that the Replica is performing. If it is the Primary replica then it will proceed to the next step in the SQLAgent Job – i.e. Load data, run Report etc. If not, a failure is “thrown”. By setting the SQL Agent Job Step to upon failure, the Job will quietly stop running on the Secondary Replica. The Script references the DatabaseName – replace [DatabaseName] with the name of a Database in your Availability Group.

DECLARE @role VARCHAR(50)

SELECT @role = [role_desc] FROM [sys].[dm_hadr_availability_replica_states] hars 
INNER JOIN [sys].[availability_databases_cluster] adc ON hars.[group_id] = adc.[group_id] WHERE hars.[is_local] = 1 AND adc.[database_name] =[DatabaseName]

IF @role = 'PRIMARY'
BEGIN
  PRINT 'Primary - Job can proceed'
END
ELSE 
BEGIN
  -- Deliberately cause a Failure
 SELECT 1/0
 END

Hope this helps!

Leave a Reply Cancel reply