3

SQL Server 2016 Standard Ed

We have a clean up job that runs some pretty simple sql:

use productionDB
go

delete from transactionaltable where createdat < DATEADD(day, -21, GETDATE()) go

This runs every weeknight, and has been great.... until last night, when an (apparent) lock escalation or conflicting jobs caused it to hang, locking the table and causing all kinds of mess.

I thought to myself: "Hey, there must be some some magic I can put on the job or on the step, a property like 'kill if running for more than ten minutes'" But I can find no such.

Is the only technique to:

  • Fire the real job at 0100
  • Fire a kill job at 0110 // a safety that rarely gets a hit

The kill job to be built with this technique:

USE msdb ;
GO
EXEC dbo.sp_stop_job N'Your Job Name' ;
GO

Or is there a better/cleaner way to do this?

SOLUTION

Stop using that crude WHERE statement and start deleting rows in small batches, using a method like that described here.

Jonesome Reinstate Monica
  • 3,379
  • 10
  • 37
  • 55
  • 2
    Was it because it qualified a large amount of records? Or you know exactly how many rows will be qualified each night? I suggest you delete in batch do keeping the number of locks low so it does not escalate to table lock. – SqlWorldWide Aug 29 '17 at 22:22
  • 2
    Deleting in a batch can also be aborted. If you try to abort a delete of a large number of rows it will take a long time to roll back, and you'll be back where you started. – David Browne - Microsoft Aug 30 '17 at 01:59

4 Answers4

3

I have been using the following stored proc on a SQL agent job:

    -- =============================================
    -- Author:        Devin Knight and Jorge Segarra
    -- Create date: 7/6/2012
    -- Description:    Monitors currently running SQL Agent jobs and 
    -- alerts admins if runtime passes set threshold
    --          
-- =============================================

/*
Change log:
 =============================================
7/11/2012 (v 1.01)  
        Changed Method for capturing currently running jobs to use master.dbo.xp_sqlagent_enum_jobs 1, ''

7/12/2012 (v 1.03)
        Updated code to deal with “phantom” jobs that weren’t really running. 
        Improved logic to handle this. Beware, uses undocumented stored procedure xp_sqlagent_enum_jobs


7/24/2012 (v. 1.16)
        Removed need to specify mail profile
        Fix for error sending notify email
        Added commented line for testing purposes (avg+1 minute for short tests)
 =============================================
 exec usp_LongRunningJobs
*/
ALTER PROCEDURE [dbo].[usp_LongRunningJobs] 
                @jobname nvarchar(100)
AS 

--Set limit in minutes (applies to all jobs)
--NOTE: Percentage limit is applied to all jobs where average runtime greater than 5 minutes
--else the time limit is simply average + 10 minutes
    DECLARE @JobLimitPercentage FLOAT

    SET @JobLimitPercentage = 150 --Use whole percentages greater than 100

    -- Create intermediate work tables for currently running jobs
    DECLARE @currently_running_jobs TABLE
        (
          job_id UNIQUEIDENTIFIER NOT NULL ,
          last_run_date INT NOT NULL ,
          last_run_time INT NOT NULL ,
          next_run_date INT NOT NULL ,
          next_run_time INT NOT NULL ,
          next_run_schedule_id INT NOT NULL ,
          requested_to_run INT NOT NULL ,-- BOOL
          request_source INT NOT NULL ,
          request_source_id SYSNAME COLLATE database_default
                                    NULL ,
          running INT NOT NULL ,-- BOOL
          current_step INT NOT NULL ,
          current_retry_attempt INT NOT NULL ,
          job_state INT NOT NULL
        ) -- 0 = Not idle or suspended, 1 = Executing, 2 = Waiting For Thread, 3 = Between Retries, 4 = Idle, 5 = Suspended, [6 = WaitingForStepToFinish], 7 = PerformingCompletionActions

--Capture Jobs currently working
    INSERT  INTO @currently_running_jobs
            EXECUTE master.dbo.xp_sqlagent_enum_jobs 1, ''

--Temp table exists check


    CREATE TABLE ##LRJobsStage
        (
          [JobID] [UNIQUEIDENTIFIER] NOT NULL ,
          [JobName] [sysname] NOT NULL ,
          [StartExecutionDate] [DATETIME] NOT NULL ,
          [AvgDurationMin] [INT] NULL ,
          [DurationLimit] [INT] NULL ,
          [CurrentDuration] [INT] NULL
        )

    INSERT  INTO ##LRJobsStage
            ( JobID ,
              JobName ,
              StartExecutionDate ,
              AvgDurationMin ,
              DurationLimit ,
              CurrentDuration
            )
            SELECT  jobs.Job_ID AS JobID ,
                    jobs.NAME AS JobName ,
                    act.start_execution_date AS StartExecutionDate ,
                    AVG(FLOOR(run_duration / 100)) AS AvgDurationMin ,
                    CASE 
        --If job average less than 5 minutes then limit is avg+10 minutes
                         WHEN AVG(FLOOR(run_duration / 100)) <= 5
                         THEN ( AVG(FLOOR(run_duration / 100)) ) + 10
        --If job average greater than 5 minutes then limit is avg*limit percentage
                         ELSE ( AVG(FLOOR(run_duration / 100))
                                * ( @JobLimitPercentage / 100 ) )
                    END AS DurationLimit ,
                    DATEDIFF(MI, act.start_execution_date, GETDATE()) AS [CurrentDuration]
            FROM    @currently_running_jobs crj
                    INNER JOIN msdb..sysjobs AS jobs ON crj.job_id = jobs.job_id
                    INNER JOIN msdb..sysjobactivity AS act ON act.job_id = crj.job_id
                                                              AND act.stop_execution_date IS NULL
                                                              AND act.start_execution_date IS NOT NULL
                    INNER JOIN msdb..sysjobhistory AS hist ON hist.job_id = crj.job_id
                                                              AND hist.step_id = 0
            WHERE   crj.job_state = 1
            and jobs.NAME=@jobname
            GROUP BY jobs.job_ID ,
                    jobs.NAME ,
                    act.start_execution_date ,
                    DATEDIFF(MI, act.start_execution_date, GETDATE())
            HAVING  CASE WHEN AVG(FLOOR(run_duration / 100)) <= 5
                              THEN (AVG(FLOOR(run_duration / 100))) + 10
                            --THEN ( AVG(FLOOR(run_duration / 100)) ) + 1  --Uncomment/Use for testing purposes only
                         ELSE ( AVG(FLOOR(run_duration / 100))
                                * ( @JobLimitPercentage / 100 ) )
                    END < DATEDIFF(MI, act.start_execution_date, GETDATE())


--Checks to see if a long running job has already been identified so you are not alerted multiple times
    IF EXISTS ( SELECT  RJ.*
                FROM    ##LRJobsStage RJ
                WHERE   CHECKSUM(RJ.JobID, RJ.StartExecutionDate) NOT IN (
                        SELECT  CHECKSUM(JobID, StartExecutionDate)
                        FROM    dbo.LongRunningJobs ) )
     BEGIN
--Send email with results of long-running jobs

    --Set Email Recipients
        DECLARE @MailRecipients VARCHAR(200)

    SET @MailRecipients = 'developer@adventureworks.com' --Uncomment/Use for testing purposes only

    EXEC msdb.dbo.sp_send_dbmail --@profile_name = @MailProfile
        @recipients = @MailRecipients,
        @query = 'USE DB_Maintenance; Select RJ.*
                From ##LRJobsStage RJ
                WHERE CHECKSUM(RJ.JobID,RJ.StartExecutionDate) 
                NOT IN (Select CHECKSUM(JobID,StartExecutionDate) From dbo.LongRunningJobs) ',
        @body = 'View attachment to view long running jobs',
        @subject = 'Long Running SQL Agent Job Alert',
        @attach_query_result_as_file = 1;

--Populate LongRunningJobs table with jobs exceeding established limits
    INSERT  INTO DB_Maintenance.[dbo].[LongRunningJobs]
            ( [JobID] ,
              [JobName] ,
              [StartExecutionDate] ,
              [AvgDurationMin] ,
              [DurationLimit] ,
              [CurrentDuration]
            )
            ( SELECT    RJ.*
              FROM      ##LRJobsStage RJ
              WHERE     CHECKSUM(RJ.JobID, RJ.StartExecutionDate) NOT IN (
                        SELECT  CHECKSUM(JobID, StartExecutionDate)
                        FROM    dbo.LongRunningJobs )
            )
    END

Edited to include table creation:

CREATE TABLE [dbo].[LongRunningJobs](
    [ID] [int] IDENTITY(1,1) NOT NULL,
    [JobName] [sysname] NOT NULL,
    [JobID] [uniqueidentifier] NOT NULL,
    [StartExecutionDate] [datetime] NULL,
    [AvgDurationMin] [int] NULL,
    [DurationLimit] [int] NULL,
    [CurrentDuration] [int] NULL,
    [RowInsertDate] [datetime] NOT NULL
) ON [PRIMARY]
GO

ALTER TABLE [dbo].[LongRunningJobs] ADD  CONSTRAINT [DF_LongRunningJobs_Date]  DEFAULT (getdate()) FOR [RowInsertDate]
GO
jstexasdba
  • 175
  • 7
  • @user1201572 - Can you provide the table definition for DB_Maintenance.dbo.LongRunningJobs? This is a pretty nifty piece of code. Thanks! – SQL_Hacker Aug 30 '17 at 13:32
  • @jstexasdba, love this solution. I added the following just below the Temp table exists check, as the check itself seemed to be missing. IF OBJECT_ID('tempdb..##LRJobsStage') IS NOT NULL DROP TABLE ##LRJobsStage; – CatPhish Mar 01 '21 at 14:27
  • @CatPhish it's a Temp Table, so it should only exist for the duration of the SP execution. The only thing that will keep the Temp Table around is if you've executed it manually in SSMS, which maintains the connection until you close the window. – Optimaximal Nov 09 '21 at 09:57
1

A bit of a hack method would be to run your script as a CmdExec step, then run sqlcmd and set your connection timeout. (Untested but I would think would work if no other option is presented.)

1

Couple of things.

  • You are using the term "deadlock", but this is most likely just blocking since deadlocks are auto resolved in SQL Server.

  • The built in method to put a time cap on waiting for locks is using the "SET LOCK_TIMEOUT" statement. See:(https://learn.microsoft.com/en-us/sql/t-sql/statements/set-lock-timeout-transact-sql?view=sql-server-ver15)

    For example lets say you want to wait no more than 1 minutes:

    SET LOCK_TIMEOUT  60000
    

    DELETE FROM transactionaltable WHERE createdat < DATEADD(day, -21, GETDATE())

  • Seems like you are trying to patch a problem that maybe should be addressed differently. Have you considered using partition switching and truncation to delete these records? (Table partitioning became available in SQL Server Standard Edition in v2016-SP1.) That should avoid the original issue completely.

Jonesome Reinstate Monica
  • 3,379
  • 10
  • 37
  • 55
  • "partition switching and truncation" Isn't that restricted to Enterprise Edition? (If so, it is not a relevant answer... and you should also make that clear in your response...) – Jonesome Reinstate Monica Mar 01 '21 at 18:06
  • Thanks for the observation. They are supported in 2016 SP1. From the Documentation: "Prior to SQL Server 2016 (13.x) SP1, partitioned tables and indexes were not available in every edition of SQL Server." For a list of features that are supported by the editions of SQL Server, see https://learn.microsoft.com/en-us/sql/sql-server/editions-and-components-of-sql-server-2016?view=sql-server-ver15 – Fernando Sibaja Mar 01 '21 at 18:27
  • Fair call. I enhanced your response to make this clear. Note that table partitioning is non trivial to implement. So while your proposal is workable in principle, migrating to partitioned tables is non trivial and that makes not really a solution but rather an architectural approach. Deleting records in small batches is an actual solution, as it prevents lock escalation, and can be dropped into current architectures and workflows... – Jonesome Reinstate Monica Mar 01 '21 at 20:12
1

Yes, let's try it out.

This is a query that will run for 6 minutes:

DECLARE @i INT = 1;
WHILE (@i <= 360)                       
BEGIN
    WAITFOR DELAY '00:00:01'            
    print FORMAT(GETDATE(),'hh:mm:ss')
    SET  @i = @i + 1;
END 

Now create an Agent Job that will run every 10 seconds and that will intercept the previous query; this is the step:

-- Put here a part of the code you are targeting or even the whole query
DECLARE @Search_for_query NVARCHAR(300) SET @Search_for_query = '%FORMAT(GETDATE(),''hh:mm:ss'')%'
-- Define the maximum time you want the query to run
DECLARE @Time_to_run_in_minutes INT SET @Time_to_run_in_minutes = 1

DECLARE @SPID_older_than smallint SET @SPID_older_than = ( SELECT --text, session_id --,start_time FROM sys.dm_exec_requests
CROSS APPLY sys.dm_exec_sql_text(sql_handle)
WHERE text LIKE @Search_for_query
AND text NOT LIKE '%sys.dm_exec_sql_text(sql_handle)%' -- This will avoid the killing job to kill itself AND start_time < DATEADD(MINUTE, -@Time_to_run_in_minutes, GETDATE())
)

-- SELECT @SPID_older_than -- Use this for testing

DECLARE @SQL nvarchar(1000) SET @SQL = 'KILL ' + CAST(@SPID_older_than as varchar(20)) EXEC (@SQL)

Make sure the job is run by sa or some valid alternative.

Now you can adapt it to your code by changing:

  • @Search_for_query = put here a part of the query you are looking for
  • @Time_to_run_in_minutes = the max number of minutes you want the job to run
Francesco Mantovani
  • 1,554
  • 12
  • 23
  • Neat! There is a uniqueness problem... Eg making sure that the query string only gets a hit on the desired sql. I don't think I can search for a comment or unused variable (that I would set to a unique value)... – Jonesome Reinstate Monica Mar 31 '21 at 18:48
  • @JonesomeReinstateMonica, if you put in @Search_for_query the whole query you are looking for it will probably catch only that. But I see the point: what if we have 2 identical queries? I will think about it and I will update my reply if I find the solution – Francesco Mantovani Mar 31 '21 at 20:08