The Curious Case of… the un-killable thread

(The Curious Case of… used to be part of our bi-weekly newsletter but we decided to make it a regular blog post instead so it can sometimes be more frequent. It covers something interesting one of us encountered when working with a client, doing some testing, or were asked in a random question from the community.)

While I was teaching IEPTO2 last week, I was discussing why sometimes a thread cannot be terminated using the KILL command, and thought it would make a great topic for a post.

Some of you have likely seen a phenomenon called a non-yielding scheduler. This is where a thread is using the processor and doesn’t voluntarily yield after using more than the thread quantum (4 milliseconds, unchangeable). There’s a background task called the scheduler monitor that checks that progress is being made on the various schedulers inside SQL Server and issues a warning if it finds a problem. For a non-yielding scheduler, you’ll see error 17883, as below:

Process 56:0:0 (0xdee) Worker 0x041611F6 appears to be non-yielding on Scheduler 2.
Thread creation time: 13884536031127. Approx Thread CPU Used: kernel 18 ms, user 263 ms.
Process Utilization 0%. System Idle 98%. Interval: 331220558 ms.

There are a variety of reasons these can occur, including I/O subsystem issues, slow calls out to a Windows API, or SQL Server bugs.

In the SQL Server bug case, it’s because the thread entered some code where it’s able to loop without checking whether the 4ms quantum has expired (by calling the SQLOS function YieldAndCheckForAbort, or one of a few variants). If the thread doesn’t check whether the quantum has expired, it won’t know, and so won’t yield, and then you’ve got a non-yielding scheduler. The bug is that there’s a missing call to YieldAndCheckForAbort.

Your first impulse will likely be to KILL the offending SPID, but you can’t. Well, you can run the command, but it likely won’t do anything. This is because there’s no way to *force* a thread to terminate from within SQL Server. The thread has to check whether it’s been asked to terminate, and then it terminates itself (and any sub-threads it may have created if it’s doing a parallel operation). And guess where that check is done? In the YieldAndCheckForAbort function! So if the thread doesn’t ever check, it won’t know, and so won’t terminate.

And that’s why sometimes if the non-yielding scheduler doesn’t resolve itself, you end up having to restart SQL Server.

13 thoughts on “The Curious Case of… the un-killable thread”

Eitan Blumin says:

April 20, 2021 at 8:12 pm

Hi Paul,
Doesn’t the scheduler monitor does something to resolve such issues automatically?

Reply
1. Paul Randal says:
  
  April 20, 2021 at 8:17 pm
  
  No, it can’t, for exactly the reasons I described in the post. It can’t interrupt what the non-yielding thread is doing. Otherwise, there would be no need to reboot in some of these cases, would there?
  
  Reply
Ruben Garrigós says:

April 21, 2021 at 12:41 am

Hi Paul,

Are there other options besides restarting the instance? I know that killing the non-yielding OS thread is not a good practice…. but if we have a large 24×7 instance with dozens of databases, different apps depending on it, and a strict SLA… is it worth the risk? Is there any way to evaluate the impact of doing that depending on the current thread stack?

Reply
1. Paul Randal says:
  
  April 21, 2021 at 8:49 am
  
  No – you should never kill a SQL Server thread from the OS – that’s likely to end up crashing SQL Server, depending on what the thread is doing. And no, because you can’t tell what resources the offending thread is holding – e.g. a lock, which would then never be released.
  
  Reply
Pingback: Unkillable Threads – Curated SQL
Srinivas M A says:

April 27, 2021 at 8:28 pm

Under what circumstances thread doesn’t yield . I have seen preemptive cases where call to O/S is made . Any Proof of concept scenario to create non yielding thread

Reply
1. Paul Randal says:
  
  April 27, 2021 at 8:42 pm
  
  I explained some circumstances in the post…
  
  Reply
Getulio Torres says:

May 28, 2021 at 1:51 pm

Paul, how are you?

In this case, when we use Linked Server and stop the execution on Origin or do a Kill on the server that gets the data, the sessions sometimes stay for some time.

In the case of Kill, is “YieldAndCheckForAbort” triggered until the next quantum is evaluated?

Reply
1. Paul Randal says:
  
  June 5, 2021 at 6:52 pm
  
  No – when there’s a linked server involved, the thread goes preemptive, which means it will wait until it gets a response from the linked server. The local thread cannot be killed because that would leave an orphaned request on the remote server.
  
  Reply
Nasy says:

May 31, 2021 at 12:50 pm

Hi Paul,
Can this non-yielding scheduler cause CPU spike to 100% considering I/O subsystem is not a suspect?

Reply
1. Paul Randal says:
  
  June 5, 2021 at 6:54 pm
  
  It entirely depends on what the cause of the non-yielding scheduler is, and what other threads may be impacted by it, depending on what they’re doing. If you see 100% CPU but don’t see the non-yielding thread message in the error log, that’s not the cause of the 100% CPU. Plenty of much more likely things would cause that.
  
  Reply
Bhanu Desh says:

January 8, 2025 at 7:21 pm

Hi Paul,

We have the exact same scenario. We are on CU30 but still we have the same issue causing multiple dumps and resulting downtime and multiple reboots. Do you have any suggestions to fix this? TIA

Reply
1. Paul Randal says:
  
  March 8, 2025 at 2:49 pm
  
  Nope – it’s a bug – contact Product Support for help.
  
  Reply

Imagine feeling confident enough to handle whatever your database throws at you.

With training and consulting from SQLskills, you’ll be able to solve big problems, elevate your team’s capacity, and take control of your data career.

The Curious Case of… the un-killable thread

13 thoughts on “The Curious Case of… the un-killable thread”

Leave a Reply Cancel reply

Other articles

2024: the year in books

2023 (not a typo!): the year in books

Black Friday Super Sale!

Summer School Savings SALE is live!

The Curious Case of… missing log information

The Curious Case of… missing 832 errors during memory corruption

Imagine feeling confident enough to handle whatever your database throws at you.