The Curious Case of… the un-killable thread

(The Curious Case of… used to be part of our bi-weekly newsletter but we decided to make it a regular blog post instead so it can sometimes be more frequent. It covers something interesting one of us encountered when working with a client, doing some testing, or were asked in a random question from the community.)

While I was teaching IEPTO2 last week, I was discussing why sometimes a thread cannot be terminated using the KILL command, and thought it would make a great topic for a post.

Some of you have likely seen a phenomenon called a non-yielding scheduler. This is where a thread is using the processor and doesn’t voluntarily yield after using more than the thread quantum (4 milliseconds, unchangeable). There’s a background task called the scheduler monitor that checks that progress is being made on the various schedulers inside SQL Server and issues a warning if it finds a problem. For a non-yielding scheduler, you’ll see error 17883, as below:

Process 56:0:0 (0xdee) Worker 0x041611F6 appears to be non-yielding on Scheduler 2.
Thread creation time: 13884536031127. Approx Thread CPU Used: kernel 18 ms, user 263 ms.
Process Utilization 0%. System Idle 98%. Interval: 331220558 ms.

There are a variety of reasons these can occur, including I/O subsystem issues, slow calls out to a Windows API, or SQL Server bugs.

In the SQL Server bug case, it’s because the thread entered some code where it’s able to loop without checking whether the 4ms quantum has expired (by calling the SQLOS function YieldAndCheckForAbort, or one of a few variants). If the thread doesn’t check whether the quantum has expired, it won’t know, and so won’t yield, and then you’ve got a non-yielding scheduler. The bug is that there’s a missing call to YieldAndCheckForAbort.

Your first impulse will likely be to KILL the offending SPID, but you can’t. Well, you can run the command, but it likely won’t do anything. This is because there’s no way to *force* a thread to terminate from within SQL Server. The thread has to check whether it’s been asked to terminate, and then it terminates itself (and any sub-threads it may have created if it’s doing a parallel operation). And guess where that check is done? In the YieldAndCheckForAbort function! So if the thread doesn’t ever check, it won’t know, and so won’t terminate.

And that’s why sometimes if the non-yielding scheduler doesn’t resolve itself, you end up having to restart SQL Server.

11 thoughts on “The Curious Case of… the un-killable thread

    1. No, it can’t, for exactly the reasons I described in the post. It can’t interrupt what the non-yielding thread is doing. Otherwise, there would be no need to reboot in some of these cases, would there?

  1. Hi Paul,

    Are there other options besides restarting the instance? I know that killing the non-yielding OS thread is not a good practice…. but if we have a large 24×7 instance with dozens of databases, different apps depending on it, and a strict SLA… is it worth the risk? Is there any way to evaluate the impact of doing that depending on the current thread stack?

    1. No – you should never kill a SQL Server thread from the OS – that’s likely to end up crashing SQL Server, depending on what the thread is doing. And no, because you can’t tell what resources the offending thread is holding – e.g. a lock, which would then never be released.

  2. Under what circumstances thread doesn’t yield . I have seen preemptive cases where call to O/S is made . Any Proof of concept scenario to create non yielding thread

  3. Paul, how are you?

    In this case, when we use Linked Server and stop the execution on Origin or do a Kill on the server that gets the data, the sessions sometimes stay for some time.

    In the case of Kill, is “YieldAndCheckForAbort” triggered until the next quantum is evaluated?

    1. No – when there’s a linked server involved, the thread goes preemptive, which means it will wait until it gets a response from the linked server. The local thread cannot be killed because that would leave an orphaned request on the remote server.

  4. Hi Paul,
    Can this non-yielding scheduler cause CPU spike to 100% considering I/O subsystem is not a suspect?

    1. It entirely depends on what the cause of the non-yielding scheduler is, and what other threads may be impacted by it, depending on what they’re doing. If you see 100% CPU but don’t see the non-yielding thread message in the error log, that’s not the cause of the 100% CPU. Plenty of much more likely things would cause that.

Leave a Reply

Your email address will not be published. Required fields are marked *

Other articles

Some thoughts on courage

(This is also the Ponderings – editorial – in today’s SQLskills newsletter.) I want to start out this post by sincerely thanking everyone who emailed

Explore

Imagine feeling confident enough to handle whatever your database throws at you.

With training and consulting from SQLskills, you’ll be able to solve big problems, elevate your team’s capacity, and take control of your data career.