I've blogged about ghost records and the ghost cleanup task a couple of time before (the only place it is really explained AFAIK), but one of my fellow MVPs was asking me some questions about it today for a customer of theirs and couldn't find the trace flag to turn it off.
My previous blog posts about this are:
These explain what ghost records are and how the ghost cleanup process works.
On big systems it's possible for the ghost cleanup process to fall behind the rest of the system, with no hope of it catching up. It's a single-threaded task – so imagine a 16-way box with lots of deletes occuring, and a single CPU spending a few seconds every 5 seconds trying to remove all the ghosted records resulting from the deletes of all the other CPUs. It's pretty obvious that the ghost cleanup process is going to lag behind.
The problem with this scenario is that the ghost cleanup process will still pop up every 5 seconds (every 10 on 2008) and start removing ghost records, potentially causing performance issues by keeping pages in the buffer pool, generating log records, and causing physical IOs. The ghost cleanup task is also one of the background processes that can cause IOs to occur on what looks like a totally quiescent system.
There is a way to turn off the ghost cleanup task, using trace flag 661, as documented in KB 920093. Be careful though!!! If you disable the ghost cleanup task, the space taken up by deleted records will *NOT* be released for reuse by SQL Server until you do something else to remove it, like rebuilding an index.
One method people sometimes consider is to force ghost cleanup to clean everything by performing a table or index scan (thus queuing all the deleted records up for the ghost cleanup task). Although this is an alternative, it still uses the ghost cleanup task to do the work, and on a very busy system with a very large number of deletes (warning: generalization! :-) it can be much more efficient to remove the deleted-but-not-yet-reclaimed records using index reorganize or index rebuild.
Turning on this trace flag can be useful as a performance gain on systems with very heavy delete workloads, but only as long as you're careful about it. It's not something that's generally recommended but it may be useful to you.
30 thoughts on “Turning off the ghost cleanup task for a performance gain”
Thanks for the info, Paul. This would have been REALLY useful for a project I did two years ago. Oh well, better late than never!
Instead of disabling the ghost cleanup with a trace flag, one should be able to force the Ghost cleanup with an index hint and be done with it instead of waking up automatically every few second(s) if generating the log records is ok. Would you or would you NOT recommend this method? Is generating too much log the only reason to turn off this feature?
What was the scenario that would have benefitted from turning off the Ghost cleanup if you can recall? I am trying to convince myself that there is a valid use case for turning off this feature.
You could do that, but it still forces the ghost cleanup task to get rid of the deleted records. For a very busy system where such a method is necessary, it can be way more efficient to simply turn off ghost cleanup and allow index reorg or index rebuild to remove the deleted-but-unreclaimed space. Thanks
Good idea, thanks for the tip. Is there an easy way to tell what % of resources the ghost rec cleanup is taking up on the system?
Hey Scott – not that I know of.
The easiest way is, of course, to disable the ghost rec cleanup and while measuring the resource usage of the system before and after. Will using the dbcc traceon(661, -1) disable it on the live system such that you could see the difference immediately or does it require a reboot (and to be set in the startup paramenters)?
Would using dbcc traceon (661, -1) disable the ghost cleanup process without requiring a reboot? That way we could have some idea if it was going to help before we permanently add the traceflag to the startup parameters.
Yes, you can turn it on dynamically. Just be careful :-)
I’m seeing a lot of Error: 1222, Severity: 16, State: 18 when doing tracing in our system. We currently have ~40 customer databases on this physical server, each with the same schema but different (their own) data. I see 1222 pop up randomly, not always after the same Stored Procedure call, so I’m not even sure that what I’m seeing is the Ghost Cleanup process running, or if it is normal in the system. How do I determine that it is indeed the Ghost Cleanup or not? Is there anything else it could be with the 1222,16,18 (ex. I’ve read maybe very long running queries are blocking/locking)? And finally, if it is the Ghost Cleanup, what impact, if any, is there to the users in the app? I was able to produce my own 1222 exceptions while testing in our app and could see the errors in the trace file go flying by, but I did not get an exception in the app layer or notice any lag or delay in the UI. But yet I have customers claiming they do get these freezes/lags. I’m just trying to isolate the issue and reproduce so we can then fix/tune accordingly. Any insight?
Why do you think the lock timeouts are from the ghost cleanup task? One way to prove yes/no is to enable trace flag 661 using dbcc traceon (661, -1) to disable ghost cleanup. If you still see the 1222s, nothing to do with ghost cleanup.
As I said, I’m not even sure it is Ghost Cleanup but I’ll have to work with our DBA to see if he is willing to enable trace flag 661 to disable ghost cleanup. Assuming I still see 1222s, in your experience, what else typically causes them and why don’t they seem to generate (from some brief testing) any impact to the UI if they are timeout related? I guess we might have to start trying to catch them specifically in the UI code to help learn more? Any suggestions as to help determine what would be causing them?
Unless you’ve changed the default lock timeout (infinite) what you’re seeing must be a background task.
I am seeing timeouts randomly and trace leads me to SPID that maps to task manager.
Is ghost clean the only task that task manager handles? if trace flag is turned on, how will it affect the performance if index rebuilds are done once a week?
Nope – there are a bunch of background tasks. Turning on the trace flag will not directly affect index rebuilds.
Paul, May I know why heap table doesn’t support ghost records?
Take a look at my Ghost Cleanup In Depth post – I explain in the comments there. Thanks
You did not mention the version of SQL you were running. Is it SQL2012? We too are seeing many 1222 messages in a profiler trace on SQL2012 SP1 systems. I have an open incident with Microsoft and they figured it was the Ghost Cleanup along with lock escalation. These messages do not show on our SQL2008R2 servers so is the process more agressive in SQL2012 than previous versions?
Microsoft have given me new information that could explain why we are seeing this in SQL2012. They have made some optimizations to the ghost cleanup making it more agressive therefore producing more messages in profiler.
Your comment is tantalizing and leaves me aching for more detail. I have a SQL 2012 Dev edition instance which is running a test/evaluation setup of my normal production (2008R2).
This consists of a dozen dbs but only 3 are significant. There are two in the 200+GB range and 1 > 1.5TB The largest DB and one of the others have a very high transaction rate amounting to several millions of rows per day inserted and deleted.
On SQL 2012 I am experiencing frequent (multiple times per day) and significant I/O slow downs including many of the symptoms mentioned above.
The index physical stats dmv is not particularly helpful because it requires so long to run (1.7B rows in Clustered Index). I thought I found the answer with the index operational stats (column leaf_ghost_count) but after further review I don’t see how the numbers in that column bear any resemblance to the actual Ghost record count.
I am planning to install SP1 and CU9 based on other discussions I have followed.
Seems that for something that has been an issue for so long (SQL 2000) that MS would at least have a better or more publicized method for at least analyzing the situation.
Please here I came up 1 article on msdn mentioning service restart also claims all ghost records. But how, is it forcing this task or some other mechanism exists
No it doesn’t – a service restart will re-enable the ghost cleanup task, unless you have the trace flag set as a startup trace flag. There is nothing in SQL Server that will remove ghost records in the background except the ghost cleanup task (my dev team used to own that code).
I’m implementing automation for our CI/CD process. A nightly restore of production > delete/truncate major tables > shrink > backup (it’s to allow developers to have a minimal DB on their laptops).
The dbcc shrinkfile , run immediately after the deletes & truncates, doesn’t hit its target size (either for log or data).
If I repeat the shrinkfile statements in the script, with no other work or delay between them, it hits its target perfectly.
Could this be ghost cleanup related? Is there a reliable way of determining when ghost cleanup has exhausted its work on a given DB? I’ve tried probing sys.dm_db_index_physical_stats, but that is never quite reaching zero.
(Select sum(ghost_record_count) from sys.dm_db_index_physical_stats(db_id(),null,NULL,NULL,’DETAILED’))
Yes – it’s ghost cleanup from the deletes, not the truncates. It’s running separately from the shrink, but because the shrink saw the pages with ghost records on. Shrink doesn’t remove ghost records itself (it can’t). The DMV is the simplest way to tell. The behavior you’re seeing is by design, so I’d stick with your current method of doing it – let shrink complete, delay a minute or two, shrink again.
Thanks Paul – good to get confirmation of some of my assumptions. Since posting, I’ve also discovered dbcc forceghostcleanup, which seems to be great so far, for our needs. Completion time is fast (<10s). And not running on a production host, no other workload or connections, so I'm seeing it as low risk, despite its undocumented status.
I wanted to add a few cents here, since the issue of ghost records involves more than taking up space. We have a situation where we use some tables as queues. Several tables have triggers that write queue entries to these tables after inserts and updates. Additional processing is then done asynchronously. These queues are being checked and processed down constantly. What I have noticed is that they can fall behind and begin to build up, sometimes into the millions of rows. This sends signal waits up, as huge amounts of CPU are being used. The ghost rows are read just like “real” records and we start seeing huge numbers of logical reads, even when the queue is “empty or just has a few records. This spins up lots of CPU. The ghost records build up because the ghost cleanup task cannot keep up. Especially, since there is lots of update traffic between the insert from triggers and then the async processing and deleting of the records, keeping transactions open causing the cleanup to skip the table.
There are a number of things that can be done about this, including physically deleting the records using “with pagelock”… but I just wanted to point this out as another angle on the issue.
Indeed. Another thing you can try is schedule an Agent job every 10 seconds (say) that does a select count (*) and forces the clustered index to be used. That will force all the ghost records to be inserted into the task’s queue for cleanup next time it runs, rather than being cleaned up over time. I’ve done that before with a client suffering from similar issues on an ASP.Net session-state tracking table.
does truncated (not deleted) data get cleaned up by the ghost cleanup thread? or is it just the ‘deleted’ data that gets cleaned up by the ghost cleanup thread.
Only records that are specifically marked as deleted through a DELETE operation.
You are a great!!!
It seems to me that dbcc traceon(661,-1) is not working on SQL 2019.
Even if i change the compatibility level.
Is it possible?
It’s working on my 2019 instance. Compatibility level is irrelevant. How are you testing it?