Inside the Storage Engine: Ghost cleanup in depth

Over the years I was in the Storage Engine team I saw a lot of concern on the various forums about the ghost cleanup task. There have been a few bugs with it in previous versions  (see these KB articles – 932115 and 815594) and there's very little info available on it. For some reason I didn't get around to posting about it on my old blog but today I want to go into some depth on it.

So what is ghost cleanup? It's a background process that cleans up ghost records – usually referred to as the ghost cleanup task. What's a ghost record? As I described briefly in the Anatomy of a record post last week, a ghost record is one that's just been deleted in an index on a table (well, actually it gets more complicated if snapshot isolation of some form is enabled but for now, a record in an index is a good start). Such a delete operation never physically removes records from pages – it only marks them as having been deleted, or ghosted. This is a performance optimization that allows delete operations to complete more quickly. It also allows delete operations to rollback more quickly because all that needs to happen is to unmark the records as being deleted/ghosted, instead of having to reinsert the deleted records. The deleted record will be physically removed (well, its slot will be removed – the record data isn't actually overwritten) later by the background ghost cleanup task. The ghost cleanup task will leave a single record on the page to avoid having to deallocate empty data or index pages.

The ghost cleanup task can't physically delete the ghost records until after the delete transaction commits because the deleted records are locked and the locks aren't released until the transaction commits. As an aside, when ghost records exist on a page, even a NOLOCK or READ UNCOMMITTED scan won't return them because they are marked as ghost records.

When a record is deleted, apart from it being marked as a ghost record, the page that the record is on is also marked as having ghost records in one of the allocation maps – the PFS page (post coming soon!) – and in its page header. Marking a page as having ghost records in a PFS page also changes the database state to indicate that there are some ghost records to cleanup – somewhere. Nothing tells the ghost cleanup task to clean the specific page that the delete happened on – yet. That only happens when the next scan operation reads the page and notices that the page has ghost records.

The ghost cleanup task doesn't just start up when it's told to – it starts up in the background every 5 seconds and looks for ghost records to cleanup. Remember that it won't be told to go cleanup a specific page by a delete operation – it's a subsequent scan that does it, if a scan happens. When the ghost cleanup task starts up it checks to see if its been told to cleanup a page – if so it goes and does it. If not, it picks the next database that is marked as having some ghost records and looks through the PFS allocation map pages to see if there are any ghost records to cleanup. It will check through or cleanup a limited number of pages each time it wakes up – I remember the limit is 10 pages – to ensure it doesn't swamp the system. So – the ghost records will eventually be removed – either by the ghost cleanup task processing a database for ghost records or by it specifically being told to remove them from a page. If it processes a database and doesn't find any ghost records, it marks the database as not having any ghost records so it will be skipped next time.

How can you tell its running? On SQL Server 2005, you can use the following code to see the ghost cleanup task in sys.dm_exec_requests:

SELECT * INTO myexecrequests FROM sys.dm_exec_requests WHERE 1 = 0;

GO

SET NOCOUNT ON;

GO

DECLARE @a INT

SELECT @a = 0;

WHILE (@a < 1)

BEGIN

INSERT INTO myexecrequests SELECT * FROM sys.dm_exec_requests WHERE command LIKE '%ghost%'

SELECT @a = COUNT (*) FROM myexecrequests

END;

GO

SELECT * FROM myexecrequests;

GO

And on SQL Server 2000 you need to use sysprocesses (well, on SQL Server 2005 this works as well but its fake view derived from the DMVs):

SELECT * INTO mysysprocesses FROM master.dbo.sysprocesses WHERE 1 = 0;

GO

SET NOCOUNT ON;

GO

DECLARE @a INT

SELECT @a = 0;

WHILE (@a < 1)

BEGIN

INSERT INTO mysysprocesses SELECT * FROM master.dbo.sysprocesses WHERE cmd LIKE '%ghost%'

SELECT @a = COUNT (*) FROM mysysprocesses

END;

GO

SELECT * FROM mysysprocesses;

GO

The output from sys.dm_exec_requests is (with most unused and uninteresting columns stripped off):

session_id request_id  start_time              status       command
———- ———– ———————– ———— —————-
15         0           2007-10-05 16:34:49.653 background   GHOST CLEANUP

So how can you tell if a record is ghosted? Let's engineer some and look at it with DBCC PAGE – I've stripped out the uninteresting bits of the output and highlighted the interesting ghost parts:

CREATE TABLE t1 (c1 CHAR(10))

CREATE CLUSTERED INDEX t1c1 on t1 (c1)

GO

BEGIN TRAN

INSERT INTO t1 VALUES ('PAUL')

INSERT INTO t1 VALUES ('KIMBERLY')

DELETE FROM t1 WHERE c1='KIMBERLY';

GO

DBCC IND ('ghostrecordtest', 't1', 1);

GO

DBCC TRACEON (3604);

GO

DBCC PAGE ('ghostrecordtest', 1, 143, 3);

GO

<snip>

m_freeData = 130    m_reservedCnt = 0   m_lsn = (20:88:20)
m_xactReserved = 0  m_xdesId = (0:518)  m_ghostRecCnt = 1
m_tornBits = 0

<snip>

Slot 0 Offset 0x71 Length 17

Record Type = GHOST_DATA_RECORD      Record Attributes =  NULL_BITMAP
Memory Dump @0x6256C071

00000000:   1c000e00 4b494d42 45524c59 20200200 †….KIMBERLY  ..
00000010:   fc†††††††††††††††††††††††††††††††††††.
UNIQUIFIER = [NULL]

Slot 0 Column 1 Offset 0x4 Length 10

c1 = KIMBERLY

Slot 1 Offset 0x60 Length 17

Record Type = PRIMARY_RECORD         Record Attributes =  NULL_BITMAP
Memory Dump @0x6256C060

00000000:   10000e00 5041554c 20202020 20200200 †….PAUL      ..
00000010:   fc†††††††††††††††††††††††††††††††††††.
UNIQUIFIER = [NULL]

Slot 1 Column 1 Offset 0x4 Length 10

c1 = PAUL

Let's see what goes on the transaction log during this process (remember this is undocumented and unsupported – do it on a test database) – I've stripped off a bunch of the columns in the output:

DECLARE @a CHAR (20)

SELECT @a = [Transaction ID] FROM fn_dblog (null, null) WHERE [Transaction Name]='PaulsTran'

SELECT * FROM fn_dblog (null, null) WHERE [Transaction ID] = @a;

GO

Current LSN              Operation         Context             Transaction ID
———————— —————– ——————- ————–
00000014:00000054:0011   LOP_BEGIN_XACT    LCX_NULL            0000:00000206
00000014:0000005a:0012   LOP_INSERT_ROWS   LCX_CLUSTERED       0000:00000206
00000014:0000005a:0013   LOP_INSERT_ROWS   LCX_CLUSTERED       0000:00000206
00000014:0000005a:0014   LOP_DELETE_ROWS   LCX_MARK_AS_GHOST   0000:00000206
00000014:0000005a:0016   LOP_DELETE_ROWS   LCX_MARK_AS_GHOST   0000:00000206

So there are the two inserts followed by the two deletes – with the rows being marked as ghost records. But where's the update to the PFS page? Well, changing the ghost bit in a PFS page is not done as part of a transaction. We'll need to look for it another way (apart from just dumping everything in the transaction log and searching manually):

SELECT Description, * FROM fn_dblog (null, null) WHERE Context like '%PFS%' AND AllocUnitName like '%t1%';

GO

Description               Current LSN              Operation        Context   Transaction ID
————————- ———————— —————- ——— —————-
Allocated 0001:0000008f   00000014:00000054:0014   LOP_MODIFY_ROW   LCX_PFS   0000:00000208
                          00000014:0000005a:0015   LOP_SET_BITS     LCX_PFS   0000:00000000

The first one is just allocating a page but the second one is the one we're looking for – it's changed the bit for the page to say it has ghost records on. Let's commit the transaction and see what happens, filtering out all the previous transaction log:

SELECT MAX ([Current LSN]) FROM fn_dblog (null, null);

GO

— 00000014:0000005e:0001

COMMIT TRAN

GO

SELECT [Page ID], * FROM fn_dblog (null, null) WHERE [Current LSN] > '00000014:0000005e:0001';

GO

Page ID         Current LSN              Operation          Context         Transaction ID
————— ———————— —————— ————— ————–
NULL            00000014:0000005f:0001   LOP_COMMIT_XACT    LCX_NULL        0000:00000206
0001:0000008f   00000014:00000060:0001   LOP_EXPUNGE_ROWS   LCX_CLUSTERED   0000:00000000

We see that almost as soon as the transaction has commited, the ghost cleanup task goes in and process the page. Let's check a page dump to make sure the record is gone, and show that the contents of the record are still on the page (again, with non-relevant bits snipped out):

DBCC PAGE ('ghostrecordtest', 1, 143, 3);

GO

<snip>

m_freeData = 130         m_reservedCnt = 0        m_lsn = (20:94:1)
m_xactReserved = 0       m_xdesId = (0:518)       m_ghostRecCnt = 0
m_tornBits = 0

<snip>

Record Type = PRIMARY_RECORD         Record Attributes =  NULL_BITMAP
Memory Dump @0x6212C060

00000000:   10000e00 5041554c 20202020 20200200 †….PAUL      ..
00000010:   fc†††††††††††††††††††††††††††††††††††.
UNIQUIFIER = [NULL]

Slot 0 Column 1 Offset 0x4 Length 10

c1 = PAUL

DBCC PAGE ('ghostrecordtest', 1, 143, 2);

GO

<snip>

6212C040:   01000000 00000000 00000000 00000000 †…………….
6212C050:   00000000 00000000 00000000 00000000 †…………….
6212C060:   10000e00 5041554c 20202020 20200200 †….PAUL      ..
6212C070:   fc1c000e 004b494d 4245524c 59202002 †…..KIMBERLY  .
6212C080:   00fc0000 00000000 00000000 01000000 †…………….
6212C090:   00000000 13000000 01000000 00000000 †…………….

<snip>

So even though the record no longer exists, all that happened was that the slot was removed from the slot array at the end of the page – the record contents will remain on the page until the space is reused.

In the next post I'll go into details of the PFS and other allocation maps. Btw – please let me know if this stuff is interesting – I'd like to know where to spend blogging time. Thanks!

48 thoughts on “Inside the Storage Engine: Ghost cleanup in depth

  1. Hi all,
    The above article extremely excellent, it clear all doubt in ghost cleanup process without doing any question. This article focus on different angle to solve and understand different question raise against it

    Thank a lot
    Vaibhav Bapat

  2. "The ghost cleanup task will leave a single record on the page to avoid having to deallocate empty data or index pages." – this is not true in SQL Server 2005 or later version. Unless this is the first page maintained in system table, a page could be de-allocated by ghost cleanup.

    1. It doesn’t need them as there’s no way that a duplicate record being inserted before the deleting transaction rolls back could cause the rollback to fail a uniqueness test, or cause a page to split.

      1. thank you Paul , and sorry for my late reply.
        But I cannot understand for your explanation, can you kind of helping to explain it more detail or give a example?

  3. We are in a situation where we have very busy OLTP system which is doing lot of deletes and then inserts,
    the ghost cleanup task cannot keep up, its always lagging behind and we are facing lot of performance problem.

    Is there a way to solve performance problem. We dont want to rebuid or reindex table.

    ALso if Ghost cleanup task is executed every 10 seconds.
    What is the possiblity that previous Ghost clean up task is still running and new one is started, overlapping each other.

    Any help is appreciated.

    Thanks.

    1. No possibility that the other one is still running as it only runs on one thread.

      No way to solve your perf issue except to manually force ghost record removal by doing something like DBCC FORCEGHOSTCLEANUP or changing your schema so the new inserts are not going into the same place in the index as the deleted records that haven’t been ‘removed’ yet by ghost cleanup.

  4. Hi Paul,

    First of all thanks for this wonderful that helps us to server of job better.

    I have a questions which not exactly on this topic but very much related to this. If i reduce table rows from 1 million to assume 200 for a table A any new record into this table will reuse the space and how other table will it reuse the space released from table A?

    1. When the ghosted records are removed, if a page becomes empty it is deallocated from the table. Any deallocated pages in extents allocated to the table can then be reused by that table. If the deallocated table is the only one allocated in an extent, the whole extent is deallocated and can then be allocated to a different table to be used.

  5. Hi Paul

    Great article as always. I have a question. We’re seeing an issue with Ghost Cleanup on our 2014 Enterprise server. It seems to be on one table where a lot of deletes are occurring. using fn_dblog I can see millions of entries for ghost cleanup against the two indexes on this table and the transaction log keeps growing to hundreds of gigabytes in a short period. The main issue seems to be on the clustered index (where were seeing most of the ghost deletes) which is a wide clustered index with 9 columns and data types of bigint, bigint, nvarchar(3), datetime, nvarchar(20), dec(32,16), dec(32,16),nvarchar(20), datetime

    There is a separate non clustered primary index against a bigint field.

    I’ve been working with SQL for 20 years now and this is the first time I’ve come across this issue. Could this simply be down to the clustered index being so wide?

    This is a third party table written by an outside company used for data extracts from our AX system.

    Cheers
    Dave

    1. Is it the deletes themselves or the ghost cleanup that’s causing the log growth? If the latter, it could be that the ghost cleanup is being triggered in one big batch (various reasons for this) in which case you might want to disable it with trace flag 661 (https://support.microsoft.com/en-us/help/920093/tuning-options-for-sql-server-when-running-in-high-performance-workloa) and rely on regular index maintenance to reclaim the space from the deleted records. If the former, the deletes could be done in small batches to allow the log to not grow so much. Also, make sure that nothing is preventing the log from being cleared and that log backups are being done at a frequency that prevents the log growing so large.

  6. Hi Paul:
    Thank you very much for your article.i am bothered by this problem.
    In 10 minutes the sqlserver log flush becomed higher, and write/sec was close to 30000 and write bytes/sec was 100M+/sec ,the pcie disk io is full.
    I’m sure that the server query was normal and there was no abnormal data file growthf at that moment,but the log file was doubled than usual.so i used dump_log() functions to do statistics on logs,i found lots of ‘ShrinkD’ 、’ghostcleanuptask’、 ‘BTree Split/Shrink’ in that moment.

    the results:
    select distinct [Transaction Name] ,count(*) from [t1].[dbo].split0426 where [Begin Time]>’2018/04/26 09:36:00′ and [Begin Time]<'2018/04/26 09:55:00' group by [Transaction Name] order by count(*) desc
    ShrinkD 957691
    INSERT 161702
    BTree Split/Shrink 110077
    UPDATE 107921
    SplitPage 18663
    implicit_transaction 16592
    GhostCleanupTask 14072
    user_transaction 245
    MERGE 219
    AllocHeapPageSysXactDML 143
    DELETE 43
    AllocPages 22

    select convert(varchar(16),[Begin Time]) as min, count(*) as counts from [t1].[dbo].split0426 where [Transaction Name]='GhostCleanupTask'
    group by convert(varchar(16),[Begin Time]) order by convert(varchar(16),[Begin Time])
    2018/04/26 09:31 4
    2018/04/26 09:32 4
    2018/04/26 09:33 4
    2018/04/26 09:34 4
    2018/04/26 09:35 4
    2018/04/26 09:36 946
    2018/04/26 09:37 525
    2018/04/26 09:38 576
    2018/04/26 09:39 534
    2018/04/26 09:40 654
    2018/04/26 09:41 717
    2018/04/26 09:42 1028
    2018/04/26 09:43 539
    2018/04/26 09:44 1317
    2018/04/26 09:45 1229
    2018/04/26 09:46 1088
    2018/04/26 09:47 1837
    2018/04/26 09:48 1212
    2018/04/26 09:49 329
    2018/04/26 09:50 454
    2018/04/26 09:51 592
    2018/04/26 09:52 480
    2018/04/26 09:53 7
    2018/04/26 09:54 8
    2018/04/26 09:55 7
    select convert(varchar(16),[Begin Time]) as min, count(*) as counts from [t1].[dbo].split0426 where [Transaction Name]='ShrinkD'
    group by convert(varchar(16),[Begin Time]) order by convert(varchar(16),[Begin Time])
    2018/04/26 09:36 75247
    2018/04/26 09:37 22963
    2018/04/26 09:38 20069
    2018/04/26 09:39 17273
    2018/04/26 09:40 37594
    2018/04/26 09:41 29922
    2018/04/26 09:42 33271
    2018/04/26 09:43 17801
    2018/04/26 09:44 34105
    2018/04/26 09:45 13176
    2018/04/26 09:46 173689
    2018/04/26 09:47 113353
    2018/04/26 09:48 241987
    2018/04/26 09:49 5350
    2018/04/26 09:50 5058
    2018/04/26 09:51 55132
    2018/04/26 09:52 61694
    2018/04/26 09:53 3
    2018/04/26 09:54 4
    2018/04/26 09:55 3
    2018/04/26 09:56 5

    some configurations:
    windows server2008 sp2
    sqlserver 2014
    database target_recovery_time = 60 seconds
    database auto shrink and index auto shrink false
    backup log per hour
    physical disk:pcie ssd 3.0T used 70%;

    other operations:
    Recently clean up the useless index, organize index fragmentation,
    increase the automatic archiving data operation, about 500w daily backup and delete;

    my question:
    Is GhostCleanupTask the cause of the problem?
    what's the shrinkd in logs? shrink database?
    how was it triggered?
    why did the database engine use so high io without any protection?

    thanks very much,all the best.

    1. Looks like a) you’ve got auto-shrink enabled for the database, or there’s an Agent job doing shrink b) you’ve had a lot of deletes, and the shrink ‘touched’ a lot of the pages with deleted records on so they got queued up for the ghost cleanup task to take care of. You need to fix a).

      1. Hello, Paul.
        I checked the database. The database and index AUTO_SHRINK set OFF, and there was no shrinking job at that time.
        Why would there be so many ghostcleanup tasks suddenly, this is what I do not understand. I will continue to observe.
        Thank you very much.

          1. Thank you for your article and reply, which made us understand a lot of knowledge that the official website did not have. It’s a great thing. I’ll check it again. Thank you again.

          2. Hello, Paul.
            I seem to have found the answer.When it happened again,i set my server dbcc traceon(661,-1),the server resumed immediately.
            And why would there be so many ghostcleanup tasks suddenly without any shrink, it seems:if the number of hot ghost pages is above the specified (hard-coded) limit,the task runs non-stop till the count comes down below the threshold value.https://zh.scribd.com/document/323838917/Ghost-Records-in-SQL-Server
            Now i resumed ghostcleanup tast auto run,and i will add a “dbcc forceghostcleanup” job after the deleted job,and keep looking at it
            Thanks very much.

  7. Very informative blog, thank you.
    Quick question , does every deleted page must go through “Ghost Cleanup” process before it’s re-allocated?

    1. The following have to go through the ghost cleanup process when they’re deleted: all index records, all data records in a clustered index, all data records in a heap when snapshot isolation is enabled.

  8. What happens when the ghost clean-up can’t complete (for example if the records are locked on a readable secondary and are being queried). Does it exit immediately and try again after 5s when it gets spun up again or does it sit there waiting for the locks to be released?

    1. That scenario isn’t possible. Read activity on a readable-secondary doesn’t cause any locking/blocking on the primary (or have any effect at all on the primary). But if the ghost cleanup task can’t get the page lock it needs, it’ll skip cleaning that page.

  9. Tong – something is reading the pages that have had a large number of deletes on them, and that is triggering ghost cleanup. There isn’t a threshold, there’s a queue of pages with known ghost records, that is added to when a page with ghost records is encountered. If there are a lot of such pages, the ghost cleanup task will clean up everything on it’s queue next time it wakes up, rather than just searching for ghost records to clean up.

    1. If so,when the ghost cleanup task clean up everything on it’s queue,it seems that there is not the limit of 10 pages?…

  10. Paul,

    I’ve got a table with one row that is updated frequently by a vendor application, table is called QRTZ_FIRED_TRIGGERS and used by Java Quartz module as part of a vendor supported application. I am seeing a high version_ghost_record_count. If I don’t rebuild the index on this table often, our transactions per second and CPU eventually climb and the SQL Instance performance becomes poor. Read committed snapshot is set to on. Have you written about ghost records and snapshot isolation since this article? I see the ghost record cleanup in this database and have also tried to force a full scan, but have no luck getting this table to play nice.

    Thanks,
    Sam

    1. Hi Sam,

      Several options here:
      1) disable snapshot isolation
      2) rebuild frequently
      3) identify the long-running versioning query that’s preventing version store cleanup and ghost removal

      My guess is you’re going to be stuck with #2.

      Thanks

      1. Paul,

        Thanks for your advice. I’m unable to identify any query running. My tempdb version store is also massive (at 155GB). Need to figure out how to get that issue under control as well. Hoping an app restart will let the cleanup happen.

        One thing I don’t understand is why having those version ghost records causes transactions (tps perfmon counter) to ramp up over time. Is the engine having to scan all pages and check to see if they are marked as deleted? If I stop rebuilds, the TPS gets up to 30-40k/s and we start to see system-wide impact.

        Thanks Again,

        Sam

      2. I’ve also encountered some version_ghost_record_count issues before, but they were all caused by a very long-running transactions (for hours and even days sometimes), and everything would come in order as soon as those transactions were gone. But today I faced something new and I’m at a loss: sys.dm_tran_active_snapshot_database_transactions is empty, and there are no long-living (more than a minute or so) transactions active on the server at all, but i’ve got millions of version ghost records in a frequently rewritten table (a single user process routinely inserts about 1000 short rows, processes them, and then deletes them all) , and I could deduce that such beahior lasts for hours already. What else, besides long-running queries could be the reason for this?

        1. It could be that nothing else has read those pages (so they’re not entered into the to-do list) and the ghost cleanup task is working through another database or portion of this one.

  11. Does a checkpoint need to occur before ghost cleanup runs? Specifically, If I delete rows for a clustered index, then run DBCC CLEANPAGE against pages in the index, is it *required* to run a checkpoint before the DBCC CLEANPAGE in order for it to be effective?

  12. Hello Paul,

    nice explanation, I really appreciate it, thank you very much.

    Is it possible ghost cleanup process would be triggered by change tracking auto cleanup ?

    At this moment, we have an issue that seems corresponding to all issues described above.

    Thanks a lot.

    Bertrand.

Leave a Reply

Your email address will not be published. Required fields are marked *

Other articles

Imagine feeling confident enough to handle whatever your database throws at you.

With training and consulting from SQLskills, you’ll be able to solve big problems, elevate your team’s capacity, and take control of your data career.