How can I Debug a Buffer Issue?

Question

I have a production "Microsoft SQL Server 2012 (SP1) - 11.0.3128.0 (X64)" that is showing weird buffer and page life expectancy (PLE) symptoms.

I am running this every minute on my server (to track this issue):

SELECT @ple = CAST([cntr_value] AS VARCHAR(20))
FROM sys.dm_os_performance_counters
WHERE [object_name] LIKE '%Manager%'
AND [counter_name] = 'Page life expectancy'

SELECT @usedBufferPages = CAST(COUNT(*) /128 AS VARCHAR(20)) 
FROM sys.dm_os_buffer_descriptors

DECLARE @StartDate VARCHAR(8) = Convert(VARCHAR(8), GETDATE(), 14)
RAISERROR ('%s. PLE at %s and Used Buffers at %s at %s ', 0, 
            1,@runCountString ,@ple, @usedBufferPages, @StartDate) WITH NOWAIT

This is some example output:

16. PLE at 858 and Used Buffers at 7290 at 09:51:42 
17. PLE at 918 and Used Buffers at 7342 at 09:52:42 
18. PLE at 978 and Used Buffers at 7408 at 09:53:43 
19. PLE at 1039 and Used Buffers at 7547 at 09:54:43 
20. PLE at 1100 and Used Buffers at 7697 at 09:55:44 
21. PLE at 1160 and Used Buffers at 7901 at 09:56:45 
22. PLE at 1221 and Used Buffers at 7961 at 09:57:46 
23. PLE at 1282 and Used Buffers at 8012 at 09:58:46 
24. PLE at 11 and Used Buffers at 313 at 09:59:46 
25. PLE at 31 and Used Buffers at 966 at 10:00:46 
26. PLE at 90 and Used Buffers at 1580 at 10:01:47 
27. PLE at 151 and Used Buffers at 3072 at 10:02:47 
28. PLE at 211 and Used Buffers at 3152 at 10:03:47 
29. PLE at 271 and Used Buffers at 3729 at 10:04:47

At item #24 SQL Server reports the PLE going from 1,282 to 11. SQL Server also reports that the used buffers go from 8,012 to 313.

First I looked for poor running queries, and I found a fixed a few (had no effect on the issue). But, I am not finding any problem queries that correlate to the times that I have PLE/Buffer issues. Also, if it was a poor running query, then I would think the Buffers would be full of that query's data, not empty/missing/errored.

Next I thought that the Virtual Machine was getting its memory restricted when this happened. But I have asked my System Admin and he assures me that the memory is not dynamic or shared in any way. (What it is assigned, it gets, all the time.) Also, I run this script every 10 minutes and when the PLE reports less than 50:

  SELECT * FROM sys.dm_os_sys_memory

And it reports the same/similar values when the PLE/Buffers are high and when they are low. For completeness, here is an example of the values before and after #24 above:

total_physical_memory_kb    available_physical_memory_kb    total_page_file_kb  available_page_file_kb  system_cache_kb kernel_paged_pool_kb    kernel_nonpaged_pool_kb   system_high_memory_signal_state   system_low_memory_signal_state   system_memory_state_desc
20970996                    4758672                         24378868            7929404                 4844160         686076                  182752                    1                                 0                                Available physical memory is high
20970996                    4743468                         24378868            7892632                 4845000         686580                  182688                    1                                 0                                Available physical memory is high

I have checked the System Health Session and it shows nothing related. (All it has are impersonation falures, and their times do not correlate with the times the PLE/Buffers show issues.

I have tracked how often this occurs, I cannot see a pattern or connect it to any jobs or scheduled activities.

Here is a graph that shows PLE and Buffers over 21 hours:

So I am stumped. I think the core of the issue is the buffers not the PLE. (I think PLE is getting a false report of low because all the buffers are somehow gone.)

But I can't think of any way that this could happen. Or what to do next.

I would love advice on additional things to check or suggestions of what this issue might be.

Updates from questions in the comments:

So, how much memory is the server given? The VM has 20 GB of memory.
What is max server memory?

name                    value   value_in_use  description
max server memory (MB)  13000   13000         Maximum size of server memory (MB)
min server memory (MB)  0       16            Minimum size of server memory (MB)

NOTE: I have done a bit of reading on this just now, and it seems these settings are wrong for my server.

How large is the database? There are two transactional databases running on this server (I am in the process of getting servers to isolate them.) Their sizes are 383 GB and 378 GB.

What other applications and services are running on that server? This server hosts the data for my application. There are no other things hitting it. (I have a replicated Operational Data Store for reports and such.

What is the VM technology VM Ware.
Is this VM running on a host that only hosts VMs with similar resource allocation? We have many VMs at our company. All of varying size. This is one of the largest though.

Can you confirm what your System Admin is telling you about memory allocation without just having to believe him? I cannot. I don't have access to those tools.

(In my experience, System Admins will say a lot of things to pass the buck and blame the app or anyone else if it means they don't have to do anything.) I can fully understand that sentiment.

That pattern certainly seems like severe memory pressure I agree. I was hoping to find something to prove that SQL is feeling memory pressure. So I can send it back to the System Admins for more research.

Wait Time Statistics

WaitType               Wait_S      Resource_S  Signal_S  WaitCount  Percentage   AvgWait_S  AvgRes_S  AvgSig_S 
---------------------- ----------- ----------- --------- ---------- ------------ ---------- --------- ---------
PAGEIOLATCH_SH         16250.10    16219.14    30.96     2171649    29.59        0.0075     0.0075    0.0000   
CXPACKET               14214.03    13238.56    975.47    1187935    25.88        0.0120     0.0111    0.0008   
PAGEIOLATCH_EX         6814.59     6806.21     8.38      638725     12.41        0.0107     0.0107    0.0000   
WRITELOG               5157.42     4873.44     283.98    3588476    9.39         0.0014     0.0014    0.0001   
BACKUPIO               2569.51     2538.12     31.39     1704119    4.68         0.0015     0.0015    0.0000   
LCK_M_IX               2477.15     2477.10     0.05      113        4.51         21.9217    21.9213   0.0004   
ASYNC_IO_COMPLETION    2079.99     2079.66     0.33      836        3.79         2.4880     2.4876    0.0004   
BACKUPBUFFER           1807.75     1759.11     48.64     380189     3.29         0.0048     0.0046    0.0001   
IO_COMPLETION          986.23      985.84      0.39      116112     1.80         0.0085     0.0085    0.0000

What is the total memory on the server instance (RAM) and what is the Max Memory you have configured ? What are you trying to solve/address by looking at the PLE and buffer values ? Also, read up on Page Life Expectancy isn’t what you think… — Kin Shah, Sep 21 '15 at 17:31
Are you just trying to fix the numbers, or is there an actual performance problem you can correlate to these points in time? PLE can jump all over the place, and just because it dips from time to time does not necessarily mean there is a problem. It can also be caused by things that aren't necessarily obvious problems, for example a query has a huge memory grant, but runs quickly... — Aaron Bertrand, Sep 21 '15 at 17:31
@AaronBertrand - My main application that runs on this server running very slow when the PLE/Buffers dip. (And it dips often.) — Vaccano, Sep 21 '15 at 17:34
@Kin - I am trying to solve a performance issue with my application. It is running slow when these issue happen. — Vaccano, Sep 21 '15 at 17:35
So, how much memory is the server given? What is max server memory? How large is the database? What other applications and services are running on that server? What is the VM technology, and is this VM running on a host that only hosts VMs with similar resource allocation? Can you confirm what your System Admin is telling you about memory allocation without just having to believe him? (In my experience, System Admins will say a lot of things to pass the buck and blame the app or anyone else if it means they don't have to do anything.) That pattern certainly seems like severe memory pressure. — Aaron Bertrand, Sep 21 '15 at 17:42
Are you seeing queries waiting on memory grants? I feel as though you may be barking up the wrong tree. What kind of activity are you seeing on your sql server when the application is running slowly? — Zane, Sep 21 '15 at 17:59
@AaronBertrand - I updated my question with the answers to the questions you asked. — Vaccano, Sep 21 '15 at 18:56
@Zane - I am not seeing queries waiting on memory grants. During normal operation nor when the values dive. — Vaccano, Sep 21 '15 at 18:59
@Vaccano This is your problem There are two transactional databases running on this server. Their sizes are 383 GB and 378 GB. ==> Your databases are very large and your buffer pool is ONLY 13GB which is way too less. Since the databases are transactional, due to limited buffer pool size you have, the pages needs to be washed out more frequently from the buffer pool. Add more memory ! and/Or seperate the databases to different instances with a reasonable memory that can fit your database. — Kin Shah, Sep 21 '15 at 18:59
So you have 20 GB of memory and your databases total 800 GB? I am surprised that your PLE is ever positive. — Aaron Bertrand, Sep 21 '15 at 19:00
I think a key question is why are used buffers suddenly so low? There is something using memory that is not buffers. Do worktables for sorting and hashing count as buffers? I think not. Or, you are running DROPCLEANBUFFERS... Attach profiler to rule that out. — usr, Sep 21 '15 at 22:49
I'm not quite convinced the database size is problematic with that little memory. Maybe most of it is cold. I have a 350gb database in production with 6gb of buffer pool. Almost all data is dead cold. Perf and PLE are good. — usr, Sep 21 '15 at 22:52
It should be noted that the build of SQL that you're running has a known bug that can amplify memory pressure effects. See: https://support.microsoft.com/en-us/kb/2845380 — willaien, Sep 24 '15 at 19:06

score 9 · Answer 1 · edited Apr 13 '17 at 12:42

Your buffer pool is only 13GB and your databases are 383 GB and 378 GB which you have classified as being OLTP - small transactions running too frequently.

The above situation, if I have to imagine is like below :

(source : Google Photos)

You have to understand how SQL Server stores information :

SQL Server stores information in memory in a structure called a memory cache. The information in the cache can be data, index entries, compiled procedure plans, and a variety of other types of SQL Server information. To avoid re-creating the information, it is retained the memory cache as long as possible and is ordinarily removed from the cache when it is too old to be useful, or when the memory space is needed for new information. The process that removes old information is called a memory sweep. The memory sweep is a frequent activity, but is not continuous.

You are for sure experienceing memory starvation due to sheer amount of database size and your inadequate buffer pool. Refer to - How to determine ideal memory for instance?

Collect wait stats and check for performance issues that arises from wasted buffer pool memory

Recommendation:

Add more memory to server instance and separate the two databases on different VMs with adequate memory.

Aaron Bertrand · Answer 2 · 2015-09-21T19:12:09.793

7

There is very little to debug here - you need to add memory, logically split your database across multiple VMs, or understand that the shuffling you have to do with limited memory will lead to performance issues and volatile PLE. Trying to fit 800 GB of data into 13 GB of memory is like trying to stow away in a backpack.

edited Sep 21 '15 at 19:12

answered Sep 21 '15 at 19:10

Aaron Bertrand

180,303
28
400
614

Are the drops in buffer usage explained by low memory as well? – Vaccano Sep 21 '15 at 19:11
@Vaccano Certainly, buffer space needs to be cleared out to make room for other data, before the other data can be moved into buffer - and this is what PLE actually measures (how long a page is likely to stay in the buffer pool). With only 13 GB available, this probably happens for just about every query. – Aaron Bertrand Sep 21 '15 at 19:13
I guess I don't understand it then. I thought that it was more like a circular memory thing, where old buffers were overwritten with new ones. Not a clear, then slowly build up again kind of thing... (like I am seeing) – Vaccano Sep 21 '15 at 19:14
As a side note, I did a quick grab of size from the properties window to get the sizes. But one of my databases has 200 GB of dead tables from a migration that occurred a short while ago. They are never selected from an will be deleted shortly. Then I will be down to 600 GB total (still a lot) – Vaccano Sep 21 '15 at 19:16
If you say "SELECT * FROM 40 GB table" where is it supposed to put that 40 GB of data if all 13 GB of memory are already in use? I'd be interested to inspect the source code to see exactly how it deals with that, but I doubt it's replace the contents on this page, replace the contents on this page, replace the contents on this page... – Aaron Bertrand Sep 21 '15 at 19:18
What I am seeing is buffer of 8000+ then it drops down to 300ish. It then slowly builds back up over a half hour. We don't have any queries that are taking longer than a second to run. So it would not be a single query doing this (or its data would fill the buffer after it is done). The only way I can see for the buffer to clear out is if SQL Server decides to drop things from the buffer when it is done. I am not doing multi GB result sets, but we do move a lot of data. Would SQL Server decide that some data is not useful and remove it from the buffer? – Vaccano Sep 21 '15 at 19:26
No, it tends to assume that if you are pulling data into the buffer pool, you will use it more than once. But we can't really answer your theoretical question - there could be dozens of things causing the buffer pool to clear out. Again, the example I gave above, a query doesn't have to take a long time to run to use a lot of memory, so just saying "there are no long-running queries" does not allow you to say "there are no queries that use a lot of memory." – Aaron Bertrand Sep 21 '15 at 19:28
But wouldn't that query's stuff still be in the buffer? That is my confusion point. If I high memory query runs (even very fast) wouldn't it leave behind what it pulled into the buffer? So it could be seen in sys.dm_os_buffer_descriptors. (Instead of an empty buffer pool like I am seeing.) – Vaccano Sep 21 '15 at 19:30
@Vaccano so you only ever have one query running on your system? Surely you have different users running different queries. And even if you don't, it could be one query that needs to pull more data into memory than the memory you have. How is SQL Server supposed to do that? Magic? – Aaron Bertrand Sep 21 '15 at 19:34
I am not getting this. I am so sorry I am not able to understand. The logic I am following is queries get run-> when they run, they put data into the buffer -> This pushes old data out of the buffers. So if one event is going to cause my buffers to go from 8000+ to 300 it should have the data from that event in the buffer before that event finishes. I have no events that are taking longer than 1 or 2 seconds to finish. So when it goes down to 300, the large memory query is empting out the buffer to make space. But when it finishes in 2 seconds, its data should be in the buffer right? – Vaccano Sep 21 '15 at 19:38
I said it could be one event. It could also be 14,000 events, and none of them have to take long enough to be found in DMVs when you get around to looking at them. – Aaron Bertrand Sep 21 '15 at 19:38
Ah, so it is a storm of events that happen to get in sync every now and then. Still shouldn't the buffer refill nearly immediately. (Not over the course of a half hour?) (I am not trying to be difficult, I am going to need all the amo I can get to convince my System Admins to give me more memory. I am already over double the max they like to give out.) – Vaccano Sep 21 '15 at 19:40
2

I think we have gone the rounds on this one enough. I thank you very much for your time. – Vaccano Sep 21 '15 at 19:42

score 2 · Accepted Answer · edited Apr 13 '17 at 12:42

2

As discussed on This SE thread and confirmed by OP.

The issue is due to bug in SQl Server 2012. Ths bug was fixed in SQL Server 2012 SP1 CU4. Or to be on safer said I would recommend you apply SQL Server 2012 SP2 instead of going for CU4.

As per Microsoft Bug fix detail

You may experience slow performance in SQL Server 2012. When you check SQL Server Performance Monitor tools, you see the following:

•A rapid decline in the SQLServer:Buffer Manager\Page life expectancy performance counter values. When this issue occurs, the counter is near 0.

edited Apr 13 '17 at 12:42

Community

1

answered Sep 29 '15 at 12:24

Shanky

18,985
4
35
58

1

After installing SP2 my PLE is over 2000 95% of the time. – Vaccano Sep 29 '15 at 13:29
@Vaccano I don't suspect that applying the service pack will be a permanent fix. You have 800 GB of data and not enough memory to hold it. Your performance will become volatile again regardless of what some PLE counter says. – Aaron Bertrand Sep 29 '15 at 15:34
1

@AaronBertrand - If that happens I will adjust at that time. As 'usr' said in the question comments, most of my data is cold. The amount of data that is heavily transacted is just the most recent stuff. I think that is why I am able to see high performance with as little memory as I have. Also a large portion of that data is partitioned so that the cold stuff is separate from the recent stuff. Either way, for now, performance is much better and PLE averages around 3500. While I believe that this "bug" was the cause of my problems, I thank you for the time you spend on my issue! – Vaccano Sep 29 '15 at 15:47

score 0 · Answer 4 · answered Sep 22 '15 at 09:28

Look closer at the queries being executed. Memory usage alone on databases is normally too coarse a metric to improve things. Assuming you cannot affect the queries (black box application), it is still worth understanding what is affecting the memory usage. For instance a batch process might go and use all the buffer space in a single hit by querying all data on a massive table.

In particular look for any missing indexes that cause full table scans - as they can effectively flush the cache on the server.

SQL Server has an excellent set of analyser tools that can monitor it in realtime, and I suspect you'll see something stick out like a sore thumb once you delve into it.

Not that I'm suggesting changing the database schema, but one thing to look out for is overly large varchar fields - they can really suck up cache space on a large database.

How can I Debug a Buffer Issue?

4 Answers4

Linked