Slow checkpoint and 15 second I/O warnings on flash storage

Question

Last couple of weeks we've been working on getting to the root cause of what could likely be the cause of the occurrence of these I/O issues and slowdown of the checkpoints.

At first glance it looks to be clearly an I/O subsystem error and the SAN admin was to be blamed for it. But recently we changed the SAN to utilize Full Flash but as of today the error still pops up and I have no clue as to why since every metric, whether wait stats or any other metric, that I run to check if SQL server is a possible culprit seems out to return normal.

It doesn't really add up. It could also be very likely that something else is chewing the disk and SQL Server is becoming victimized here...but I am not able to find out what?

Dbs are in Availability Groups and as and when these events occur we do see role changes and flip overs occurring along with timeouts.

Any help in figuring this out would be highly appreciated. Let me know if any further details is needed.

Error msgs. below

SQL Server has encountered 14212 occurrence(s) of I/O requests taking longer than 15 seconds to complete on file [E:\MSSQL\DATA\ABC.mdf] in database [ABC] (7). The OS file handle is 0x0000000000000D64. The offset of the latest long I/O is: 0x0000641262c000

SQL Server has encountered 5347 occurrence(s) of I/O requests taking longer than 15 seconds to complete on file [E:\MSSQL\DATA\XYZ.mdf] in database [XYZ] (7). The OS file handle is 0x0000000000000D64. The offset of the latest long I/O is: 0x0000506c060000

FlushCache: cleaned up 111476 bufs with 62224 writes in 925084 ms (avoided 19 new dirty bufs) for db 7:0 average throughput: 0.94 MB/sec, I/O saturation: 55144, context switches 98407 last target outstanding: 10240, avgWriteLatency 14171 FlushCache: cleaned up 5616 bufs with 3126 writes in 248687 ms (avoided 3626 new dirty bufs) for db 6:0 average throughput: 0.18 MB/sec, I/O saturation: 10080, context switches 20913 last target outstanding: 2, avgWriteLatency 3

Here's the virtual file stats info over a 30 minute span:

And wait stats as well:

Here is the note from the system architect:

We separate workloads for high I/O intense workloads (such as DB) so that we only have one per host. The specs for the current host is Dell R730 with 16 cores of Xeon E5-2620 (2 sockets), 512GB, and 2x10G interconnects for storage. No other VM on the cluster nor host are experiencing these issues. Storage for VMs and workloads is on Pure FA-x20.

General System Information:

SQL Server 2012 sp3-cu9 (Enterprise Edition)
Total RAM: 128 GB
Total DB size: Close to 1 TB

What are your block sizes? Can you post output of fsutil fsinfo ntfsinfo D: replacing D: with the drive your DB files are on? Are these occurring randomly, or when tasks such as DBCC CHECKDB or backups are running? — HandyD, Apr 03 '19 at 02:30
Hi Feivel, I'm encountering the same problems. High latency and 15 seconds warning on a ALL-Flash SAN. Where you able to troubleshoot this issue any further? — Frederik Vanderhaegen, Dec 24 '20 at 12:10

score 8 · Answer 1 · answered Apr 02 '19 at 21:30

Last couple of weeks we've been working on getting to the root cause of what could likely be the cause of the occurrence of these I/O issues and slowdown of the checkpoints.

Sounds good. Have you collected and cut up the minifilter and storport tracing, yet? If so, what did it show?

At first glance it looks to be clearly an I/O subsystem error and the SAN admin was to be blamed for it. But recently we changed the SAN to utilize Full Flash but as of today the error still pops up and I have no clue as to why since every metric, whether wait stats or any other metric, that I run to check if SQL server is a possible culprit seems out to return normal.

I want go over two different areas here.

The first is that SQL Server itself doesn't actually do anything with I/O, it posts it to Windows using the typical Windows APIs. Whether it is ReadFile, WriteFile, or the vectored I/O of those it's all up to Windows. SQL Server keeps a list of pending I/O and checks that I/O at various times to get the status if it isn't completed. This is done using, again, the typical Windows asynchronous I/O model. The message is printed when the I/O has been pending and not completed, according to Windows for over 15 seconds as we're using the GetOverlappedResult Windows API to check status. This means, SQL Server doesn't really have a say in the matter, it's what is being returned via Windows.

The second item is that just because it's all flash and 10 Gb fiber doesn't mean something isn't setup or configured incorrectly, that a driver, filter, or other bug or item isn't hit, or that something isn't physically wrong. Just to get an idea:

Windows Config
Windows Drivers such as mutli-pathing being setup and the latest version
Filter Drivers (you know, disk devices, antivirus, backup, etc.)
Hypervisors (if any)
HBA drivers
HBA firmware
HBA configuration
Physical Cabling
Fiber switching
I/O Group connections/SAN/Device
Configuration of SAN/Device

That's all under SQL Server, it's just that SQL Server is the one telling you about it.

Dbs are in Availability Groups and as and when these events occur we do see role changes and flip overs occurring along with timeouts.

That's really good information to know, although it doesn't necessarily mean it's exactly related. Now, if it only happens when there is a failover, then that would hone in the issue much more and that would sound to me more like the drivers et al. doesn't like throwing a whole lot of mixed I/O at it as a failover typically results in the redo/undo and resync's happening which could be a spike in outstanding I/O.

Any help in figuring this out would be highly appreciated.

Unless it's a query or set of queries that are pushing high IOPs, which it doesn't sound like as the snapshot for 30 minutes you have was only 737,465 I/O operations which averages to 410 IOPs (not that high, especially if it's flash) looking inside of SQL Server isn't going to help with this issue since SQL Server is the messenger.

You'd want to collect if not already:

Minifilter time spent. This can be done through WPR (XPerf) if you don't have anything else. This can help if the I/O is getting stalled in a filter driver.
Storport trace. This will be the last stop on the way our and the first stop on the way back. Any time between these two readings is time spent outside of Windows... It'll also show you the targets and where the slowness might be on the other end (but isn't always conclusive).

If none of those are helpful in diagnosis or narrowing in the scope of the issue, it may be time to open a ticket with Windows Storage support and have all the data already collected so that you all can start on the same page.

Josh Darnell · Answer 2 · 2019-04-02T21:26:49.120

You mentioned you're checking wait stats and "every other metric." I assume you are seeing high PAGELATCH and WRITELOG waits? Just to double check, have you reviewed sys.dm_io_virtual_file_stats? That's where I would start when getting these 15 second I/O messages.

Use Erin Stellato's excellent article "What Virtual Filestats Do, and Do Not, Tell You About I/O Latency" as a guide in terms of what queries to use. Log snapshots of that DMV to a table every 5 or 15 minutes. Look for spikes in average stalls / latency.

Look to see if the number reads / writes, or the average bytes per read / write, has gone up during these spikes. It could be that you have maintenance or user queries that are flooding the I/O subsystem with more traffic than it can handle. These queries will need to be tuned, or the maintenance tasks need to be broken up or moved to a different time of day.

Work with your SAN admin to see if there are any "noisy neighbors" or errors in the SAN that correlate with these times. Compare the SAN setup with other SQL Server boxes - it's possible you have a throughput issue at the physical connection level, or you have caching settings that need to be tweaked, or updates that need to be installed, etc.

I realize these are somewhat general steps, but hopefully it gives you some direction of where to go next.

Regarding this:

We separate workloads for high I/O intense workloads (such as DB) so that we only have one per host...No other VM on the cluster nor host are experiencing these issues

I think it makes sense that SQL Server would be the only one seeing these problems, if it's the only one with a high I/O workload on the host - the other servers / applications might not even notice or have any way of reporting if they are experiencing disk latency.

The E drive looks particularly problematic in your screenshot of the virtual file stats. Is there anything different about that drive?

...2x10G interconnects for storage

You could have a cabling issue. Consider reseating them / making sure they have a solid connection. Possibly swap with different, known-good cables. As mentioned above, have the SAN team review caching settings and other configuration to see if there are any differences with this volume / host vs other SQL Server VMs.

Slow checkpoint and 15 second I/O warnings on flash storage

2 Answers2

Linked