Last couple of weeks we've been working on getting to the root cause of what could likely be the cause of the occurrence of these I/O issues and slowdown of the checkpoints.
At first glance it looks to be clearly an I/O subsystem error and the SAN admin was to be blamed for it. But recently we changed the SAN to utilize Full Flash but as of today the error still pops up and I have no clue as to why since every metric, whether wait stats or any other metric, that I run to check if SQL server is a possible culprit seems out to return normal.
It doesn't really add up. It could also be very likely that something else is chewing the disk and SQL Server is becoming victimized here...but I am not able to find out what?
Dbs are in Availability Groups and as and when these events occur we do see role changes and flip overs occurring along with timeouts.
Any help in figuring this out would be highly appreciated. Let me know if any further details is needed.
Error msgs. below
SQL Server has encountered 14212 occurrence(s) of I/O requests taking longer than 15 seconds to complete on file [E:\MSSQL\DATA\ABC.mdf] in database [ABC] (7). The OS file handle is 0x0000000000000D64. The offset of the latest long I/O is: 0x0000641262c000
SQL Server has encountered 5347 occurrence(s) of I/O requests taking longer than 15 seconds to complete on file [E:\MSSQL\DATA\XYZ.mdf] in database [XYZ] (7). The OS file handle is 0x0000000000000D64. The offset of the latest long I/O is: 0x0000506c060000
FlushCache: cleaned up 111476 bufs with 62224 writes in 925084 ms (avoided 19 new dirty bufs) for db 7:0 average throughput: 0.94 MB/sec, I/O saturation: 55144, context switches 98407 last target outstanding: 10240, avgWriteLatency 14171 FlushCache: cleaned up 5616 bufs with 3126 writes in 248687 ms (avoided 3626 new dirty bufs) for db 6:0 average throughput: 0.18 MB/sec, I/O saturation: 10080, context switches 20913 last target outstanding: 2, avgWriteLatency 3
Here's the virtual file stats info over a 30 minute span:
And wait stats as well:
Here is the note from the system architect:
We separate workloads for high I/O intense workloads (such as DB) so that we only have one per host. The specs for the current host is Dell R730 with 16 cores of Xeon E5-2620 (2 sockets), 512GB, and 2x10G interconnects for storage. No other VM on the cluster nor host are experiencing these issues. Storage for VMs and workloads is on Pure FA-x20.
General System Information:
- SQL Server 2012 sp3-cu9 (Enterprise Edition)
- Total RAM: 128 GB
- Total DB size: Close to 1 TB

