Consistent drop in performance on SQL Server 2016 SP2

Question

Recently upgraded from 2008R2 to 2016SP2. Have a slow process that I can't change. Application contacts the server back and forth constantly and runs tiny queries, including doing an update 30K+ times serially.

I've tried any suggested indexes, etc...to try to help since it's a sticking point for completing the upgrade to production. The new server is loaded with twice as many procs, double the memory, etc..

The interesting thing about the process is that in testing it, using the same process and parameters, every time the server is restarted, it runs fast, then each subsequent run is about 2 minutes slower. It steadies about about the 5th or 6th run, but it's this same pattern every time. I would expect it to get better as it stores the plans and reuses them, but it's consistently the same behavior. I know that the underlying process needs to be corrected, but I don't have any control over that atm, and I find the pattern very odd, since nothing else is running on the server.

Testing has consisted of running the process from the app several times after each change. Started in 2008 compatibility mode, reindexed and updated, stats, applied missing indexes based on cache info., updated to use current 2016 Query Optimizer and Cardinality Estimator, added Optimize for adhoc queries, moved to full 2016 compatibility mode. With each change, the same pattern ensues.

The baseline runs against a 2008DB and is consistently somewhere in the middle of the times I get for 2016. It is run from the same application and workstation.

VMs for the app server and DB server. No AGs yet. The queries are all trivial in time. There's tons, but the queries themselves aren't slowing down. Max memory around 64GBs (and nothing else is hitting this server right now),

*Update: I'm sorting through some Extended Event logs now to see if I can pare down specifics as to what runs longer. Even sorting out the query hash = 0, each run executes 13602 executions (ad hoc sql statements + rpc completed), so it's a lengthy process. Was hoping for some insight as to why a SQL Server restart would have a 10 minute difference on the duration of the process since nothing else is running on this server. I would think a restart would make it slower the first few runs, but it's exactly the opposite. Even things like sp_setapp role are taking longer on subsequent runs.

Any ideas why I would have query hash plans all equal to "0" on queries over 20 secs? Is this telling me these are compilation timeout queries?

VMs for the app server and DB server. No AGs yet. The queries are all trivial in time. There's tons, but the queries themselves aren't slowing down. Max memory around 64GBs (and nothing else is hitting this server right now), — Dbot, Jun 04 '18 at 22:36
Could you provide further information about this "process"? IE Execution plans of the fast and slow runs, DML / DDL code samples etc? — George.Palacios, Jun 05 '18 at 07:36
As @George.Palacios says, you should go for the low-hanging fruit first. However, rather than worrying about something that's executing 30+k times, worry about what's taking time (30k x 1.0 µs) isn't worth worrying about. Something that's running once that takes 30s is! Oracle has the OWI (Oracle Wait Interface) - SQL Server has DMVs (know nothing about them)! Is your disk thrashing? Is your CPU page faulting? Is your RAM bottlenecked? It may well be your 30k thing, but check others! — Vérace, Jun 05 '18 at 08:44
Before you start tuning this have you done any benchmarking of the servers themselves? We got caught with a new VM that performed badly turned out to be host load related. — Sir Swears-a-lot, Jun 05 '18 at 09:13
Go for the low-hanging fruit first. Don't worry about X that's executing 30+k times, worry about what's taking time (30k x 1.0 µs) isn't worth worrying about. Something that's running once that takes 30s is! Oracle has the OWI (Oracle Wait Interface) - SQL Server has DMVs (know nothing about them)! Is your disk thrashing? Is your CPU page faulting? Is your RAM bottlenecked? It may well be your 30k thing, but check others! @Peter makes an excellent point about the importance of ongoing benchmarking! — Vérace, Jun 05 '18 at 09:30
Try getting wait stats, execution plans etc and baseline them against your 2008 instance. Without some more specifics this is going to be very hard to investigate through means of a Q&A site. — Tom V, Jun 05 '18 at 10:25
The VMs have been benchmarked. The new server is outperforming the baseline server in all aspects except this. CPU, RAM, Disk - nothing is thrashed during the process. I know out first grab is going for the execution plans, but this process takes 7-15 mins of back and forth with the app and sql server. Analyzing each plan (I looked for any obvious outliers, none) will take until next year. Was hoping the odd, but consistent behavior would hit home with someone. If the duration was inconsistent, it would make more sense to me, as many of the queries compilation timeout, but it's same every time — Dbot, Jun 05 '18 at 19:35
Have you enabled query store? It may help you quickly narrow down where the time is being spent. — Sir Swears-a-lot, Jun 06 '18 at 08:49
Are the db file growth settings the same? After you restart are you starting with baseline db? Or are you just repeating the process on the same db? Does your process insert a lot of data? — Sir Swears-a-lot, Jun 06 '18 at 08:59
DB hasn't grown. Process is being repeated on the same DB. It inserts and updates. Script to rollback is then run. Does create some fragmentation, but rebuilding the indexes has no effect.I have Query Store on. I should dig deeper to see if any queries are taking longer. Initial looks didn't seem to indicate it, but I should analyse more — Dbot, Jun 06 '18 at 15:00
I have traces of both a short and long run. The difference is 260K entries for the short run vs. 385K entries for the long run. Filtered on Procs and SQL. If I could filter this more meaningfully, I could maybe discern why there are so many more entries on the longer runs. Removing the "SELECT @@SPID" already knocked out 180k and 180K respectively to get to these numbers — Dbot, Jun 06 '18 at 15:11
@Dbot I see this is getting some votes to close as too broad, I suggest summarizing all the data you've added in the comments in an edit to your question to avoid getting it closed, as I believe you're narrowing it down quite dutifully. — LowlyDBA - John M, Jun 07 '18 at 22:55

score 3 · Accepted Answer · answered Jun 13 '18 at 15:33

Solved: Not sure it will help anyone else, but just in case it will, what I found was the serial nature of the process (one command per connection) was causing the process to send 143,000+ sp_setapprole calls during the process. This was bloating the TokenAndPermUserStore cache by 1GB each time it ran. This is why a restart of the service saw it run quickly and subsequent runs got slower as it took longer to search through the security cache. Long term fix is, of course, to have the process rewritten correctly with aggregate inserts and updates, but as a temporary solution, I have a job monitoring the size of the cache and flushing it with DBCC FREESYSTEMCACHE ('TokenAndPermUserStore'). This will add some IO to other calls, but sometimes we're forced to do what we can with the code we're given.

Good find. That sounds nasty. – Sir Swears-a-lot Jun 17 '18 at 23:41 — Sir Swears-a-lot, Jun 17 '18 at 23:41

Consistent drop in performance on SQL Server 2016 SP2

1 Answers1