You’re never too experienced to learn a new lesson. Hopefully the following story will save you from a similar predicament.
I provide DBA/Architect services for a very large corporation with a very busy mobile application. After a major application upgrade including a migration to new VMs, we were experiencing intermittent outages. We’re talking total application meltdown with rebooting many application servers to resolve the issue.
Before the data gets from SQL Server 2016 Enterprise to the customer’s phone, it passes through 7 or 8 layers of different technologies, including a message queue. After each outage it looked like SQL Server was perfectly healthy and happy, and the problem appeared to lie with the message queue servers spiraling to an impressive death.
Fortunately, I capture performance metrics hourly. Plus, we turned on the Query Store, which provided another treasure trove of information. We also had at least 2 other monitoring tools in place.
Almost immediately, I warned the application team of a “Select *” beauty that was performing badly. The trouble was that despite the “*”, it had a pretty darn good execution plan. That is, every time we caught a plan or tested the query, it was a very fast index seek. It was a very simple, non-nullable parameterized query. Still, this query kept showing up with insane I/O numbers every once in a blue moon. Every week it would show up as 2 of the top 10 hardest working queries.
Despite all of the numbers, I couldn’t understand how it was performing so badly at times. The app team and other operational database support people were at a loss as well. I kept thinking, “maybe it reaches a threshold in which it just gets called too many times concurrently, and the “*” just overwhelms the system with all of those columns.” I also thought about parameter sniffing, but I was fairly confident a bad plan wasn’t sneaking in there.
Finally, an operational support DBA said, “The heck with it. I’m running a Profiler trace all day, and hopefully I get lucky.” At least I imagine that’s what he said to himself. I thought it was a bit risky and wished he would’ve used an XE; but, I’m glad he did it, because he did get lucky. Yes, the system was a little slower for a few hours, but he found the culprit. He traced only looking for that same “Select *” query I had identified, and amidst thousands of calls he captured an instance in which the parameter was an empty string. My first thought was, “Surely this non-nullable column wouldn’t be a huge problem with an empty string. Who would store an empty string in this fairly important column on the main customer table?”
It turns out that of the 23 million customer rows, over 6 million of them had an empty string for the offending column. So 99.9% of the time, the query returned 1 row. A few times a week however, it flooded the system with 26% of the customer data.
While I wish the application team had investigated the query immediately, they were fairly swamped with troubleshooting other parts of the stack. Additionally, I could not prove that this query was the root cause.
Now I know. Trust the numbers. Even if it doesn’t seem possible or likely that the problem lies within your domain, take a step back and consider more possibilities. Encryption makes it impossible for me to investigate the data, but I wish I would’ve thought of the possibility of an empty string. Deep in the recesses of my dusty brain, I knew the “*” didn’t explain the huge amounts of I/O and CPU consumption. I shouldn’t have waited for the app team to discover the problem. A simple Group By/Count on that column might’ve turned the light bulb on.