SQL Server, Troubleshooting

Trust Your SQL Performance Research

SmileyOwl

You’re never too experienced to learn a new lesson.  Hopefully the following story will save you from a similar predicament.

 

Background

SmileyFrustrated

I provide DBA/Architect services for a very large corporation with a very busy mobile application.  After a major application upgrade including a migration to new VMs, we were experiencing intermittent outages.  We’re talking total application meltdown with rebooting many application servers to resolve the issue.

Before the data gets from SQL Server 2016 Enterprise to the customer’s phone, it passes through 7 or 8 layers of different technologies, including a message queue.  After each outage it looked like SQL Server was perfectly healthy and happy, and the problem appeared to lie with the message queue servers spiraling to an impressive death.

 

Troubleshooting

SmileySherlock

Fortunately, I capture performance metrics hourly.  Plus, we turned on the Query Store, which provided another treasure trove of information.  We also had at least 2 other monitoring tools in place.

Almost immediately, I warned the application team of a “Select *” beauty that was performing badly.  The trouble was that despite the “*”, it had a pretty darn good execution plan.  That is, every time we caught a plan or tested the query, it was a very fast index seek.  It was a very simple, non-nullable parameterized query.  Still, this query kept showing up with insane I/O numbers every once in a blue moon.  Every week it would show up as 2 of the top 10 hardest working queries.

Despite all of the numbers, I couldn’t understand how it was performing so badly at times.  The app team and other operational database support people were at a loss as well.  I kept thinking, “maybe it reaches a threshold in which it just gets called too many times concurrently, and the “*” just overwhelms the system with all of those columns.”  I also thought about parameter sniffing, but I was fairly confident a bad plan wasn’t sneaking in there.

 

The Solution

SmileyThumbsUp

Finally, an operational support DBA said, “The heck with it.  I’m running a Profiler trace all day, and hopefully I get lucky.”  At least I imagine that’s what he said to himself.  I thought it was a bit risky and wished he would’ve used an XE; but, I’m glad he did it, because he did get lucky.  Yes, the system was a little slower for a few hours, but he found the culprit.  He traced only looking for that same “Select *” query I had identified, and amidst thousands of calls he captured an instance in which the parameter was an empty string.  My first thought was, “Surely this non-nullable column wouldn’t be a huge problem with an empty string.  Who would store an empty string in this fairly important column on the main customer table?”

It turns out that of the 23 million customer rows, over 6 million of them had an empty string for the offending column.  So 99.9% of the time, the query returned 1 row.  A few times a week however, it flooded the system with 26% of the customer data.

 

Closing Thoughts

SmileyLightBulb

While I wish the application team had investigated the query immediately, they were fairly swamped with troubleshooting other parts of the stack. Additionally, I could not prove that this query was the root cause.

Now I know.  Trust the numbers.  Even if it doesn’t seem possible or likely that the problem lies within your domain, take a step back and consider more possibilities.  Encryption makes it impossible for me to investigate the data, but I wish I would’ve thought of the possibility of an empty string.  Deep in the recesses of my dusty brain, I knew the “*” didn’t explain the huge amounts of I/O and CPU consumption.  I shouldn’t have waited for the app team to discover the problem.  A simple Group By/Count on that column might’ve turned the light bulb on.

Advertisements

2 thoughts on “Trust Your SQL Performance Research”

  1. Great catch… This is another proof of my notion on unit tests: “Either do it right, or don’t even bother with it”.
    This should raise a big concern on the customer about code quality and unit tests.
    An addition operation will always yield 4 for 2+2. Your unit tests should test what happens if you try to add edge values, like int.MaxValue… Or like in this case, an empty string 🙂

    Like

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

w

Connecting to %s