Toyota and Database nothing in common, until NOW!

Turn the key, hear the engine catch go to put it in drive and... shit Darn it golly gee a stall. This is NOT just for cars anymore. One of the biggest performance issue indicators I have ever used in practice to check for DB issues has been an IO Stall test metric. In days of old, you know when 16 bit gaming consoles made you the man if you had one, and Mortal Kombat was THE GAME to play.. (still is now by the way if anyone is on PSN and wants to take a beating ProphetDE come find me) I had no monitoring software. So I used the script below and dumped the results to a text file. Now that tools have come so far (As have video game consoles) I am using Red Gate's SQL Monitor. The new version gives you the ability to add custom metrics. Below I will explain IO Stalls, what they mean, and how to track them.

What causes an IO stall?
In our case there are 3 main causes of IO stalls.

1. A poorly performing disk - Now a days this is almost ALWAYS caused by a misconfiguration in the SAN. I now understand why SAN Admins get paid as much as they do.

2. Poorly defined / written queries. I don't just mean the obvious Oh Johnny used a cursor to update 22.7 million rows one at a time because he thinks set based operations are the antichrist. I mean all issues some silly some not so much.

3. Overloaded disks. You may or may not hear the term "data bursts" often. I will not go into the explanation. What I will say is there is a wonderful all inclusive article on disks, how they work, how they don't work and data bursts, which can be found here: http://www.symantec.com/connect/articles/getting-hang-iops

Needless to say just like you do not want your Toyota to stall, you do not want your database to stall either. Troubleshooting database stalls, and disk contention issues can be for lack of a better term a real pain in the proverbial arse. The script will essentially check the average time of reads and writes in MS / number of total reads and writes. Ideally you never want to exceed 100MS. This is just a general number with larger systems you may and that could be fine. The 100MS baseline is simply the guideline i use.

[code]
SELECT CAST(SUM(io_stall_read_ms + io_stall_write_ms) / SUM(1.0 + num_of_reads + num_of_writes) AS NUMERIC(10,1)) AS [avg_stall]
FROM sys.dm_io_virtual_file_stats(DB_ID(), NULL)
WHERE FILE_ID = 2;
[/code]

Now the all important question: "John that metric just returned 12,948,102MS what the hell do I do"
Well grasshopper its not that simple... Let me rephrase it COULD BE that simple, but may not be. If you are using a virtual environment, the first thing I would recommend trying would be to allocate a few more processors and memory to the cause. A good start would be to double what you currently have. This may also be practical for your physical environment and if it is bravo!!

Secondly (or if adding hardware is a no go) I would recommend getting very familiar with SQL profiler. Go through and run those overly complex stored procedures you inherited from Bob who is now working at Kinko's because he decided it would be a good idea to write them. "What do you mean writing dynamic SQL using a cursor to update 22.7M rows one at a time for only one field is a bad idea?!" Try taking just one of those and re-writing er... re-factoring it. Then run again, you will be surprised how often I hear "Wow all i did was re-write one sproc and I am getting a 22% performance boost.

Finally I would recommend a SAN Admin. If you do NOT have one, and your company does not want to staff a proper one fine. I would then say get one to re-configure the SAN and document the new subsystem for you. If your infrastructure is to large and contracting someone, or hiring someone perm is a no go, I would make a strong case to management why they need to fork over the 4 - 6K and get someone trained in the area.

Trust me I didn't think it was a necessary evil... it IS. Use that script and keep the above in mind, and I can assure you your Toyota will be purring like the day you got her off the show room floor and took her for that first test drive.

encodeThis - a quest in SQL or sometimes SQ-Hell

Pages

Thursday, May 9, 2013

Toyota and Database nothing in common, until NOW!

No comments:

Post a Comment