A week ago, I received my first SQL Server error in my email. Or rather, I received five of them back to back within twenty-one seconds. What to do?
Don’t panic and breathe normally. The first bit is so that one remains level headed. Someone who is panicking will not make rational decisions and that’s one of the worst things that can happen. The second is so one doesn’t pass out, because unconsciousness isn’t helpful either.
First, I made sure that the application was still accessible and functioning. It was and everything seemed fine superficially. Diving into SSMS, I opened the logs to see what happened. It wasn’t difficult to identify where and when the issue took place, but the why was elusive. I received two Message 824 CHECKSUM errors. After that, I had two 3314s, which allude to undoing transactions, and a single 9001, which indicates a missing log file. This is all information that I gleaned from Books Online and it all sounded rather serious.
Before moving forward, I sent a nice email to the users stating that an error occurred on the server, and that while they can use the system, I might have to restore a backup and their work would be overwritten. I also assured them that I’d work as quickly as I could and stated that I’d email a second time when it was finished.
I had begun to hypothesize what had happened. After the first two 824 errors, SQL Server looked as though it had shut down itself. I thought that perhaps the other three errors arose because SQL Server was shutting down during the middle of a scheduled job. Once SQL Server restarted, it rolled back the bad transactions. During an email exchange with Allen Kinsel (Blog | @sqlinsaneo) after the fact, it turned out that my original hypothesis wasn’t too far from the mark.
BOL stated that Message 824 could be indicative of disk problems. Not as bad as an 823, but the disk could be on its way to failure. I ran a Check Disk on the drive. The Check Disk worked to about 90% completion before returning, “Windows was unable to complete the disk check.” To add complexity to the situation, the drive is also virtual. I thought that perhaps Disk Check wouldn’t work in the same way on a virtual drive, so I tested all the others on the machine and they passed. Not satisfied, I logged into the host and tested the physical drives as well. They all passed.
A bad sector on a virtual drive? Virtual corruption? I tossed my findings over to the Senior Engineer who told me that he’d look into it. Back to troubleshooting. Unfortunately, I had already exhausted my initial list of never-having-done-this-before options, and so what’s a guy to do? Google. And thankfully, Microsoft already had a troubleshooting list readily available.
- I find the two pages referenced in the 824 messages inside the suspect_pages table in MSDB.
- I run a DBCC CHECKDB on the database and it returns, “CHECKDB found 0 allocation errors and 0 consistency errors in database.” This is good, but confusing since the suspect_pages table had not been updated to show that the inconsistencies had been corrected.
- I look into the database options and find that PAGE_VERIFY had already been set to CHECKSUM.
- Checking the logs again, there is nothing listed that relates to the disks, storage, or controllers.
- The next step is to test the IO capacity using SQLIOSim. This is somewhat exciting as I have never had the opportunity to use the program before. Knowing that this could be a job ending experience if I really screw it up, I try it on the test server first. I heed the warnings to not burn my data and log files, and everything runs successfully. Moving over to production, I repeat the process, and despite finding myself a little disgusted with the results, it returned no fatal errors.
- The final troubleshooting points revolve around the physical hardware. It’s a bit problematic. I don’t know the server components, and the issue seemed to be virtual. I stop using the list and take a step back to re-evaluate.
I decide that I need to know more about Message 824. Downloading IO Basics Chapter 2, I begin to read the relevant portions. It says that the CHECKSUM error I received would require a restore to correct. While I know how to restore a database, I’m not entirely sure how to check for CHECKSUM errors. I mean, if my backup is corrupt, how am I to know?
I came across Paul Randal’s blog, and if you’ve been in the community for any amount of time, you’ll know that Paul Randal is a man that’s coo-coo for CHECKDB.
I learn about BACKUP WITH CHECKSUM. When enabled, a backup will verify that the CHECKSUMs are correct for all the pages within the backup, or it will fail automatically. This does rely on the fact that PAGE_VERIFY has been set to CHECKSUM, as I tested earlier. I learned that WITH CHECKSUM also applies to RESTORE VERIFYONLY as well. I’m golden. I’ve found the way to tell if my backup is good or as good as toast.
If I’ve tricked you into a false sense of security, then I’ve done my job as a writer correctly. The only way to VERIFYONLY WITH CHECKSUM is by taking a BACKUP WITH CHECKSUM. If you haven’t, then it will simply fail. Guess who hadn’t been taking backups in that way? Still, I take a new BACKUP WITH CHECKSUM. It completes without error. I VERIFYONLY WITH CHECKSUM. The backup is good.
I continue to run DBCC CHECKDB WITH NO_INFOMSGS, ALL_ERRORMSGS throughout the day. It comes back clean each time. But I refuse to settle the matter. Things seem fine but I can’t let this thing beat me: if I don’t figure out what happened, then I’ve lost this battle. I realize that I haven’t heard back from the Senior Engineer and that it’s been several hours. I go in for a drive by of his cube.
“So, did you ever find anything?”
“Nope. I ran some tests, but nothing came up.”
“Ah. I ran a Check Disk on the R drive and it wouldn’t complete.”
He quickly does the same. It doesn’t complete.
“Hm. Let me try some other tests.”
I walk back to my cube feeling a little smug. No sooner do I get back to my computer, I have a message waiting for me on Skype, “I am shutting down the server to take care of the disk issue. Is this fine?”
Here is another moment where I failed miserably. I should have said no. I should have said, “Sure, but let me come over and watch you do what you do for my own education.” What did I say? “Sure.” I was too fried from seven hours of troubleshooting. If this fixed the bad sector of disk, then I’d be fine with it.
Ten minutes later, the server had been booted back up, the application was running, and the Check Disk comes in clean. Another CHECKDB also comes in clean. While the disk had been fixed and there was no corruption identified in the database, I didn’t feel victorious. I still hadn’t figured out why it had happened in the first place. It bothered me. But users can’t wait forever. I sent out an email saying that the problem had been fixed, but if they notice anything strange to let me know about it.
I turned to the last place I could think to find out the answers to my questions: #sqlhelp. A lot of great advice was offered, and Allen Kinsel worked with me the next day to read through the error logs. It was at that time that I learned that SQL Server will crash itself. I learned that there is no such thing as a magical CHECKDB that runs after a crash. (I mean, come on, it’s not like it’s an iPad…) And I learned that after a Message 824, once you’ve verified that there is no corruption or fixed any that has occurred, immediately kick it to your Engineering Team to fix the damn disks. Don’t even stop to think about it because I can tell you that seven hours of troubleshooting doesn’t impress anyone.
Lessons Learned
- While emergencies and error messages can cause a great deal of stress, they are also prime learning opportunities. Do the research and don’t immediately run to #sqlhelp if you can help it. Write down what you do as you troubleshoot, write down what you’re thinking and your reasoning. It will help you later when you’ve come down from the adrenaline high and you need to re-analyze what happened.
- Take every opportunity to learn regardless of how burnt you’re feeling. I didn’t and now I don’t know what was done to correct the disk issue. I can take an educated guess, but an educated guess isn’t a solid answer.
- After receiving a Message 824, run DBCC CHECKDB WITH NO_INFOMSGS, ALL_ERRORMSGS until it comes back clean. Try to fix any corruption without data loss. Once you’re good, farm the disk stuff to your Engineering Team.
- I can’t trust maintenance plans to get the job done right, not even ones that I’ve set up. I need to write my own custom jobs to ensure that the backups are as good and valid as I can make them.
Pingback: October’s T-SQL Tuesday: The Magical CHECKDB | Matt Velic