I’m sitting here writing my next article for TechNet Magazine – this time on Tracking Changes Using SQL Server 2008 – and was fooling around on the web when I came across a page I hadn’t seen for a long time – http://www.tardis.ed.ac.uk/. It’s a student-run computer system at the University of Edinburgh where I studied. I resurrected the system onto a few old Sun 3/60 workstations (before they switched over to SPARC processors) during my final year in 93-94 and the amount of time I spent building, maintaining, and managing that system (along with writing 3-d graphics programs – my favorite was a spinning cube with a GIF projected onto each side) contributed to me not getting the First Class honours degree I was hoping for – never was much one for studying :-)

Anyway, sitting here steeped in nostalgia, I realized what day it is – August 2nd. Fourteen years ago today was my first day of work after graduating – working for Digital Equipment Corp. looking after the VMS (well, OpenVMS by that time) kernel file system (F11BXQP) and CHKDSK equivalent, ANALAYZE/DISK. I also spent 6 months of 1995 seconded to the AltaVista team, helping build what was to become the first web search engine – www.altavista.digital.com – not hyperlinked because it no longer exists. I was one of the very few ex-DECcies in the SQL team who hadn’t worked on RDB before joining Microsoft.

So, apart from indulging myself, I have a war story to tell – my favorite (almost as good as the one where we were debugging a VMS crash dump over the phone with a sysadmin at a US government facility and every so often we’d ask “what’s contained in these memory locations?” and get the answer “I can’t tell you that”…). So, we were getting reports of the VMS filesystem crashing (doing a ‘bugcheck’, basically the equivalent of a blue-screen-of-death). Analyzing the kernel dumps I found something weird. F11BXQP was written to be horribly paranoid – sometimes it would calculate a value and then check it in the very next statement. Looking at the Alpha chip code that was being generated (tedious to work out what’s going on with multi-issue instruction pipelines) I saw that there was no way for the register value (I remember it was R19 on the Alpha) to change between being set and then checked again. What the hell?

Same thing happening on 4 or 5 big customers around the world so I was skeptical about it being a hardware issue (although our group did once find an issue with an early Alpha chip where the clock signals weren’t propagating correctly through the chip). The only thing I could think of was that an interrupt completion routine for an interrupt with a higher IPL (Interrupt Priority Level) than the F11BXQP ran at was pre-empting it, using the register, and not preserving it again, even thought F11BXQP had declared that it was using that register. So how to catch it? I spent a week writing a routine in Alpha assembly language (not fun) that would force F11BXQP to use a different register, poison the R19 register, and then periodically check it to see if it had changed. If it did, it would capture the current stack and then bugcheck the routine. I wish I had a printout of that code still!

Only one customer agreed to run the patched version of the file system in production, but within 24 hours it had bugchecked and dumped several times. The problem was a UCX (network) driver somewhere else in VMS that wasn’t preserving the R19 register. Bingo!

Ahh – those were the days…