Tuesday, June 8, 2021

It works on my machine

One truth I had to come to terms with earlier in my career was the "it works on my machine" anti-pattern. It's a close cousin to the name of my blog - sometimes words DO mean something.

This was back in the days before Windows (yes there was such a thing) and I was writing a custom TSR (terminate and stay resident) module that we used to communicate with the custom database engine we wrote on NetWare (starting to get a sense of the decade yet?). Long story short there was a problem with the code that prevented it from working on one machine - just one. And it wasn't like it always didn't work, it only didn't work sporadically.

In my arrogance at the time, I was slow to take a look at this issue. My arrogant thinking went like (1) this is the only machine to have the issue (2) I spent alot of time on this code and especially testing it (3) this code has been running on hundreds of other machines without an issue (4) so it must be something particular to the machine or user. Fast forward a couple of weeks and surprise a couple more people report having the issue. It was then that a colleague and I were talking and I woke up from the spell - "this is something I need to look into" I finally told myself.

Turns out the issue was an interrupt conflict with a badly behaved driver that was slowly be rolled out in a pilot and very soon would released across the entire org and would impact my, now thousands, of customers. Once I figured this out (which was actually pretty hard to do) - the fix was simple; I just switched to a different interrupt. There were other times when I fell under the spell of my own perfume, each time learning something and making me more humble.

Computers and software have grown much more complex since then. developers have much less control over all the code they are dependent upon. In the example above, I was dependent on the CPU microcode, the BIOS and MSDOS - all of which I had the source code for. I could usually discount the microcode (I only recycled the manual last year) which meant that any issue was either with my code or the few dependencies I had. The lack of dependencies and illusion of control was part of where my arrogance came from. But today there are dependencies on dependencies; it's such a complex ecosystem. And yet developers still have quite a bit of arrogance. 

Why am I writing about this now? This time I was the one user for which the application did not work - last year (mid-2020) it started  and I would constantly send logs and traces to the development team and nothing would happen. I told them that my fear is that my experience was the "canary in the coal mine" and that either (1) other people are also having the issue and not reporting it OR (2) that at some time in the not too distant future it would become more urgent. 

The later was this week.

It now works on my machine.