Wednesday, August 7, 2024

Surprisingly good project estimation

At Amazon, we went through several cycles/iterations of planning. Regardless of how Agile an organization operated, we needed to map capacity to the roadmap to understand how much we could deliver in a given year. This does not mean we actually delivered exactly what we initially planned since requirements and priorities changed over the course of the year. But it did allow us to have a starting place that we used to have a shared vision across engineering and stakeholders.

For each iteration, I would strive for better accuracy by asking technical leaders to revise their estimates. Effort was the input we refined the most, asking teams to go from swags to more educated estimates. How much were they relying on existing services? How much had never been done before? I had a list of prompts to help them assess the complexity/unknowns and, therefore, the risk. I would take all that and put it into my model (i.e., Excel), which would auto-magically create a calendar plan.

Invariably our leaders always requested when we would be "done," typically in the context of something like Prime Day or re:Invent. To answer this question, I would use a scaled-down version of the organization planning spreadsheet. Typically, the project was underway, and varying levels of design were complete. Most of the time this meant that the effort estimates were better, and the number of resources available was known. For each estimate, I would ask managers to provide me with the factor they used to compensate for PTO, meetings, etc. - I called it non-keyboard time.

Anecdotally, the static model usually yielded a surprisingly accurate date. I would use it to drive discussions around resource and/or feature tradeoffs depending on which knob we were using to tune the outputs (typically the launch date).

Such a simple model gave many people heartburn—"How could something so simple model something so complicated?" I tried "enhancing" the model to account for other factors, for instance, parallelization. However, it was too complex to model and required too much from engineering managers or technical leads. Parallelization is needed when building a plan but doesn't work when doing estimation—it simply didn't improve the model. 

The problem was that many leaders would simply not believe the model. I found this very frustrating since they would usually choose a somewhat arbitrary date as to when they wanted to be done and tell me to use that. I would challenge that approach and eventually agree since it was obvious that they were going to get their way (an anti-pattern of Amazon's Have Backbone, Disagree, and Commit). Side note, one time, I escalated a case of this, and while the VP I escalated to was appreciative, the Director threw a tantrum in the meeting. So much so that the VP later contacted me to apologize for the Director's behavior.

In project retrospectives, we discussed how we course-corrected throughout the project. It turns out that the early model outputs were right—scary right?

I have been around long enough to know that when bringing in the date, leaders are often pressuring the teams to keep their foot on the gas. Something to watch out for is that this practice can lead to the outcomes I describe in my "Secure Apps Take A Village" post. Some of this behavior comes down to a couple of challenges leaders (including me) have to grapple with - trust and giving up control. Trust (and verify) what my teams are telling me, and accept that I don't have the same level of control I used to. Both are probably topics in and of themselves.

Things that I ran into...

  1. Inflated efforts. Managers and leads pad their effort estimates to ensure they meet commitments. First, how do you know if it's inflated? Who better to know the complexity of a system than the people closest to it? If managers are really doing this, then do they understand what is being asked of them? How will it be used? If managers are padding, they are responding to a fear typically rooted in the organization's culture. This is something to go deeper into to figure out how to make it work for you.
  2. Low efficacy factors. Given that keyboard time is an input into the model, this will directly impact the date/resource calculation. Is a low number accounting for a less experienced team? Or is the manager using the same number every time? Why—to pad dates? I used a generously low number, which was painful to write down but proved to be more accurate. For instance, this was especially true for teams that had high operational overhead.
  3. What is DONE? Of course, done is something that not everyone defines the same way. As the owner of a launch, done to me meant the system was launched and open to customers. Invariably some folks would define done as some level of completeness for their code. Ensure that everyone is aligned on what done means. I would typically use code complete, including all automated tests and the binaries, in a shared test environment. I used this as done since the cost of integration testing and other launch activities was typically shared across teams, and I could estimate those differently.


No comments:

Post a Comment

A Brief History of Time - AI Style