Monday, November 29, 2010

It’s Always Cold in a Room Full of Ice

This story is true. Well mostly...)

Ice hockey McGill University 1901

One of the social benefits of being a hockey parent is being able to commiserate with other hockey  parents about the shared experiences of early mornings, long drives, and always too cold ice rinks. At a recent game - at 7:00 AM on a Saturday when the temperature inside the rink was a balmy 34F - I happened to ask a fellow hockey father about his software company. His situation, and my suggested actions for him form the basis of this post. I'm hoping that this post will help him and others.

(I'm guessing that anyone involved in software testing will have encountered one or more of the types of problems that he described - but hopefully not all at the same time! Note that all the names have been changed from the humans involved to the names of household pets.)

The Horror

When I asked Walter how his company was doing, his response sounded like a cross between a hi-tech project plan, James Joyce, and Conrad's "The Heart of Darkness."

He started by saying, "The horror, man, the horror of this situation is freaking insane. "

He then described how, through an acquisition by his company, he had inherited a QE team. Walter had several years of project management experience, and while he had worked with QE teams and QE managers, he had never before had direct control over a QE team. To call the QE team "troubled" was something of an understatement. In only his first few days being responsible for the team he quickly learned the following about the team:
  • The team was geographically dispersed. Walter was based in Boston and the rest of the team was spread out between Baltimore, Boise, Belarus, and Bangalore. (The team had been nick-named the "killer-B's.) 
  • Most of the team member had never met each other. The QE manager (her name was Celia) was based in Boise and had "issues" with either getting up early or staying up late to accommodate other team members' time zones. As a result, the team seldom had group meetings and the team manager rarely communicated over the telephone directly with individual team members in one-on-one conversations.
  • There was no mapping of tests and test coverage to product features.
  • A central/shared bug tracking system was used, but no organized bug triage reviews were ever held. Hundreds of old unresolved bugs or bugs that were likely no longer relevant were cluttering the bug database.
  • Many tests were automated, but there was no central/shared repository for all the tests. Individual team member often kept the automated tests in their personal home directories.
  • As the automated tests were written in multiple different languages, there was also no central/shared automated test framework. Some team member used Junit or similar frameworks, while others built their own "home grown" test frameworks.
  • Many of the automated tests were suspect due to a lack of maintenance. Individual engineers maintained personal lists of tests that were "known to fail."
  • The level of confidence in the QE team on the part of the other project teams was low. One product manager had recently asked Walter if testing was “strictly necessary, after all we do code reviews.”
  • Test plans that defined a test strategy or that provided traceability of product features and operational requirements to actual test coverage were non-existent. At the start of a project release, the team manager would send emails to team members instructing them as to which tests to develop. In the absence of these emails, individual team members often created whatever tests they felt were necessary. 
  • There was no central/shared repository of archived test results. In short, he knew that he and the team were in bad shape, but because of the general lack of "institutional memory" (e.g., test plans, test results) he didn't even know how bad off he was in any detail.
Yes, It Can Always Get Worse

Years ago, before they broke the "Curse of the Bambino" (http://en.wikipedia.org/wiki/Curse_of_the_Bambino), when I commented to a friend that the prospects for the Boston Red Sox could not get any worse, an older friend of mine commented that "things can always get worse." This can be true of software too. 

A few days after Walter first described his situation to me, he mentioned to me that, "Would you believe it? My QE manager just walked out!

At this point, I thought that it was worthwhile to try to give him some advice. Besides, I had been looking for a good topic for my software testing blog for weeks.  ;-)

The Wine is Bad, Throw it Out!

What I suggested to him was that he should look at the hasty exit of his QE manager as a positive development. The team had gotten itself into such a deep hole, that only some drastic changes could improve the situation. 

There’s a really great line from a really bad old movie, “The Agony and the Ecstasy.” In the film, Michaelanglo (played by Charlton Heston. And if that seems like strange casting, how about Rex Harrison as the Pope?) is disgusted by his first attempt to paint the ceiling of the Sistine Chapel. When he witnesses his local bartender throwing away wine that has gone bad, he is inspired to throw away his work and start again.

Walter’s initial reaction to his team’s problems was to try to patch things together. When I asked him if the team had ever been effective or even functional, he replied, “No way man. All I keep hearing is that the team has always been a disaster.” I suggested that instead of patching things together to back to where they were before the manager left, he would better off trying to address and resolve some of the team's problems. What ever they had been doing was simply not working. He was pressed for time, as his product release schedule was tight. But, I suggested that investing only a few days in restructuring things could pay some immediate results.

While he thought he was dealing with an infinite number of problems, I suggested that he classify the problems, and the solutions, into categories:
  • People
  • Tools and Assets
  • Processes
The People

My diagnosis for his team’s primary problem was: fragmentation. In fact, the team was suffering from two distinct forms of fragmentation. 

The first form was the fragmentation of their work effort. 
  • Regional Fragmentation - The teams’ tasks were divided by geographic region. The stress testing was performed by the Boise team, while the UI testing was performed in Bangalore, and the integration testing was performed in Baltimore. As a result, it was common for team members, when they were asked about specific tests, would just say, “oh, the other guys do that work - those tests are theirs.” This made test status information collecting difficult if the team in one location had gone home for the day. I suggested to Walter that the work be divided such that each location would have some people able to run any type of test that the team built. In other words, spread the work tasks across the timezones and divide the team by tasks, not by geography.
  • What’s Your Role in this Organization? What’s My Role? - In listening to Walter describe the manner in which the recently departed manager led the team, I imagined that the general lack of information, planning, and leadership would have caused many team members to be uncertain of their role within the team. I suggested that Walter not define their roles in terms of an org-chart, but rather, to define everyone’s role in terms of their dependencies and their deliverables. The functioning of a large, dispersed team can be thought of as being analogous to integrating software modules together. What you need is not just an org-chart, but also a dependency diagram. The inter-dependencies between the team members are in effect a social contract that binds the team together. 
The second form was the fragmentation of their work “community.” I think it was Robert Kennedy who lamented that divisions in society meant that “we share a city, but not a community.” In the case of Walter’s team, their geographic divisions made it difficult for the team to function as a team. Walter could not move everyone to the same physical location, but I did have some suggestions to get the team members working together better:
  • They Aren’t Remote, You Are - It’s only a matter of words, but I suggested that Walter stop referring to the team members that were not in his physical location as “remote.” The fact was, that purely in terms of numbers, that the larger concentrations of team members were in locations other than Walters. If anyone was “remote,” it was Walter. The goal of this change in terminology would be to help team members feel that they were full partners in the team, and not second class team members.
  • Is it Time for the Meeting? What Time is it Exactly? - The practice of having the same team members be inconvenienced by the time selected for team meetings had to stop. Walter had to set up a system where the times for the meetings would rotate so everyone would share the burden of getting up early, or staying up late. And, speaking of time, I suggested that Walter always refer to the time of day for meetings or other events in UTC time, and not in his local timezone. By using UTC time, everyone on the team could share a common “virtual time zone.”
  • Daily Contact - Question: How do you communicate with people? Answer: You talk to them. Having regular weekly team meetings would help the team communicate, but I suggested that it was important for Walter to establish, and then work hard to maintain, daily contacts, whether on the telephone, or via on-line chat. Email, I suggested, would not be a substitute for some informal, and at least daily contact.
  • What’s a “Day?” - Finally, I suggested to Walter that he re-think what a “day” is. The reality of his geographically dispersed team was that, at almost any given time of day, some members of his team would be working. There really was no “end of the day.” I suggested to Walter that he not look at this as a handicap, but as an advantage. The analogy that I used was that of a ship at sea. While the ship has to keep moving 24 hours a day, the sailors don’t stay awake for 24 hours. They work in shifts. His team could do the same thing if, as we discussed a few minutes ago, their work was partitioned so that people in different locations worked together on the same tasks. 
The Assets

When I asked Walter about the tests that the team created, he described them by saying, “Man, they have thousands of tests. But lots of them fail, others are disabled, and, nothing is described in any test plan documents. I have no idea if we have enough tests, too many, or no way near enough tests. And, what’s worse, no one can explain to me just what the tests actually do.

Walter then described how he wanted to review every test, and build up documentation on the operation of each test, so that he could get at least a general idea of the feature coverage provided by the tests. He also mentioned that the project did not have any detailed design specifications, or requirement definition documents. I suggested a slightly different approach:
  • Define the Product’s Test Needs First - I suggested to him, that before he started wading through all the tests, he first needed to create a functional definition of the product under test. Before he started to measure test coverage, he needed to create the “yardstick” against which he could measure the coverage. This “functional decomposition” of the product would define the functions performed by the product and the user requirements that the product was intended to fulfill. AND, I suggested strongly to him that in order to make this product functional decomposition useful, it would have to include the relative priorities of each product function, and the perceived risk of bugs being found in testing each feature. My reasoning was that since it is never possible to test the infinite number of possible function, configuration, sequence of user actions, etc., it’s important to concentrate your always finite test resources on those product functions that have the highest priority and are most at risk. 
In Walter’s case, since he was starting with a blank slate for a product test coverage definition, the place he had to start was with these high priority functions, configurations, integrations, use cases, etc. Once he had this definition in hand, he could then begin the review of the tests, and leverage whatever partial information individual team members had and map that information into a test coverage matrix. And then, he could expand on that definition to include features of lesser priorities until his test coverage matrix was complete.

This matrix would be the beginning of building his team’s “institutional memory.”  In a geographically dispersed team, where it would frequently be difficult to contact people in real-time, it was vital that every team member have access to persistent information on the tests’ goals and coverage and the test strategy, including the defined priorities for tests. The documents would have to be able to stand on their own.
  • Plan is Also a Verb - The next step would be the creation of test plans. When I asked Walter about the teams’ test plans, he told me, “Man, they are terrified of writing formal plans. All they do is tell me that they don’t have time to create huge documents.
My reply was, “Then, start by making the plans small. The goal of writing a test plan is not just a document. The act of writing the document forces people to think and review their analysis of the product’s risks.” I then suggested that Walter try a light-weight template for test plans. And, luckily, I had one handy - http://swqetesting.blogspot.com/2007/12/when-less-may-be-more-lighter-weight.html

I suggested to Walter that he introduce the team to the value of test plans by explaining to them that the plan’s real goal would be to get the team to be able to answer a series of questions. And that it was easier to ask themselves these questions before other people did! Each the answer to each question would take the form of a section of the plan. Some of these questions would be:
  • Introduction - what are we doing?
  • Test Strategy - how are we doing it
  • Test Priorities - what's most important?
  • Scope - What's being tested?
  • Scope - What's beyond the scope of testing?
  • Test Pass/Fail Criteria - how do we know that it's good or bad?
  • Test Deliverables - What are the docs and tests that we'll build?
  • Test Cases - What does each test do?
  • Responsibilities - who's doing what?
  • Schedule/Milestones - when are we doing it?
  • Risks and Contingencies - what might go wrong and how we'll handle it?
  • Approvals - do we agree?
  • References - pointers to background docs?
  • Revision history - why did the plan change and how?
The answers to these questions would be important to other teams on the project, to inform them of the test coverage planned, and to be used as a vehicle for them to provide input and suggestions and criticisms to the team. 

The Processes

Finally, we talked about the tests themselves and how they were run. I had two suggestions for Water:
  • Open Things Up - One of the problems that Walter’s team had was a lack of information sharing. Sometimes by design and sometimes by accident team members kept important information from other team members. I suggested that Walter start to run the team as if it were an open source software project. In his essay “The Cathedral and the Bazaar,”  (http://www.catb.org/~esr/writings/cathedral-bazaar/cathedral-bazaar/), Eric Raymond defined Linus’ Law (named for Linus Torvalds) as "given enough eyeballs, all bugs are shallow." This is one of the great strengths of open source software; as the bugs are not hidden, they can be identified and resolved. For Walter’s team, the test plans, test results (logs, reports, etc.), and especially the tests themselves should be shared repository, where all the team members, and the people working on the Development, Management, Documentation teams, could access and review them. The reviews of the tests could be the beginning of the process of getting the tests to running on a single test framework. And, if more people on the team were aware of the automated tests’ design, then maybe someone other than the original author would be able to maintain them - AND - maybe start to clean up that backlog of old bugs too!  
  • You’re Not Finished - You’re Just Starting - Finally, I suggested to Walter, that even if he were able to improve the effectiveness of his team with any of my suggestions, that his work was only just beginning. What’s the most important process for a software test team to adopt? To me, it’s continuous improvement. (http://swqetesting.blogspot.com/2009/12/choosing-kaizen-over-cheez-whiz.html) Software will always include bugs, and some bugs will be missed in testing. You have to analyze your mistakes in an honest manner, and constantly refine your processes to incorporate corrections to past mistakes AND adapt your processes to meet new situations. But, in order to be able to do this, you first have to document your plans and your results so that you can review them at a later date. 
Closing Thoughts

Well Walter, I hope these ideas can help you dig your team out of that hole. In thinking about your situation, the word that kept coming to mind was “fragmentation.” Your team was fragmented, the tools and data stores they built and relied on were fragmented, and their working environment was fragmented. Whatever you can do to replace this fragmentation with a coordinated effort, where everyone understands their role within the team, should improve things quickly. How should you begin? Talk to your team, both as a group in a team meeting, and as individuals. Define their roles in terms of their dependencies and deliverables. Map the product features, and then map the tests to cover those features. Make all communications, and institutional memory shared and open. And, don’t fall behind by standing still. If think you see some improvement, then press for more. 

Oh, and remember to think in UTC time. Just not for the hockey games!  ;-)

4 comments:

Phil said...

Awesome post - love reading these real-life experience reports

Shey said...

A very well written and informative post. I think all teams, whether they consider themselves coordinated or not would do well to read this and measure themselves against it.

I look forward to reading future articles.

triceo said...

So, Len, with Walter in need for a crisis QE Manager... When is your last day with us going to be? :-)

All joking aside - a great article!

Aleksandar Kostadinov said...

Hey Len, you forgot to say you are online 24/7 to avoid time difference problems ;)