Making A Difference - The Impact of Inspections

Paul Sawyer
Digital Equipment Corporation
LKG 2-2/Z7
550 King Street
Littleton, MA 01460-1289
(508) 486-7119
sawyer@tnpubs.enet.dec.com

Alicia Flanders
Digital Equipment Corporation
ZKO 2-3/R77
110 Spit Brook Rd,
Nashua, NH 03062-2642
(603) 881-1337
flanders@casdoc.enet.dec.com

Dennis Wixon
Digital Equipment Corporation
ZKO 2-3/R44
110 Spit Brook Rd,
Nashua, NH 03062-2642
(603) 881-2276
wixon@usable.enet.dec.com

ABSTRACT

In this methodology paper we define a metric we call impact ratio. We use this ratio to measure the effectiveness of inspections and other evaluative techniques in getting usability improvements into products. We inspected ten commercial software products and achieved an average impact ratio of 78%. We discuss factors affecting this ratio and its value in helping us to appraise usability engineering's impact on products.

Keywords

Formal inspections, heuristic evaluations, usability metrics, user testing, walkthroughs, impact ratio usability problems.

INTRODUCTION

Usability as a discipline, and usability organizations, face the challenge of measuring the effectiveness of the methods they use. The possibility of making such measurements varies with the type of work. For long-term consulting work, where usability consultants work as members of a development team from its inception, the relative contributions of usability professionals versus other contributors is difficult to distinguish. However, in the case of more tightly-focused, shorter-term methods, the ability to measure effectiveness is substantially higher because these projects offer the possibility of a before- intervention versus after-intervention comparison. One can assess an interface and collect such metrics as number of problems uncovered [6] or measure improvements in user performance [1, 16]. In addition, one can measure the relative effectiveness of one method over another [7, 8]. While these kinds of measures help usability professionals decide on the optimal method for their situation, they do not measure the effectiveness of any given method's ability to leverage product improvement.

Recent papers have begun discussing the need to measure the number of changes made to products as a result of a usability method [1, 5, 16]. We suggest that including the number of problems fixed is a more complete way of measuring the impact of usability activities.

The Usability Expertise Center offers both short-term and long-term consulting on usability issues within Digital Equipment Corporation. Our long-term work is often strategic in nature. This paper focuses on our activity as short-term consultants, called on by a development group to measure and improve the usability of a product by applying focused, well-defined tactical methods to achieve a quick result. In a tactical setting, the client needs specific suggestions that will solve specific problems. In other words, they need impact. Contributing to our need to measure our effectiveness is our own management who must be shown that we have a positive, measurable impact on the usability of the company's products. We must, therefore, measure the impact we have had on products to show that we accomplished something and to understand where we can be more effective [11].

In this paper we list a number of projects we have done, showing how many problems we found and how many of those problems were fixed by the development groups. We look at a few of these projects in detail by way of example to understand our relative impact and the factors contributing to it. Finally we describe conclusions about impact, which we have drawn based on all the projects we've listed. Although the projects we discuss here involved usability inspections, we believe the measurement of impact applies to any usability evaluation method.

FINDING THE PROBLEMS

One of our primary responsibilities as usability professionals is to find problems in user interfaces. A good deal of attention has been focused on efficient and effective ways to find all of the problems that exist in a given interface. Nielsen has nicely summarized these methods [12].

In the Usability Expertise Center at Digital, we make frequent use of user testing and usability inspections. These methods often find different problems, so that the best coverage of a product results from using both methods [3, 8]. In practice we are usually constrained by development-group budget and preference to one or the other. To those clients who are short on time (and budget) and who value "expert opinion", we suggest an inspection. Inspections take less time. To those clients who have a little more time and budget or who do not value the opinions of experts, we suggest user testing. (However we find that we can often insert a little expert opinion into user testing.)

In the past we had been asked to do reviews of designs. We did these reviews, but there was no follow-up so we had no idea of how effective the review was. Since usability is a scarce resource, we decided to institute a structured process to assess our effectiveness and to winnow out any client groups who were asking for input that they did not plan to use. Specifically we decided to adopt a review format similar to that used in software code inspections. [18]

Here is our standard procedure for doing an inspection.

After talking with the client, we give them a written proposal that states which parts of the product we will inspect and which aspects of usability we will cover. Our coverage can include general usability principles, style-guide compliance, visual design principles, and online information (help and tutorials) principles.
We provide the client with a document describing the principles on which we base our inspection.
Our proposal states that we will provide a written report enumerating problems found and including for each problem the severity (with a specific scale), the specific principles violated, and a recommendation for ways to fix the problem.
The proposal further states that the client must make a written response that includes, for each problem: a statement of whether they will fix the problem immediately or in a later release or not at all; if not at all, why not; an estimate of the work required to fix the problem; and a statement of what fix they will use (our recommendation or their own alternative).
We have pairs of inspectors work as a team on each aspect of usability to be evaluated. (Although Nielsen discusses the advantages of multiple inspectors [13, 14, 15], he has evaluators work independently to ensure that one does not bias the other. This produces greater variability in the kinds of errors found.) Our experience is that having inspectors look at the interface together results in finding more problems and suggesting better solutions than if the inspectors work alone. However, having more inspectors is not necessarily better. We also find that two is the optimal number for interacting with each other and recording the findings efficiently. With more than two, the person taking notes becomes a bottleneck and the other inspectors become bored.
One person compiles a single report from all of the inspection comments from each pair of inspectors.
All inspectors review the final report and, if necessary, meet to resolve any gaps, contradictions, or disagreements.
We give this written report to the client team to read and act on, and if we are suggesting major design changes, we may walk through the suggested new design with the client team.
We use the feedback from the development group to evaluate our impact and to look for ways to improve our process.

HOW WE MEASURED IMPACT

As described above, we require the client to provide a written response. From that report we measure the number of problems that the client commits to fixing. We define a metric called impact ratio as the number of problems that the client commits to fix divided by the total number of problems that we found, expressed as a percentage:


               problems committed to fix
impact ratio = -------------------------  * 100
                 total problems found

FINDINGS

Table 1 summarizes our activity on ten projects. These inspections took place from August of 1994 to September of 1995 and represent all of the inspections that we delivered to clients during this time that used this framework. The table shows the result of these usability inspections ranked in descending order based on their impact ratios. The average number of usability problems found is 66 with a range of 26 to 111. The average number of committed fixes is 52 with a range of 24 to 84. The average Committed Impact Ratio is 78% with a range of 58% to 96%.

Table 1 - Commitment Ratios from 10 Inspections
Product	Issues Found	Committed Fixes	Committed Impact Ratio
A	26	24	92%
B	85	77	91%
C	78	63	81%
D	111	84	76%
E	33	24	73%
F	46	33	72%
G	89	63	71%
H	59	39	66%
I	52	30	58%
J	85	82	96%
Total	664	519
Average	66	52	78%
Range	26-111	24-84	58% - 96%

A salient feature of the committed impact ratio is that it measures a commitment that we get from product development. One might wonder if this commitment is predictive of the actual number of fixes. For six usability inspections, we have had access to the version of the product produced after our inspection. This allowed us to see which usability issues development actually addressed.

When evaluating the number of usability issues fixed, it is important to keep in mind that product development is an ongoing process. Issues that are not addressed in one version of a product are sometimes deferred to another version. Therefore, a seemingly large divergence between the committed impact ratio and the completed-to-date impact ratio does not invalidate the committed impact ratio. The completed-to-date impact ratio is a measure of how far the development team has come towards addressing all of the problems identified.

For five out of the six inspections, the committed impact ratios are the same as the completed-to-date. The reasons for this reside in the pressures of the software development environment. In some cases, because of compressed schedules and the need to ship, developers implemented our recommendations before reporting back to us, so their reports reflected what they had actually done. This is an example of developments propensity to not commit to anything until they are sure they can do it.

For product G the response from development showed a committed impact ratio of 71%; however, when we were able to follow-up on the product, the actual number of recommendations implemented resulted in a completed-to- date impact ratio of only 37%. This ratio was partly due to the nature of the recommendations. For example, one recommendation was to implement mnemonics for every button in the interface. Developing a set of mnemonics for an interface requires some thought and planning, in addition to the actual coding. Other recommendations affected the entire interface. These included changing the action of scroll bars, providing progress indicators, making buttons, labels and placement consistent, allowing for maximizing and minimizing windows, and providing a task-oriented help system. Recommendations from our inspection represented only one set of pressures on the developers. Other pressures included the need to implement new features, fix bugs, and meet their deadlines for each release. The way that this development team balanced the demands was by staging their approach to implementing the usability recommendations over more than one version. and by placing a higher priority on software bugs than usability issues.

Gunn [5] reported that 14 inspections produced an average fix rate of 74% with a range of 50% to 95%. In addition, Muller [10] reported for one heuristic evaluation that 89 recommendations were made, 87% were committed to and 72% were actually implemented.

Our data as well as that of Gunn [5] demonstrate that formal inspections are a relatively effective method of getting usability problems fixed. These findings qualify Klemmer's [9] assertion that expert opinion is not as effective as data from usability studies in facilitating product improvement and suggest that, when embedded in an appropriate process, expert opinion can be quite effective.

WHAT AFFECTS IMPACT

There are a number of factors that affect the impact of a usability method. The following is a discussion of some of these factors, based on our experiences delivering inspections and our reflection on these experiences.

Written Report

Providing a written report has an obvious positive effect on our impact: The problem descriptions and specific solutions are available for all group members to read and reference, and the written word removes some (but not all) ambiguity about what we intend to say. The written report also has a less-obvious effect that has been reported earlier [18]: We avoid being used as an approval agency, wherein our recommendations may be ignored while "usability approval" is cited. In such situations, our impact would actually be negative: no changes would be made, and the incentive to make changes would be lessened.

Multiple Methods

Since inspections and user tests are known often to discover different problems [6], a way to improve impact is to do both kinds (or other kinds) of evaluations. It should be noted that inspections usually provide more systematic coverage of an interface than user testing.

Written Response

Requiring a client to respond in writing to our report changes the client's response. We ask our clients to include a work-time estimate for each problem fix as part of their response. Having invested the time to consider each problem, its fix, and the time required to implement it, development groups tend to actually do what they tell us they will do, though we have no authority to ask for commitments. This is the request form of speech acts [17] which provides for a clear specification of action, a responsible person, and a date to be completed.

In addition, requesting engineering to agree to respond to our report in writing often helps us to eliminate clients who are unlikely to implement our recommendations. That is, if they cannot agree to respond to our report, it follows that they probably will not have time to implement our recommendations either. Therefore the context in which we measure our impact actually helps us to identify optimal ways to apply our resources.

We find that requiring a written response from a development team is good practice for any usability technique. It is also one of the operational aspects of any kind of engineering. [16].

Easier Response Process

Client responses are important, but we know that they impose additional work on engineering. Therefore, we try to make the process easier. One way we do this is to set up a meeting where we go over the report with the client. We answer any questions they might have about the recommendations and they give us their response to each recommendation, which we record on the spot. Sometimes we provide our report document to clients so they can add their information to the report without having to reference each recommendation in a separate file. We are usually willing to adapt our process to achieve our desired result.

Specific Recommendations

Providing specific recommendations to fix specific problems has a tremendous positive effect: The development group need not spend time thinking of a solution, plus we gain a psychological advantage in offering constructive suggestions rather than just criticism. To the extent that we have the expertise, we include enough technical details in our recommendations so that the solution becomes "a mere matter of coding." Modern GUI toolkits are complicated, and many development groups do not have a GUI guru available, so we can often provide enough details to change a programmer's estimate from "Can't do that" or "That would take a month" to "I can have that done tomorrow".

Along that line, we sometimes offer two or more recommendations to solve a given problem, suggesting a better but more complex solution if there is time and talent available, and a less desirable but still acceptable solution that can be implemented more quickly. Some impact is definitely better than no impact.

Severity Level

We rate each problem found with a severity level of low, medium, or high according to that problem's effect on usability. This allows the development group to focus on the fixes that will have the most usability impact. We have found, however, that the severity of the issues that we report does not appear to have any effect on whether or not development will commit to addressing an issue. For the inspections shown in Table 1, the impact ratio for high, medium, and low usability issues is 72%, 71% and 72% respectively. This finding suggests the need to further examine the reporting of severity levels to learn if other factors (such as time, effort, and risk involved in addressing an issue) influence whether or not development will commit to addressing a usability issue.

We find that reporting severity makes our report more like engineering quality assurance reports and as a result adds more credibility to our process and findings. Therefore we will continue reporting severity levels.

We also encourage developers to go for some "cheap wins," problems that are easy to fix irrespective of severity. When a number of "cheap wins" are aggregated, the result can be a significant improvement in the perceived quality of the interface.

Client Participation

Having a member of the client development team sit in on an inspection or user test sometimes has a big impact. Karat has observed [7] that this is due to the "seeing is believing" phenomenon and the developer's involvement in and commitment to the usability evaluation process. We affirm this observation, noting that some developers and managers do not value expert opinion at all, so direct participation gives them assurance that the data we collect is credible. [4, 18]

Early Involvement

An obvious factor for impact is being able to work with a development group as early in the development cycle as possible, since it is well known that solving quality problems (whether software bugs or usability bugs) earlier is cheaper. We have less control over this factor, but we have found that we get a fair bit of repeat business from satisfied clients, and that our involvement with the next project for a client often happens earlier in the development cycle. Earlier involvement is, of course, advantageous to the client: inspections and user testing operate in a scrap and rework mode, so from a larger perspective they are less efficient than an approach which focuses on getting the design right earlier in the process. One might think of inspections as both a short-term aid to usability and a long-term investment in changing development processes.

FACTORS BEYOND OUR CONTROL

Some factors that affect impact are beyond our control. Although we believe that usability problems are real bugs and can often convince development groups of this, usability bugs most often are given lower priority in comparison to software bugs, for a good reason: if the product doesn't work, usability doesn't matter much. For products that have a high number of software bugs, impact from usability evaluations is likely to be low.

The following are reasons for not implementing a recommendation that we feel are beyond the control of usability practitioners:

The development team did not have enough time or resources to implement the change. Shipping the product is the highest priority.
Implementing the recommendation would have a negative impact on performance.
The product was canceled.
The problem existed only in prototype code and not in the actual product. When we compute the impact ratio for an inspection of a prototype, we base the impact ratio on the number of valid recommendations. That is, we subtract from the total number of problems identified any that were invalid because they described problems existing only in the prototype.

FACTORS AFFECTING SPECIFIC INSPECTIONS

A number of factors contributed to the high impact ratios of 92% and 91% for products A and B. Both of the products are part of a tightly integrated product set. The usability engineers who participated in the inspection enjoyed a long-term relationship with the development group and had earned the respect of the developers. The usability engineers worked on an ongoing basis with the developers, performing many kinds of usability processes and evaluations at various points in the development cycle. However, the inspections noted here were the first formal inspections performed in that product space. Previously a single usability engineer might review an interface and provide feedback to the developers. Formal inspections were a departure from the existing process because usability engineers worked as a team to review the interface and requested a written response from engineering. The novelty of the formal inspection process generated a lot of excitement and interest. Working as a team, the usability engineers were able to produce a comprehensive report in a relatively short time. The report categorized usability issues by screen and severity ratings. The development teams were pleased to be able to get usability feedback in time for current releases and the severity ratings helped them to focus on the most pressing issues.

Another factor contributing to the high success rate for the inspection of Product B (91% impact ratio) was that the development team was located in Valbonne, France and was appreciative of having the spotlight of attention from the "home office" focused on their product. Being in the spotlight might have also encouraged them to higher levels of cooperation.

On the other hand, a completely different set of factors contributed to the low impact ratio of Product H (66% impact ratio). The developer responsible for the first version was reassigned to another product. The new developer who was finally assigned lacked vital product knowledge and expertise to address the usability issues. In addition, the future of the product was uncertain because high-level strategy decisions were threatening to render it obsolete. Therefore development was unwilling to invest a lot of resources on the product unless its future was more certain. For future projects we will ask questions during our initial contact with clients to determine how stable a product is before we commit to performing an inspection on it. This will help us to avoid performing inspections on products to which development is not fully committed.

Here are some of the factors contributing to the 58% impact ratio of Product I. It is intended for internal use. The initial client request was for specific consultation on Motif style guide issues, but in looking at the interface with the client to understand the Motif issues, our consultant observed several usability problems that were not related to style-guide compliance. We suggested an inspection. The development manager acknowledged that the product had not had any involvement of a design professional, and that there were likely many problems that would not be observed until the product was actually deployed, so the manager was enthusiastic about an inspection. The inspection was done in our standard way (described earlier), and the client requested a consultation meeting to clarify some of the recommendations made.

When the client response came back, we discovered that our impact had been lower than usual. Analysis of the specific responses revealed several factors. Development time and resources were very limited, so most work was deferred to a later release. The developers had some strongly-held beliefs on how Motif applications are supposed to work, at least partly based on the default behavior of several Motif widgets. But the largest single factor was a strongly-held belief that users of the product must be domain experts (in an aspect of the company's business) and must undergo training before using the product. Although we demonstrated alternative designs that were not especially difficult to implement and that would remove the training requirement, the development team opted to assume the need for training (adding to the cost of use for every product user) in order to reduce their development cost (since they had already implemented the products). Only strong user criticism has a chance of changing their minds on this issue.

WAYS THAT WE CAN IMPROVE OUR IMPACT

Reflection on our processes has shown us that many factors affect our impact. This section describes some process improvements that resulted from analyzing the written responses from product development.

The following are reasons that clients gave us for not implementing our recommendations that we feel are within the control of usability practitioners:

The recommendation is not in keeping with product goals. This reason was the most common reason that we encountered. It occurred in development responses from seven of the ten studies. It reveals the need for usability engineers to take more time to become familiar with the products they evaluate. They could do this by reviewing product brochures and product descriptions before beginning the inspection.
The recommendation would make this component inconsistent with other components in the product family. If a product is one of a family of products, usability engineers could request access to the entire family, or at least major components. They could then reference other components to ensure that they do not make recommendations that compromise consistency in the product family.
Product developers expressed strong (erroneous) opinion that the current implementation is satisfactory. Developers gave this as the reason for not implementing recommendations in half of the studies. The best way to counter this kind of complacency with the product is with usability testing rather than inspection methods. Expert opinion, one distinguishing characteristic of inspection methods, is only as effective as development's willingness to accept it. [2]
The software platform or development tool seems to impose restrictions. This reason was present in the responses of five of the ten usability studies. Often the restriction is real. However, just as frequently, implementation requires a complex solution of which the developer is unaware. Usability engineers with technical expertise can influence a positive decision by describing a workable solution.

The impact ratio is a quantitative measure of how many of our recommendations the development team will implement. It provides us feedback on how we are doing. However, it is only a metric and not an end in itself. For example, we frequently make recommendations even when we are unsure whether a problem exists. We feel obligated to bring these kinds of ambiguous issues to the attention of the developers, even if there is a chance that they will not accept the recommendation and thus reduce our impact ratio. We do not limit our reports to items that we are sure development will address.

CONCLUSION

It is important for usability professionals to find usability problems quickly and effectively, but it is equally important to have an impact on the usability of products examined and to reflect on the effectiveness of usability practice. We have described here a simple way to measure the probable impact on product usability by (1) producing a written report, (2) requiring a response from the client, and (3) calculating the ratio of problems the client says will be fixed to the total number of problems found.

We have also described factors that we have found affect our impact: (1) providing severity levels for problems; (2) making recommendations that are both detailed and technically-specific; (3) requiring a client response; (4) getting involved as early as possible, (5) becoming familiar with the product and its design goals prior to the inspection, and (6) inspecting the product within the context of its product family.

Using these methods, we have satisfied our clients, improved the usability of our company's products, and measurably justified our existence to our management. In addition, we now have a database of results to which we can compare future evaluations to spot deviant cases and investigate them.

ACKNOWLEDGMENTS

We gratefully acknowledge the contributions to the usability methods described here of the other members of the Usability Expertise Center: Minette Beabes, George Casaday, Elizabeth Comstock, Nancy Clark, Debby Falck, Rick Frankosky, Tom Graefe, Trudi Leone, Mike Paciello, Anne Parshall, Sharon Smith, Linda Sue Trapasso.

REFERENCES

Bias, R.G. and Mayhew, D. J. Cost Justifying Usability, Academic Press, New York, 1994.
Brooks, P., Adding Value to Usability Testing. In Usability Inspection Methods, Nielsen, J and Mack, R.L., Eds. John Wiley & Sons, Inc., New York 1994, pp 255-271.
Desurvire, H.W. Faster, Cheaper!! Are usability inspection methods as effective as empirical testing? In Usability Inspection Methods, Neilsen, J. and Mack, R.L., Eds. John Wiley and Sons, New York, 1994, pp 173-202.
Dumas, J., and Redish, J.C., A Practical Guide to Usability Testing. Ablex, Norwood, NJ, 1993.
Gunn, C., An Example of Formal Usability Inspections in Practice at Hewlett-Packard Company. In Human Factors in Computing Systems CHI'95 Conference Companion (May 7-11, Denver, Colorado) ACM/SIGCHI, New York, 1995, pp 103-104.
Jeffries, R., Miller, J.R., Wharton, C., and Uyeda, K.M. User Interface Evaluation in the Real World: A Comparison of Four Techniques. In Proceedings of CHI'91 Human Factors in Computing Systems (April 27-May 2, 1991, New Orleans, Louisiana). ACM/SIGCHI, New York, 1991, pp 119-124.
Karat, C-M, A Comparison of User Interface Evaluation Methods. In Usability Inspection Methods, Nielsen, J and R.L. Mack, Eds. John Wiley & Sons, Inc., New York, 1994, pp 203-233.
Karat, C-M, Campbell, R, and Fiegel, Comparison of Empirical Testing and Walkthrough Methods in User Interface Evaluation. In CHI'92 Conference Proceedings (May 3-7, 1992, Monterey, California). ACM SIGCHI, New York, 1992, pp 397-404.
Klemmer, E.T., Impact in Ergonomics: Harness the Power of Human Factors in your Business. Ablex, Norwood, NJ: 1989, pp 45-87.
Muller, M. McClard, A., Bell, B., Dooley, S., Meiskey, L., Meskill, A., Sparks, R., and Tellam, D., Validating an Extension to Participatory Heuristic Evaluation: Quality of Work and Quality of Work Life. In Human Factors In Computing Systems CHI'95 Conference Companion (May 7-11, Denver, Colorado). ACM/SIGCHI, New York, 1995, pp 115-116.
Muller, M. , Wildman, D. and White, D. Taxonomy of PD Practices: A Brief Practitioners Guide. In Communications of the ACM , Vol. 36. No 4, (Oct. 1993) 26-28
Nielsen, J., Usability Inspection Methods. In Human Factors in Computing Systems CHI'95 Conference Companion (May 7-11, Denver, Colorado). ACM/SIGCHI, New York, 1995, pp 377-378.
Nielsen, J., and Landauer, T., (1993) A Mathematical Model of the Finding of Usability Problems. In Human Factors in Computing Systems INTERCHI'93 Conference Proceedings (April 24-29, 1993, Amsterdam, The Netherlands). ACM/SIGCHI New York, 1993, pp 206-213.
Nielsen, J., and Molich, R. (1990) Heuristic Evaluation of User Interfaces. In Human Factors in Computing Systems, CHI'90 Conference Proceedings (April 1-5, 1990, Seattle, Washington), ACM/ SIGCHI, New York, 1990, pp 249-256
Neilsen, J. and Phillips, V. Estimating the Relative Usability of Two Interfaces: Heuristic, Formal, and Empirical Methods Compared. In INTERCHI'93 Conference Proceedings (24-29 April, Amsterdam, The Netherlands). ACM/SIGCHI, New York, 1993, pp 214-221.
Whiteside, J. Bennett, J. and Holtzblatt, K. Usability Engineering: Our Experience and Evolution, in Handbook of Human Computer Interaction, North Holland, New York, 1988.
Winograd, T., and Flores, F. Understanding Computers and Cognition. Ablex Publishing Corporation, Norwood, NJ, 1986.
Wixon, D., Jones, S., Tse, L., and Casaday, G. Inspections and Design Reviews: Framework, History, and Reflection. In Usability Inspection Methods, J. Nielsen and R.L. Mack, Eds. John Wiley & Sons, Inc. New York, 1994, pp 77-103.