Looking for Usability Problems with the Ergonomic Criteria and with the ISO 9241-10 Dialogue Principles

J. M. Christian Bastien*, Dominique L. Scapin**, Corinne Leulier***

*Centre de Recherche Informatique de Montréal
Interfaces Personne-Systeme
1801, McGill College Ave, Suite 800
Montréal (Québec) Canada H3A 2N4
Tel: (1) (514) 398-1234
E-mail: Christian.Bastien@crim.ca

**Institut National de Recherche en Informatique et en Automatique
Domaine de Voluceau - Rocquencourt - B.P. 105
78153 Le Chesnay Cedex, France
Tel: (33 1) 39 63 55 07
E-mail: Dominique.Scapin@inria.fr

***Université René Descartes
Laboratoire d'Ergonomie Informatique
45, rue des Saints-Pères
75270 Paris Cedex 06, France
Tel: (33 1) 42 86 20 74
E-mail: leulier@ergo-info.univ-paris5.fr

ABSTRACT

The relative effectiveness of the Ergonomic Criteria and the ISO/DIS 9241-Part 10 Dialogue Principles in guiding the evaluation of user interfaces was assessed. After a demonstration of a musical database application and a free exploration phase, three groups of participants (Criteria, ISO, Control) were invited to evaluate the interface of the application. Preliminary results indicate that the performance of the Control and ISO groups did not differ statistically in terms of the number of problems uncovered or the percentages on problems uncovered as a function of the size of the aggregates. However, when using the Ergonomic Criteria, participants uncovered statistically more usability problems, and the percentage of problems uncovered with respect to the size of the aggregates was higher. For instance, the aggregation of 3 evaluations in the Control and the ISO group permits to uncover about 48% of the usability problems while it permits to uncover about 63% of the usability problems in the Criteria group.

Keywords

User interface evaluation, inspection methods, ergonomic criteria, standards, dialogue principles, usability problems.

INTRODUCTION

Many different sets of dimensions are suggested for assessing usability of human-computer interfaces. As for any method or tool, their validity, their reliability and their effectiveness must be addressed. However, very few sets have been tested on these grounds, especially in the context of user performance testing.

For the Ergonomic Criteria (hereafter EC) [2], such questions were partly addressed: the empirically based design of this set has ensured their validity [5]; their comprehensibility, their reliability and effectiveness in evaluation tasks has been documented [1, 3]. The goal of the present study was first to test further the EC as an evaluation aid for non-experts, and second to compare the relative effectiveness of the EC and the ISO/DIS 9241-Part 10 Dialogue Principles (hereafter DP) [4] in an evaluation task.

The ISO dimensions differs from the EC both in terms of number and precision. These differences seem to stem from different design strategies. The design of DP has been based on psychological theories from which recommendations were extracted and organized into high-level dimensions, and standardization efforts (which, through many iterations, lead to many changes to reach an "experts' consensus"). The design of EC has been based on available experimental data and a large set of individual guidelines that were iteratively grouped into sets that were characterized by specific dimensions (criteria).

METHOD

Seventeen students from the Ergonomics department of the University Paris V (France) were randomly assigned either to the Criteria group (N = 6), the ISO group (N = 5), or the Control group (N = 6). They were inexperienced in user-interface evaluation, but were familiar with the Macintosh®.

One week before the experimental session, the participants in both Criteria and ISO groups received the EC and the DP respectively and were asked to read carefully the documents before the experimental session. The EC document contained the definition, rationale, examples of guidelines and comments for each of the criterion; the DP document contained the descriptions, typical applications and examples of the principles.

The experimental session proceeded in four phases: the demonstration phase, the free-exploration phase, the reading phase, and the evaluation phase. In the demonstration phase, participants were provided with a 10 minutes demonstration of a musical database application (designed with HyperCard®). The interface of the application purposely included a number of design flaws leading to usability problems. In this experiment only a part of the application was made available. This part contained a total of 246 usability problems. In the free-exploration phase, the participants were allowed to explore the application freely for 10 minutes. In the reading phase, participants in the Criteria and ISO groups were asked to read over again their respective documents. The evaluation phase then followed and ended when the participants felt they had completed their evaluation.

RESULTS

The ISO and Criteria groups did not differ from one another in terms of time dedicated to the reading phase (F (1, 9) = .009, p = .9272). Participants in the ISO and the Criteria groups took 18 min (SD = 4.64) and 17.7 min (SD = 6.68) respectively.

The duration of the evaluation differed among groups (F (2, 14) = 9.355, p = .0026). While the participants in the Control group spent an average of 50 min (SD = 12.2) evaluating the interface, participants in the ISO and Criteria groups took at least twice that time. They spent 100.0 min (SD = 19.3) and 137.5 min (SD = 54.9) in the ISO and Criteria group respectively. Fisher's Protected LSDs indicated that the Control group differed significantly from both the ISO (p = .0340) and the Criteria (p = .0007) group, and that these latter two groups did not differ significantly from one another (p = .0998).

The number of usability problems uncovered during the evaluation phase differed among groups (F (2, 14) = 5.636, p = .016). As indicated by Fisher's Protected LSDs, participants in the Criteria group uncovered statistically more problems than participants in the Control (p = .0101) and ISO (p = .0144) groups. These two latter groups did not differ from one another (p = .9665). On the average, participants in the Criteria group uncovered 86.2 usability problems (SD = 12.7) while 61.8 (SD = 15.8) et 62.2 (SD = 13.8) problems were uncovered in the Control and ISO groups respectively.

Figure 1 illustrates the proportion of problems uncovered as a function of the size of aggregates, and the proportion of problems common to the evaluations of the aggregates ("com" curves) for the three groups. The aggregates are formed by putting together problems uncovered by different number of participants. On the average, participants found 25.1% (min = 19.1; max = 33.7) and 25.3% (min = 19.9; max = 34.1) of the total number (246) of usability problems in the Control and ISO group respectively. In the Criteria group, the percentage reached 35.0% (min = 27.2; max = 40.7). The percentage of problems uncovered as a function of the size of the aggregate is similar for both the Control and the ISO groups. In the Criteria group however, the percentage of problems uncovered is higher than the percentage of the two other groups. For example, to uncover about 60% of the usability problems contained in the application one needs 5 evaluators in the ISO and the Control group and only 3 evaluators in the Criteria group. The aggregation of 5 evaluations permits to uncover 60.0% and 60.6% of the problems in the Control and ISO groups and 76.9% in the Criteria group. The proportion of common problems were quite similar for both the Control and the ISO group as a function of the number of evaluations in the aggregate. Although the tendency was the same for the Criteria group, the proportions tended to be slightly higher.

DISCUSSION

These preliminary results indicate that the use of EC leads to better evaluation performances than with the use of the DP or with no guidance. Actually, the performances are similar when evaluators use the DP or rely on their own (non-expert) knowledge (even though the ISO group took twice the evaluation time compared to the Control group). Even though these results on the EC are encouraging, much work remains to be done in order to provide better guidance and consequently to increase the performance of this type of inspection method, for instance by improving the systematic examination of the interface, by itemizing in greater detail each one of the dimensions, and by providing software tools to support the diagnosis. Then only will it be appropriate to consider such dimensions as potential measurements.

REFERENCES

Bastien, J. M. C., & Scapin, D. L. (1992). A validation of ergonomic criteria for the evaluation of human-computer interfaces. International Journal of Human-Computer Interaction, 4, 183-196.
Bastien, J. M. C., & Scapin, D. L. (1993). Ergonomic criteria for the evaluation of human-computer interfaces (Report No. 156). Rocquencourt, France: Institut National de Recherche en Informatique et en Automatique.
Bastien, J. M. C., & Scapin, D. L. (1995). Evaluating a user interface with ergonomic criteria. International Journal of Human-Computer Interaction, 7, 105-121.
International Standards Organisation (1994). ISO 9241. Ergonomic requirements for office work with visual display terminals - Part 10 Dialogue Principles; Draft International Standard.
Scapin, D. L. (1990). Organizing human factors knowledge for the evaluation and design of interfaces. International Journal of Human-Computer Interaction, 2, 203-229.