J. M. Christian Bastien*, Dominique L. Scapin**, Corinne Leulier***
For the Ergonomic Criteria (hereafter EC) [2], such questions were partly addressed: the empirically based design of this set has ensured their validity [5]; their comprehensibility, their reliability and effectiveness in evaluation tasks has been documented [1, 3]. The goal of the present study was first to test further the EC as an evaluation aid for non-experts, and second to compare the relative effectiveness of the EC and the ISO/DIS 9241-Part 10 Dialogue Principles (hereafter DP) [4] in an evaluation task.
The ISO dimensions differs from the EC both in terms of number and precision. These differences seem to stem from different design strategies. The design of DP has been based on psychological theories from which recommendations were extracted and organized into high-level dimensions, and standardization efforts (which, through many iterations, lead to many changes to reach an "experts' consensus"). The design of EC has been based on available experimental data and a large set of individual guidelines that were iteratively grouped into sets that were characterized by specific dimensions (criteria).
One week before the experimental session, the participants in both Criteria and ISO groups received the EC and the DP respectively and were asked to read carefully the documents before the experimental session. The EC document contained the definition, rationale, examples of guidelines and comments for each of the criterion; the DP document contained the descriptions, typical applications and examples of the principles.
The experimental session proceeded in four phases: the demonstration phase, the free-exploration phase, the reading phase, and the evaluation phase. In the demonstration phase, participants were provided with a 10 minutes demonstration of a musical database application (designed with HyperCard®). The interface of the application purposely included a number of design flaws leading to usability problems. In this experiment only a part of the application was made available. This part contained a total of 246 usability problems. In the free-exploration phase, the participants were allowed to explore the application freely for 10 minutes. In the reading phase, participants in the Criteria and ISO groups were asked to read over again their respective documents. The evaluation phase then followed and ended when the participants felt they had completed their evaluation.
The duration of the evaluation differed among groups (F (2, 14) = 9.355, p = .0026). While the participants in the Control group spent an average of 50 min (SD = 12.2) evaluating the interface, participants in the ISO and Criteria groups took at least twice that time. They spent 100.0 min (SD = 19.3) and 137.5 min (SD = 54.9) in the ISO and Criteria group respectively. Fisher's Protected LSDs indicated that the Control group differed significantly from both the ISO (p = .0340) and the Criteria (p = .0007) group, and that these latter two groups did not differ significantly from one another (p = .0998).
The number of usability problems uncovered during the evaluation phase differed among groups (F (2, 14) = 5.636, p = .016). As indicated by Fisher's Protected LSDs, participants in the Criteria group uncovered statistically more problems than participants in the Control (p = .0101) and ISO (p = .0144) groups. These two latter groups did not differ from one another (p = .9665). On the average, participants in the Criteria group uncovered 86.2 usability problems (SD = 12.7) while 61.8 (SD = 15.8) et 62.2 (SD = 13.8) problems were uncovered in the Control and ISO groups respectively.

Figure 1 illustrates the proportion of problems uncovered as a function of the size of aggregates, and the proportion of problems common to the evaluations of the aggregates ("com" curves) for the three groups. The aggregates are formed by putting together problems uncovered by different number of participants. On the average, participants found 25.1% (min = 19.1; max = 33.7) and 25.3% (min = 19.9; max = 34.1) of the total number (246) of usability problems in the Control and ISO group respectively. In the Criteria group, the percentage reached 35.0% (min = 27.2; max = 40.7). The percentage of problems uncovered as a function of the size of the aggregate is similar for both the Control and the ISO groups. In the Criteria group however, the percentage of problems uncovered is higher than the percentage of the two other groups. For example, to uncover about 60% of the usability problems contained in the application one needs 5 evaluators in the ISO and the Control group and only 3 evaluators in the Criteria group. The aggregation of 5 evaluations permits to uncover 60.0% and 60.6% of the problems in the Control and ISO groups and 76.9% in the Criteria group. The proportion of common problems were quite similar for both the Control and the ISO group as a function of the number of evaluations in the aggregate. Although the tendency was the same for the Criteria group, the proportions tended to be slightly higher.