Received: from ACM.ORG (ACM.ORG [192.135.174.1]) by uni-paderborn.de (8.7.3/8.7.3) with ESMTP id DAA25837 for ; Tue, 9 Jan 1996 03:31:27 +0100 (MET) Received: from "port 4145"@ALTA.CS.DIEBOLD.COM by ACM.ORG (PMDF V4.3-13 #4177) id <01HZS4680C1C00CB9R@ACM.ORG>; Mon, 08 Jan 1996 20:31:10 -0600 (CDT) Received: from jib.cse.ogi.edu by ALTA.CS.DIEBOLD.COM (PMDF V5.0-3 #10219) id <01HZS454PTLM00007P@ALTA.CS.DIEBOLD.COM> for chi96-ep-deliver@acm.org; Mon, 08 Jan 1996 20:30:16 -0600 (CDT) Received: by jib.cse.ogi.edu (Smail3.1.29.1 #7) id m0tZTpD-00000SC; Mon, 08 Jan 1996 18:30 -0800 (PST) Date: Mon, 08 Jan 1996 18:30:10 -0800 From: Brian Hansen Subject: html version of chi 96 paper To: chi96-ep-deliver@ACM.ORG Cc: brianh@cse.ogi.edu Message-id: MIME-version: 1.0 X-Mailer: exmh version 1.6.1 5/23/95 Content-Type: text/plain; charset=us-ascii Content-transfer-encoding: 7BIT Content-Length: 47288 X-Lines: 1594 Status: O X-Status: X-Mozilla-Status: 0001 Systematic Design of Spoken Prompts

Systematic Design of Spoken Prompts

Brian Hansen, David G. Novick, Stephen Sutton

Center for Spoken Language Understanding
Oregon Graduate Institute of Science & Technology
20000 N.W. Walker Road
Beaverton, OR, 97006

brianh@jib.cse.ogi.edu, novick@cse.ogi.edu, sutton@cse.ogi.edu

(503) 690-1121

ABSTRACT

Designers of system prompts for interactive spoken-language systems typically seek 1) to constrain users so that they say things that the system can understand accurately and 2) to produce ``natural'' interaction that maximizes users' satisfaction. Unfortunately, these goals are often at odds.

We present a set of heuristics for choosing appropriate prompt styles and show that a set of dimensions can be formulated from these heuristics. A point (or region) in the space formed by these dimensions is a ``style'' for prompts. We develop and apply metrics for empirically testing different prompt styles. Finally, we describe a toolkit that automatically generates prompts in a variety of styles for spoken-language dialogues.

Keywords

Interaction design, auditory I/O, dialog analysis, design techniques, evaluation, toolkits

INTRODUCTION

Between the attainable practicality of command-based speech recognition and the elusive attraction of ``natural'' spoken-language interaction lies the growing use of spoken dialogue systems (SDSs). This middle ground includes applications such as the AT&T long-distance billing system, the OGI automated spoken questionnaire for the U.S. Census [6], and systems performing the ATIS travel information task [15]. These systems engage in relatively simple task-based dialogues, often expecting users' utterances to consist of a single word or a short phrase; they are analogous in complexity to graphical user interfaces (GUIs). Like GUIs, SDSs generally do not generate their output at run time; they instead use pre-specified phrases or templates as prompts. Development of these prompts is usually taken to be more art than science; to create the system's prompts, designers most often rely on expert intuition and tacit experience. But beyond intuition and experience we propose systematic methods for characterizing and generating spoken-dialogue prompts. In this paper, we present these methods and show their usefulness for developing effective SDSs. Although we concentrate on SDS development, these methods have a natural extension to the speech component of multimedia systems.

For effective interaction, SDSs must rely on directive prompts [7] to increase user compliance with system requirements. The best intuitively-designed prompts serve two functions though: they not only constrain the user's response so that speech recognizers--all too fallible--stand a better chance of success, but also provide a feeling of ``natural'' dialogue for the user so that the overall interaction is not aversive. These two functions provide a basis for a more systematic approach to generation of prompts. From our own experiences in designing SDSs, we have developed a set of heuristics for designing system output. From these heuristics, we have further developed a framework for analyzing and designing prompts in terms of dimensions, features and styles.

Our interest in identifying dimensions of dialogue variation grew out of our need to translate a written questionnaire for the US Census (the 1990 Census Short Form) into a configuration usable with a SDS. The first problem we encountered was a proliferation of possible ways of structuring and phrasing census questions. For each written question, there seemed a nearly infinite variety of ways of expressing its underlying meaning to users. We needed a way to choose only a few representatives from the profusion of conceivable treatments; we could not test the effectiveness of every variation and needed to converge quickly on solutions better suited to SDS technology. Unfortunately we did not know, a priori, which solutions would be best. To ensure adequate performance of the speech recognition component of a SDS, while avoiding confusing or alienating users, we needed to find a way to characterize the ``space'' of ways to formulate questions, and to predict the results obtained by using prompts in different points of that space.

We derive a space of possible system prompts by first producing and analyzing heuristics offering different treatments, or styles, for handling various aspects of spoken interaction. These styles suggest several ways to present automated human-computer dialogues, with an emphasis on the ways of phrasing system prompts. By abstracting across the heuristics, we have identified a set of features associated with each style, and a set of dimensions covering and containing the feature set. A point in the space described by the dimensions is a ``style'' for prompts.

In this paper, we present a framework of heuristics, dimensions and styles for spoken dialogue prompts. We define the notion of a SDS's overall stylistic consistency. We outline metrics for empirically testing different prompt styles in terms of their effects on constraint of user speech and on user satisfaction, and briefly present results of our use of this approach. Finally, we describe a SDS toolkit we are developing that incorporates some of this framework in the automated generation of system prompts.

THE PROBLEM

Current speech recognition algorithms match the features of a speech signal with models of the features of known phonemes via a statistical process. One effect of this statistical matching is that recognition is probabilistic. In a GUI, only the set of relevant user actions are defined at any given moment. This is equivalent to imposing a vocabulary of legal actions. It is generally not possible to enforce the same rigid constraints in a spoken interface. Unfortunately, the need for a high-degree of recognition accuracy in speaker-independent speech recognition imposes the requirement that the words to be recognized come from a relatively small set of candidates. The overall effectiveness [8] of a SDS, then, is dependent upon the ability of dialogue designers to produce prompts that constrain users' possible responses. One of the subtleties of dialogue design lies in giving users a feeling of naturalness and freedom of response although underlying constraints exist.

In an ideal SDS, the naturalness and accuracy criteria would both be satisfied. Users would be free from artificial constraints on their use of vocabulary, grammar, or the interactional fluidity that characterize routine task-based human-human telephone speech. Unfortunately, this ideal is unrealistic given current technology [5,12,13]. Indeed, given the limitations inherent in current speech recognition technology and the need for near-perfect recognition accuracy, dialogue designers must often make compromises between accuracy and naturalness. While these criteria may sometimes agree on the best way to present system prompts, most often a tension exists between them.

We have used different versions of dialogues to refine our understanding of this tension. Consider, for example, a dialogue in which the system sought to elicit certain information from the user by asking only yes/no questions. Such an approach would likely be very accurate from a speech recognition standpoint, yet fairly unnatural and inappropriate for many situations. Determining a user's age by means of yes/no questions, for example, would most certainly be a laborious and unacceptable approach, unpleasant for the user and time-consuming for both user and system.

Finally, where limits of speech recognition technology force dialogue designers to adopt solutions that are not maximally pleasing or natural, it is essential that they have a clear understanding of the implications of the compromises they make. Only then are they in a position to take advantage of improvements in the accuracy and robustness of that technology as they become available.

HEURISTICS, DIMENSIONS AND STYLES

To advance the production of voice-response questionnaires from an ad hoc, mostly intuitive ``craft'' into more of an engineering discipline, we have developed a method using a set of heuristics for transforming [16] a written version of a questionnaire into a script (or protocol) for use with speech-recognition systems. From these heuristics, we then developed a systematic approach to the design of spoken prompts; this approach is based on defining a space of possible system prompts that can be described by a set of task-independent descriptive dimensions. We identify a set of fourteen dimensions of system prompts, and define a point in the space they form as a ``style'' for prompts.

Heuristics for Designing Dialogues

In designing prompts for the census task, we quickly saw that for each question there was a myriad of ways of expressing its underlying intent. To converge on a practical number of dialogue designs to test, we needed a principled way of deciding which, out of thousands of wording variations, we should use. One limiting factor for this particular project derived from the nature of the census task itself: we needed the spoken questionnaire to be as true to the original written form as possible in order to avoid distorting census results. This concern led us to examine ways of transforming the original written questionnaire into a form suitable for use with a SDS. One result of our investigation was a set of heuristics for translating from written to spoken media.

Associated with each heuristic are 1) a pattern, or a set of pre-conditions, specifying where the heuristic may be used; 2) a set of styles into which a question (or other aspect of the interaction) can be transformed; and 3) a discussion of the trade-offs between the styles. The discussions of the trade-offs constitute informal hypotheses about the effects of the different styles on the accuracy of speech recognition, the naturalness of the interaction, and the interaction's length. As an example, consider a heuristic applicable to multiple-choice questions (described below). This heuristic applies when transforming, from written to spoken form, questions involving a choice among three to six options. We have informally specified five different styles of structuring and phrasing such questions, and have analyzed their implications on speech recognition accuracy as well as their expected effect on the naturalness and length of the interaction.

The styles associated with each heuristic are representative samples, depicting not every way of phrasing or presenting prompts but rather a reasonable breadth of approaches. Although many of the heuristics are associated with the forms that questions may take, some are applicable to the interaction as a whole or to a particular aspect of interaction, such as ways of accomplishing turn-taking. Multiple heuristics may apply to a single prompt and, in the case of heuristics describing the re-structuring of complex questions, may be cascaded; one heuristic may partially break down the question while another may finish the translation.

We developed these heuristics in order to provide a principled starting point for an iterative dialogue design prototyping effort, but their value is not limited to this particular usage. In general, the prompt heuristics have several important uses, including:

developing designs for a set of initial dialogues,
categorizing existing dialogues,
testing hypotheses about the effectiveness and naturalness of different styles, and
implementation via rapid prototyping toolkits.

In our efforts, these heuristics have been useful for reducing the expected user vocabulary, reducing the effects of user intonation, mitigating the reduced level of system's understanding and interactive abilities, and compensating for the loss of visual access to the written form (including the ability to scan ahead) [3]. Perhaps the greatest benefit of this framework is that it encourages an empirical approach to dialogue design. By making and testing predictions about the effects of various styles, we can reject inappropriate dialogue styles and reduce the dialogue designer's reliance on intuition and hand-crafting.

The following sections describe individual heuristics we devised in the course of the census project. In many cases the styles associated with a heuristic are presented as hypothetical interactions between the system (``S'') and a user (``U'').

Questions Containing Presuppositions

Figure 1 depicts two different styles for dealing with questions containing presuppositions. Style 1.1 ignores the possibility of a question eliciting an unexpected response due to an invalid presupposition on the systems's part.


Style 1.1:	S: What is your home phone number (including area code)?

		U: I don't have a phone.



Style 1.2:	S: Do you have a home phone number?

		U: Yes.

		S: What is your home phone number (including area code)?

		U: <telephone number>

Figure 1: Questions containing presuppositions

Alternately, style 1.2, employs a ``guard question'' that reduces the difficulty of interpreting a response where the precondition does not hold. It also increases the length of the interaction, though it may be relevant for only a fraction of the cases encountered. In those cases, guard questions may reduce the chance of communication breakdown. In the context of a large number of yes/no questions, however, style 1.2 could become tedious for users.

Questions Eliciting Compound Answers

Figure 2 shows different styles of structuring questions to elicit compound, or multi-part, answers. Style 2.1 offers the fewest constraints as to how the user may answer and, if successful, is the quickest and most efficient. The lack of expressed constraints on the form of the expected reply may not be too damaging where a standard way of specifying the information is already well-established in the minds of most users. Style 2.2 breaks the question down into an explanatory sentence and several prompts, a pattern that style 2.3 takes one step further. The extreme of this approach would be to ask for each digit of the number separately, clearly an arduous task especially considering the costs of turn-taking. It is, however, likely to be the style having the highest recognition accuracy.


Style 2.1:	S: What is your home phone number 

		U: 503...um... 690...



Style 2.2:	S: We need to know your home phone number.

		S: What is the area code?

		U: 503

		S: and the number?

		U: <tel number>



Style 2.3:	S: We need to know your home phone number.

		S: What is the area code?

		U: 503

		S: and the exchange?

		U: 226

		S: and...



Style 2.4:	S: We need to know your home phone number.

		S: Please state the area code <pause> 3-digit exchange, and 4-digit...

		U: 503    226   2...



Style 2.5:	S: We need to know your home phone number.

		Please state your number, area code first.

		U: 503

		S: Mmm-hmm

		U: 690

		S: Yes.

		U: 1121

		S: 1121. Ok.

Figure 2: Questions eliciting compound answers

Style 2.4, like style 2.3, specifies each component the user is expected to provide, sharing with style 2.3 the danger of confusing people unfamiliar with the notion of a telephone ``exchange'' or, more generally, the names of the individual components. Style 2.4 encourages the user, however, to supply all components within a single turn at speech. The need to forestall extended repair sub-dialogues may require that the system offer acceptances [4] of users' utterances after the components of multi-part answers are received. Style 2.5 depicts such a case in which the system provides feedback in the form of acknowledgments and echoing [2,10].

Questions Involving Choice Between Two Options

The heuristic depicted in Figure 3 describes the different ways of asking questions where only two responses are expected (for example ``Are you male or female?''). Style 3.1 invites uncooperative users to answer ``Yes'' or ``No'', especially if minimal or non-intuitive intonation is used in presenting the question. This may require a clarifying repair sub-dialogue perhaps employing a style-3.2 type interaction.


Style 3.1:	S: A or B?

		U: B



Style 3.2:	S: A?

		U: No.

		S: B?

		U: Yes.



Style 3.3:	S: A?

		U: No.

		S: Then B, correct?

		U: Yes/No.

Figure 3: Questions involving choice between two options

Style 3.2 increases the number of interactions required in the average case, subsequently increasing the survey time overall. Further, if the two options are truly mutually exclusive, users, recognizing the overall intent of the series of questions, and volunteer the answer to the underlying question (e.g., U: ``No, I'm a B.''), or worse (U: ``If I said I wasn't female, then what else could I be but male?''). In both cases the variety and complexity of expressions that must be recognized are greatly increased.

Questions Involving Choice Among Three to Six Options

Figure 4 depicts a heuristic for multiple choice questions having more than two but still only a few alternatives. We judge that among the different treatments, style 4.1 is somewhat less natural than styles 4.2 and 4.3. This is especially true for questions having stereotypical answers (E.g. ``What's your marital status?'' ``Single''). It is slightly less natural than Style 4.2, because a human operator can compensate for the user's not mentioning an option name directly and can either interpret a response as indicating a category, or can move toward a Style 4.3 interaction if necessary. While style 4.1 may be expected to elicit more constrained responses, it may suggest that the user cannot be trusted to recognize the choices, an indication that may appear to be insulting or condescending if obvious choices are spelled out.


Style 4.1:	S: <ask question, give options>

		U: <option-name>



Style 4.2:	S: <ask question without giving options>

		U: <option-name>



Style 4.3:	<transform question into series of sub-questions

		(a decision tree) having yes/no answers>



Style 4.4: 	<for each option, ask if it is the case>



Style 4.5: 	Similar to style 3, except when number of options

		is reduced to 2-3, ask for the option-name

Figure 4: Questions involving choice among three to six options

Style 4.2 constrains the response the least (e.g., S: ``What is your current marital status?'') and would therefore be presumed to elicit answers having greater variability (U: ``I've been living with X for...''), again tending to reduce recognition accuracy.

Style 4.4 takes the longest to achieve but does employ only yes/no questions, as does style 4.3. Both invite the user to anticipate the line of reasoning implied by the sequence and to volunteer the response that the sequence of questions suggests, increasing variability and reducing recognition accuracy.

Style 4.5 may be a good compromise between styles 4.1 and 4.3, allowing recognition of only a few keywords at any one time, without the rigidity of a strict binary tree style.

Questions Involving Choice Among More Than Six Options

The analysis here is similar to that for styles presented in the previous section except that with more choices the problems become more severe. Use of style 4.1 for more than six options may put a severe strain on the user's short- term memory, while style 4.2 may leave the user even more adrift as to what exactly constitutes a proper answer. The decision tree of style 4.3 becomes deeper, though not so quickly as the option-checking sequence of style 4.4, which becomes clearly unnatural as the number of options increases.

Figure 5 shows two more styles that may be useful in cases where there are a large number of options. Style 5.1 is quite similar to style 4.5, with 5.1's ``other'' serving to help control the flow of the dialogue. In style 5.1 and 5.2, recognition of the word ``other'' must already be in place. Using style 5.1, the overall gains in automation may be reduced by requiring human interpretation.


Style 5.1:	<reduce problem to fewer options and include

		``other'', then use more choice-constrained

		heuristics, in the case of ``other'', either store

		what the user says for later interpretation, or ask

		the same question with the next group of options>



Style 5.2:	S: <ask question, give explanation of n-at-a-time

		style, loop through the options n at a time>



		U:  <option name, or special phrases for user

		initiated repair>

Figure 5: Questions involving choice among more than six options

Encouraging Brief Answers

Figure 6 shows three different styles for eliciting brief, concise answers.Of these, style 6.1 is quick and formal, though not particularly ``friendly,'' and is likely to evoke a reasonably focussed response. Style 6.2 takes longer but is likely to elicit fewer open-ended responses. It is also likely to be frustrating for expert users. Style 6.3 is most natural in presentation but does little to constrain the response. Style 6.3 might require increasing the coverage of grammar to accommodate more verbose or non-standard responses, thereby decreasing recognition accuracy


Style 6.1: 	Give ``telegraphic'' questions. For example,



		S: Date of birth?



Style 6.2:	Explicitly state what information is wanted, and what

		form it should take as a parenthetical to the

		question. for example,



		S: ``We now ask about your date of birth. Please say

		the month, the day and then the year or your birth.''



Style 6.3:	Phrase question ``naturally'' and hope user provides a

		short, appropriate response. For example,



		S: ``What is your date of birth?''

Figure 6: Encouraging brief answers

Other Heuristics

In this section we briefly describe some additional heuristics that serve to illustrate the breadth and utility of this approach. In particular, we sketch the expected trade-offs of using:

different techniques for turn-taking
more or less explanatory text in prompts,
human versus computer voice,
different personas (such as a spokesperson or, in the case of the census, a particular census taker),
faster or slower rate of speech, and
stronger or weaker confirmation requirements after giving user information.

It is difficult, using current speech recognition methods, to accurately gauge when a user has finished his or her turn at speech. Moreover, it is difficult to provide timely feedback to the user as to whose turn it is. We have identified at least three possible implementations of turn-taking. If the system employs ``natural'' intonation patterns to signal end-of-turn, it may encourage users to encode information in intonation, possibly causing misunderstanding. If it relies only on illocutionary expectations, the dialogue may be vulnerable to communication breakdowns following turn confusion. If it uses beeps or other tone patterns to indicate turn completion, it may require some explanation to the user, increasing the number of utterances made by the system.

For questions that require prior explanations, there are two general styles: 1) provide as short an explanation as possible, or 2) provide longer explanations. Longer or more frequent explanatory text describing the intent of the question or the form of the expected answer tends to increase the output time and the output vocabulary. Increasing the output vocabulary may serve to entrain users into believing the system is able to recognize a large vocabulary, leading them to use out-of-vocabulary keywords or complex grammatical constructs.

In the case of the system's voice (either recorded human or computer synthesized), we expect to find that users react negatively to the use of synthesized speech. Not only is such technology not ``natural,'' but often difficult for human hearers to understand. We expect, however, that users provide more concise answers when prompted by a synthesized voice. In the course of developing the census system, this heuristic was tested [11] with mixed results.

Related to system voice is the choice of the persona within which the system interacts [9]. Although we make no clear prediction as to the effects of varying the persona on speech recognition accuracy, the choice of persona may affect users' acceptance of the system. Different personas in our case included the government, a census taker, or a spokesperson. In the census project, the system persona was an anonymous census enumerator.

An area not explicitly tested in the census project was to vary the rate of speech of the system voice. On one hand, we predict that faster speech may be more compelling but entrains users to use faster speech in response, possibly degrading speech recognition accuracy. A slower rate of speech, on the other hand, may increase user frustration and lead to users interrupting (or ``barging in on'') the system voice, again degrading recognition accuracy.

Finally, as the census project was concerned primarily with asking questions, we did not develop extensive heuristics addressing how best to convey information to, or answer questions of, the users. Where the objective is primarily to convey information to the user, the quickest style for presenting information would be simply to present it and go on to the next stage of the dialogue. If it were critical that the information be understood, the system might ask for confirmation and go on if confirmed. Alternately, if the system detected silence or sounds indicating that the user was uncertain or did not understand, it could present the information again or inquire as to possible sources of misunderstandings.

Dimensions of Prompts

Although the heuristics described above were useful in the initial stages of our project, they still did not capture what we term the dimensions of spoken prompts. By examining the ways in which styles varied within a single heuristic, we were able to identify different ``features'' that characterized the various styles. By examining features used in different heuristics, we distinguished which features were in opposition. These mutually exclusive features formed points within a single dimension.

The dimensions may be thought of as naming a way of varying a system prompt. The dimension PreExplanation, for example, denotes the degree to which the intent behind the prompt is described to the user before the question is actually given. Although in this case, as in many of the other dimensions, a whole continuum could be imagined, we often limited our analysis to polar opposites (e.g. +PreExplanation and -PreExplanation). In other cases, such as Decomposition, ordering the points within the dimension was less clear.

By revisiting the various styles for each of the heuristics, we identified a set of dimensions characterizing the phrasing of system prompts. These dimensions include the following ten:

PreExplanation. Should preparatory text be included before posing the question(s)?
Terse. Should the question be posed as tersely as possible?
ListOptions. Should we list the set of words from which we expect an answer?
CompoundQuestion. Should we break the question down into its component parts or leave it as one question?
Polite. Should the question be phrased politely [9]?
Decomposition. Should we break down the selection from a list of options into a decision tree, a partial decision tree, a decision list, or not at all?
AllowOther. Should we formulate the question so as to allow the user to specify ``other'' as an option?
Indirection. Should we ask the question indirectly (for example ``Could you spell that?) or should we require that questions be posed directly, perhaps as commands (for example, ``Please spell that.'')?
GiveOptionName. Should we mention the ``name'', or topic, of the information desired (for example ``Are you married or single'' does not mention ``marital status'')?
GuardQuestion. Should we ask initial questions to rule out incorrect presuppositions?

In addition, we also identified a number of dimensions characterizing the interaction as a whole, including:

Voice (human or synthesized),
Intonation (minimal or natural),
Persona, and
Turn-taking cues.

In total, these dimensions define a fourteen-dimensional space of system prompts.

Dialogue Styles

Having defined the dimensions of spoken prompts in terms of the features of styles, we can now define more formally the concept of a style as being a collection of points within a number of dimensions. Since each style within a heuristic uses only a few of the identified dimensions, a style can be described as a region in the space of possible ways of expressing prompts. Figure 7 shows different styles (described in terms of features) for eliciting the user's marital status. Again, not all dimensions are explored equally. Instead, we examine those regions in the space of system prompts that best suit the needs of our dialogue evaluation effort.


Style 1:	(+Terse, -PreExplanation, -ListOptions)

		S: Marital status?



Style 2: 	(+Terse, -PreExplanation, +ListOptions)

		S: Marital status? Now married, widowed, divorced, separated, 

		or never married?



Style 3: 	(-Terse, -PreExplanation, +ListOptions)

		S: What is your marital status, now married, widowed,

		divorced, separated, or never married?



Style 4: 	(+PartialDecisionTree, -Terse, +GiveOptions, -PreExplanation)

		S: Are you now married (yes or no)?

		if no, then 

		S: Have you ever been married (yes or no)?

		if yes, then 

		S: Were you widowed, divorced or separated (please say one)?



Style 5: 	(+PreExplanation, +ListOptions, -Terse, +GiveOption)

		S: The next question will determine your marital

		status. The categories are: now married, widowed,

		divorced, separated, and never married.  What is your

		marital status?

Figure 7: Examples of styles for marital status questio

One of the advantages of using styles defined in terms of features is that it allows us to characterize the overall style of the interaction rather than limiting our analysis to identifying the style of a single prompt. We thus define overall stylistic consistency of a SDS as the property of a dialogue in which the styles associated with each prompt do not conflict.

EVALUATING DIALOGUE STYLES

In our development of the census dialogue model, we went through several iterations of dialogue design, using up to four competing designs in a test of which worked best. In order to converge quickly on reasonable solutions, we started by identifying criteria that the various dialogue models should meet. In particular, we considered potential dialogues that were (a) closest to original form, (b) most constrained, (c) most ``natural'', (d) clearest to the hearer, (e) tersest, (f) most polite, (g) most open-ended, and (h) most recognizable. By identifying the features that best met each criterion, we were able to characterize the region in the space of dialogue prompts best suiting our needs.

The iterative approach required us to produce a method for assessing the merit of each design as a basis for further refinement. We addressed this problem from two perspectives: accuracy of recognition and naturalness of interaction. To evaluate our dialogue designs we used an objective measure of the conciseness of users' responses in combination with a subjective measure of naturalness as reflected in users' feedback to evaluation questions. Together these metrics supplied grounds for making a wide range of dialogue design decisions, including evaluating candidate styles. In addition, these evaluation metrics provided a means to test the predictions made by our heuristics. These predictions effectively narrowed the search space of subsequent prompt refinements.

We now briefly present a behavioral coding scheme, a subjective evaluation metric, and some results from using our approach to dialogue development for the census system.

Behavioral Coding Scheme

The need to refine our system prompts so as to elicit only the most concise and recognizable user responses led us to develop a behavioral coding scheme (BCS) as an evaluation metric [11]. The BCS is used to characterize a user's utterance into one of eleven classes. Each class has an associated code which is used to label users' responses during transcription. Table 1 provides a summary of the behavioral coding scheme showing the eleven response classes, a brief description of each, and an example system prompt and user response.

Response Class		Description					System prompt			User response

Adequate Answer 1	Answer is concise and responsive.			Have you ever been 

married?		Yes

Adequate Answer 2	Answer is usable but not concise.			Have you ever been 

married?		No I haven't

Adequate Answer 3	Answer is responsive but not usable.			Have you ever 

been married?		Unfortunately

Inadequate Answer 1	Answer does not appear to be responsive.		What is 

your sex, female or male?	Neither

Inadequate Answer 2	User says nothing at all.				What is your sex, female 

or male?	<silence>

Qualified Answer	User expresses uncertainty 				What year were you born?		

Nineteen fifty five I think

Req for Clarification	User reqs clarification of the meaning of a 

question.	Are you black, white or other?		What do you mean?

Interruption		User interrupts the speaking of the question.		What year 

were you born?		*teen fifty five

Don't Know		User responds ``I don't know'' or equivalent.		Are you black, 

white or other?		 I'm not sure

Refusal			User refuses to answer.					What year were you born?		I'm not 

telling you

Other			User behavior not captured by the above codes		What year were you 

born?		Thirt... <noise>

Table 1: Summary of behavioral coding scheme

The BCS can characterize a set of utterances; the distribution of BCS codes associated with responses to a given question in different treatments, or regions of the space of dialogue prompts, can be used as a basis for evaluation. For example, suppose we have three candidate prompt styles and wish to select the one that is the most constraining. First, we collect data for the three prompt styles, then label these data according to the BCS. Comparing the frequency of class ``Adequate Answer 1'' for the three styles shows which style elicited the most constrained responses.

Subjective Evaluation Questions

Given the potential trade-off between recognition accuracy and naturalness of interaction, reliance on the BCS as our sole criterion when designing prompts might lead us to dialogues that were very effective from the standpoint of eliciting highly recognizable responses but rather awkward or frustrating for users. We balance the behavioral coding evaluation of prompt styles with feedback from users. We solicited this feedback through evaluation questions presented at the end of the questionnaire providing users the opportunity to express their likes and dislikes regarding any aspect of the dialogue, including question topics, the wording of prompts, and the manner in which the prompts were presented.

Results of Evaluations

We used the behavioral coding scheme (and its predecessor versions), task completion rates, and responses to evaluation questions in three formal rounds of dialogue development in the Census project. The first round (based on roughly 100 callers) involved comparisons of the strongest differences among three overall styles. Our evaluation enabled us to pursue only those designs that elicited constrained answers and were generally acceptable to users.

Subsequent rounds focused on increasingly smaller differences among dialogue styles and thus required greater numbers of respondents. By round three (involving nearly 4000 callers) the differences between proposed dialogue designs were quite small and concise (AA1) responses were elicited over 90 percent of the time for all census questions.

DEPLOYMENT OF STYLES IN CSLURP TOOLKIT

The styles developed in the census project are proving useful in a broad range of applications. As part of on-going research in SDSs we have incorporated the notion of dialogue style into a toolkit [1] for creating spoken-language applications. This toolkit provides state-of-the-art speaker- and vocabulary-independent spoken-language recognition technology allowing developers to design, test and deploy spoken language interfaces rapidly for useful (real world) applications. The toolkit greatly simplifies the process of specifying a SDS by use of the Center for Spoken Language Understanding's rapid prototyper (CSLUrp), a graphically-based SDS authoring environment. CSLUrp currently provides a small set of style templates which a developer may use to generate a prompt. A corresponding template is displayed and slots in the template are filled with current vocabulary items.

If, for example, a dialogue designer were given the task of developing an automated pizza-ordering system, and needed to generate a prompt to elicit the size of pizza the user wanted, they would first specify the vocabulary to be recognized (small, medium, or large). He or she would then specify the style to be used in expressing the question (Polite1, Polite2, or Terse). CSLUrp would then generate a prompt incorporating both the vocabulary words and the style specification. Table 2 shows the prompts generated in this case. After the prompt has been generated, the designer is free to modify the text to better serve the situation. Although its implementation of our framework is incomplete, the repertoire of styles provided by CSLUrp has been used in the development of a variety of SDSs, including e-mail browsers, ordering and other form-filling systems, and even games.

Style		Generated prompt

Polite1		Please choose one of the following options: small, medium or 

large.

Polite2		Please say: small, medium or large.

Terse		Small, medium or large?

Table 2: CSLUrp generated prompts

CONCLUSION

In this paper we have presented a set of heuristics describing different styles of transforming a written questionnaire into a form usable with a SDS. We have identified a set of features that characterize these styles and a set of dimensions that cover and contain the feature set. Taken together, the fourteen dimensions we have presented define a space of system prompts. We refine the notion of ``style'' as being a set of features characterizing a prompt. Alternately, we define a prompt style as being a region in a space of prompts.

In order to evaluate the degree to which different prompt styles constrain users' responses while maintaining a sense of natural interaction, we have devised a behavioral coding scheme. Additionally, we have sketched our initial efforts at incorporating prompt styles in a rapid-prototyping toolkit.

Scope for further integrating dimensions into the dialogue design process, especially during prompt specification and dialogue critiquing, offer promising areas for further research. It is our hope that others will find this framework helpful, and we invite dialogue designers to develop their own heuristics, features, dimensions, and styles.

ACKNOWLDEGMENTS

This research was funded by the U.S. Bureau of the Census, U S WEST, the Office of Naval Research, the National Science Foundation, ARPA and the OGI CSLU.

REFERENCES

Colton, D., Cole, R., Novick, D., & Sutton, S. A laboratory course for designing and testing spoken dialogue systems, Proceedings of ICASSP-96, Atlanta, GA, May, 1996 (in press).

Clark, H. & Brennan, S. Grounding in communication, Shared Cognition: Thinking as Social Practice, APA Books (1991).

Clark, H. & Marshall, C. Definite reference and mutual knowledge, Elements of Discourse Understanding, Cambridge University Press, 1981, 10-63.

Clark, H. & Shaefer, E. Contributing to Discourse, Cognitive Science, 13 (1989), 259-294.

Cole, R., Hirschman, L., et al. Workshop on Spoken Language Understanding, 1992, Technical Report CSE92-014, Department of Computer Science and Engineering, Oregon Graduate Institute, 1992.

Cole, R., Novick, D.G., Burnett, D., Hansen, B., Sutton, S. & Fanty, M. Towards automatic collection of the U.S. Census, Proceedings of the 1994 International Conference on Acoustics, Speech and Signal Processing (1994), I:93-96.

Kamm, C. User interfaces for voice applications, Voice Communications between Humans and Machines, National Academy Press, 1994, 422-442.

Marshall, C., & Novick, D. G. Conversational effectiveness in multimedia communications, Information, Technology & People, 8 (1) (1995), 54-79.

Nass, C. Steuer, J. & Tauber, E.R. Computers are social actors, Proceedings of Computer Human Interaction (1994), 72-78.

Novick, D. & Hansen, B. Mutuality strategies for reference in task-oriented dialogue, Twente Workshop on Language Technology, Corpus-Based Approaches to Dialogue Modeling (TWLT 9), Enschede, The Netherlands (June, 1995), 83-93.

Novick, D.G., Sutton, S., Vermeulen, P. & Fanty, M. Rapid design and deployment of spoken dialogue systems, Technical Report CSLU95-008, Department of Computer Science and Engineering, Oregon Graduate Institute, 1995.

Rudnicky, A., Hauptmann, A. & Lee, K. Survey of current speech technology, Communications of the ACM (March 1994), 37(3), 52-57.

Schmandt, C. Voice communication with computers, Van Nostrand Reinhold, 1994.

Sutton, S., Hansen, B., Lander, T., Novick, D.G. & Cole, R. Evaluating the effectiveness of dialogue for an automated spoken questionnaire, Technical Report CSE95-12, Department of Computer Science and Engineering, Oregon Graduate Institute, 1995.

Ward, W. The CMU Air Travel Information Service: Understanding spontaneous speech, Proceedings of the DARPA Speech and Natural Language Workshop (1990), 127-129.

Yankelovich, Levow, G & Marx, M. Designing SpeechActs: Issues in speech user interfaces, Proceedings of Computer Human Interaction (1995), 369-376.

Last Modified: 02:03pm PST, January 04, 1996