MailCall: Message Presentation and Navigation in a Nonvisual Environment

**Matthew Marx^* and Chris Schmandt**

groucho@media.mit.edu, geek@media.mit.edu

MIT Media Laboratory

E15-252, 20 Ames St., Cambridge, MA 02139

^* Currently affiliated with Applied Language Technologies: Cambridge Massachusetts, 617.225.0012.

ABSTRACT

MailCall is a telephone-based messaging system using speech recognition and synthesis. Its nonvisual interaction approaches the usability of visual systems through a combination of intelligent message categorization,efficient presentation, and random-access navigation. MailCall offers improved feedback, error-correction, and online help by considering the conversational context of the current session. Studies suggest that its nonvisual approach to handling messages is especially effective when the user has a large number of messages.

KEYWORDS: Auditory I/O, interaction design, mobile computing, speech recognition, speech interface design.

INTRODUCTION

The telephone is as an ideal remote access tool for those who need to keep abreast of incoming messages. Unlike laptop computers and PDAs, which can be unwieldy or difficult to connect to the network, the telephone is ubiquitous and necessarily networked. Smaller size and better networking capabilities will make PDAs increasingly attractive, especially because of their graphical user interfaces, but they will never be ideal in situations where the user's hands or eyes are busy, such as driving. Speech-driven messaging systems, however,will continue to be indispensable in such situations, which compose much of the mobile user's activity. Thus nonvisual systems are important both because of their current practicality over the telephone as well as the fundamental suitability of audio-only interaction to hands/eyes busy situations.

The nonvisual modality, however, introduces obstacles for the user interface designer. The slow, serial, and transient nature of speech makes finding important messages not only frustrating but also expensive if calling long-distance or from a cellular phone. Furthermore, a system that gives repetitive or superfluous feedback can feel mechanical and unfriendly.

An effective nonvisual messaging system must support rapid access by giving an overview of the information space and supporting random access to individual messages. Feedback must be accordingly efficient and appropriate. Since, as will be discussed, these interactional demands exceed the capabilities of touch-tone input, speech recognition must be incorporated.Thus the system must wrestle with additional issues regarding helping users learn the command set as well as handling recognition errors.

This paper describes the design of MailCall, a telephone-based messaging system that supports in a nonvisual environment the flexible message access available in visual messaging systems. MailCall categorizes incoming messages by importance,summarizes categories, and allows random access. It also strives to meet user expectations of conversational interaction by customizing feedback according to conversational context,tracking recognition errors, and offering online help.

RELATED WORK

Phoneshell [6] offers telephone-based access via touch-tones to incoming messages as well as rolodex, calendar, news, weather, and traffic. Since many users receive dozens of messages a day, Phoneshell supports rule-based filtering to group messages into categories such as"important" and"mass mailings." The user is principally restricted to sequential navigation, however--either reading the next message or the previous one, and many find it tedious to process a long list of messages.

Chatter [4] used speech recognition to allow the user to retrieve email messages, send voice messages, look up information in a personal rolodex, place outgoing calls, and ask the location of other Chatter users. Messages were presented in order of relevance based on the user's past usage. Chatter used a sophisticated discourse model to track the conversation, although it did not handle recognition errors.

SpeechActs [9] combines the conversational style of Chatter with the broad functionality of Phoneshell, offering a speech interface to mail, calendar, stock quotes, and weather forecasts. SpeechActs improves upon sequential navigation by allowing the user to access messages by number (e.g., "read message 17"). Recognition errors are addressed by SpeechActs, which explicitly verifies requests that are irreversible (e.g., "delete message") and offers progressively more detailed assistance when the system fails to understand the user.

The Wildfire electronic assistant is a commercial system that screens and routes incoming calls, schedules reminders, and retrieves voice mail (but not email). Wildfire allows the user to sort messages, but navigation is chiefly sequential.

Phoneshell, Chatter, SpeechActs, and Wildfire all provide remote access to messages, but none of them offers interaction comparable to what users enjoy with a GUI mail reader. Phoneshell and Chatter prioritize messages but neither summarize them nor allow random access. SpeechActs scans message headers and allows the user to pick out a message by number, but remembering the number of a message adds to the user's cognitive load. Wildfire allows the user to ask if there are messages from people in the rolodex but does not summarize incoming messages and allow users to pick from them at random. To improve upon previous work, MailCall must support methods of finding important messages quickly without imposing excessive cognitive demands on the user.

MAILCALL

MailCall is a telephone-based messaging system that employs speech recognition for input and speech synthesis for output. It was developed on a Sun Sparcstation 20 under both SunOS 4.1.3 and Solaris, using the DAGGER speech recognizer from Texas Instruments and DECtalk for text-to-speech synthesis. Call control is facilitated by XTL, ISDN software from Sun Microsystems. The following paragraphs give a high-level summary of MailCall's behavior and capabilities; see [5] for a complete description of MailCall's functionality, recognition vocabulary, and software architecture.

Unified voice/text message retrieval. MailCall retrieves and categorizes incoming email and voice mail messages. The user can ask the sender, subject, arrival time, or recipients of any message. Audio attachments are processed and played as sound files, and email notification sent by the voice mail system acts as a pointer to the original voice message. Messaging is "unified" in that there is no differentiation by media; email and voice mail from the same person are grouped together.

Sending messages. The user can send a voice message in reply to any message, or to anyone in the rolodex. If the recipient is a local voice mail subscriber, it will be placed in the appropriate mailbox; if not, then it is encoded--available formats include MIME,NextMail, Sun, and uuencode--and sent as electronic mail. (Speech recognition is not yet accurate enough to transcribe without editing, so dictating replies to be sent as text is not supported.)

Voice Dialing. Instead of sending a voice message, the user may elect to place a call. If the person's phone number is available in the rolodex, MailCall uses it--and if there is both a home and work number, MailCall prompts the user to choose one or the other. If the phone number cannot be found, the user is prompted to enter it.

Usage Scenarios

MailCall is especially useful when the user has a long list of incoming messages or when time is short. In the first example, Pat is in a hurry trying to meet a deadline, waiting for Lynn to send a critical piece of information. MailCall enables Pat to check for it quickly.

MailCall: You have seven personal messages.
Pat: Is there anything from Lynn Smith?
MailCall: Sorry, no messages from Lynn Smith.
Pat: How about Renee Foster?
MailCall: Nothing from Renee Foster, either.
Pat: Good-bye.

In a second example, Pat calls up after a marathon afternoon meeting to find over 100 new messages. MailCall enables Pat to hear the most important ones first.

MailCall: Welcome to MailCall. Who's there?
Pat: It's Pat Peterson again.
MailCall: Pat Peterson, correct?
Pat: Correct.
MailCall: Hello, Pat. Please say your six-digit password.
Pat: 5-6-5-5-1-9.
MailCall: Hold on while I get your messages...Just a second...Okay! You have 16 timely messages, 33 personal ones, and 150 others.
Pat: Is there anything from Lynn Smith?
MailCall: Personal messages from Lynn Smith. Message 25 from Lynn Smith about "here's the information you requested."
Pat: When did that arrive?
MailCall: It arrived about three hours ago.
Pat: Send a reply.
MailCall: Record your message for Lynn Smith. Pause for several seconds when finished.
Pat: Lynn, thanks for finally sending the info, but it was a bit too late. Let's try to coordinate a little better next time.
MailCall: Got it. Review the message, send it, or cancel?
Pat: Send it.
MailCall: Did you say "send it"?
Pat: Yes.
MailCall: Sending your message.Please wait....... What now?
Pat: Go to my timely messages.
MailCall: Timely messages. I can read the summary, or you can start with the first one.
Pat: Read me the summary.
MailCall: You have a bunch from Chris Barry, a few from Mark Matthews, one each from Deb Cohen and Dominique Farnsworth, and many others.
Pat: Read me the ones from Mark Matthews.
MailCall: Messages from Mark Matthews. Message 5 from Mark Matthews about "can we meet?"
Pat: Let me hear it.
MailCall: (playing voice message)Hey Pat, I've got a great new pyramid scheme I'd like to get you involved in. We'll make millions!
Pat: Delete it.
MailCall: Did you say to delete the message?
Pat: That's correct.
MailCall: Deleting message 5.
Pat: Go to my other messages.
MailCall: Other messages. Read the summary, or read the first header?
Pat: Read me the summary.
MailCall: You have an incredible number from Mail Delivery Subsystem, a few from Fran Tess, several from Joyce Ball, one from listserv@mit.edu, and several others.
Pat: Read the one from listserv at MIT dot e d u.
MailCall: Message 121 from listserv@mit.edu about "your request for privacy watch."You have been added to the privacywatch mailing list.
Pat: That's all for now.
MailCall: Did you say to hang up?
Pat: Yep.
MailCall: Pat, thanks for calling. Good-bye!

As the above examples demonstrate,MailCall enables the user to find important messages quickly. Its organization of incoming messages into categories makes it feasible to provide efficient presentation, which in turn affords effective navigation of the information space.Further, its attention to conversational context helps establish its credibility as a cooperative conversant. The next two sections describe MailCall's approach to nonvisual information management and its strategies for crafting conversation.

DESIGN CHALLENGE: NONVISUAL NAVIGATION

Retrieving messages over the phone is more cumbersome than with a GUI-based mail reader. With a visual interface, the user can immediately see what messages are available and access the desired one directly via point and click. In a nonvisual environment,however, a system must list the messages serially, and since speech is slow, care must be taken not to overburden the user with long lists of choices. Indeed, a recent empirical study of speech vs. GUI-based message retrieval [8] revealed that a major problem with the otherwise well-received speech interface was finding important messages. The following three sections describe how MailCall supports effective nonvisual messaging through a combination of intelligent filtering, efficient presentation, and random access navigation.

Information organization

A first step towards effective message management in a nonvisual environment is prioritizing and categorizing messages. Like other mail readers, MailCall filters incoming messages based on a user profile that contains a set of rules for assigning messages to various categories or deleting them entirely. Although rule-based filtering is powerful, writing rules to keep up with dynamic user interests can require significant user effort. We considered memory-based reasoning(as adopted by Chatter) to lighten the burden on the user, but decided against it since the associated learning curve can be too slow to capture quickly changing interests.

Our approach to finding important messages exploits the wealth of information that can be present in a desktop environment. The user's calendar, for instance, keeps track of timely appointments, and a record of email the user has sent suggests people from whom the user might be awaiting replies. MailCall exploits these information sources through a background process, CLUES, that scans various databases and automatically generates filtering rules. (For details on the mechanics of CLUES see [5]). The rules generated by CLUES are available to other applications, as well; both Phoneshell [6]and an HTML-based visual mail reader described in [5] utilize CLUES' filtering rules to categorize messages.

The rules generated by CLUES are integrated into an existing filtering system; thus the user can benefit from both static rules in the filtering profile and dynamic ones generated automatically by CLUES. Below are described several information sources and how CLUES uses them to prioritize messages. The elements that match entries in the information sources are printed in bold.

CALENDAR
- Monday
  - 10am Motorola
  - 5pm leave for airport
- Tuesday
  - visiting Sun all day
  - fax: 415/555-5555
Rolodex card 32 of 89
- Name: Kim Silverstone
- Address: 25 Harshwood Way, Palo Alto CA 94306
- Phone: (415) 555-5555
- Email: ksilvers@eng.sun.com
Outgoing email
- To: geek@media.mit.edu Subject: latest draft of paper
- To: chi96@sigchi.acm.org Subject: deadline extension?
Outgoing phone calls
- dialed 215-555-5555 (Charles Hemphill)
- dialed 8-5956 (unknown)

Figure 1: Various information sources available in the user's computing environment, including calendar, rolodex, and records of outgoing email and phone calls.

Calendar. Assuming an"interest window" of approximately two weeks ahead of the current date and a few days back,CLUES extracts salient items from calendar entries. Thus the following message is marked even though it was not addressed directly to the user.

From lad@media.mit.edu
To: demo-staff@media.mit.edu
Subject: visitors from Motorola delayed until 3:30

Email replies. CLUES scans the user's record of outgoing messages to see who might be sending a reply and what they might be replying about. For instance, CLUES would automatically mark the following message.

From chi96@sigchi.acm.org
Subject: Re: deadline extension?

Returned phone calls.Similarly, CLUES can detect when someone returns a call by correlating the user's record of outgoing phone calls--created when the user dials using MailCall, Phoneshell [6], or a desktop dialing utility--with the Caller ID number of voice mail. Our voice mail system sends the user email with the Caller ID of the incoming message. Thus, CLUES would mark the following:

From Operator<root@media.mit.edu>
Subject: Voice message from 8-5956

Medium-independent filtering.Often, someone does not reply using the same medium; one may call in response to an email message or vice versa. After finding the email address of the person the user has called by cross-referencing the phone number in the rolodex, CLUES can mark an email reply to an outgoing phone call. In this example, CLUES marks the email message from Charles Hemphill since the user recently tried to call him.

From hemphill@csc.ti.com (Charles Hemphill)
Subject: I just got your voice mail; here's what I think

Geographic filtering. Business travelers often leave behind a phone number where they can be reached.CLUES correlates a phone number found in the calendar with the user's rolodex to produce a list of people who live in that area. (CLUES also keeps a small table of co-located area codes such as 408 and 415.) A message from any of those people is then marked under the assumption that the user might be trying to coordinate with them while in town. Hence, messages from those the user is visiting suddenly become important, as when the calendar has a fax number in an area code near where Kim Silverstone lives:

From ksilvers@eng.sun.com
Subject: Let's do lunch while you're in town!

Domain-based filtering. CLUES also extracts the domain names from the email addresses for the people the user has recently sent messages, presuming that messages from people at the same site may too be relevant. Since CLUES identified Kim Silverstone as someone whom the user might want to visit on Tuesday's trip, messages from anyone at the same domain are marked.

From smith@eng.sun.com
Subject: Kim Silverstone told me you're in town...

Isolating the important part of a domain name is nontrivial. Some people may receive email at one sub-domain(groucho@media-lab.media.mit.edu) but send from another (groucho@timpanogos.media.mit.edu). Some corporations are geographically distributed; eng.sun.com is in California, but east.sun.com is in Massachusetts. Others like aol.com are non-location specific, so they are useless for geographic filtering.CLUES prunes domain names as appropriate.

MailCall's categorization breaks up a long list of messages into several smaller lists, one of those being the messages marked by CLUES. Once the messages have been sorted into various categories, the user can jump from category to category in nonlinear fashion, saying "go to my personal messages" or "go back to my important messages."

Nonvisual presentation

Categorization of messages helps to segment the information space, but when there are many messages within a single category, the user again is faced with the challenge of finding important messages in a long list. Creating more and more categories merely shifts the burden from navigating among messages to navigating among categories, so the user must have a method of navigating within a large category--or, more generally, of finding one's way through a long list of messages. Efficiently summarizing the information space is the second step toward effective nonvisual messaging.

With a GUI-based mail reader, the user is treated to a visual summary of messages and may point and click on items of interest. This works because alist of the message headers quickly summarizes the set and affords rapid selection of individual messages but is difficult to achieve aurally, due to the slow, non-persistent nature of speech. Whereas the eyes can scan a list of several dozen messages in a matter of seconds, the ear may take several minutes to listen to the same list read aloud. Further, one must rely on short-term memory to recall the spoken items whereas the screen serves as a persistent reminder of one's choices.

Since speech is slow, summaries must be streamlined,avoiding extraneous information or repetition. The approach adopted by SpeechActs is to read the headers one by one, which was necessary to list the number of each message:

SpeechActs: Your next six messages are from MIT.
User: Scan the headers.
SpeechActs: Message 1 from Gina-Anne Levow about "final draft changes." Message 2 from Chris Schmandt about "visit Friday." Message 3 from Chris Schmandt about "visit Friday--correction." Message 4 from Matt Marx about"videotape."........(etc.)........Message 10 from Matt Marx about "final draft changes." Message 7 from Matt Marx about "hello."

MailCall's summary is more concise, though at the expense of some detail.

MailCall: Personal messages. Read the summary, or read the first header?
User: Read me the summary.
MailCall: You have several from Matt Marx, a couple from Chris Schmandt, and one from Gina-Anne Levow.

Although the latter summary does not list the subject of each message, it is more quickly conveyed and easier to remember. By grouping messages from the same sender, it avoids mentioning each message individually, instead providing a summary of what is available.

In addition, MailCall attempts not to overburden the user with information. When reading the list, for instance, it does not say the exact number of messages but rather a fuzzy quantification of the number: e.g., "several from Matt Marx" instead of "six from Matt Marx." And if there are messages from many different people in the same category, MailCall will mention only the first four or five and add "...among others."

Nonvisual navigation

The fact that MailCall summarizes the contents of a category implies that the user is able to pick from them at random. Random access refers to the act of nonlinear information access--i.e., something other than the neighboring items in a list. The chart below delineates four general types of random access.

          
               location-based        content-based  
                                            
relative       "skip ahead           "read the      
               five                  next one       
               messages"             about          
                                     `meeting'"     
absolute       "read me              "read the      
               message               message from   
		thirty-five"         John Linn"

Figure 2: Four types of random access.

By location-based random access we mean that the user is picking out a certain item by virtue of its position or placement in a list. Location-based random access may either be absolute (e.g., "Read message 10."), when the user has a specific message in mind,or relative, when one moves by a certain offset (e.g., "skip ahead five messages.") (It may be noted that sequential navigation is a form of relative location-based navigation where the increment is one.) SpeechActs supports both absolute and relative location-based random access, and suggests their use by reading the number of each message while scanning the headers in a message category. Location-based random access imposes an additional cognitive burden on the user, who must remember the numbering of a certain message in order to access it. Indeed, participants in the SpeechActs usability study were often observed jotting down the numbering of messages, though doing so would be most difficult in situations where speech is maximally beneficial,such as while driving.

With content-based random access, the user may reference an item by one of its inherent attributes: sender, subject,date, etc. For instance, the user may say "read me the message from John Linn" without having to recall the numbering scheme. As with location-based navigation, both relative and absolute modes exist. Relative content-based access associated with following "threads," multiple messages on the same subject. Phoneshell, for instance, allows the user to drop into "scan mode," which reads the next message on the current topic or from the current sender; this is feasible with touch-tones since the user need not specify the desired topic explicitly but signify that the current topic is to be followed. Absolute content-based navigation is the contribution of MailCall, allowing the user to pick the interesting message(s) from a summary.

MailCall: You have several messages from Lisa Stifelman, a few from Mike Phillips, and one each from Jill Kliger and Steve Lee.
User: Read me the ones from Mike Phillips.

It is practical to support absolute content-based navigation with recent advances in speech recognition.Traditionally, a speech recognizer loads a precompiled vocabulary that cannot be modified at runtime. This makes it impractical for the speech recognizer to know about new messages that arrive constantly. (One can imagine recompiling the grammar whenever anew message arrives, but this is both computationally intensive and susceptible to race conditions, as when the user calls while the grammar is being recompiled to reflect new messages.) Recently, however, a dynamic vocabulary updating feature of the DAGGER speech recognizer [3] supports runtime vocabulary modification. When the user enters a category, MailCall adds the names of the senders in that category to the recognizer's vocabulary. Thus the user may ask for a message from those senders listed in a summary. One may also ask if there are messages from anyone in the rolodex, or from someone whom one has recently sent a message or called (as determined by CLUES).

Absolute content-based random access combined with intelligent categorization and category summarization brings MailCall closer in line with the experience one expects from a graphical mail reader, although a GUI mail reader on a large display has the potential to communicate more information more quickly about the available messages. It satisfies some criteria for an effective nonvisual navigational scheme [8]: users do not have to learn many additional commands, short-term memory is not overburdened, and the structure of the mailbox is not complicated excessively.

DESIGN CHALLENGE: CRAFTING CONVERSATION

MailCall uses speech recognition because it is not practical to support absolute content-based navigation with touch tones. Expressing the command "read the message from Mark Matthews," for instance, would require a command for "read the message from..." and then a mechanism for spelling the desired name, which is especially difficult when dealing with email addresses. (MailCall does support touch-tone equivalents for sequential navigation and other simple functions; see [5].) The use of speech for both input and output, however, raises user expectations of the interaction since it implicitly resembles a human conversation. These heightenedexpectations can be damning, however, since speech recognizers are far less adept than humans. Knowing what to say is a stumbling block for beginners, yet taking the time to explain one's options can be tiresome for "power users." Further, speech recognition errors slow down the interaction, requiring the system to perform constant "grounding" [1] between the system and user to insure that they share a common perception of what is transpiring. This section describes strategies for avoiding conversational breakdown.

Communicating capabilities with help

People have very different expectations for "conversational" systems: some speak as if chatting with a good friend, expecting the machine to perform human-level language understanding. Experience [9], however, demonstrates that many people do not ascribe much competence to a spoken language application and instead expect to be led step-by-step as with a menu-driven IVR system.In both cases, it is essential to guide users to speak requests that the system understands. Given the range of expectations, a major responsibility of the speech interface designer is to communicate system capabilities to the user.

Like SpeechActs, MailCall avoids explicitly listing options in the form of a menu, instead opting for a conversational style,which is both more familiar and also faster for experts who have learned the system's vocabulary. Still, novice users need to know what their options are. The SpeechActs strategy involved giving each user a printed card with a list of sample commands. In the usability study, this proved successful, as most users spoke only the commands listed on the card. For MailCall, our goal is an "out-of-box" experience, meaning that we want users to be able to learn the system without reading manuals or carrying instruction pamphlets. MailCall users are informed at the beginning of a session that they can ask "what can I say now?" or press the `0' telephone key for assistance.

The first step in providing help is reestablishing context--reminding users where they are and how they got there, usually by revisiting the last action taken by the system and explaining what the system thought the user said. Next, a list of currently available options is given with an explanation of each command. Finally, the user is reminded of the global reset command ("start over from the beginning") so that the user can begin the session anew instead of hanging up in disgust.

Handling recognition errors

Like other speech systems [1, 7, 9],MailCall invests significant effort in detecting and correcting speech recognition errors. It handles rejection errors by apologizing and givingprogressive assistance [9] and verifies potential substitution errors, allowing the user to correct them quickly (e.g., "No, I said read the message.").

Its contribution to error-handling is ErrorLock, a general "interface algorithm" for handling recognition errors. (See [5] for a complete description and a diagram.) Instead of dealing with errors on a case-by-case basis, each incoming utterance is passed through ErrorLock for evaluation and confirmation. Aside from insuring uniform error-handling among varied input, it also keeps track of past recognition errors and deals more intelligently with consecutive errors. If, for instance, the recognizer consistently picks a hypothesis but its confidence is just under the cutoff threshold,ErrorLock temporarily relaxes the threshold and asks if that is what the user meant. Error-tracking also helps to keep MailCall from sounding oblivious to context. In the following example, MailCall makes the same mistake twice but reflects that knowledge in its feedback.

MailCall: Welcome to MailCall. Who's this?
User: Matt Marx.
MailCall: Nat Parker, correct?
User: No, Matt Marx.
MailCall: I thought I heard"Nat Parker" again, but you just said that. Is that right?

If the speech recognizer were able to return more than one hypothesis (an N-best list), ErrorLock would be able to discard the hypothesis that was already rejected and try the next-most-likely one. Besides facilitating better interaction, ErrorLock is a convenience for the developer.Centralizing error-handling removes the need to copy error-handling code all over the application, and since ErrorLock is domain-independent, it can be reused for other applications,changing prompts as necessary.

ErrorLock is, not, however, a complete discourse model like that of Chatter[4]. And although its verification of potential misrecognitions is invaluable when the speech recognizer is having difficulty, its constant verification can become annoying when the speech recognizer is performing well. An adaptive model of feedback such as described in [1] may help MailCall appear less rote in deciding when to ask for feedback.

Contextually customized feedback

Grice's cooperative principle of conversation [2] as applied to spoken language systems might read as follows: "give as little feedback as possible, but noless." Giving too much is a vice, but failing to give enough is the greater sin if the user becomes confused for lack of explanation. Both the user's competence and the conversational context of the current session are factors in deciding how to give feedback.

Varying the length of feedback. Like other speech systems [6, 7], MailCall offers various levels of feedback, with longer prompts for novices than experts. MailCall goes a step further, however,adapting its feedback based on the conversational context through the use of ellipsis. Ellipsis refers to the omission of words or phrases that have already been established earlier in the conversation, with the assumption that the intended meaning can be reconstructed from a shorter prompt and the existing context. In the following example, the user is asked twice whether to review, send, or cancel a message.Items omitted by ellipsis appear in strikeout text.

User: Send a message to Jordan Slott.
MailCall: Record your message for Jordan Slott. Pause for several seconds when finished.
User: records a voice message
MailCall: Got it. Review the message,send it, or cancel?
User: Send.
MailCall: Sending your message. Please wait.
User: Send a message to Steve Lee.
MailCall: Record your message for Steve Lee. Pause for several seconds when finished.
User: records a voice message
MailCall: Got it. Review the message, send it, or cancel?
User: Send.
MailCall: Sending your message. Please wait.

As the above example demonstrates,MailCall uses ellipsis on individual words, phrases, and even sentences.Ellipsis helps to streamline the interaction and may help users to perceive the interface as non-repetitive.

Varying the speed of feedback. Like other speech systems [6, 7], MailCall allows the user to set a default speech output rate or change it during a session. It also recognizes that certain items are more familiar than others, having been established earlier in the conversation. Prompts, for instance, become familiar with use. Items that are new to the conversation, however, may be harder to understand. Thus MailCall temporarily slows down its speech when presenting the sender or subject of a message. It does so in the following example, with the words spoken more slowly rendered in expanded text.

User: Start with the first message.
MailCall: Message 1 is from Stuart Adams about a response to "next week."

Temporarily slowing down for new information can allow the user to set a higher default speaking rate, speeding the presentation of prompts.Ellipsis and automatic speed adaptation help to reduce the repetitiveness of repeated prompts while helping to insure that new, unfamiliar information can be comprehended. This is especially important when using synthetic speech, which is necessary when delivering dynamic data such as email messages.

USER STUDY

To evaluate the effectiveness of MailCall, a user study was conducted. The goal was not only to determine how usable the system was for a novice, but also how useful it would prove long-term as a tool for mobile messaging.

Method

Since our goal was not only to evaluate ease of learning but likelihood of continued use, we conducted a five-week study involving four Media Lab affiliates with varying experience using speech recognition. In order to gauge the learning curve, minimal instruction was given except upon request. Sessions were neither recorded nor monitored due to privacy concerns surrounding personal messages, so the results described below are based chiefly on user reports. The experiences of the two system designers using MailCall over a period of three months were also considered.

Results

Feedback from beginning users centered mainly on the process of learning the system, though as users became more familiar with the system, they also commented on the utility of MailCall's nonvisual presentation. Seasoned users offered more comments on navigation as well as the limits of MailCall in various acoustic contexts.

Bootstrapping. As described above, our approach was to provide a conversational interface supported by a help system. All users experienced difficulty with recognition errors, but those who used the help facility found they could sustain a conversation in many cases. A participant familiar with other speech systems found the combination of error-handling and help especially useful:

I have never heard such a robust system before. I like all the help it gives. I said something and it didn't understand, so it gave suggestions on what to say.

Other participants were less enthusiastic, though nearly all reported that their MailCall sessions became more successful with experience.

Recognition robustness. MailCall has a very large vocabulary, pushing the limits of the speech recognizer. Two factors were identified in the success of the system:vocabulary size and acoustic context. Since the user can ask if there are messages from anyone in the rolodex, send them messages, or call them at anytime, MailCall's vocabulary size varies directly with the number of people in the user's rolodex. An experienced user with well over 100 names in his rolodex found recognition to be unreliable when talking over a noisy cellular connection.

Specifying Names. A difficulty common to all users was getting MailCall to understand names, both in asking for messages from certain people and simply identifying one's self. Unlike Chatter, which used first names only, MailCall requires the user to say full names. Everyone responded to the prompt "Welcome to MailCall. Who's this?" with "It's Matt!" or something similar. Furthermore, users were disappointed when they had to refer to someone the same way that MailCall did; instead of saying"read the one from groucho at M I T dot E D U," they wanted to say "read the one from groucho" which, given the conversational context, is all that is required to uniquely specify that person.

Navigation. Users cited absolute content-based navigation as a highlight of MailCall. One beginning user said "I like being able to check if there are messages from people in my rolodex." And one of the system designers, a diehard Phoneshell user, admits that he uses MailCall instead of Phoneshell when facing an unusually large number of messages because he can ask for a summary of a category and then pick the ones he wants to hear first.

For sequential navigation, however, speech was in fact adisadvantage. The time necessary to say "next" and then wait for the recognizer to respond can be greater than just pushing a touch-tone, especially when the recognizer may misunderstand.Indeed, several used touch-tone equivalents for "next"and "previous." And since some participants in the study received few messages, they were content to step through them one by one. MailCall's touch-tone equivalents address this preference.

These results suggest that MailCall is most useful to people with high message traffic, whereas those with a low volume of messages may be content to simply step through the list with touch-tones,avoiding recognition errors.

Implications for redesign

The results of our user study suggested several areas where MailCall could improve, particularly for novice users.Some changes have already been made, though others will require more significant redesign of the system.

More explanation for beginners. Supporting conversational prompts with help appears to be a useful method of communicating system capabilities to novices. Our experience with four novice users, however, suggests that our prompts and help were not explicit enough. As an iterative design step,we lengthened several prompts, especially those at the beginning of the session, and lengthened the help messages. A fifth user who joined late in the study after these changes had been made was able to log on, navigate, and send messages on his very first try without major difficulties. This suggests that prompts for beginners should err on the side of lengthy exposition.

More flexible specification of names. Specifying names continues to be an elusive problem. MailCall should allow the user to refer to someone using as few items as necessary to uniquely specify them. Doing so would involve two additions to MailCall:first, a "nickname generator," which creates a list of acceptable alternatives for a given name; second, an interface algorithm for disambiguating names with multiple referents as in [9].

Moded vs. Modeless interaction. If MailCall is to be usable in weak acoustic contexts (like the cellular phone) for people with a large rolodex, its interaction may need to become more modal. We intentionally designed MailCall to be modeless so that users would not have to switch back and forth among applications, but as the number of people in the rolodex grows, it may become necessary to define a separate "rolodex" segment of the application to constrain the recognition task. Wildfire has segmented their recognition of names (necessitated by other constraints),though the resulting interaction is less conversational.

CONCLUSIONS

Nonvisual messaging systems can approach their visual counterparts in usability and usefulness if users can quickly access the messages they want. Through a combination of intelligent organization, efficient presentation, and random- access navigation, MailCall offers interaction similar to that of a visual messaging system.Consideration of context helps to meet user expectations of error-handling and feedback, though beginning users may require more assistance than was anticipated. Results suggest, however,that a large-vocabulary conversational system like MailCall can be both usable and useful for mobile messaging.

ACKNOWLEDGMENTS

Jordan Slott implemented the ISDN software, and several research associates of the MIT Media Lab Speech Group have contributed to infrastructure. Raja Rajasekharan and Charles Hemphill made it possible for us to use DAGGER.Nicole Yankelovich reviewed a draft of this paper. This work was supported by Sun Microsystems and Motorola.

REFERENCES

[1] S. Brennan and E. Hulteen. "Interaction and feedback in a spoken language system: a theoretical framework." Knowledge-Based Systems,April/June 1995, pp. 143-151.

[2] H. Grice. "Logic and Conversation," Syntax and Semantics: Speech Acts, Cole &Morgan, editors, Volume 3, Academic Press, 1975.

[3] C. Hemphill & P. Thrift. "Surfing the Web by Voice." In Proceedings of ACM Multimedia `95, San Francisco, CA, Nov. 5-9, 1995.

[4] E. Ly. "Chatter: A Conversational Telephone Agent" MIT Master's Thesis, Program in Media Arts and Sciences, 1993.

[5] M. Marx. "Toward Effective Conversational Messaging." MIT Master's Thesis, Program in Media Arts and Sciences, 1995.

[6] C. Schmandt. "Phoneshell: the Telephone as Computer Terminal" Proceedings of ACM Multimedia Conference, August 1993.

[7] L. Stifelman, B. Arons, C.Schmandt, and E. Hulteen. "VoiceNotes: A Speech Interface for a Hand-Held Voice Notetaker, ACM INTERCHI `93 Conference Proceedings, Amsterdam, The Netherlands, April 24-29, 1993.

[8] C. Wolf, L. Koved, and E. Kunzinger, (1995) Ubiquitous Mail: Speech and Graphical User Interfaces to an Integrated Voice/E-Mail Mailbox, Interact '95, 247-252.

[9] N. Yankelovich, G. Levow, and M. Marx. "Designing SpeechActs: Issues in Speech Interfaces." Proceedings of CHI `95, Denver, CO, May 8-11, 1995.

MailCall: Message Presentation and Navigation in a Nonvisual Environment

Matthew Marx* and Chris Schmandt

groucho@media.mit.edu, geek@media.mit.edu

MIT Media Laboratory

E15-252, 20 Ames St., Cambridge, MA 02139

**Matthew Marx^* and Chris Schmandt**