Structuring Voice Records Using Keyword Labels

Nick Haddock
Hewlett-Packard Laboratories
Bristol BS12 6QZ, U.K.
njh@hplb.hpl.hp.com

ABSTRACT

The paper proposes an interaction technique which allows some structure and content to be extracted from a voice record, thus making it easier to review the recording and integrate it with other data. Silence detection and speech recognition are employed to pick out intentionally uttered keyword labels, in order to create a form-field view of the voice recording.

Keywords

Speech as data, speech recognition, form-filling, multi-modal interfaces, portable computing.

INTRODUCTION

As computing appliances shrink in size, voice will be an increasingly natural medium for entering and capturing information. The benefits of speech as an input medium are well-documented [4]; it is suitable in hands/eyes busy situations, and it is a quick way of capturing information. A growing range of pocket-size products allow the user to capture, store and play back voice messages, "as data", in a digital form [5]. However, recorded speech can be arduous to review later. This paper introduces an interaction technique which allows some structure and content to be extracted from a voice record, thus making it easier to review the recording and integrate it with other data.

Under our approach, the user can insert keywords into their speech in order to categorise and structure the recording. For example, if while driving on the highway you are over-taken by a truck laden with useful looking contact information, you could quickly record: "Phone book entry ... Business number is 408 927 6353 ... Name is Hamilton & Co Removals ... Comment is the truck says something about extra heavy items a speciality, could get that piano shifted at last".

Given the spoken keywords (underlined), the recorded note will be added as a new entry in an electronic phone book, and the speech segmented into different fields in the contact form. This interaction technique could also be applied to interpersonal voice messaging. Although previous work in the literature has promoted semi-structured text messages, there has been little research on the extraction of form structure from a voice record. VoiceNotes [4] allowed speech items to be sorted into named lists, but did not provide for structuring within voice notes. A number of projects have shown how manual and keyword indices can be assigned to a recording [1], but do not derive further structure from the set of index points. Perhaps the closest analogue to our method is in telephone answering services which generate voice prompts to structure a caller's voice message [3]. In our approach, the user volunteers these prompts as part of single voice recording operation ("capture now, and help to organise later"), rather than entering into a dialogue with the system.

SYSTEM DESIGN

We have developed a software prototype of these ideas under Windows, visually presented as if the user is interacting with a palmtop PC (see Figure 1). For our purposes, the palmtop organiser contains four data areas, or "applications": a phone/address book, diary, to-do list, and outgoing messages. Each application is a list of entries, and each entry in the list is a form with N fields. Text can be optionally entered into each field.

Our goal is to translate recorded voice items into this format. This consists of three tasks:

File the voice record into the right application;
Segment the speech into the form fields of this application;
Where possible, recognise the speech of the field entries.

Our design assumes that the user explicitly signals these filing and segmentation operations; however, it is up to the user to choose how much indexing information to use in a voice note.

New voice recordings are by default added to a 'general' list, which displays simple header information about the items. Further to this, if the voice record begins with the name of a recognised application, such as "Phone book entry", then the voice record is filed into that area; if not, it remains in the general list. If an application name is recognised, then the rest of the voice record is searched for any keyword labels corresponding to form fields. Key word (or phrase) labels are assumed to be preceded by a pause. The speech content following the label, up to the next recognised field name, is then associated with this field in the form.

Figure 1: Voice clips assigned to form fields

Figure 1 shows a form-based phone book entry which has been derived from the example speech input described above. An audio icon indicates that a field contains speech. Speech can be played back in clips according from its field location, or as the original whole record. Notice that the order of field entries in the original voice input does not have to correspond to their ordering in the form; furthermore, the user chooses which fields to fill in.

In some fields, for example where known items are expected, the field contents are also recognised.

SPEECH PROCESSING

Speech recognition is performed using a continuous, speaker-independent recogniser based on sub-word Hidden Markov Models. Recorded voice files are first scanned for any pauses longer than 1 second. If an application name is recognised in the initial sections of speech, then the recogniser looks for a keyword field label after each detected pause.

In general, we do not aim to recognise the speech immediately following a keyword label, and so characterise this with a "garbage" model consisting of any sequence of phonemes. The garbage model also detects the non-keywords which may follow natural juncture pauses within the speech.

DISCUSSION

The expected advantage of using keyword labels is their relative speed of capture: the speaker can add structure to a stream of spoken input without entering into a more time-consuming interaction with the system. Informal experiments suggest that people understand how to apply keyword labels, and that speech recognition works reliably in this context.

We plan to explore user benefits in the context of a broader multi-modal design which allows interaction trade-offs between speech commands and button presses. For instance, when recording with the palmtop lid closed, speech is used to identify the target application. If the application has already been accessed, only field labels and content are needed. Finally, if the user manually selects the field focus, only speech content for that field is required.

An advantage of form-structured input is that it focuses the requirements of speech recognition: detecting unknown names is a problem for speech-to-text systems, but for our phone book's Name field we expect an unknown word, and can design a recognition strategy accordingly. A more general goal of our research is to understand how structured, multi-modal input may lead to a better overall transcription of speech than we would get from a straightforward application of speech-to-text (cf. [ 2]).

ACKNOWLEDGEMENTS

Roger Tucker implemented the silence detection and integration with the speech recogniser.

REFERENCES

Degen, L., Mander,R., and Salomon, G. 1992. Working with Audio: Integrating Personal Tape Recorders and Desktop Computers. Proc. CHI '92, ACM, New York.
Oviatt, S.L., Cohen, P. R., and Wang, M. Q. 1994. Toward Interface Design for Human Language Technology: Modality and Structure as Determinants of Linguistic Complexity. Speech Communication, 15(3-4), December.
Schmandt, C. and Arons, B. 1985. Phone Slave: A Graphical Telecommunications Interface. Proc. Soc. Information Display, 26(1).
Stifelman, L. J. et al 1993. VoiceNotes: A speech interface for a hand-held voice notetaker. Proc. InterCHI '93, ACM, New York.
Voice Organizer. 1993. Owner's manual, Voice Powered Technology Inc.