Under our approach, the user can insert keywords into their speech in order to categorise and structure the recording. For example, if while driving on the highway you are over-taken by a truck laden with useful looking contact information, you could quickly record: "Phone book entry ... Business number is 408 927 6353 ... Name is Hamilton & Co Removals ... Comment is the truck says something about extra heavy items a speciality, could get that piano shifted at last".
Given the spoken keywords (underlined), the recorded note will be added as a new entry in an electronic phone book, and the speech segmented into different fields in the contact form. This interaction technique could also be applied to interpersonal voice messaging. Although previous work in the literature has promoted semi-structured text messages, there has been little research on the extraction of form structure from a voice record. VoiceNotes [4] allowed speech items to be sorted into named lists, but did not provide for structuring within voice notes. A number of projects have shown how manual and keyword indices can be assigned to a recording [1], but do not derive further structure from the set of index points. Perhaps the closest analogue to our method is in telephone answering services which generate voice prompts to structure a caller's voice message [3]. In our approach, the user volunteers these prompts as part of single voice recording operation ("capture now, and help to organise later"), rather than entering into a dialogue with the system.
Our goal is to translate recorded voice items into this format. This consists of three tasks:
Our design assumes that the user explicitly signals these filing and segmentation operations; however, it is up to the user to choose how much indexing information to use in a voice note.
New voice recordings are by default added to a 'general' list, which displays simple header information about the items. Further to this, if the voice record begins with the name of a recognised application, such as "Phone book entry", then the voice record is filed into that area; if not, it remains in the general list. If an application name is recognised, then the rest of the voice record is searched for any keyword labels corresponding to form fields. Key word (or phrase) labels are assumed to be preceded by a pause. The speech content following the label, up to the next recognised field name, is then associated with this field in the form.
Figure 1: Voice clips assigned to form fields
Figure 1 shows a form-based phone book entry which has been derived from the example speech input described above. An audio icon indicates that a field contains speech. Speech can be played back in clips according from its field location, or as the original whole record. Notice that the order of field entries in the original voice input does not have to correspond to their ordering in the form; furthermore, the user chooses which fields to fill in.
In some fields, for example where known items are expected, the field contents are also recognised.
In general, we do not aim to recognise the speech immediately following a keyword label, and so characterise this with a "garbage" model consisting of any sequence of phonemes. The garbage model also detects the non-keywords which may follow natural juncture pauses within the speech.
We plan to explore user benefits in the context of a broader multi-modal design which allows interaction trade-offs between speech commands and button presses. For instance, when recording with the palmtop lid closed, speech is used to identify the target application. If the application has already been accessed, only field labels and content are needed. Finally, if the user manually selects the field focus, only speech content for that field is required.
An advantage of form-structured input is that it focuses the requirements of speech recognition: detecting unknown names is a problem for speech-to-text systems, but for our phone book's Name field we expect an unknown word, and can design a recognition strategy accordingly. A more general goal of our research is to understand how structured, multi-modal input may lead to a better overall transcription of speech than we would get from a straightforward application of speech-to-text (cf. [ 2]).