MANAGING ERRORS IN MULTIMODAL HUMAN-COMPUTER INTERACTION

Impressions from the assistants
(not published, not submitted, no way)

Erik C. B. Olsen & Robert Van Gent
Computer Dialogue Laboratory & Artificial Intelligence Center
SRI International, 333 Ravenswood Avenue, Menlo Park, CA. 94025

ABSTRACT

This research examined how people resolve errors, specifically spiraling errors, known as the recognition degradation spiral [1], in a contrastive manner. In correcting errors (self-made or system generated) people used spoken and written input in a contrastive way to designate a shift from one mode to another. This is known as contrastive functionality [2], and in this case, was used to designate a shift in content from speech to writing or from writing to speech. Often, this shift occurred due to the receipt of an error, and the repeated input was similar in content if not identical to the initial input entered before the shift, in the first mode. Suggestions for design as well as informal thoughts about subject's behavior in response to errors is included for just for kicks.

INTRODUCTION

We hoped to answer some of the following questions. How are people altering spoken and written language in attempting to correct errors? when mode shifting? What do people believe will resolve errors? What actions do they take to correct errors? Why do people alternate inputs? Do excessive errors lead to abbreviations? What is the likelihood of input recognition degradation(s) on follow-up input on a spiral error?

Subject disfluencies seemed to have been minimized, in that subjects progressed carefully, and even tentatively; they may have been "on the look out" for errors, since varying amounts of errors were received, and because they were forewarned that "the system needs some extra patience to use and you may need to enter your input repeatedly."

Another factor regarding the low disfluency rate is believed to be related to the fact that this study used a form-based format. As in previous studies, [3] spoken disfluencies were effectively minimized by the use of a structured format. [3] also found that the unconstrained format required the speaker to self-structure and plan to a greater degree. In the present study the use of the constrained presentation format, in the form of input "boxes" seemed to have facilitated neater writing, clearer, slower speech, both with seemingly more planning involved. Thus, an increased rate of system generated errors in the constrained format seems to have caused subjects to carefully consider input and be less disfluent.

Mixing modes was not facilitated in this study and very little combined speech and writing, i.e., simultaneous input, was observed. This may be due again to the fact that when errors were delivered, subjects often used the alternate input mode in a contrastive manner to distinguish a new input from the previously, seemingly error-producing input.
With real recognition systems used over time, a high amount of errors would force mode switching; users usually believe that if an error is made in one mode, the other mode is more likely to be successful.

When subjects received errors, as in a spiral-error, they usually entered input using a particular mode 2- 3 times before changing to the other mode. With repeated use of the first mode, it seemed that subjects believed the system was not accepting this input and thus changing modes would assist the system in better understanding the input. As was often the case, the system would then accept the input on the first or second input of that contrastively used input.

Another interesting finding is that errors seem to actually facilitate clarity. With errors abound, people seem to be more careful. Writing is neater and more likely to be printed. Speech is clearly articulated and more carefully spoken. Subjects seem to spend more time planning their input as well, and less disfluent, shorter phrases were observed.

Although, not a problem with writing, numerous errors could have potential problems as well; people may hyper-articulate, slow down substantially, and increase their volume. These are not particularly problematic for simulation studies, but would cause problems for a combined pen/voice recognition system, where the system is trained with language input, and louder, hyper-articulated, slow speech may too far varied to be accepted by the systems training model. It is understandable that people use such strategies when errors occur - these (slowing down, hyper articulating, speaking louder) are the same strategies used in human-human interactions when a speaker is not be understood or heard correctly.

Even subjects who have a strong preference for the speech input mode, will revert to the writing mode to correct errors after the third or forth attempt at unsuccessfully correcting a deep spiral error. On subject even formed a strategy for correction: 2 inputs in the 1st mode, then 2 inputs in the alternate mode and so forth until the error was resolved.

There seems to be a sense of relief when switching mode to correct errors. After 3-4 times, people may become suspicious of why there are errors which are not resolving, especially if the person feels they are clearly, and consistently entering their input(s).

Future research may look at how people correct their own performance errors. It is believed that a multimodal system would also be beneficial with these type of errors as well.

PRACTICAL SUGGESTIONS FOR DESIGN

In this study, subjects were informed that the system was still being tested. The experimenter allowed subjects to practice and familiarize themselves with the two input modes. During this practice session, system generated spiral errors were produced, ranging from 1 to 6 in depth. As a preview to the actual experiment, the experimenter encouraged the subject to re-enter the input when error messages, i.e., a series of ????? with a "beep" were received.

This method afforded repeated input attempts when the errors occurred during six tasks, when the experimenter would not be assisting the subject. This sort of "preview" could be used by designers in producing new, fully functional pen/voice systems. By informing the user that some input may be difficult to recognize, thus necessitating numerous inputs, the user will be better prepared to use the system and perhaps more understanding and patient when problems arise. Ideally, of course, new technology will be quite robust, but this sort of preview may be a proactive answer until limitations of technology are better addressed. Such a preview would "keep people going" when errors occur.

New hardware is constantly being improved and today robust systems are available. It is believed that more development of the actual interface is needed. Interfaces must be designed to best facilitate the functionality needs of the user and the system. For instance, an interface used in real estate selection may differ dramatically from another system where airline reservations are to be made. This research hopes to discover some of the preferred and desired functions which potential users have. Using such information will allow better, easier to use multimodal systems.

Advantages of multimodal systems: alternating inputs: good for error correction, dealing with uncommon, foreign words.

Robert's comments re. error study

People usually try one modality 2 or 3 times before switching. This means that on spiral levels of 6 or so people are starting to switch back to the original modality.

At least at the start of the task people respond to errors by trying to alter their input. For example, they switch from "seven sixteen ninety four" to "July sixteen nineteen ninety four." If they have been using abbreviations, they write or speak out the whole phrase; if they have been using the whole phrase they often try to use abbreviations. In speech, subjects tend to rephrase what they are saying, speak more slowly, and hyper articulate. In writing, they tend to print, separate their letters more, write more slowly, and switch off between lowercase and uppercase. Toward the end of the task, subjects tend to switch modes and methods of phrasing less often, because they have started to realize it doesn't make a difference.

Subjects tend to get less frustrated with high spiral error rates in fields where they have a lot of different input options. At the beginning of the task, they don't tend to input something the same way more than twice -- they always try to vary something, be it modality, phrasing, speed, cursive vs. print, uppercase vs. lowercase, abbreviated vs. non-abbreviated, etc. Subjects tend to get frustrated more easily on fields where there isn't a lot of possible variations -- for example, getting 6 errors in a row on an input of "1" was more frustrating than getting 6 in a row on the credit card field, even though it took a lot longer to input the credit card field each time. This is probably because the errors are just more believable on longer input fields.

Small changes in system response time seem like they can make a big difference in how fast subjects perform the task. It at least seems like if I confirm something immediately after they input it, they will just move on to the next field right away, whereas if I delay half a second or so, they start looking around the form, checking out the receipt, etc., and take a lot longer to move on to the next field even after the confirmation arrives.

I am guessing that more people from this study will complain about the lack of feedback from the address/phone number fields than in the previous study. As people because less certain about the system's reliability, they want to check up on it and make sure that it got everything right, and don't like to depend on it's own judgment.

It might be interesting to give subjects some made-up information about how the system works and see how that affects their attempts to resolve errors. For example, in the current study we are pretty much discouraging them from trying stuff out during errors by telling them just to try it over and over, and eventually the system will get it. What would happen if we mentioned that if the system doesn't like something the first time, it might be worth repeating once, but if doesn't work the second time, then it probably isn't going to work and they'll have to use a different input. Within a certain field, this is probably more realistic in terms of simulating the response of a real recognition system; however in our simulation we would run into the problem that the system response would not be consistent across fields. In other words, the system would respond well to a certain phrasing in one task and then object to it in the next one.

The pattern of errors we are generating now is pretty unrealistic for several reasons. First of all, the random errors should be distributed more heavily to long input fields. For example, the phone number field should be 10 times more likely to receive an error (at least!) than the number of tickets field, since it has ten times more digits. It might be interesting to have an overall error rate (say 10%) and generate errors on the fly instead of pre-assigning them, and have a possible error for each character that needs to be inputted. For example, "1" would have a 10% error rate, whereas "12" would have a 19% (1 - .9^2) error rate. The phone number, with 10 digits, would have 65% error rate. This method would automatically lead to spiraling, in perhaps a more realistic fashion than the current method. However it's not clear whether this system would work for generating errors for speech -- it might be better to use an error rate per syllable uttered, for example, instead of per character (you would have to use average number of syllables for some fields, esp. numbers, which could be stated in many different ways).

Another way the current system is unrealistic is that it never gets anything wrong. For real recognition systems, I suspect that the kind of error where the system thinks it recognized what you said, but actually got it wrong, is at least as common as the kind we are simulating, where the system realizes that it has no idea what the input was. It would be interesting to simulate this kind of error as well. Furthermore, if the system started getting things wrong regularly, I suspect you would start getting a LOT more complaints about not getting feedback in the address/phone number slots.

It would also be interesting to see what happens when the two modalities have different base error rates, and whether it affects which modality the subjects prefer. It would also be interesting, if the current method of determining which input fields will generate random errors beforehand is continued, to generate the errors separately for each modality, so that switching modalities is encouraged.

Another thing which might be interesting to simulate is having yet another modality available, such as a keyboard. At what error rate would subjects completely abandon the speech/voice input in favor of the keyboard? How about speech only input, or written only input? It would be interesting to see what kinds of error rates are possible while still leaving the modality preferable to a keyboard.

1. S. L. Oviatt. Pen/Voice: Complementary Multimodal Communication. In Proceedings of Speech Tech '92, New York, New York, February, 1992.

2. S. L. Oviatt and E. Olsen. Integration themes in multimodal human-computer interaction. In Proceedings of the International Conference on Spoken Language Processing, Acoustical Society of Japan, v. 2, 551-554, 1994.

3. S. L. Oviatt. Predicting and managing spoken disfluencies during human-computer interaction. In ARPA, Human Language Technology Proceedings, 1994.