Modern technologies in teaching FLT
Historically, basic speech recognition research has focused
almost exclusively on optimizing large vocabulary speaker-independent
recognition of continuous dictation. A major impetus for this research has come
from US government sponsored competitions held annually by the Defense Advanced
Research Projects Agency (DARPA). The main emphasis of these competitions has
been on improving the "raw" recognition accuracy--calculated in terms
of average omissions, insertions, and substitutions--of large-vocabulary
continuous speech recognizers (LVCSRs) in the task of recognizing read sentence
material from a number of standard sources (e.g., The Wall Street Journal
or The New York Times). The best laboratory systems that participated in
the WSJ large-vocabulary continuous dictation task have achieved word error rates
as low as 5%, that is, on average, one recognition error in every twenty words
(Pallet, 1994).
CURRENT TRENDS IN VOICE-INTERACTIVE CALL
In recent years, an increasing number of speech laboratories
have begun deploying speech technology in CALL applications. Results include
voice-interactive prototype systems for teaching pronunciation, reading, and
limited conversational skills in semi-constrained contexts. Our review of these
applications is far from exhaustive. It covers a select number of mostly
experimental systems that explore paths we found promising and worth pursuing.
We will discuss the range of voice-interactions these systems offer for
practicing certain language skills, explain their technical implementation, and
comment on the pedagogical value of these implementations. Apart from giving a
brief system overview, we report experimental results if available and provide
an assessment of how far away the technology is from being deployed in the
commercial and educational environments.
Pronunciation Training
A useful and remarkably successful application of speech
recognition and processing technology has been demonstrated by a number of
research and commercial laboratories in the area of pronunciation training.
Voice-interactive pronunciation tutors prompt students to repeat spoken words
and phrases or to read aloud sentences in the target language for the purpose
of practicing both the sounds and the intonation of the language. The key to
teaching pronunciation successfully is corrective feedback, more specifically,
a type of feedback that does not rely on the student's own perception. A number
of experimental systems have implemented automatic pronunciation scoring as a
means to evaluate spoken learner productions in terms of fluency, segmental
quality (phonemes) and supra-segmental features (intonation). The automatically
generated proficiency score can then be used as a basis for providing other
modes of corrective feedback. We discuss segmental and supra-segmental feedback
in more detail below.
Segmental Feedback.
Technically, designing a voice-interactive pronunciation tutor goes beyond the
state of the art required by commercial dictation systems. While the grammar
and vocabulary of a pronunciation tutor is comparatively simple, the underlying
speech processing technology tends to be complex since it must be customized to
recognize and evaluate the disfluent speech of language learners. A
conventional speech recognizer is designed to generate the most charitable
reading of a speaker's utterance. Acoustic models are generalized so as to
accept and recognize correctly a wide range of different accents and
pronunciations. A pronunciation tutor, by contrast, must be trained to both
recognize and correct subtle deviations from standard native pronunciations.
A number of techniques have been suggested for automatic
recognition and scoring of non-native speech (Bernstein, 1997; Franco, Neumeyer,
Kim, & Ronen, 1997; Kim, Franco, & Neumeyer, 1997; Witt & Young,
1997). In general terms, the procedure consists of building native
pronunciation models and then measuring the non-native responses against the
native models. This requires models trained on both native and non-native
speech data in the target language, and supplemented by a set of algorithms for
measuring acoustic variables that have proven useful in distinguishing native
from non-native speech. These variables include response latency, segment
duration, inter-word pauses (in phrases), spectral likelihood, and fundamental
frequency (F0). Machine scores are calculated from statistics derived from
comparing non-native values for these variables to the native models.
In a final step, machine generated pronunciation scores are
validated by correlating these scores with the judgment of human expert
listeners. As one would expect, the accuracy of scores increases with the
duration of the utterance to be evaluated. Stanford Research Institute (SRI)
has demonstrated a 0.44 correlation between machine scores and human scores at
the phone level. At the sentence level, the machine-human correlation was 0.58,
and at the speaker level it was 0.72 for a total of 50 utterances per speaker
(Franco et al., 1997; Kim et al., 1997). These results compare with 0.55, 0.65,
and 0.80 for phone, utterance, and speaker level correlation between human
graders. A study conducted at Entropic shows that based on about 20 to 30
utterances per speaker and on a linear combination of the above techniques, it
is possible to obtain machine-human grader correlation levels as high as 0.85
(Bernstein, 1997).
Others have used expert knowledge about systematic
pronunciation errors made by L2 adult learners in order to diagnose and correct
such errors. One such system is the European Community project SPELL for
automated assessment and improvement of foreign language pronunciation (Hiller,
Rooney, Vaughan, Eckert, Laver, & Jack, 1994). This system uses advanced
speech processing and recognition technologies to assess pronunciation errors
by L2 learners of English (French or Italian speakers) and provide immediate
corrective feedback. One technique for detecting consonant errors induced by
inter-language transfer was to include students' L1 pronunciations into the
grammar network. In addition to the English /th/ sound, for example, the
grammar network also includes /t/ or /s/, that is, errors typical of non-native
Italian speakers of English. This system, although quite simple in the use of
ASR technology, can be very effective in diagnosing and correcting known
problems of L1 interference. However, it is less effective in detecting rare
and more idiosyncratic pronunciation errors. Furthermore, it assumes that the
phonetic system of the target language (e.g., English) can be accurately mapped
to the learners' native language (e.g., Italian). While this assumption may
work well for an Italian learner of English, it certainly does not for a
Chinese learner; that is, there are sounds in Chinese that do not resemble any
sounds in English.
A system for teaching the pronunciation of Japanese long
vowels, the mora nasal, and mora obstruents was recently built at the
University of Tokyo. This system enables students to practice phonemic
differences in Japanese that are known to present special challenges to L2
learners. It prompts students to pronounce minimal pairs (e.g., long and short
vowels) and returns immediate feedback on segment duration. Based on the
limited data, the system seems quite effective at this particular task.
Learners quickly mastered the relevant duration cues, and the time spent on
learning these pronunciation skills was well within the constraints of Japanese
L2 curricula (Kawai & Hirose, 1997). However, the study provides no data on
long-term effects of using the system.
Supra-segmental Feedback.
Correct usage of supra-segmental features such as intonation and stress has
been shown to improve the syntactic and semantic intelligibility of spoken
language (Crystal, 1981). In spoken conversation, intonation and stress
information not only helps listeners to locate phrase boundaries and word
emphasis, but also to identify the pragmatic thrust of the utterance (e.g.,
interrogative vs. declarative). One of the main acoustical correlates of stress
and intonation is fundamental frequency (F0); other acoustical characteristics
include loudness, duration, and tempo. Most commercial signal processing
software have tools for tracking and visually displaying F0 contours (see Figure 2).
Such displays can and have been used to provide valuable pronunciation feedback
to students. Experiments have shown that a visual F0 display of supra-segmental
features combined with audio feedback is more effective than audio feedback
alone (de Bot, 1983; James, 1976), especially if the student's F0 contour is
displayed along with a native model. The feasibility of this type of visual
feedback has been demonstrated by a number of simple prototypes (Abberton &
Fourcin, 1975; Anderson-Hsieh, 1994; Hiller et al., 1994; Spaai & Hermes,
1993; Stibbard, 1996). We believe that this technology has a good potential for
being incorporated into commercial CALL systems.
Other types of visual pronunciation feedback include the
graphical display of a native speaker's face, the vocal tract, spectrum
information, and speech waveforms (see Figure 2).
Experiments have shown that a visual display of the talker improves not only
word identification accuracy (Bernstein & Christian, 1996), but also speech
rhythm and timing (Markham & Nagano-Madesen, 1997). A large number of
commercial pronunciation tutors on the market today offer this kind of
feedback. Yet others have experimented with using a real-time spectrogram or
waveform display of speech to provide pronunciation feedback. Molholt (1990)
and Manuel (1990) report anecdotal success in using such displays along with
guidance on how to interpret the displays to improve the pronunciation of
suprasegmental features in L2 learners of English. However, the authors do not
provide experimental evidence for the effectiveness of this type of visual
feedback. Our own experience with real-time spectrum and waveform displays
suggests their potential use as pronunciation feedback provided they are
presented along with other types of feedback, as well as with instructions on
how to interpret the displays.
Teaching Linguistic Structures and Limited
Conversation
Apart from supporting systems for teaching basic
pronunciation and literacy skills, ASR technology is being deployed in
automated language tutors that offer practice in a variety of higher-level
linguistic skills ranging from highly constrained grammar and vocabulary drills
to limited conversational skills in simulated real-life situations. Prior to
implementing any such system, a choice needs to be made between two
fundamentally different system design types: closed response vs. open
response design. In both designs, students are prompted for speech input by
a combination of written, spoken, or graphical stimuli. However, the designs
differ significantly with reference to the type of verbal computer-student
interaction they support. In closed response systems, students must choose one
response from a limited number of possible responses presented on the screen.
Students know exactly what they are allowed to say in response to any given
prompt. By contrast, in systems with open response design, the network remains
hidden and the student is challenged to generate the appropriate response
without any cues from the system.
Closed Response Designs. One of
the first implementations of a closed response design was the Voice Interactive
Language Instruction System (VILIS) developed at SRI (Bernstein & Rtischev,
1991). This system elicits spoken student responses by presenting queries about
graphical displays of maps and charts. Students infer the right answers to a
set of multiple-choice questions and produce spoken responses.
A more recent prototype currently under development in SRI
is the Voice Interactive Language Training System (VILTS), a system designed to
foster speaking and listening skills for beginning through advanced L2 learners
of French (Egan, 1996; Neumeyer et al., 1996; Rypa, 1996). The system
incorporates authentic, unscripted conversational materials collected from
French speakers into an engaging, flexible, and user-centered lesson
architecture. The system deploys speech recognition to guide students through
the lessons and automatic pronunciation scoring to provide feedback on the
fluency of student responses. As far as we know, only the pronunciation scoring
aspect of the system has been validated in experimental trials (Neumeyer et al.,
1996).
In pedagogically more sophisticated systems, the
query-response mode is highly contextualized and presented as part of a
simulated conversation with a virtual interlocutor. To stimulate student
interest, closed response queries are often presented in the form of games or
goal-driven tasks. One commercial system that exploits the full potential of
this design is TraciTalk (Courseware Publishing International, Inc., Cupertino,
CA), a voice-driven multimedia CALL system aimed at more advanced ESL learners.
In a series of loosely connected scenarios, the system engages students in
solving a mystery. Prior to each scenario, students are given a task (e.g.,
eliciting a certain type of information), and they accomplish this task by
verbally interacting with characters on the screen. Each voice interaction
offers several possible responses, and each spoken response moves the
conversation in a slightly different direction. There are many paths through
each scenario, and not every path yields the desired information. This
motivates students to return to the beginning of the scene and try out a
different interrogation strategy. Moreover, TraciTalk features an agent that
students can ask for assistance and accepts spoken commands for navigating the
system. Apart from being more fun and interesting, games and task-oriented
programs implicitly provide positive feedback by giving students the feeling of
having solved a problem solely by communicating in the target language.
The speech recognition technology underlying closed response
query implementations is very simple, even in the more sophisticated systems.
For any given interaction, the task perplexity is low and the vocabulary size
is comparatively small. As a result, these systems tend to be very robust. Recognition
accuracy rates in the low to upper 90% range can be expected depending on task
definition, vocabulary size, and the degree of non-native disfluency.
FUTURE TRENDS IN VOICE-INTERACTIVE CALL
In the previous sections, we reviewed the current state of speech
technology, discussed some of the factors affecting recognition performance,
and introduced a number of research prototypes that illustrate the range of
speech-enabled CALL applications that are currently technically and
pedagogically feasible. With the exception of a few exploratory open response
dialog systems, most of these systems are designed to teach and evaluate
linguistic form (pronunciation, fluency, vocabulary study, or grammatical
structure). This is no coincidence. Formal features can be clearly identified
and integrated into a focused task design. This means that robust performance
can be expected. Furthermore, mastering linguistic form remains an important
component of L2 instruction, despite the emphasis on communication (Holland,
1995). Prolonged, focused practice of a large number of items is still
considered an effective means of expanding and reinforcing linguistic
competence (Waters, 1994). However, such practice is time consuming. CALL can
automate these aspects of language training, thereby freeing up valuable class
time that would otherwise be spent on drills.
While such systems are an important step in the right
direction, other more complex and ambitious applications are conceivable and no
doubt desirable. Imagine a student being able to access the Internet, find the
language of his or her choice, and tap into a comprehensive voice-interactive
multimedia language program that would provide the equivalent of an entire
first year of college instruction. The computer would evaluate the student's
proficiency level and design a course of study tailored to his or her needs. Or
think of using the same Internet resources and a set of high-level authoring
tools to put together a series of virtual encounters surrounding the task of
finding an apartment in Berlin. As a minimum, one would hope that natural
speech input capacity becomes a routine feature of any CALL application.
To many educators, these may still seem like distant goals,
and yet we believe that they are not beyond reach. In what follows, we identify
four of the most persistent issues in building speech-enabled language learning
applications and suggest how they might be resolved to enable a more widespread
commercial implementation of speech technology in CALL.
1. More research is necessary on modeling and
predicting multi-turn dialogs.
An intelligent open response language tutor must not only
correctly recognize a given speech input, but in addition understand
what has been said and evaluate the meaning of the utterance for pragmatic
appropriateness. Automatic speech understanding requires Natural Language
Processing (NLP) capabilities, a technology for extracting grammatical,
semantic, and pragmatic information from written or spoken discourse. NLP has
been successfully deployed in expert systems and information retrieval. One of
the first voice-interactive dialog systems using NLP was the DARPA-sponsored
Air Travel Information System (Pallett, 1995), which enables the user to obtain
flight information and make ticket reservations over the telephone. Similar
commercial systems have been implemented for automatic retrieval of weather and
restaurant information, virtual environments, and telephone auto-attendants.
Many of the lessons learned in developing such systems can be valuable for
designing CALL applications for practicing conversational skills.
Страницы: 1, 2, 3
|
|