Modern technologies in teaching FLT
Modern technologies in teaching FLT
PLAN:
INTRODUCTION……………………………………..……………….………..…2
PRINCIPLES OF ASR TECHNOLOGY………………..……………….………..3
PERFORMANCE AND DESIGN ISSUES IN SPEECH APPLICATIONS……...7
CURRENT TRENDS IN VOICE-INTERACTIVE CALL………….…….………8
FUTURE TRENDS IN VOICE-INTERACTIVE CALL…….…..…………....…13
DEFINING AND ACQUIRING LITERACY IN THE AGE OF INFORMATION…………………………………….………………………..…..14
CONTENT-BASED INSTRUCTION AND LITERACY DEVELOPMENT…..15
THEORY INTO PRACTICE…………….……………………………………….17
CONCLUSION………………………………………………………...…………17
REFERENCES……………………………………………………………………18
INTRODUCTION
During the past two decades, the exercise of spoken language
skills has received increasing attention among educators. Foreign language
curricula focus on productive skills with special emphasis on communicative
competence. Students' ability to engage in meaningful conversational
interaction in the target language is considered an important, if not the most
important, goal of second language education. This shift of emphasis has
generated a growing need for instructional materials that provide an opportunity
for controlled interactive speaking practice outside the classroom.
With recent advances in multimedia technology,
computer-aided language learning (CALL) has emerged as a tempting alternative
to traditional modes of supplementing or replacing direct student-teacher
interaction, such as the language laboratory or audio-tape-based self-study.
The integration of sound, voice interaction, text, video, and animation has
made it possible to create self-paced interactive learning environments that
promise to enhance the classroom model of language learning significantly. A
growing number of textbook publishers now offer educational software of some
sort, and educators can choose among a large variety of different products.
Yet, the practical impact of CALL in the field of foreign language education
has been rather modest. Many educators are reluctant to embrace a technology
that still seeks acceptance by the language teaching community as a whole
(Kenning & Kenning, 1990).
A number of reasons have been cited for the limited
practical impact of computer-based language instruction. Among them are the
lack of a unified theoretical framework for designing and evaluating CALL
systems (Chapelle, 1997; Hubbard, 1988; Ng & Olivier, 1987); the absence of
conclusive empirical evidence for the pedagogical benefits of computers in
language learning (Chapelle, 1997; Dunkel, 1991; Salaberry, 1996); and finally,
the current limitations of the technology itself (Holland, 1995; Warschauer,
1996). The rapid technological advances of the 1980s have raised both the
expectations and the demands placed on the computer as a potential learning
tool. Educators and second language acquisition (SLA) researchers alike are now
demanding intelligent, user-adaptive CALL systems that offer not only
sophisticated diagnostic tools, but also effective feedback mechanisms capable
of focusing the learner on areas that need remedial practice. As Warschauer
puts it, a computerized language teacher should be able to understand a user's
spoken input and evaluate it not just for correctness but also for
appropriateness. It should be able to diagnose a student's problems with
pronunciation, syntax, or usage, and then intelligently decide among a range of
options (e.g., repeating, paraphrasing, slowing down, correcting, or directing
the student to background explanations). (Warschauer, 1996, p. 6)
Salaberry (1996) demands nothing short of a system capable
of simulating the complex socio-communicative competence of a live tutor--in
other words, the linguistic intelligence of a human--only to conclude that the
attempt to create an "intelligent language tutoring system is a
fallacy" (p. 11). Because speech technology isn't perfect, it is of no use
at all. If it "cannot account for the full complexity of human language,"
why even bother modeling more constrained aspects of language use (Higgins,
1988, p. vii)? This sort of all-or-nothing reasoning seems symptomatic of much
of the latest pedagogical literature on CALL. The quest for a theoretical
grounding of CALL system design and evaluation (Chapelle, 1997) tends to lead
to exaggerated expectations as to what the technology ought to accomplish. When
combined with little or no knowledge of the underlying technology, the
inevitable result is disappointment.
PRINCIPLES OF ASR
TECHNOLOGY
Consider the following four scenarios:
1.
A court reporter
listens to the opening arguments of the defense and types the words into a
steno-machine attached to a word-processor.
2.
A medical doctor
activates a dictation device and speaks his or her patient's name, date of
birth, symptoms, and diagnosis into the computer. He or she then pushes
"end input" and "print" to produce a written record of the
patient's diagnosis.
3.
A mother tells
her three-year old, "Hey Jimmy, get me my slippers, will you?" The
toddler smiles, goes to the bedroom, and returns with papa's hiking boots.
4.
A first-grader
reads aloud a sentence displayed by an automated Reading Tutor. When he or she
stumbles over a difficult word, the system highlights the word, and a voice
reads the word aloud. The student repeats the sentence--this time
correctly--and the system responds by displaying the next sentence.
At some level, all four scenarios involve speech
recognition. An incoming speech signal elicits a response from a
"listener." In the first two instances, the response consists of a
written transcript of the spoken input, whereas in the latter two cases, an
action is performed in response to a spoken command. In all four cases, the
"success" of the voice interaction is relative to a given task as
embodied in a set of expectations that accompany the input. The interaction
succeeds when the response--by a machine or human "listener"--matches
these expectations.
Recognizing and understanding human speech requires a considerable
amount of linguistic knowledge: a command of the phonological, lexical,
semantic, grammatical, and pragmatic conventions that constitute a language.
The listener's command of the language must be "up" to the
recognition task or else the interaction fails. Jimmy returns with the wrong
items, because he cannot yet verbally discriminate between different kinds of
shoes. Likewise, the reading tutor would miserably fail in performing the
court-reporter's job or transcribing medical patient information, just as the
medical dictation device would be a poor choice for diagnosing a student's
reading errors. On the other hand, the human court reporter--assuming he or she
is an adult native speaker--would have no problem performing any of the tasks
mentioned under (1) through (4). The linguistic competence of an adult native
speaker covers a broad range of recognition tasks and communicative activities.
Computers, on the other hand, perform best when designed to operate in clearly
circumscribed linguistic sub-domains.
Humans and machines process speech in fundamentally
different ways (Bernstein & Franco, 1996). Complex cognitive processes
account for the human ability to associate acoustic signals with meanings and
intentions. For a computer, on the other hand, speech is essentially a series
of digital values. However, despite these differences, the core problem of
speech recognition is the same for both humans and machines: namely, of finding
the best match between a given speech sound and its corresponding word string.
Automatic speech recognition technology attempts to simulate and optimize this
process computationally.
Since the early 1970s, a number of different approaches to
ASR have been proposed and implemented, including Dynamic Time Warping,
template matching, knowledge-based expert systems, neural nets, and Hidden
Markov Modeling (HMM) (Levinson & Liberman, 1981; Weinstein, McCandless,
Mondshein, & Zue, 1975; for a review, see Bernstein & Franco, 1996).
HMM-based modeling applies sophisticated statistical and probabilistic
computations to the problem of pattern matching at the sub-word level. The
generalized HMM-based approach to speech recognition has proven an effective,
if not the most effective, method for creating high-performance
speaker-independent recognition engines that can cope with large vocabularies;
the vast majority of today's commercial systems deploy this technique.
Therefore, we focus our technical discussion on an explanation of this
technique.
An HMM-based speech recognizer consists of five basic
components: (a) an acoustic signal analyzer which computes a spectral
representation of the incoming speech; (b) a set of phone models (HMMs) trained
on large amounts of actual speech data; (c) a lexicon for converting sub-word
phone sequences into words; (d) a statistical language model or grammar network
that defines the recognition task in terms of legitimate word combinations at
the sentence level; (e) a decoder, which is a search algorithm for computing
the best match between a spoken utterance and its corresponding word string. Figure 1
shows a schematic representation of the components of a speech recognizer and
their functional interaction.
Figure 1. Components of a speech recognition device
A. Signal Analysis
The first step in automatic speech recognition consists of
analyzing the incoming speech signal. When a person speaks into an ASR
device--usually through a high quality noise-canceling microphone--the computer
samples the analog input into a series of 16- or 8-bit values at a particular
sampling frequency (ranging from 8 to 22KHz). These values are grouped together
in predetermined overlapping temporal intervals called "frames." These
numbers provide a precise description of the speech signal's amplitude. In a
second step, a number of acoustically relevant parameters such as energy,
spectral features, and pitch information, are extracted from the speech signal
(for a visual representation of some of these parameters, see Figure 2 on
page 53). During training, this information is used to model that particular
portion of the speech signal. During recognition, this information is matched
against the pre-existing model of the signal.
B. Phone Models
Training a machine to recognize spoken language amounts to
modeling the basic sounds of speech (phones). Automatic speech recognition
strings together these models to form words. Recognizing an incoming speech
signal involves matching the observed acoustic sequence with a set of HMM
models. An HMM can model either phones or other sub-word units or it can model
words or even whole sentences. Phones are either modeled as individual sounds--so-called
monophones--or as phone combinations that model several phones and the
transitions between them (biphones or triphones). After comparing the incoming
acoustic signal with the HMMs representing the sounds of language, the system
computes a hypothesis based on the sequence of models that most closely
resembles the incoming signal. The HMM model for each linguistic unit (phone or
word) contains a probabilistic representation of all the possible
pronunciations for that unit--just as the model of the handwritten cursive b
would have many different representations. Building HMMs--a process called
training--requires a large amount of speech data of the type the system is
expected to recognize. Large-vocabulary speaker-independent continuous dictation
systems are typically trained on tens of thousands of read utterances by a
cross-section of the population, including members of different dialect regions
and age-groups. As a general rule, an automatic speech recognizer cannot
correctly process speech that differs in kind from the speech it has been
trained on. This is why most commercial dictation systems, when trained on
standard American English, perform poorly when encountering accented speech,
whether by non-native speakers or by speakers of different dialects. We will
return to this point in our discussion of voice-interactive CALL applications.
C. Lexicon
The lexicon, or dictionary, contains the phonetic spelling
for all the words that are expected to be observed by the recognizer. It serves
as a reference for converting the phone sequence determined by the search
algorithm into a word. It must be carefully designed to cover the entire
lexical domain in which the system is expected to perform. If the recognizer
encounters a word it does not "know" (i.e., a word not defined in the
lexicon), it will either choose the closest match or return an
out-of-vocabulary recognition error. Whether a recognition error is registered
as a misrecognition or an out-of-vocabulary error depends in part on the vocabulary
size. If, for example, the vocabulary is too small for an unrestricted
dictation task--let's say less than 3K--the out-of-vocabulary errors are likely
to be very high. If the vocabulary is too large, the chance of misrecognition
errors increases because with more similar-sounding words, the confusability
increases. The vocabulary size in most commercial dictation systems tends to
vary between 5K and 60K.
D. The Language Model
The language model predicts the most likely continuation of
an utterance on the basis of statistical information about the frequency in
which word sequences occur on average in the language to be recognized. For
example, the word sequence A bare attacked him will have a very low
probability in any language model based on standard English usage, whereas the
sequence A bear attacked him will have a higher probability of
occurring. Thus the language model helps constrain the recognition hypothesis
produced on the basis of the acoustic decoding just as the context helps
decipher an unintelligible word in a handwritten note. Like the HMMs, an
efficient language model must be trained on large amounts of data, in this case
texts collected from the target domain.
In ASR applications with constrained lexical domain and/or
simple task definition, the language model consists of a grammatical network
that defines the possible word sequences to be accepted by the system without
providing any statistical information. This type of design is suitable for CALL
applications in which the possible word combinations and phrases are known in
advance and can be easily anticipated (e.g., based on user data collected with
a system pre-prototype). Because of the a priori constraining function
of a grammar network, applications with clearly defined task grammars tend to
perform at much higher accuracy rates than the quality of the acoustic
recognition would suggest.
E. Decoder
Simply put, the decoder is an algorithm that tries to find
the utterance that maximizes the probability that a given sequence of speech
sounds corresponds to that utterance. This is a search problem, and especially
in large vocabulary systems careful consideration must be given to questions of
efficiency and optimization, for example to whether the decoder should pursue
only the most likely hypothesis or a number of them in parallel (Young, 1996).
An exhaustive search of all possible completions of an utterance might
ultimately be more accurate but of questionable value if one has to wait two
days to get a result. Trade-offs are therefore necessary to maximize the search
results while at the same time minimizing the amount of CPU and recognition
time.
PERFORMANCE AND DESIGN ISSUES IN SPEECH APPLICATIONS
For educators and developers interested in deploying ASR in
CALL applications, perhaps the most important consideration is recognition
performance: How good is the technology? Is it ready to be deployed in language
learning? These questions cannot be answered except with reference to
particular applications of the technology, and therefore touch on a key issue
in ASR development: the issue of human-machine interface design.
As we recall, speech recognition performance is always
domain specific--a machine can only do what it is programmed to do, and a
recognizer with models trained to recognize business news dictation under
laboratory conditions will be unable to handle spontaneous conversational
speech transmitted over noisy telephone channels. The question that needs to be
answered is therefore not simply "How good is ASR technology?" but
rather, "What do we want to use it for?" and "How do we get it
to perform the task?"
In the following section, we will address the issue of
system performance as it relates to a number of successful commercial speech
applications. By emphasizing the distinction between recognizer performance on
the one hand--understood in terms of "raw" recognition accuracy--and
system performance on the other; we suggest how the latter can be optimized within
an overall design that takes into account not only the factors that affect
recognizer performance as such, but also, and perhaps even more importantly,
considerations of human-machine interface design.
Страницы: 1, 2, 3
|
|