EU-US WORKING GROUP ON SPOKEN-WORD AUDIO
COLLECTIONS
0.0 EXECUTIVE SUMMARY
Our diverse cultures rely
increasingly on audio and video resources. We need to chart a steady course to
assure the utility of this record. Such a course calls for a plan to preserve
these resources and to determine the most effective ways to access their rich
content. For example, though our nations possess enormous collections of
spoken-word materials, much of these collections will remain inaccessible to
the public for lack of adequate search technologies or from decay unless we act
to chart an access and preservation path. Our aim is to forge agreement on
these vital topics so that as technology changes, we will be able to rely on
our collections to understand and preserve these essential components of our
cultural heritage. We also need to focus research support on areas of access
and preservation that we believe will yield the greatest benefits across many
intersecting disciplines. This document
presents an agenda for collaborative research in this field.
Spoken-word collections cover
many different domains. These include radio and television broadcasts,
governmental proceedings, lectures, oral narratives, meetings and telephone
conversations. Needs vary in collecting,
accessing and preserving such data:
>> Political and
economic: providing access for citizens
to governmental proceedings, corporate shareholder meetings, public meetings of
political parties and NGOs
>> Cultural: building and
providing access to large, multilingual archives of broadcast material, public
performance, oral narrative
>> Educational: acquisition, preservation, search and access
of lectures; use of digital audio and video resources as primary sources for
inquiry and explication.
We have now reached the point
where various enabling technologies have matured sufficiently for the research
community to address these needs.
0.1 RESEARCH AGENDA
We have structured the
research agenda for Spoken-Word Archiving into three main areas: technologies,
privacy and copyright, and archiving and access. The main priorities are to advance the state-of-the-art within
each area and to foster integration among them. It is clear that each area informs the others.
0.1.2. Technologies
Audio/signal processing: Many spoken-word
collections of interest, particularly historical collections, have
deteriorating audio, due to media degradation or imperfect analog recording
technology. Other audio signal processing challenges arise from multiple
overlapping speakers (e.g., meetings), low signal quality due to far-field
microphones (e.g., in courtrooms), and effects of other sound sources and room
acoustics.
Speech and speaker recognition: Any spoken
audio collection raises two immediate questions: (a) What was said? (b) Who
said it? Speech and speaker recognition
technologies now work to minimally acceptable levels in controlled domains such
as broadcast news. However, to achieve
substantial improvements will require new tools to address less controlled
collections of spoken audio. Without such tools, the costs in labor to access
spoken-word collections will be prohibitive. The creation of these tools also
enables the hearing-impaired public to access and use these materials.
Language identification: In a
multilingual context, automatic language identification is essential. In particular many collections (e.g.,
meetings at a European level, some oral narratives) feature speakers switching
between different languages. We can
construct adequate baseline systems based on current knowledge, but issues such
as within-utterance language change pose interesting and challenging research
problems.
Information extraction: The use of a
spoken-word collection can be enhanced by the automatic generation of content
annotations. Currently it is possible to automatically identify names and
numbers and to provide punctuation.
However, it would be advantageous to annotate many other
elements--particularly prosodic events--above the word level such as emotion,
decision points in meetings, and interaction patterns in a conversation.
Collection level browsing and search: The current
state-of-the-art for collection-level browsing and search is based on the
application of text retrieval approaches to speech recognizer output. While this has been relatively successful in
some domains (e.g., broadcast news), such approaches have clear limitations,
and the development of new search-and-browse approaches, beyond simple text
retrieval, are required.
Presentation: The final technological research area that
we have identified is presentation.
Currently this involves little more than playing an audio clip and
displaying its transcription. There is
an enormous need for research in this area. Several examples come to mind: the
construction of audio scenes, presentation of higher-level structure,
summarization, and presentation of non-lexical information in speech.
0.1.3. Privacy and copyright
policy
A number of policy issues
arise when discussing spoken-word collections, and it is impossible to treat
the technologies in isolation from these issues.
Privacy: Privacy is a major problem, particularly for
some spoken-word collections when individuals do not have an expectation that
their statements will be archived, although they have spoken in a public forum
such as a company board meeting or a political rally. It may not be possible to offer a comprehensive solution to the
privacy problem, particularly for materials where contact with the original
collector or subject has long since been lost, but research in this area can
accomplish some practical goals. Future
collectors must be armed with reasonable policies to obtain clearances and
document applicable rights.
Copyright: The impact of copyright varies by
collection, and by national jurisdiction.
Because the legal terrain here is difficult to understand and is
undergoing rapid change, a practical approach for cultural institutions to take
may be to implement "acceptable risk" policies. These policies set forth overarching
principles of respect for subjects and for the creators' intellectual property
rights, but balance them against a need to provide access to important cultural
heritage materials. Issues to research include: copyright exemptions (e.g., for
educational purposes), classes of works that do not qualify for copyright
protection, digitization for preservation and mediated access, and questions
collection custodians should pose to determine copyright status and likely
consequences of wide availability of digital surrogates.
0.1.4. Archiving and access
Preservation: Open research issues include standards for
preservation and development of sustainable digital repositories. Issues that
need to be addressed include: funding; automating digitization and metadata
capture; migrating and refreshing/augmenting collections. Computerized automated capture and
preservation of collections clearly underlies the development of this entire
area.
Content structure: This area spans metadata, item
structure, annotation, discovery and delivery issues, such as network
bandwidth. Metadata vocabularies have
been developed, but this area still needs further research, particularly when
the archived items have a complex structure.
Additionally, metadata needs to be aggregated and services offered on
the aggregated collection. Models and
tools for annotation are a rapidly evolving research area, particularly in the
area of distributed and collaborative annotation.
Media storage:
Even with the rapidly declining costs of spinning disks, most preservation-quality
audio collections will continue to require supplemental digital storage media
for the raw audio files at least into the foreseeable future. Research is
needed on various media (CD, DVD) and best practices for storing, checking, and
refreshing.
0.2 CONCLUSION
Though we represent diverse
disciplines, we see convergence in the domain of spoken-word collections to
address new and challenging issues. In advancing an ambitious research agenda,
we envision ancillary benefits across many communities of interest: speech and
language technology; software architecture; information science and digital
libraries; and a set of diverse user communities. Progress requires integration across these areas at the international
level. In our judgment, the impact will
be substantial. To do any less will risk significant loss to an essential
element of our collective heritage.
EU:
Steve Renals, CS, University
of Sheffield, UK
Franciska de Jong, CS,
University of Twente, Netherlands
Marcello Federico, ITC-IRST,
Trento, Italy
Lori Lamel, LIMSI-CNRS,
France
Fabrizio Sebastiani, IEI-CNR,
Pisa, Italy
Richard Wright, BBC
Information+Archives, UK
US:
Jerry Goldman, Political
Science, Northwestern University
Steven Bird, CIS/LDC, University of Pennsylvania
Claire Stewart, Library,
Northwestern University
Carl Fleischhauer, Library of
Congress
Mark Kornbluh, MATRIX and
History, Michigan State
Douglas W. Oard, CLIS/UMIACS,
University of Maryland
1.0. INTRODUCTION
This report emerges from a
joint working group, supported in the US by the NSF Digital Library Initiative
and in the EU by the Network of Excellence for Digital Libraries (DELOS). The aim of the group is to define a common
agenda for research in the area of spoken word audio collections, and to define
areas for collaborative research between European and US researchers. The scope of this report is spoken word
audio, with no explicit reference to related areas, such as non-speech audio or
video.
The following sections set
out the issues, approaches and polices that converge in the area of spoken-word
collections. Section 2 outlines the major issues in the field. Section 3
surveys the current technological state-of-the-art. Section 4 examines the controversial and rapidly changing policy
issues pertaining to privacy and copyright.
Section 5 covers the issues of collecting, archiving and preserving
spoken-word content. An appendix addresses content preservation from a digital
library perspective.
2.O DESCRIPTION OF THE ISSUES
2.1 Types of spoken word
collections
As recording technology
improved dramatically over the course of the twentieth century, the size and
diversity of spoken word audio collections expanded geometrically. The cost of
recording and preserving sound has moved steadily downward as progressive
technological change has facilitated successive generations of recording
devices, each with increased storage capacity. While some recordings have been
migrated forward to newer technologies, archives today contain recorded
collections from every stage of technological development, captured at diverse
standards, and on assorted storage media.
Existing spoken word
collections cover an enormous range, from the earliest recordings of public
speeches and broadcasts on wax cylinders and 78 rpm records to oral histories
on cassette and reel-to-reel tapes and on through continuous digital recordings
of contemporary broadcast news.
With declining costs and
increasing recording and storage capacity, the breadth of audio collections
continues to grow. As a result, the domains covered by spoken word collections
are both varied and vast. Nonetheless, it is possible to identify the major
types of spoken word collections. These are:
1.
Broadcast news (both
radio and TV)
2.
Governmental proceedings
(parliamentary debates, court recordings, commissions and committees)
3.
Presentations in the
form of speeches, sermons, and lectures (political, religious, educational)
4.
Oral narratives (usually
retrospective interviews)
5.
Interactive meetings
(business (e.g., shareholders), political (e.g., political party conventions)
6.
Recorded telephone
conversations
2.2 Magnitude of content
National archives and public
broadcast archives in Europe and the United States have millions of hours of
holdings, much of which features spoken language. The bulk of this content -- estimated to be 80 percent -- is in
analog form. It will perish within a
few decades unless we take steps to preserve it. An order-of-magnitude estimate for all significant world holdings
of audio material in analog formats is 100 million hours. In addition, millions
of hours of spoken language materials come into existence in digital form every
year. As digital systems replace analog systems, and as recording and storage
costs decline, we envision accelerated growth in the creation of spoken word
documents and increased demand for efficient archiving and retrieval
strategies.
2.3 User communities
Not surprisingly, the user
communities for spoken word collections are as vast and varied as the
collections themselves. Indeed, many collections serve diverse needs for very
different user communities. For example, a recorded speech might be used for
political purposes, for educational and research purposes, or for linguistic
analysis.
Spoken word collections have
implications across all areas of daily life. In politics, recordings of
speeches, debates, broadcasts, etc. are an essential source of governmental
proceedings, political candidates and positions, and citizen activities. In
commerce, recordings of board meetings, political debates, broadcast news, etc.
can provide access to vital information for economic planning and action. In
law, recordings can serve as essential evidence in civil and criminal cases.
And in culture and education, spoken word collections are vital to preserve,
understand and teach about all aspects of social life.
2.4 Archives
Existing spoken word archives
are equally varied. Government bodies, archives, libraries, museums,
universities, churches, political parties, corporations, broadcasters, recording
companies, community organizations, and individuals hold spoken word
collections. While some major media
organizations such as broadcasters and recording companies have prioritized
preservation and developed systems for access, particularly with native digital
collections, these are the exceptions.
For most organizations, their spoken word collections comprise a small
part of a larger effort to collect written material. Small and regional media organizations, which often produce large
amounts of new spoken word materials daily, do not have the resources to
archive their growing collections adequately, if at all. Thus spoken word collections are often the
stepchild of an archive—minimally managed, poorly preserved, and hardly
accessible.
Most spoken word archives
have analog holdings on various media, typically different types of tape. Tape
recordings have limited life spans, are hard to maintain and lose quality when
copied on to new analog media. Analog recordings can provide only linear access;
they must be listened to in consecutive order. Preservation needs and access
are in perpetual conflict with analog materials since the media deteriorate
with use. Working or shelf copies protect the source material, but increase the
archiving costs through additional physical space and collection management.
The future of spoken word
archives clearly lies with digital technology. Most new recordings, including
broadcast materials, are now 'born digital.' Equally important, conversion of
older analog content to digital media is essential for both long-term
preservation and to provide increased access to spoken word resources. With
digital collections, access and preservation are not in conflict. Digital
content can be endlessly replicated with no loss of quality. Most important,
digitization breaks the linear tyranny of analog recording. Digital sound can
be searched in ways unimaginable with analog recordings. Access can be nearly
instantaneous.
2.5 Access
For analog recordings, access
is only possible through replication of the physical recording. To listen to
most spoken word recordings in archival collections today, one must either
travel to the archive and listen to a second- or third-generation copy or pay
for a copy of the second- or third-generation recording. The cost in time and
resources constrains access. Digital recordings, however, can be transmitted
without loss over the Internet. The World Wide Web is rapidly becoming a
doorway to digital audio collections. Streamed audio access to news and
cultural programming can be accessed from RAI Radio (Italy)
<http://www.radio.rai.it/>, BBCi (United Kingdom) <www.bbc.co.uk>,
and National Public Radio (United States) <www.npr.org>. Other specialized spoken word collections,
such as The OYEZ Project <www.oyez.org>,which delivers the archived
recordings of US Supreme Court arguments <www.oyez.org>, vividly
illustrate the increased access to spoken word collections made possible by
advances in information technology. The Web, however, is only the first doorway
to digital spoken word collections. Already we are seeing the development of
alternative multimedia delivery devices, from collection sharing via
peer-to-peer networking and downloading of spoken-word materials in popular
formats like MP3 or OGG. Client storage devices hold ever-increasing amounts of
data at lower and lower price points. PDAs and cell phones coupled with the
development of new services will soon permit multi-channel delivery including
spoken word materials such as talking books.
2.6 Convergence of
Technology, Needs, and Possibilities
We have now reached the point
where various enabling technologies have matured sufficiently for large-scale
conversion of spoken word resources from analog to digital. Computer equipment
is now available enabling low cost and low loss transfer to digital media. The
archival community has set standards for such conversion to ensure the
integrity of the original collections. Ubiquitous and inexpensive computers
offer ready access to digitized sound.
These are only the first
steps, however, to ensuring long term preservation and enhanced access to
spoken word collections. We are at a watershed moment where digital sound
archives are possible, their advantages over old analog collections are
self-evident, and the technology to work with digital audio has matured to
facilitate a new research agenda for spoken word archives. Furthermore, many of these analog
collections are at risk, and we have a narrow window of opportunity to preserve
digitally this analog media, before it starts to become unusable. We can now envision with confidence and
clarity the research agenda in speech recognition, speech enhancement, speech
analysis and retrieval, archival design, metadata development, delivery
interface, educational integration, and other areas to fully realize the
potential of spoken word archives.
The digital revolution has
the potential to do for aural resources what the printing press did for written
resources. For the first time in human history, the spoken word can now be
preserved for the long term and made accessible to those far beyond hearing
range and in ways that open up entirely new possibilities for human
culture. Researchers have the tools and
capacities to transform access to the spoken word and vastly enrich capacities
across all aspects of our societies.
3.0 CURRENT TECHNOLOGIES AND
NEW LANDSCAPES
In the last decade, speech
recognition technology has made impressive advances and has proven to be
effective for indexing audiovisual archives. Research projects, such as
Informedia at Carnegie Mellon University
<http://www.informedia.cs.cmu.edu/>, and recent commercial products, such
as Virage <http://www.virage.com/>, have successfully deployed
state-of-the-art speech recognition into digital libraries of broadcast news.
Automatic indexing and content-based access of audiovisual archives is today
feasible thanks to outstanding results from research in speech recognition,
language processing, and information retrieval. An important driving force in speech
recognition has been the US Defense Advanced Research Projects Agency
(DARPA)<www.darpa.mil>. Working through the National Institute of
Standards and Technology (NIST)<www.nist.org/speech> and the Linguistic
Data Consortium at the University of Pennsylvania (LDC)
<www.ldc.upenn.edu>, DARPA coordinates and focuses efforts on relevant
research topics, collects language resources, and organizes systematic
evaluations. We survey technologies for indexing and accessing spoken documents
developed under the DARPA umbrella.
3.1. Background Information.
Work in Large Vocabulary
Continuous Speech Recognition (LVCSR) started well over a decade ago with tasks
mostly oriented toward automatic dictation. In the United States, DARPA set up
a common framework by providing both large amounts of training data and
evaluation data. The reference task--dictation of Wall Street Journal
articles--was used to refine the technology for dealing with a vocabulary of
several tens of thousands of words, a challenging task in itself at the time.
In the years that followed, research interest moved toward making automatic
speech recognition (ASR) systems capable of handling a range of acoustic
conditions and speaking styles much wider than what could be found in dictation
tasks. The new reference task became therefore the transcription of broadcast
news (BN) programs, for which increasing amounts of training data were
collected.
During the same years,
similar evaluations were organized by DARPA for other research problems related
with spoken information processing: speaker recognition, information
extraction, spoken document retrieval, and topic detection and tracking. Tasks
of increasing complexity have been defined over time, for which increasing
amounts of training data have been made available to participants. As a
consequence, scaling-up of the technology was enforced, together with
improvements in performance and robustness of the systems. Concerning the
content of data, up to now most of the evaluations have been carried out on
American English BN. The news domain is indeed very general and makes data
collection relatively easy. Research aimed at porting these techniques to other
domains and languages has started in several labs.
3.2 Audio Indexing
Audio indexing involves several
discrete topics: audio partitioning, speech recognition, speaker
identification, information extraction, and automatic summarization. We examine
each topic below.
3.2.1. Audio Partitioning
Audio partitioning is
concerned with segmenting an audio stream into acoustically homogeneous chunks
and classifying them according to a set of broad acoustic classes. For
instance, for the purpose of speech recognition, the audio is usually
partitioned by identifying segments containing speech versus other types of
content, such as music. In many systems, the classification of speech segments
is refined by recognizing, for instance, the signal bandwidth, the gender of
the speaker, the speaker itself, the level of noise, etc. The difficulty of
this task increases with the level of detail required by the
segmentation/classification task. For instance, while detecting speech segments
in conversational speech is relatively easy, detecting speaker turns can be
very difficult when overlapping speech occurs (that is, when two people speak
simultaneously). Moreover, segment classification, as well as any other pattern
classification task, becomes difficult when the actual conditions mismatch with
those observed in the training data. Acoustic segmentation and classification is
crucial for indexing audio recording which may contain more than pure such,
e.g. music scores, jingles, etc. Moreover, an accurate segmentation can be
exploited to run speech recognizers specifically trained on a given acoustic
condition, e.g. bandwidth, gender, speaker.
In recent years, several
algorithms have been presented which use a statistical decision criterion to
detect spectral changes (SCs) within the feature space of the signal. Assuming
that a Gaussian process generates data, SCs are detected within a sliding
window through a model selection method. The most likely SC is tested by
comparing two hypotheses: (i) the data in the window are generated by the same
distribution; (ii) the data in the left and right halves of the window are
drawn by two different distributions. The test is performed with a likelihood
ratio that also takes into account the different "sizes" of the
compared models. Usually, the Bayesian Information Criterion (BIC) is applied
to select the best fitting model.
In order to classify
segments, researchers use Gaussian mixture models, which typically have been
trained on supervised data. Finally, clustering of speech segments is carried
out by a bottom-up scheme that groups segments, which are acoustically close
with respect to the BIC or some defined metric. Audio partitioning has been
applied successfully and extensively, mainly on broadcast news transcription.
The application to other audio collections poses problems of portability and
robustness of the methods, which at the moment are surmounted by tuning the
system on some development data. Future work should be devoted to developing
robust methods which can cope with greater variability of acoustic
conditions.
3.2.3. Speech Enhancement
Speech is often recorded
under sub-optimal conditions, but pre-processing techniques can be used to
enhance the suitability of the signal for subsequent processing. For access to
spoken content, speech enhancement typically seeks to achieve one or more of
the following goals: (1) improved accuracy from subsequent automatic processing
(e.g., automatic speech recognition), (2) improved intelligibility for a human
listener, or (3) a qualitative improvement in the listening experience for a
human listener. Human perception is far more robust than present automated
approaches to speech recognition, so signal processing that precedes speech
recognition is presently the focus of a substantial research effort. The
initial focus of that work has been accommodation of environmental factors
(e.g., background sounds such as vehicle noise or unrelated transient signals,
and the results of microphone placement and room acoustics such as echo or
reverberation) and the effect of transmission channels (e.g., speech
compression algorithms for cellular telephones). Work with recorded materials
has generally focused on improving intelligibility and/or the listening
experience, topics often referred to as "audio restoration." Much
recorded speech is stored on analog media, including cassette tape, open-reel
magnetic tape, phonograph records, and (less commonly) Dictabelt loops, wire
recordings and wax cylinders. In addition to environmental factors, analog
recordings might be degraded when they are first created (e.g., by the
frequency response of the microphone), during duplication (e.g., reduction in
the signal-to-noise ratio), during storage (e.g., warping of a phonograph
record), as a result of prior use (e.g., splicing to repair a tape break), and
during replay (e.g., due to variations in motor speed). Audio restoration
techniques leverage an understanding of the characteristics of undesirable
signal components (e.g., clicks and pops from damaged phonograph records, or
"thumps" from Dictabelt loops that have were folded for storage) and
human perceptual characteristics (e.g., critical bands and auditory masking) to
produce a more satisfactory reproduction of the original content.
3.2.4. Speech recognition
Speech recognition is
concerned with converting the speech waveform (an acoustic signal) into a
sequence of words. In the context of audio archives, the audio signal often
contains more than just speech. These other sounds may be intentionally
recorded: such as background music or noise added to set the mood, or samples
of sounds such as animal vocalizations.
Speech is generally produced
with the purpose of being understood by a native speaker of the same language,
who usually shares some set of common values or experience with the speaker.
The choice of lexical items and speaking style depend on the given talker and
the intended audience. There are significant differences of an acoustic nature
due to anatomical differences across speakers, as well as social and
dialectical conventions. These factors complicate speech understanding for
humans and machines. Transcribing and annotating audio data are necessary to
provide access to its content, and large vocabulary continuous speech
recognition is a key technology for automatic processing. Such audio data is
challenging to process as it consists of a continuous flow of audio data
comprised of segments with various acoustic and linguistic characteristics.
Processing such inhomogeneous data thus requires appropriate modeling at the
acoustic and linguistic levels. (Since much of the linguistic information is
encoded in the audio channel of video data, once transcribed it can be accessed
using text-based tools. Transcripts will allow users to access data based on
linguistic content.)
Today's most effective speech
recognition approaches are based on a statistical model of the speech signal.
Speech is assumed to be generated by a language model which provides estimates
of the probability of all word strings independently of the observed signal,
and an acoustic model encoding the message in the audio signal. The goal of speech
recognition is to find the most likely word sequence given the observed
acoustic signal.
Transcription system
development requires large annotated training corpora for all languages and
audio data types of interest. Transcription performance is highly dependent
upon the availability of sufficient training materials, the preparation of
which requires substantial human effort. Speaker independence is obtained by
estimating the parameters of the acoustic models on large speech corpora
containing data from a large speaker population. It is common practice to use
gender-dependent acoustic models to account for anatomical differences (on
average, females have a shorter vocal tract) and social ones (female voice is
often "breathier" caused by incomplete closure of the vocal folds).
Other groupings of speakers according to different characteristics such as
dialect or speaking rate have also been investigated to improve performance.
State-of-the-art systems are typically trained on several tens to hundreds of
hours of audio data and several hundred million words of text materials. The
significant advances in speech recognition over the last decade can be
partially attributed to advances in robust feature extraction, acoustic
modeling with effective parameter sharing, unsupervised adaptation to speaker
and environmental condition, efficient dynamic network decoding, and audio
stream partitioning algorithms, as well as to the availability of large audio
and text corpora for model estimation, combined with increased computational
power and storage capacity.
While the same basic
transcription technology has been successfully applied to different languages
and types of speech, specific adaptations are required to optimize performance.
Mismatches in training and test conditions typically result in high error
rates.
Despite significant advances
in speech recognition, at least two fundamental problems remain: speed and
robustness. There is a large gap between machine and human performance (a
factor 5 to 10, depending upon the transcription task and test conditions). It
is well acknowledged that there are large performance differences for the best
systems (attributed to a variety of factors such as speaking style, speaking
rate and accent). Improvements are needed in the modeling techniques at all
levels: acoustic, lexical and pronunciation, and linguistic (syntactic and
semantic).
Ongoing research [4] is
addressing issues such as reducing the cost of system development [18], and
improving the genericity, portability [19, 2] and adaptability of the models.
Some techniques of interest are, for example, light and unsupervised training,
faster adaptation techniques, learnable pronunciation lexicons, language model
adaptation, topic detection and labeling, and metadata annotation. Accurate
metadata annotation (topic, speaker, acoustic conditions) can also be used to
adapt generic models to the particular audio data type to be transcribed.
3.2.5. Speaker identification and tracking
Speaker recognition has been
an active research area for many years [32, 7]. Several types of recognition
problems can be distinguished: speaker identification, speaker detection and
tracking, speaker verification (also called speaker authentication). In speaker
identification the absolute identity of the talker is determined. In contrast,
for speaker verification the question is to determine if the talker is the
person s/he claims to be. Speaker tracking refers to finding audio segments
from the same speaker, even if the identity of the speaker is unknown.
Accurately identifying a
speaker [23, 14] is an unsolved research problem, despite several decades of
research. The problem is quite close to that of speech recognition in that the
speech signal encodes both linguistic information (i.e. the word sequence which
is of interest for speech recognition) and non-linguistic information (the
speaker identity, as well as less well-quantified values such as mood, emotion,
attention level, etc.). The characteristics of a given individual's voice
change over time (short and long periods) and depend on the talker's emotional
and physical state. The identification problem is also highly influenced by the
environmental, recording, and channel conditions. For example, it is very
difficult to determine if a voice is the same in different background
conditions, such as in the presence of background music or noise.
Automatically identifying
speakers and tracking them throughout individual recordings and in recording
collections can reduce the manual effort required to annotate this type of
metadata. Automatic speaker identification will allow digital library users to
access spoken word documents based on who is talking. Some of the recent
speaker tracking research can potentially allow talkers to be located in large
audio corpora using a sample of speech, even if the absolute identify of the
talker is unknown.
Most of today's working
speaker recognition systems [33, 30] make use of the same statistical
approaches as are used in speech recognition, i.e., hidden Markov models or
Gaussian mixture models. Speaker specific models estimated on speaker-specific
audio data are used to assess whether unknown speech samples are from one of
the known speakers (speaker identification). Much of the research in speaker
recognition has been for security purposes, either controlling access to a
physical location or to restricted information, or in intelligence monitoring.
Recent promising research at the Johns Hopkins Summer Workshop 2002 (SuperSID:
Exploiting High Level Information for High-performance Speaker Recognition)
<http://www.clsp.jhu.edi/ws2002> has addressed using multiple types of
acoustic, supra-linguistic and phonetic attributes to improve speaker
recognition performance.
While today's speaker
recognition technology is not perfect, performance levels are probably adequate
for use in automatic annotation of audio collections and for speaker-based
access in digital libraries.
3.2.6. Information extraction
Information extraction (IE)
is the task of extracting meaningful information from information sources. The
search objective can range from named entities -- such as persons,
organizations, and locations -- to attributes, facts, or events. The difficulty
of information extraction is related to the natural language processing
required to recognize complex concepts, the intrinsic ambiguity of named
entities (e.g., the name "Barcelona" could denote a city or a
football team, depending on the context), and the steady evolution of language
(e.g., new names gradually appear in the media).
Given the aim of accessing
spoken documents, information extraction automatically selects pieces of
content that may prove interesting or useful. Moreover, by maintaining links
between the extracted information and the original documents, it is possible to
provide context for each retrieved concept. Most recent research on information
extraction from spoken documents has been carried out under the IE Entity
Recognition and Automatic Content Extraction (ACE) programs under DARPA and
NIST. Considered tasks are the detection of named entities (names of locations,
organizations, and people), temporal expressions, currency amounts, and
percentages, within BN shows. State-of-the-art performance was achieved as well
by rule-based system and statistical language modeling approaches. Research
under the ACE program currently focuses on more complex tasks, such as
detecting and tracking entities over time, recognizing mentioned events, and
relations among entities.
3.2.7. Automatic summarization
Speech summarization is
commonly applied to techniques that reduce the size of automatically generated
transcripts in a way that resembles summarization technology for text
documents. Its goal can thus be described in a similar way as text
summarization: to take a partial or unstructured source text, extract
information content from it, and present the most important content in a
condensed form in a manner sensitive to the needs of the user and task.
Depending on the nature of the content and the user information need, both
summarization of single fragments as well as multi-document summarization can
be helpful browsing tools.
The fact that speech
transcripts may be linguistically incorrect requires techniques for enhancement
of the content. To generate coherent, syntactically well-formed descriptions
that preserve the original meaning, semantically complex operations have to be
developed, e.g., for anaphora resolution. Two types of summarization tasks are
distinguished here: (1) condensation of content to reduce the size of a
transcription according to a target compression ratio, e.g. to produce closed
captions, meeting minutes, etc, involve both intra-sentential as well as text
processing, and (2) a presentation tool for spoken document retrieval. Several
obstacles impede transparent presentation of speech retrieval results.
Automatically generated audio transcripts are not easily read, because of
recognition errors and the lack of punctuation, but also because of
disfluencies, repairs, repetitions, etc. Extraction of the relatively important
information can help users to browse more easily through search results.
Purely audio summaries of
speech can be envisaged, and prototype speech skimming systems have been
developed. An important issue in this
case is the development of accelerated audio playback, which is an interesting
signal-processing task if intelligibility and the speech characteristics (such
as intonation) are to be maintained as much as possible. This area is rather closely related to
speech synthesis.
3.2.8. Prosody
A spoken message contains
more that simply what was said (i.e., the text transcription) and who said it
(i.e., the identity of the speaker).
The prosody (timing, intonation and stress) of the speech signal offers
a great deal more information such as the emotional state of the speaker,
boundaries and "punctuation" in the speech and disambiguation of the
intended message (e,g., questions have a rising intonation). The research challenge in this area is to
develop prosodic models that are sensitive to these supra-segmental features.
3.3 Collection Level Browsing
and Searching
Collection level browsing and
searching involves several complex tasks including: spoken document retrieval,
interactive speech retrieval, topic detection and tracking, cross-language
information retrieval, and speech enhancement. We take up these topics below.
3.3.1. Spoken document retrieval (SDR)
By using speech recognition
to convert speech into text, detailed textual representations can be generated
for spoken content. These representations are not exact renderings of the
spoken content, but they do allow searching for specific words and phrases and
in general are suited for a variety of audio browsing support tools. Since
speech recognition systems can label recognized words with exact time stamps,
their output can be viewed as metadata by which it becomes possible to lead
users directly to relevant audio fragments (perhaps with links to related
content, e.g., video). By default, SDR recognition technology is speaker-independent
and geared toward continuous speech and large vocabularies. Building an
acoustic and language model requires substantial effort and data. For
general-purpose audio mining tools, acceptable retrieval performance calls for
a minimum word error rate of .50.
Tuning the lexica to specific
domains, collections or periods require additional effort and work flow
procedures from user organizations. Recognition of unknown words (numbers for
compounding languages like German and Dutch are relatively high) and proper
names are problematic. Many audio collections are difficult to search because
of the recording conditions (e.g., multiple speakers, bandwidth, background
noise) do not meet minimum requirements. Transparent presentation of retrieval
results is hindered in several ways. It is not easy to ready audio transcripts
due to these errors and the lack of inter-punctuation. Simply put, retrieval
requires listening. But semantically sound fragment boundaries are not easy to
detect, complicating listening retrieval. Therefore fragment and cluster
classification is crucial to SDR.
SDR allows the disclosure of
speech at the fragment level in a way that resembles the most common text
search engines. Other search support techniques, e.g., automatic classification
and clustering, are applicable on automatic transcripts. We distinguish two
approaches: (1) word-spotting and (2) automatic transcript generation in
combination with (advanced) full text retrieval tools. For word-spotting,
acoustic models are built for a small set of words that are matched during
retrieval on query-term models. However, word-spotting is only suited for a
small set of search terms. Its chief advantage is that is requires no off-line
content processing. Automatic transcription requires acoustic models,
(statistical) language models (co-occurrence frequencies) and a recognition
lexicon (for some systems limited to 65k words). Its principal limitations are
the requirement of off-line content processing and the availability of large
corpora. A lot is uncertain about the retrieval performance for speech content.
Commercial audio mining tools are available for English only. Systems have been
compared only within the general news domain. Within DARPA context, speech
retrieval is considered a solved problem. [13]. However this is only valid in a
very academic interpretation of the concept of SDR.
Many open problems remain and
we envision substantial issues calling for additional research. There is little
experience with SDR outside research labs. Content segmentation at a semantic
level is crucial, but poorly developed. Current technologies for recognition
require huge textual training collections and labour-intensive annotation of
audio training corpora. These investments are not straightforward for smaller
languages. Techniques for training that circumvent the annotation task are
under investigation. Evaluation measures specific for SDR recognition
technology are not generally available. (Word error rate is not always the best
predictor of retrieval quality.)
3.3.2. Interactive Speech Retrieval
Any interactive search
process involves five stages, as represented in Figure 1. In query formulation,
users interact with the system to craft an expression of the information
need--a query--that the system can use to produce a useful search result.
Queries are typically expressed as either an undifferentiated set of search
terms or as a Boolean expression. In the sorting stage, the system reorders the
documents, seeking to put the most promising recordings ahead of others. In
Boolean systems, this typically equates to placing documents into one of two
sets (relevant, or not). Increasingly common "ranked retrieval"
systems take a different approach, allowing searchers to pose queries with
little or no structure and then peruse a ranked list of potentially interesting
recordings.
Efficient indexing enables
quick searches of large collections, but the effectiveness of interactive
searching ultimately depends on synergy with a sophisticated user. Humans bring
sophisticated pattern recognition, abstraction and inferential skills to the
search process, but the number of documents to which those skills can usefully
be applied is limited. The goal of the selection stage is to allow the user to
rapidly discover the most promising documents from a system-ranked list through
examination of indicative summaries, i.e., summaries designed to support
selection. These summaries are generally quite terse since several must be
displayed simultaneously in the available screen space. Because summaries may
not provide enough information to support a final selection decision, modern
systems also typically provide users with the ability to play segments of
individual recordings. Direct use of a recording may also result from replay within
the retrieval system, or a separate delivery stage may be required (e.g., the
audio might be stored on a compact disk for later replay with high fidelity).
Recorded speech poses both
challenges and opportunities for the interactive retrieval process. The key
challenges are deceptively simple: automatic transcription is imperfect and
listening to recordings can be time consuming. Some important opportunities
include potential use of speaker identification, speaker turn detection, dialog
structure, channel characteristics (e.g., telephone vs. recording studio) and
associated audio (e.g., background sounds) to enhance either the sorting or the
browsing process. Multimedia integration (e.g., with video) also offers some
important opportunities for synergy. For example, query formulation based on
spoken words might be coupled with selection based on key frames extracted from
video.
3.3.3. Topic Detection and Tracking
The Text Retrieval
Conference's (TREC) Spoken Document Retrieval (SDR) track emerged in 1996 from
a tradition of ranked retrieval evaluations, and the design of the track
reflects that heritage. In 1997, a second venue for comparative evaluation of
speech retrieval research was introduced in the United States; it is known as
Topic Detection and Tracking (TDT). The still ongoing TDT evaluations reflect a
broadening of speech processing research to include a strong application focus.
Four test collections (known as TDT-1 through TDT-4) have been developed, with
the most recent having the following distinguishing characteristics: (1)
multi-modal, including both broadcast news audio and newswire text; (2)
multilingual, including English, Chinese, and Arabic; (3) multi-source,
typically including news from more than one source in each combination of modality
and language; (4) event-oriented, with relevance assessment based on whether a
story reports on a relatively narrowly defined event (e.g., a specific airplane
crash).
The most recent TDT
evaluations include comparative evaluations on five tasks: (1) topic
segmentation, in which systems seek to discern the times at which the story
being reported changes; (2) topic detection, an unsupervised learning task in
which systems seek to cluster stories together if they report on the same
event; (3) topic tracking, a semi-supervised learning task in which systems
seek to identify subsequent news stories that report on the same event as one
or more example stories; (4) new event detection, in which systems seek to
identify the first story to report on each event; and (5) story link detection,
in which systems seek to determine whether pairs of stories report on the same
event. The story segmentation task is performed only on broadcast news sources.
All other tasks are multi-modal.
3.3.4. Cross-Language Information Retrieval
When searchers lack the
language skills needed to pose their query using terms from the same language
as the spoken content that they seek, some form of support for translation must
be embedded within the search system. There are three cases in which such a
capability might be useful: (1) use by searchers capable of understanding the
spoken language who are not sufficiently fluent to formulate effective queries
in that language (e.g., searchers with a limited active vocabulary in that language);
(2) use by searchers lacking the ability to understand the spoken language, if
their principal goal is to find easily recognized objects associated with the
spoken content (e.g., searching photographs based on spoken captions); an (3)
use by any searcher, if suitable speech-to-speech (or speech-to-text)
translation technology can be provided. At present, speech-to-speech
translation has been demonstrated only in limited domains (e.g., travel
planning), but development of more advanced capabilities are the focus of a
substantial research investment.
Cross-language information
retrieval relies on three commonly used strategies: (1) query translation, (2)
document translation, and (3) interlingual techniques. Query-translation
architectures are well suited to situations where many query languages must be
supported. In interactive applications, query translation also offers the
possibility of exploiting interaction designs that might help the system better
understand the system's capabilities and/or help the system better translate
the searcher's intended meaning. "Document translation" is actually
somewhat of a misnomer, since it is the internal representation of the spoken
content that is translated. Document translation architectures are well suited
to cases in which query-time efficiency is an important concern. Document
translation also typically offers a greater range of possibilities for
exploiting linguistic knowledge because spoken content typically contains many
more words than a query, and because queries are often not grammatically well
formed. With interlingual techniques, both the query and the document
representations are transformed into some third representation to facilitate
comparisons. Interlingual techniques may be preferred in cases where many query
languages and many document languages must be accommodated simultaneously, or
in cases where the conforming space is automatically constructed based on
statistical analysis of texts in each language.
4.0 PRIVACY AND COPYRIGHT
Collectors of spoken word
audio materials must address a number of complex privacy and copyright issues
relating to the collection, retention and distribution of works. These policy issues cannot be ignored, but
the legal frameworks that define them offer incomplete and sometimes
conflicting guidance. Privacy and copyright are two of the most rapidly
changing aspects of United States and European law. We provide a brief analysis of some key issues.
4.1 Privacy
Privacy is not a precisely
defined concept. The issues of data and
communications privacy have been very widely debated, across both the U.S. and
the E.U. Less commonly discussed
aspects of privacy may be equally relevant to a spoken-word archive. For example, what are the legal implications
of recording the proceedings of a public meeting?
4.1.1. An expectation of privacy
Some issues surrounding audio
and video capture in public are not dissimilar to those debated when
face-recognition technology began to be used to scan for potential criminals in
crowds at airports and other public places.[25] Here, the expectation of
privacy is one of anonymity, but this expectation is not always codified in
law. Several U.S. state courts have
resisted attempts to curtail video and audio recording in public, finding that no
reasonable expectation of privacy can exist in a public place.[34] Use of recording technologies for public
surveillance in the United Kingdom has been common for some years, though the
government in 2000 signaled its intention to regulate such surveillance in
accordance with its 1998 Data Protection Act, passed to harmonize U.K. laws
with the 1995 European Union Data Protection Directive.[16] Other E.U. nations,
including Greece and Sweden, also interpret the E.U. Directive (revised in 1998
and 2000) to specifically pertain to public video surveillance and closely
regulate its use. [25]
Use of wiretapping and other
communications surveillance technology is, in general, well regulated,
requiring that law enforcement obtain court or judicial orders to make use of
such know-how. In reality, permission
to wiretap is easily obtained. In the
United States no state or federal law enforcement agency requests for wiretaps
were denied in 2001, and a total of 1,491 were authorized.[12] The French
government approved 4,175 wiretaps in 2000, and the German government 12,651 in
the same year.[25: pages 178, 185, 388]
Open monitoring and recording of telephone transactions and monitoring of
employees' electronic communications for business purposes is also widespread.[9] The right of employees to opt out of such
data-gathering has been weak or non-existent.
The E.U. is leading the push to expand data privacy regulations to
include employee-monitoring activities, which may have the effect of
discouraging such monitoring beyond the E.U.[15] Most European Union nations
have appointed a central data protection agency, charged with oversight of all
personal data collection and processing, and grant individual citizens a
mechanism for review, change or removal of their own information.
The ability of governments to
compel disclosure of recordings, data, and personal information has increased
since 2001, particularly for electronic communications, and particularly in the
United States. Under the October 2001
USA PATRIOT Act, federal law enforcement agencies are still required to obtain
permission to access records; but the agencies now have the additional
instrument of a Foreign Intelligence Surveillance Act (FISA) order along with
warrants and subpoenas. FISA, passed in
1978, created a secret court that acts on terrorism investigations in national
security situations.[36]
Given the need for oversight
and the ease of access to such information once stored in digital form, some
difficult choices face the custodian. What balance should be struck between
protection of the individual and benefits of large spoken word collections for
worthy public purposes (e.g., scholarly inquiry, political discourse, law
enforcement, artistic expression)? A
good place to turn for examples and guidance may be the regulations governing
research on human subjects.[9] These regulations clearly advocate
informed consent and limited gathering and use of personal data.[42]
Collecting agencies should
determine whether individuals have granted permission for a recording to be
made, implicitly or explicitly. A
signed consent or permission form is the best safeguard, but is unlikely to be
available, particularly for older recordings.
Presenters and announcers, interviewers and interviewees, audience members
and call-in guests, parties in a conversation: all such participants must be
considered when determining whether privacy rights are an issue. A public
figure, such as a politician or a known lecturer, is unlikely to substantiate
an invasion of privacy claim were his speech to be recorded. The more public the citizen, the less likely
he or she is to be able to make a claim.
4.2 Copyright
There are three main issues
concerning spoken word materials:
1. whether these materials
are protected by copyright,
2. whether rights auxiliary
to the copyright must be taken into consideration when considering an archiving
digitization initiative, and
3. all rights
notwithstanding, whether an argument can be made to proceed with digitization
and delivery.
Copyright legislation has
changed dramatically over the past decade, both in the United States and in
Europe. The rise and demise of Napster
and other online file-swapping services have focused the attention of the technology,
content, legal and consumer advocacy communities on the issue of digital audio
distribution. Despite this attention
and debate, clear rules have failed to emerge, and are unlikely to surface in
the near term, particularly for non-commercial use by libraries and archives.
4.2.1. Extent of copyright protections for spoken word materials
As signatories to the Berne
convention [43], the United States and the European Union member nations have
reciprocity in copyright protection so that materials created or published in
one nation will, for the most part, enjoy the same protections in other
nations. Copyright statutes generally
reserve for the copyright holder the exclusive right to reproduce, display,
distribute copies of, and perform or broadcast the work. The European Union issued a copyright
directive [11] in 2001 that matches many of the provisions
in the United States Digital Millennium Copyright Act (DMCA) of 1998. Both extend encryption protections with
harsh anti-circumvention language.
Principles in the EU Copyright Directive will be implemented through the
laws of member nations. The results of this implementation do not yet offer
clarity or guidance.
In general, sound recordings
have historically been accorded fewer protections than other types of works,
though some recent initiatives have the effect of increasing
theirprotection.[17] In the United States, sound recordings were not protected
by federal copyright law until 1972, and recordings made before that date are
still not federally protected (though they may be under state copyright laws).
Works fixed after 1977 receive at least 70 years of protection. (In the United
States, in order for works to qualify for protection, they must be fixed in
some physical medium. This requirement
has been clarified to encompass digital publication, as well.) In the United
Kingdom, copyright for sound recordings was established in the 1911 law [24]
and lasts for 50 years, 20 fewer years than granted to creators of print
works. The 1979 revision of the Berne
convention likewise established a 50-year duration of copyright, a term also
endorsed by the European Union in 1993.[5]
Most of the signatory nations
require either some form of fixity (United States) or availability to the
public (Germany and the United Kingdom) in order to claim copyright
protection. However, France's copyright
law is much more generous toward authors, stating: "A work shall be deemed
to have been created, irrespective of any public disclosure, by the mere fact
of realization of the author’s concept, even if incomplete."[28]
There may be layers of
authorship embedded in a single sound recording, and each act of authorship may
be subject to separate protection. For
a musical work, the composition and arrangement might both be protected even if
the physical recording itself is not. A
more relevant example of layered rights may be seen in observing several
separate acts of creation that might be said to be encompassed within a sound
recording of a news broadcast: a typescript, background music, and interviews
with news subjects. It is unclear how
stringently these protections will be pursued and enforced.
The Berne convention singles
out certain types of works and suggests that signatory states may wish to
exempt them from copyright protection.
The article reads in part: