EU-US WORKING GROUP ON SPOKEN-WORD AUDIO COLLECTIONS

 

0.0 EXECUTIVE SUMMARY

Our diverse cultures rely increasingly on audio and video resources. We need to chart a steady course to assure the utility of this record. Such a course calls for a plan to preserve these resources and to determine the most effective ways to access their rich content. For example, though our nations possess enormous collections of spoken-word materials, much of these collections will remain inaccessible to the public for lack of adequate search technologies or from decay unless we act to chart an access and preservation path. Our aim is to forge agreement on these vital topics so that as technology changes, we will be able to rely on our collections to understand and preserve these essential components of our cultural heritage. We also need to focus research support on areas of access and preservation that we believe will yield the greatest benefits across many intersecting disciplines.  This document presents an agenda for collaborative research in this field.

 

Spoken-word collections cover many different domains. These include radio and television broadcasts, governmental proceedings, lectures, oral narratives, meetings and telephone conversations.  Needs vary in collecting, accessing and preserving such data:  

>> Political and economic:  providing access for citizens to governmental proceedings, corporate shareholder meetings, public meetings of political parties and NGOs
>> Cultural:  building and providing access to large, multilingual archives of broadcast material, public performance, oral narrative

>> Educational:  acquisition, preservation, search and access of lectures; use of digital audio and video resources as primary sources for inquiry and explication.


 

We have now reached the point where various enabling technologies have matured sufficiently for the research community to address these needs.

 

0.1 RESEARCH AGENDA

We have structured the research agenda for Spoken-Word Archiving into three main areas: technologies, privacy and copyright, and archiving and access.   The main priorities are to advance the state-of-the-art within each area and to foster integration among them.  It is clear that each area informs the others.

 

0.1.2. Technologies

Audio/signal processing:  Many spoken-word collections of interest, particularly historical collections, have deteriorating audio, due to media degradation or imperfect analog recording technology. Other audio signal processing challenges arise from multiple overlapping speakers (e.g., meetings), low signal quality due to far-field microphones (e.g., in courtrooms), and effects of other sound sources and room acoustics. 

 

Speech and speaker recognition:  Any spoken audio collection raises two immediate questions: (a) What was said? (b) Who said it?  Speech and speaker recognition technologies now work to minimally acceptable levels in controlled domains such as broadcast news.  However, to achieve substantial improvements will require new tools to address less controlled collections of spoken audio. Without such tools, the costs in labor to access spoken-word collections will be prohibitive. The creation of these tools also enables the hearing-impaired public to access and use these materials.

Language identification:  In a multilingual context, automatic language identification is essential.  In particular many collections (e.g., meetings at a European level, some oral narratives) feature speakers switching between different languages.  We can construct adequate baseline systems based on current knowledge, but issues such as within-utterance language change pose interesting and challenging research problems.

 

Information extraction:  The use of a spoken-word collection can be enhanced by the automatic generation of content annotations. Currently it is possible to automatically identify names and numbers and to provide punctuation.  However, it would be advantageous to annotate many other elements--particularly prosodic events--above the word level such as emotion, decision points in meetings, and interaction patterns in a conversation.

 

Collection level browsing and search:  The current state-of-the-art for collection-level browsing and search is based on the application of text retrieval approaches to speech recognizer output.  While this has been relatively successful in some domains (e.g., broadcast news), such approaches have clear limitations, and the development of new search-and-browse approaches, beyond simple text retrieval, are required.

 

Presentation:  The final technological research area that we have identified is presentation.  Currently this involves little more than playing an audio clip and displaying its transcription.  There is an enormous need for research in this area. Several examples come to mind: the construction of audio scenes, presentation of higher-level structure, summarization, and presentation of non-lexical information in speech.

 

0.1.3. Privacy and copyright policy

A number of policy issues arise when discussing spoken-word collections, and it is impossible to treat the technologies in isolation from these issues.

 

Privacy:  Privacy is a major problem, particularly for some spoken-word collections when individuals do not have an expectation that their statements will be archived, although they have spoken in a public forum such as a company board meeting or a political rally.  It may not be possible to offer a comprehensive solution to the privacy problem, particularly for materials where contact with the original collector or subject has long since been lost, but research in this area can accomplish some practical goals.  Future collectors must be armed with reasonable policies to obtain clearances and document applicable rights.

 

Copyright:  The impact of copyright varies by collection, and by national jurisdiction.   Because the legal terrain here is difficult to understand and is undergoing rapid change, a practical approach for cultural institutions to take may be to implement "acceptable risk" policies.   These policies set forth overarching principles of respect for subjects and for the creators' intellectual property rights, but balance them against a need to provide access to important cultural heritage materials. Issues to research include: copyright exemptions (e.g., for educational purposes), classes of works that do not qualify for copyright protection, digitization for preservation and mediated access, and questions collection custodians should pose to determine copyright status and likely consequences of wide availability of digital surrogates.

 

 

 

0.1.4. Archiving and access

Preservation:  Open research issues include standards for preservation and development of sustainable digital repositories. Issues that need to be addressed include: funding; automating digitization and metadata capture; migrating and refreshing/augmenting collections.  Computerized automated capture and preservation of collections clearly underlies the development of this entire area.   

 

Content structure:  This area spans metadata, item structure, annotation, discovery and delivery issues, such as network bandwidth.  Metadata vocabularies have been developed, but this area still needs further research, particularly when the archived items have a complex structure.  Additionally, metadata needs to be aggregated and services offered on the aggregated collection.  Models and tools for annotation are a rapidly evolving research area, particularly in the area of distributed and collaborative annotation. 

 

Media storage: Even with the rapidly declining costs of spinning disks, most preservation-quality audio collections will continue to require supplemental digital storage media for the raw audio files at least into the foreseeable future. Research is needed on various media (CD, DVD) and best practices for storing, checking, and refreshing. 

 

0.2 CONCLUSION

Though we represent diverse disciplines, we see convergence in the domain of spoken-word collections to address new and challenging issues. In advancing an ambitious research agenda, we envision ancillary benefits across many communities of interest: speech and language technology; software architecture; information science and digital libraries; and a set of diverse user communities.  Progress requires integration across these areas at the international level.  In our judgment, the impact will be substantial. To do any less will risk significant loss to an essential element of our collective heritage.

 

EU:

Steve Renals, CS, University of Sheffield, UK

Franciska de Jong, CS, University of Twente, Netherlands

Marcello Federico, ITC-IRST, Trento, Italy

Lori Lamel, LIMSI-CNRS, France

Fabrizio Sebastiani, IEI-CNR, Pisa, Italy

Richard Wright, BBC Information+Archives, UK

 

US:

Jerry Goldman, Political Science, Northwestern University

Steven Bird, CIS/LDC, University of Pennsylvania

Claire Stewart, Library, Northwestern University

Carl Fleischhauer, Library of Congress

Mark Kornbluh, MATRIX and History, Michigan State

Douglas W. Oard, CLIS/UMIACS, University of Maryland




1.0. INTRODUCTION

 

This report emerges from a joint working group, supported in the US by the NSF Digital Library Initiative and in the EU by the Network of Excellence for Digital Libraries (DELOS).  The aim of the group is to define a common agenda for research in the area of spoken word audio collections, and to define areas for collaborative research between European and US researchers.  The scope of this report is spoken word audio, with no explicit reference to related areas, such as non-speech audio or video.

 

The following sections set out the issues, approaches and polices that converge in the area of spoken-word collections. Section 2 outlines the major issues in the field. Section 3 surveys the current technological state-of-the-art.  Section 4 examines the controversial and rapidly changing policy issues pertaining to privacy and copyright.  Section 5 covers the issues of collecting, archiving and preserving spoken-word content. An appendix addresses content preservation from a digital library perspective.

 

 

 2.O DESCRIPTION OF THE ISSUES

 

2.1 Types of spoken word collections

As recording technology improved dramatically over the course of the twentieth century, the size and diversity of spoken word audio collections expanded geometrically. The cost of recording and preserving sound has moved steadily downward as progressive technological change has facilitated successive generations of recording devices, each with increased storage capacity. While some recordings have been migrated forward to newer technologies, archives today contain recorded collections from every stage of technological development, captured at diverse standards, and on assorted storage media.

 

Existing spoken word collections cover an enormous range, from the earliest recordings of public speeches and broadcasts on wax cylinders and 78 rpm records to oral histories on cassette and reel-to-reel tapes and on through continuous digital recordings of contemporary broadcast news.

 

With declining costs and increasing recording and storage capacity, the breadth of audio collections continues to grow. As a result, the domains covered by spoken word collections are both varied and vast. Nonetheless, it is possible to identify the major types of spoken word collections. These are:

1.      Broadcast news (both radio and TV)

2.      Governmental proceedings (parliamentary debates, court recordings, commissions and committees)

3.      Presentations in the form of speeches, sermons, and lectures (political, religious, educational)

4.      Oral narratives (usually retrospective interviews)

5.      Interactive meetings (business (e.g., shareholders), political (e.g., political party conventions)

6.      Recorded telephone conversations

 

2.2 Magnitude of content

National archives and public broadcast archives in Europe and the United States have millions of hours of holdings, much of which features spoken language.   The bulk of this content -- estimated to be 80 percent -- is in analog form.  It will perish within a few decades unless we take steps to preserve it.  An order-of-magnitude estimate for all significant world holdings of audio material in analog formats is 100 million hours. In addition, millions of hours of spoken language materials come into existence in digital form every year. As digital systems replace analog systems, and as recording and storage costs decline, we envision accelerated growth in the creation of spoken word documents and increased demand for efficient archiving and retrieval strategies.

 

2.3 User communities

Not surprisingly, the user communities for spoken word collections are as vast and varied as the collections themselves. Indeed, many collections serve diverse needs for very different user communities. For example, a recorded speech might be used for political purposes, for educational and research purposes, or for linguistic analysis.

 

Spoken word collections have implications across all areas of daily life. In politics, recordings of speeches, debates, broadcasts, etc. are an essential source of governmental proceedings, political candidates and positions, and citizen activities. In commerce, recordings of board meetings, political debates, broadcast news, etc. can provide access to vital information for economic planning and action. In law, recordings can serve as essential evidence in civil and criminal cases. And in culture and education, spoken word collections are vital to preserve, understand and teach about all aspects of social life. 

 

2.4 Archives

Existing spoken word archives are equally varied. Government bodies, archives, libraries, museums, universities, churches, political parties, corporations, broadcasters, recording companies, community organizations, and individuals hold spoken word collections.  While some major media organizations such as broadcasters and recording companies have prioritized preservation and developed systems for access, particularly with native digital collections, these are the exceptions.  For most organizations, their spoken word collections comprise a small part of a larger effort to collect written material.  Small and regional media organizations, which often produce large amounts of new spoken word materials daily, do not have the resources to archive their growing collections adequately, if at all.  Thus spoken word collections are often the stepchild of an archive—minimally managed, poorly preserved, and hardly accessible.

 

Most spoken word archives have analog holdings on various media, typically different types of tape. Tape recordings have limited life spans, are hard to maintain and lose quality when copied on to new analog media. Analog recordings can provide only linear access; they must be listened to in consecutive order. Preservation needs and access are in perpetual conflict with analog materials since the media deteriorate with use. Working or shelf copies protect the source material, but increase the archiving costs through additional physical space and collection management.

 

The future of spoken word archives clearly lies with digital technology. Most new recordings, including broadcast materials, are now 'born digital.' Equally important, conversion of older analog content to digital media is essential for both long-term preservation and to provide increased access to spoken word resources. With digital collections, access and preservation are not in conflict. Digital content can be endlessly replicated with no loss of quality. Most important, digitization breaks the linear tyranny of analog recording. Digital sound can be searched in ways unimaginable with analog recordings. Access can be nearly instantaneous.

 

2.5 Access

For analog recordings, access is only possible through replication of the physical recording. To listen to most spoken word recordings in archival collections today, one must either travel to the archive and listen to a second- or third-generation copy or pay for a copy of the second- or third-generation recording. The cost in time and resources constrains access. Digital recordings, however, can be transmitted without loss over the Internet. The World Wide Web is rapidly becoming a doorway to digital audio collections. Streamed audio access to news and cultural programming can be accessed from RAI Radio (Italy) <http://www.radio.rai.it/>, BBCi (United Kingdom) <www.bbc.co.uk>, and National Public Radio (United States) <www.npr.org>.  Other specialized spoken word collections, such as The OYEZ Project <www.oyez.org>,which delivers the archived recordings of US Supreme Court arguments <www.oyez.org>, vividly illustrate the increased access to spoken word collections made possible by advances in information technology. The Web, however, is only the first doorway to digital spoken word collections. Already we are seeing the development of alternative multimedia delivery devices, from collection sharing via peer-to-peer networking and downloading of spoken-word materials in popular formats like MP3 or OGG. Client storage devices hold ever-increasing amounts of data at lower and lower price points. PDAs and cell phones coupled with the development of new services will soon permit multi-channel delivery including spoken word materials such as talking books.

 

     

2.6 Convergence of Technology, Needs, and Possibilities

We have now reached the point where various enabling technologies have matured sufficiently for large-scale conversion of spoken word resources from analog to digital. Computer equipment is now available enabling low cost and low loss transfer to digital media. The archival community has set standards for such conversion to ensure the integrity of the original collections. Ubiquitous and inexpensive computers offer ready access to digitized sound.

 

These are only the first steps, however, to ensuring long term preservation and enhanced access to spoken word collections. We are at a watershed moment where digital sound archives are possible, their advantages over old analog collections are self-evident, and the technology to work with digital audio has matured to facilitate a new research agenda for spoken word archives.  Furthermore, many of these analog collections are at risk, and we have a narrow window of opportunity to preserve digitally this analog media, before it starts to become unusable.  We can now envision with confidence and clarity the research agenda in speech recognition, speech enhancement, speech analysis and retrieval, archival design, metadata development, delivery interface, educational integration, and other areas to fully realize the potential of spoken word archives.

 

The digital revolution has the potential to do for aural resources what the printing press did for written resources. For the first time in human history, the spoken word can now be preserved for the long term and made accessible to those far beyond hearing range and in ways that open up entirely new possibilities for human culture.  Researchers have the tools and capacities to transform access to the spoken word and vastly enrich capacities across all aspects of our societies.

 

3.0 CURRENT TECHNOLOGIES AND NEW LANDSCAPES

In the last decade, speech recognition technology has made impressive advances and has proven to be effective for indexing audiovisual archives. Research projects, such as Informedia at Carnegie Mellon University <http://www.informedia.cs.cmu.edu/>, and recent commercial products, such as Virage <http://www.virage.com/>, have successfully deployed state-of-the-art speech recognition into digital libraries of broadcast news. Automatic indexing and content-based access of audiovisual archives is today feasible thanks to outstanding results from research in speech recognition, language processing, and information retrieval. An important driving force in speech recognition has been the US Defense Advanced Research Projects Agency (DARPA)<www.darpa.mil>. Working through the National Institute of Standards and Technology (NIST)<www.nist.org/speech> and the Linguistic Data Consortium at the University of Pennsylvania (LDC) <www.ldc.upenn.edu>, DARPA coordinates and focuses efforts on relevant research topics, collects language resources, and organizes systematic evaluations. We survey technologies for indexing and accessing spoken documents developed under the DARPA umbrella.

 

3.1. Background Information.

Work in Large Vocabulary Continuous Speech Recognition (LVCSR) started well over a decade ago with tasks mostly oriented toward automatic dictation. In the United States, DARPA set up a common framework by providing both large amounts of training data and evaluation data. The reference task--dictation of Wall Street Journal articles--was used to refine the technology for dealing with a vocabulary of several tens of thousands of words, a challenging task in itself at the time. In the years that followed, research interest moved toward making automatic speech recognition (ASR) systems capable of handling a range of acoustic conditions and speaking styles much wider than what could be found in dictation tasks. The new reference task became therefore the transcription of broadcast news (BN) programs, for which increasing amounts of training data were collected.

 

During the same years, similar evaluations were organized by DARPA for other research problems related with spoken information processing: speaker recognition, information extraction, spoken document retrieval, and topic detection and tracking. Tasks of increasing complexity have been defined over time, for which increasing amounts of training data have been made available to participants. As a consequence, scaling-up of the technology was enforced, together with improvements in performance and robustness of the systems. Concerning the content of data, up to now most of the evaluations have been carried out on American English BN. The news domain is indeed very general and makes data collection relatively easy. Research aimed at porting these techniques to other domains and languages has started in several labs.

 

3.2  Audio Indexing

Audio indexing involves several discrete topics: audio partitioning, speech recognition, speaker identification, information extraction, and automatic summarization. We examine each topic below.

 

3.2.1. Audio Partitioning

Audio partitioning is concerned with segmenting an audio stream into acoustically homogeneous chunks and classifying them according to a set of broad acoustic classes. For instance, for the purpose of speech recognition, the audio is usually partitioned by identifying segments containing speech versus other types of content, such as music. In many systems, the classification of speech segments is refined by recognizing, for instance, the signal bandwidth, the gender of the speaker, the speaker itself, the level of noise, etc. The difficulty of this task increases with the level of detail required by the segmentation/classification task. For instance, while detecting speech segments in conversational speech is relatively easy, detecting speaker turns can be very difficult when overlapping speech occurs (that is, when two people speak simultaneously). Moreover, segment classification, as well as any other pattern classification task, becomes difficult when the actual conditions mismatch with those observed in the training data. Acoustic segmentation and classification is crucial for indexing audio recording which may contain more than pure such, e.g. music scores, jingles, etc. Moreover, an accurate segmentation can be exploited to run speech recognizers specifically trained on a given acoustic condition, e.g. bandwidth, gender, speaker.

 

In recent years, several algorithms have been presented which use a statistical decision criterion to detect spectral changes (SCs) within the feature space of the signal. Assuming that a Gaussian process generates data, SCs are detected within a sliding window through a model selection method. The most likely SC is tested by comparing two hypotheses: (i) the data in the window are generated by the same distribution; (ii) the data in the left and right halves of the window are drawn by two different distributions. The test is performed with a likelihood ratio that also takes into account the different "sizes" of the compared models. Usually, the Bayesian Information Criterion (BIC) is applied to select the best fitting model.

 

In order to classify segments, researchers use Gaussian mixture models, which typically have been trained on supervised data. Finally, clustering of speech segments is carried out by a bottom-up scheme that groups segments, which are acoustically close with respect to the BIC or some defined metric. Audio partitioning has been applied successfully and extensively, mainly on broadcast news transcription. The application to other audio collections poses problems of portability and robustness of the methods, which at the moment are surmounted by tuning the system on some development data. Future work should be devoted to developing robust methods which can cope with greater variability of acoustic conditions. 

 

3.2.3. Speech Enhancement

Speech is often recorded under sub-optimal conditions, but pre-processing techniques can be used to enhance the suitability of the signal for subsequent processing. For access to spoken content, speech enhancement typically seeks to achieve one or more of the following goals: (1) improved accuracy from subsequent automatic processing (e.g., automatic speech recognition), (2) improved intelligibility for a human listener, or (3) a qualitative improvement in the listening experience for a human listener. Human perception is far more robust than present automated approaches to speech recognition, so signal processing that precedes speech recognition is presently the focus of a substantial research effort. The initial focus of that work has been accommodation of environmental factors (e.g., background sounds such as vehicle noise or unrelated transient signals, and the results of microphone placement and room acoustics such as echo or reverberation) and the effect of transmission channels (e.g., speech compression algorithms for cellular telephones). Work with recorded materials has generally focused on improving intelligibility and/or the listening experience, topics often referred to as "audio restoration." Much recorded speech is stored on analog media, including cassette tape, open-reel magnetic tape, phonograph records, and (less commonly) Dictabelt loops, wire recordings and wax cylinders. In addition to environmental factors, analog recordings might be degraded when they are first created (e.g., by the frequency response of the microphone), during duplication (e.g., reduction in the signal-to-noise ratio), during storage (e.g., warping of a phonograph record), as a result of prior use (e.g., splicing to repair a tape break), and during replay (e.g., due to variations in motor speed). Audio restoration techniques leverage an understanding of the characteristics of undesirable signal components (e.g., clicks and pops from damaged phonograph records, or "thumps" from Dictabelt loops that have were folded for storage) and human perceptual characteristics (e.g., critical bands and auditory masking) to produce a more satisfactory reproduction of the original content.

 

3.2.4. Speech recognition

Speech recognition is concerned with converting the speech waveform (an acoustic signal) into a sequence of words. In the context of audio archives, the audio signal often contains more than just speech. These other sounds may be intentionally recorded: such as background music or noise added to set the mood, or samples of sounds such as animal vocalizations.

 

Speech is generally produced with the purpose of being understood by a native speaker of the same language, who usually shares some set of common values or experience with the speaker. The choice of lexical items and speaking style depend on the given talker and the intended audience. There are significant differences of an acoustic nature due to anatomical differences across speakers, as well as social and dialectical conventions. These factors complicate speech understanding for humans and machines. Transcribing and annotating audio data are necessary to provide access to its content, and large vocabulary continuous speech recognition is a key technology for automatic processing. Such audio data is challenging to process as it consists of a continuous flow of audio data comprised of segments with various acoustic and linguistic characteristics. Processing such inhomogeneous data thus requires appropriate modeling at the acoustic and linguistic levels. (Since much of the linguistic information is encoded in the audio channel of video data, once transcribed it can be accessed using text-based tools. Transcripts will allow users to access data based on linguistic content.)

 

Today's most effective speech recognition approaches are based on a statistical model of the speech signal. Speech is assumed to be generated by a language model which provides estimates of the probability of all word strings independently of the observed signal, and an acoustic model encoding the message in the audio signal. The goal of speech recognition is to find the most likely word sequence given the observed acoustic signal.

 

Transcription system development requires large annotated training corpora for all languages and audio data types of interest. Transcription performance is highly dependent upon the availability of sufficient training materials, the preparation of which requires substantial human effort. Speaker independence is obtained by estimating the parameters of the acoustic models on large speech corpora containing data from a large speaker population. It is common practice to use gender-dependent acoustic models to account for anatomical differences (on average, females have a shorter vocal tract) and social ones (female voice is often "breathier" caused by incomplete closure of the vocal folds). Other groupings of speakers according to different characteristics such as dialect or speaking rate have also been investigated to improve performance. State-of-the-art systems are typically trained on several tens to hundreds of hours of audio data and several hundred million words of text materials. The significant advances in speech recognition over the last decade can be partially attributed to advances in robust feature extraction, acoustic modeling with effective parameter sharing, unsupervised adaptation to speaker and environmental condition, efficient dynamic network decoding, and audio stream partitioning algorithms, as well as to the availability of large audio and text corpora for model estimation, combined with increased computational power and storage capacity.

 

While the same basic transcription technology has been successfully applied to different languages and types of speech, specific adaptations are required to optimize performance. Mismatches in training and test conditions typically result in high error rates.

 

Despite significant advances in speech recognition, at least two fundamental problems remain: speed and robustness. There is a large gap between machine and human performance (a factor 5 to 10, depending upon the transcription task and test conditions). It is well acknowledged that there are large performance differences for the best systems (attributed to a variety of factors such as speaking style, speaking rate and accent). Improvements are needed in the modeling techniques at all levels: acoustic, lexical and pronunciation, and linguistic (syntactic and semantic).

 

Ongoing research [4] is addressing issues such as reducing the cost of system development [18], and improving the genericity, portability [19, 2] and adaptability of the models. Some techniques of interest are, for example, light and unsupervised training, faster adaptation techniques, learnable pronunciation lexicons, language model adaptation, topic detection and labeling, and metadata annotation. Accurate metadata annotation (topic, speaker, acoustic conditions) can also be used to adapt generic models to the particular audio data type to be transcribed.

 

 

3.2.5. Speaker identification and tracking

Speaker recognition has been an active research area for many years [32, 7]. Several types of recognition problems can be distinguished: speaker identification, speaker detection and tracking, speaker verification (also called speaker authentication). In speaker identification the absolute identity of the talker is determined. In contrast, for speaker verification the question is to determine if the talker is the person s/he claims to be. Speaker tracking refers to finding audio segments from the same speaker, even if the identity of the speaker is unknown.

 

Accurately identifying a speaker [23, 14] is an unsolved research problem, despite several decades of research. The problem is quite close to that of speech recognition in that the speech signal encodes both linguistic information (i.e. the word sequence which is of interest for speech recognition) and non-linguistic information (the speaker identity, as well as less well-quantified values such as mood, emotion, attention level, etc.). The characteristics of a given individual's voice change over time (short and long periods) and depend on the talker's emotional and physical state. The identification problem is also highly influenced by the environmental, recording, and channel conditions. For example, it is very difficult to determine if a voice is the same in different background conditions, such as in the presence of background music or noise.

 

Automatically identifying speakers and tracking them throughout individual recordings and in recording collections can reduce the manual effort required to annotate this type of metadata. Automatic speaker identification will allow digital library users to access spoken word documents based on who is talking. Some of the recent speaker tracking research can potentially allow talkers to be located in large audio corpora using a sample of speech, even if the absolute identify of the talker is unknown.

 

Most of today's working speaker recognition systems [33, 30] make use of the same statistical approaches as are used in speech recognition, i.e., hidden Markov models or Gaussian mixture models. Speaker specific models estimated on speaker-specific audio data are used to assess whether unknown speech samples are from one of the known speakers (speaker identification). Much of the research in speaker recognition has been for security purposes, either controlling access to a physical location or to restricted information, or in intelligence monitoring. Recent promising research at the Johns Hopkins Summer Workshop 2002 (SuperSID: Exploiting High Level Information for High-performance Speaker Recognition) <http://www.clsp.jhu.edi/ws2002> has addressed using multiple types of acoustic, supra-linguistic and phonetic attributes to improve speaker recognition performance.

 

While today's speaker recognition technology is not perfect, performance levels are probably adequate for use in automatic annotation of audio collections and for speaker-based access in digital libraries.

 

3.2.6. Information extraction

Information extraction (IE) is the task of extracting meaningful information from information sources. The search objective can range from named entities -- such as persons, organizations, and locations -- to attributes, facts, or events. The difficulty of information extraction is related to the natural language processing required to recognize complex concepts, the intrinsic ambiguity of named entities (e.g., the name "Barcelona" could denote a city or a football team, depending on the context), and the steady evolution of language (e.g., new names gradually appear in the media).

 

Given the aim of accessing spoken documents, information extraction automatically selects pieces of content that may prove interesting or useful. Moreover, by maintaining links between the extracted information and the original documents, it is possible to provide context for each retrieved concept. Most recent research on information extraction from spoken documents has been carried out under the IE Entity Recognition and Automatic Content Extraction (ACE) programs under DARPA and NIST. Considered tasks are the detection of named entities (names of locations, organizations, and people), temporal expressions, currency amounts, and percentages, within BN shows. State-of-the-art performance was achieved as well by rule-based system and statistical language modeling approaches. Research under the ACE program currently focuses on more complex tasks, such as detecting and tracking entities over time, recognizing mentioned events, and relations among entities.

 

3.2.7. Automatic summarization

Speech summarization is commonly applied to techniques that reduce the size of automatically generated transcripts in a way that resembles summarization technology for text documents. Its goal can thus be described in a similar way as text summarization: to take a partial or unstructured source text, extract information content from it, and present the most important content in a condensed form in a manner sensitive to the needs of the user and task. Depending on the nature of the content and the user information need, both summarization of single fragments as well as multi-document summarization can be helpful browsing tools.

 

The fact that speech transcripts may be linguistically incorrect requires techniques for enhancement of the content. To generate coherent, syntactically well-formed descriptions that preserve the original meaning, semantically complex operations have to be developed, e.g., for anaphora resolution. Two types of summarization tasks are distinguished here: (1) condensation of content to reduce the size of a transcription according to a target compression ratio, e.g. to produce closed captions, meeting minutes, etc, involve both intra-sentential as well as text processing, and (2) a presentation tool for spoken document retrieval. Several obstacles impede transparent presentation of speech retrieval results. Automatically generated audio transcripts are not easily read, because of recognition errors and the lack of punctuation, but also because of disfluencies, repairs, repetitions, etc. Extraction of the relatively important information can help users to browse more easily through search results.

 

Purely audio summaries of speech can be envisaged, and prototype speech skimming systems have been developed.  An important issue in this case is the development of accelerated audio playback, which is an interesting signal-processing task if intelligibility and the speech characteristics (such as intonation) are to be maintained as much as possible.  This area is rather closely related to speech synthesis.

 

3.2.8. Prosody

A spoken message contains more that simply what was said (i.e., the text transcription) and who said it (i.e., the identity of the speaker).  The prosody (timing, intonation and stress) of the speech signal offers a great deal more information such as the emotional state of the speaker, boundaries and "punctuation" in the speech and disambiguation of the intended message (e,g., questions have a rising intonation).  The research challenge in this area is to develop prosodic models that are sensitive to these supra-segmental features.

 

3.3 Collection Level Browsing and Searching

Collection level browsing and searching involves several complex tasks including: spoken document retrieval, interactive speech retrieval, topic detection and tracking, cross-language information retrieval, and speech enhancement. We take up these topics below.

 

3.3.1. Spoken document retrieval (SDR)

By using speech recognition to convert speech into text, detailed textual representations can be generated for spoken content. These representations are not exact renderings of the spoken content, but they do allow searching for specific words and phrases and in general are suited for a variety of audio browsing support tools. Since speech recognition systems can label recognized words with exact time stamps, their output can be viewed as metadata by which it becomes possible to lead users directly to relevant audio fragments (perhaps with links to related content, e.g., video). By default, SDR recognition technology is speaker-independent and geared toward continuous speech and large vocabularies. Building an acoustic and language model requires substantial effort and data. For general-purpose audio mining tools, acceptable retrieval performance calls for a minimum word error rate of .50.

 

Tuning the lexica to specific domains, collections or periods require additional effort and work flow procedures from user organizations. Recognition of unknown words (numbers for compounding languages like German and Dutch are relatively high) and proper names are problematic. Many audio collections are difficult to search because of the recording conditions (e.g., multiple speakers, bandwidth, background noise) do not meet minimum requirements. Transparent presentation of retrieval results is hindered in several ways. It is not easy to ready audio transcripts due to these errors and the lack of inter-punctuation. Simply put, retrieval requires listening. But semantically sound fragment boundaries are not easy to detect, complicating listening retrieval. Therefore fragment and cluster classification is crucial to SDR.

 

SDR allows the disclosure of speech at the fragment level in a way that resembles the most common text search engines. Other search support techniques, e.g., automatic classification and clustering, are applicable on automatic transcripts. We distinguish two approaches: (1) word-spotting and (2) automatic transcript generation in combination with (advanced) full text retrieval tools. For word-spotting, acoustic models are built for a small set of words that are matched during retrieval on query-term models. However, word-spotting is only suited for a small set of search terms. Its chief advantage is that is requires no off-line content processing. Automatic transcription requires acoustic models, (statistical) language models (co-occurrence frequencies) and a recognition lexicon (for some systems limited to 65k words). Its principal limitations are the requirement of off-line content processing and the availability of large corpora. A lot is uncertain about the retrieval performance for speech content. Commercial audio mining tools are available for English only. Systems have been compared only within the general news domain. Within DARPA context, speech retrieval is considered a solved problem. [13]. However this is only valid in a very academic interpretation of the concept of SDR.

 

Many open problems remain and we envision substantial issues calling for additional research. There is little experience with SDR outside research labs. Content segmentation at a semantic level is crucial, but poorly developed. Current technologies for recognition require huge textual training collections and labour-intensive annotation of audio training corpora. These investments are not straightforward for smaller languages. Techniques for training that circumvent the annotation task are under investigation. Evaluation measures specific for SDR recognition technology are not generally available. (Word error rate is not always the best predictor of retrieval quality.)

 

3.3.2. Interactive Speech Retrieval

Any interactive search process involves five stages, as represented in Figure 1. In query formulation, users interact with the system to craft an expression of the information need--a query--that the system can use to produce a useful search result. Queries are typically expressed as either an undifferentiated set of search terms or as a Boolean expression. In the sorting stage, the system reorders the documents, seeking to put the most promising recordings ahead of others. In Boolean systems, this typically equates to placing documents into one of two sets (relevant, or not). Increasingly common "ranked retrieval" systems take a different approach, allowing searchers to pose queries with little or no structure and then peruse a ranked list of potentially interesting recordings.

 

Efficient indexing enables quick searches of large collections, but the effectiveness of interactive searching ultimately depends on synergy with a sophisticated user. Humans bring sophisticated pattern recognition, abstraction and inferential skills to the search process, but the number of documents to which those skills can usefully be applied is limited. The goal of the selection stage is to allow the user to rapidly discover the most promising documents from a system-ranked list through examination of indicative summaries, i.e., summaries designed to support selection. These summaries are generally quite terse since several must be displayed simultaneously in the available screen space. Because summaries may not provide enough information to support a final selection decision, modern systems also typically provide users with the ability to play segments of individual recordings. Direct use of a recording may also result from replay within the retrieval system, or a separate delivery stage may be required (e.g., the audio might be stored on a compact disk for later replay with high fidelity).

 

Recorded speech poses both challenges and opportunities for the interactive retrieval process. The key challenges are deceptively simple: automatic transcription is imperfect and listening to recordings can be time consuming. Some important opportunities include potential use of speaker identification, speaker turn detection, dialog structure, channel characteristics (e.g., telephone vs. recording studio) and associated audio (e.g., background sounds) to enhance either the sorting or the browsing process. Multimedia integration (e.g., with video) also offers some important opportunities for synergy. For example, query formulation based on spoken words might be coupled with selection based on key frames extracted from video.

 

3.3.3. Topic Detection and Tracking

The Text Retrieval Conference's (TREC) Spoken Document Retrieval (SDR) track emerged in 1996 from a tradition of ranked retrieval evaluations, and the design of the track reflects that heritage. In 1997, a second venue for comparative evaluation of speech retrieval research was introduced in the United States; it is known as Topic Detection and Tracking (TDT). The still ongoing TDT evaluations reflect a broadening of speech processing research to include a strong application focus. Four test collections (known as TDT-1 through TDT-4) have been developed, with the most recent having the following distinguishing characteristics: (1) multi-modal, including both broadcast news audio and newswire text; (2) multilingual, including English, Chinese, and Arabic; (3) multi-source, typically including news from more than one source in each combination of modality and language; (4) event-oriented, with relevance assessment based on whether a story reports on a relatively narrowly defined event (e.g., a specific airplane crash).

 

The most recent TDT evaluations include comparative evaluations on five tasks: (1) topic segmentation, in which systems seek to discern the times at which the story being reported changes; (2) topic detection, an unsupervised learning task in which systems seek to cluster stories together if they report on the same event; (3) topic tracking, a semi-supervised learning task in which systems seek to identify subsequent news stories that report on the same event as one or more example stories; (4) new event detection, in which systems seek to identify the first story to report on each event; and (5) story link detection, in which systems seek to determine whether pairs of stories report on the same event. The story segmentation task is performed only on broadcast news sources. All other tasks are multi-modal.

 

3.3.4. Cross-Language Information Retrieval

When searchers lack the language skills needed to pose their query using terms from the same language as the spoken content that they seek, some form of support for translation must be embedded within the search system. There are three cases in which such a capability might be useful: (1) use by searchers capable of understanding the spoken language who are not sufficiently fluent to formulate effective queries in that language (e.g., searchers with a limited active vocabulary in that language); (2) use by searchers lacking the ability to understand the spoken language, if their principal goal is to find easily recognized objects associated with the spoken content (e.g., searching photographs based on spoken captions); an (3) use by any searcher, if suitable speech-to-speech (or speech-to-text) translation technology can be provided. At present, speech-to-speech translation has been demonstrated only in limited domains (e.g., travel planning), but development of more advanced capabilities are the focus of a substantial research investment.

 

Cross-language information retrieval relies on three commonly used strategies: (1) query translation, (2) document translation, and (3) interlingual techniques. Query-translation architectures are well suited to situations where many query languages must be supported. In interactive applications, query translation also offers the possibility of exploiting interaction designs that might help the system better understand the system's capabilities and/or help the system better translate the searcher's intended meaning. "Document translation" is actually somewhat of a misnomer, since it is the internal representation of the spoken content that is translated. Document translation architectures are well suited to cases in which query-time efficiency is an important concern. Document translation also typically offers a greater range of possibilities for exploiting linguistic knowledge because spoken content typically contains many more words than a query, and because queries are often not grammatically well formed. With interlingual techniques, both the query and the document representations are transformed into some third representation to facilitate comparisons. Interlingual techniques may be preferred in cases where many query languages and many document languages must be accommodated simultaneously, or in cases where the conforming space is automatically constructed based on statistical analysis of texts in each language.

 

 

4.0 PRIVACY AND COPYRIGHT

Collectors of spoken word audio materials must address a number of complex privacy and copyright issues relating to the collection, retention and distribution of works.  These policy issues cannot be ignored, but the legal frameworks that define them offer incomplete and sometimes conflicting guidance. Privacy and copyright are two of the most rapidly changing aspects of United States and European law.  We provide a brief analysis of some key issues.

 

4.1 Privacy

Privacy is not a precisely defined concept.  The issues of data and communications privacy have been very widely debated, across both the U.S. and the E.U.  Less commonly discussed aspects of privacy may be equally relevant to a spoken-word archive.  For example, what are the legal implications of recording the proceedings of a public meeting? 

 

4.1.1. An expectation of privacy

Some issues surrounding audio and video capture in public are not dissimilar to those debated when face-recognition technology began to be used to scan for potential criminals in crowds at airports and other public places.[25] Here, the expectation of privacy is one of anonymity, but this expectation is not always codified in law.  Several U.S. state courts have resisted attempts to curtail video and audio recording in public, finding that no reasonable expectation of privacy can exist in a public place.[34]  Use of recording technologies for public surveillance in the United Kingdom has been common for some years, though the government in 2000 signaled its intention to regulate such surveillance in accordance with its 1998 Data Protection Act, passed to harmonize U.K. laws with the 1995 European Union Data Protection Directive.[16] Other E.U. nations, including Greece and Sweden, also interpret the E.U. Directive (revised in 1998 and 2000) to specifically pertain to public video surveillance and closely regulate its use. [25]

 

Use of wiretapping and other communications surveillance technology is, in general, well regulated, requiring that law enforcement obtain court or judicial orders to make use of such know-how.  In reality, permission to wiretap is easily obtained.  In the United States no state or federal law enforcement agency requests for wiretaps were denied in 2001, and a total of 1,491 were authorized.[12] The French government approved 4,175 wiretaps in 2000, and the German government 12,651 in the same year.[25:  pages 178, 185, 388] Open monitoring and recording of telephone transactions and monitoring of employees' electronic communications for business purposes is also widespread.[9]  The right of employees to opt out of such data-gathering has been weak or non-existent.  The E.U. is leading the push to expand data privacy regulations to include employee-monitoring activities, which may have the effect of discouraging such monitoring beyond the E.U.[15] Most European Union nations have appointed a central data protection agency, charged with oversight of all personal data collection and processing, and grant individual citizens a mechanism for review, change or removal of their own information.

 

The ability of governments to compel disclosure of recordings, data, and personal information has increased since 2001, particularly for electronic communications, and particularly in the United States.  Under the October 2001 USA PATRIOT Act, federal law enforcement agencies are still required to obtain permission to access records; but the agencies now have the additional instrument of a Foreign Intelligence Surveillance Act (FISA) order along with warrants and subpoenas.  FISA, passed in 1978, created a secret court that acts on terrorism investigations in national security situations.[36]

 

Given the need for oversight and the ease of access to such information once stored in digital form, some difficult choices face the custodian. What balance should be struck between protection of the individual and benefits of large spoken word collections for worthy public purposes (e.g., scholarly inquiry, political discourse, law enforcement, artistic expression)?  A good place to turn for examples and guidance may be the regulations governing research on human subjects.[9]  These regulations clearly advocate informed consent and limited gathering and use of personal data.[42]

 

Collecting agencies should determine whether individuals have granted permission for a recording to be made, implicitly or explicitly.  A signed consent or permission form is the best safeguard, but is unlikely to be available, particularly for older recordings.  Presenters and announcers, interviewers and interviewees, audience members and call-in guests, parties in a conversation: all such participants must be considered when determining whether privacy rights are an issue. A public figure, such as a politician or a known lecturer, is unlikely to substantiate an invasion of privacy claim were his speech to be recorded.  The more public the citizen, the less likely he or she is to be able to make a claim.

 

4.2 Copyright

There are three main issues concerning spoken word materials:

 

1. whether these materials are protected by copyright,

2. whether rights auxiliary to the copyright must be taken into consideration when considering an archiving digitization initiative, and

3. all rights notwithstanding, whether an argument can be made to proceed with digitization and delivery.

 

Copyright legislation has changed dramatically over the past decade, both in the United States and in Europe.  The rise and demise of Napster and other online file-swapping services have focused the attention of the technology, content, legal and consumer advocacy communities on the issue of digital audio distribution.  Despite this attention and debate, clear rules have failed to emerge, and are unlikely to surface in the near term, particularly for non-commercial use by libraries and archives.

 

4.2.1. Extent of copyright protections for spoken word materials        

As signatories to the Berne convention [43], the United States and the European Union member nations have reciprocity in copyright protection so that materials created or published in one nation will, for the most part, enjoy the same protections in other nations.  Copyright statutes generally reserve for the copyright holder the exclusive right to reproduce, display, distribute copies of, and perform or broadcast the work.  The European Union issued a copyright directive [11]  in 2001 that matches many of the provisions in the United States Digital Millennium Copyright Act (DMCA) of 1998.  Both extend encryption protections with harsh anti-circumvention language.  Principles in the EU Copyright Directive will be implemented through the laws of member nations. The results of this implementation do not yet offer clarity or guidance.

 

In general, sound recordings have historically been accorded fewer protections than other types of works, though some recent initiatives have the effect of increasing theirprotection.[17] In the United States, sound recordings were not protected by federal copyright law until 1972, and recordings made before that date are still not federally protected (though they may be under state copyright laws). Works fixed after 1977 receive at least 70 years of protection. (In the United States, in order for works to qualify for protection, they must be fixed in some physical medium.  This requirement has been clarified to encompass digital publication, as well.) In the United Kingdom, copyright for sound recordings was established in the 1911 law [24] and lasts for 50 years, 20 fewer years than granted to creators of print works.  The 1979 revision of the Berne convention likewise established a 50-year duration of copyright, a term also endorsed by the European Union in 1993.[5]

 

Most of the signatory nations require either some form of fixity (United States) or availability to the public (Germany and the United Kingdom) in order to claim copyright protection.  However, France's copyright law is much more generous toward authors, stating: "A work shall be deemed to have been created, irrespective of any public disclosure, by the mere fact of realization of the author’s concept, even if incomplete."[28]

 

There may be layers of authorship embedded in a single sound recording, and each act of authorship may be subject to separate protection.  For a musical work, the composition and arrangement might both be protected even if the physical recording itself is not.  A more relevant example of layered rights may be seen in observing several separate acts of creation that might be said to be encompassed within a sound recording of a news broadcast: a typescript, background music, and interviews with news subjects.  It is unclear how stringently these protections will be pursued and enforced. 

 

The Berne convention singles out certain types of works and suggests that signatory states may wish to exempt them from copyright protection.  The article reads in part: