Autonomy's Technology
Unified Information Access
Connectivity
Rich Media
Healthcare Technology

Technology Information Connectivity | Video, Audio & Speech Analysis | Meaning Based Healthcare Technologies
Overview
Related Events
Related Case Studies
Related Resources
Related News

Understanding Audio

In order to analyze an audio file or the audio track of a video, technology must be able to understand the meaning of the interaction, whether live or recorded. This process is affected by a variety of factors: the speaker's language, dialect or accent, as well as background noise or interference. Due to the variable in speech and language, legacy approaches like phoneme matching and word spotting alone are not enough to determine what is truly being said. Autonomy Virage's SoftSound delivers sophisticated audio recognition and analysis technology that processes spoken interactions based on their conceptual content, not just the way they sound.

History of Speech Technology

Speech technology has gone through several phases of innovation, each one building upon the shortcomings of the previous generation. Many remember using speech technology over the phone to remotely program an interactive video recording system. These systems were able to recognize a limited number of keywords such as "Yes" and "No" or the number "5." If more than one word was present, these systems needed a pause of silence in between to differentiate the words. Unfortunately, conversational speech does not naturally have these pauses.

The next evolution in speech technology was phonetic indexing. This was a fast way of finding matches as it was only looking for base phonemes and specific grouping of phonemes. Unfortunately, this system had many false positives as sometimes words may appear in other words, such as "cat" in catastrophe. It was also sensitive to background noise, bandwidth of the call and, particularly, accents as the phonemes will differ considerably.

Phonetic
Keyword
Boolean
Parametric
Conceptual
Pattern Recognition
Data
Voice
Video
E-mail
Chat
Databases
Structured
Unstructured
Autonomy's Approach to Audio Search

A language model was developed to address the issues presented by phonetic indexing. This technique used a dictionary and a pre-defined language model and gave a highly accurate recognition rate that could also find phrases. However, the language model could not distinguish between homophones (e.g. 'eye' and 'I') and heteronyms (e.g. 'The bass player ate bass') and was confined to the pre-set dictionary.

To make the language model more flexible, a self-learning language model was developed that could learn new words. This improved model was highly accurate and could be set up without the need for the dictionary, but required massive computational requirements.

Today, the latest generation of speech technology delivers conceptual search. This approach utilizes advanced mathematics and complex algorithms to derive meaning from speech. Conceptual search addresses the shortcomings of previous speech technology models and provides the most accurate way of recognizing and finding speech because it understands what is being said. It can distinguish between homophones, heteronyms, as well as find and group things by concept. It can also find related information based on meaning and has lower computational need than some of the earlier generations of speech recognition technology.

Phoneme Processing

Phonemes are the smallest discreet sound-parts of language and form the basic components of any word. Phoneme matching attempts to break down words into their constituent phonemes and then match searched terms to combinations of phonemes as they occur in the audio stream. While this approach does not require a dictionary, it is limited in its accuracy and inability to make conceptual matched.

Phoneme processing is commonly used approach to audio recognition, but is frequently inaccurate and often returns high levels of false positives. Because words are treated simply as combinations of sounds with no awareness of their meaning in context, the system cannot differentiate between the required data, homophones, and phrases that share the same phoneme but bear no conceptual relation to the search terms. For example, the sentence "The computer can recognize speech" contains many of the same basic phoneme components as "The oil spill will wreck a nice beach," while the meaning is entirely different. Phoneme processing cannot account for multiple expression of the same concept, so any information that is related to the search term but does not contain the same phoneme will not be returned.

Word Spotting

As with phoneme matching, word spotting techniques search for words out of context, so they are unable to differentiate between homophones and homonyms. Because the system relies on exact sound matches, it is also unable to account for changes in pronunciation that affect sound, but not the actual concepts behind spoken words, such as plurals. As with other purely phonetic approaches, word spotting cannot make conceptual associations and will frequently miss related information that is not included in the search terms.

Autonomy offers all methods and has developed these systems originally for use in the UK and US Intelligence community.

Natural Language Processing

Natural Language Processing (NLP) is a form of human-to-computer interaction where the elements of human language, be it spoken or written, are formalized so that a computer can perform value-adding tasks based on that interaction. Autonomy's approach differs from standard NLP use in that it is still able to harness the power of IDOL's conceptual analysis. Autonomy's NLP technology functions independently of linguistic restraints, giving Autonomy's software universal application possibilities anywhere in the world.

"Autonomy excels in natural language processing techniques for understanding queries in multiple languages."
Forrester Research

Understanding Video

Traditionally, multimedia content has been considered an unwieldy resource, requiring considerable man-hours to extract tangible benefits and returns. As a consequence, most rich media assets remain dormant in the different repositories or lost on local user machines, thus creating duplication and inefficient storage management. The majority of organizations have failed to utilize the valuable intelligence contained within resources such as recorded meetings and training videos, and frequently fail to reuse video content such as broadcast content or marketing promotional videos. Using advanced image and audio analysis engines that watch, listen to and read a video signal in real-time, Autonomy Virage delves into the video file itself to extract the meaning of the information it contains.

Deep Video Indexing, Analysis and Extraction

Autonomy Virage delves deep within the video file itself rather than relying on human-defined metatags which are subjective and limited in scope. It then provides a highly detailed, time-encoded, comprehensive range of data which users, or IDOL itself, can search through to find relevant content with pinpoint accuracy. The full range of IDOL's functions can be performed on video content and video can be automatically cross-referenced with any other form of information.

Deep Video Indexing (DVI) uniquely tackles the challenge of asset identification and management through its use of a cascade application method that layers technology analytics to create a unique fingerprint for all video assets. This fingerprint is a flexible representation of the asset whose characteristics can be extracted or used to determine similarities between assets. Due to this unique, flexible mode of asset representation within the IDOL conceptual index, many of the challenges that traditionally plague the video medium are overcome. Issues such as spherical distortion, artifacts from low resolution transcoding, screen inclusion (i.e. assets within assets), audio distortions and audio mismatching (e.g. translations, misalignment) do not affect the integrity of the representation.

Openly configurable, the DVI functionality uses numerous proprietary approaches in a configurable lattice which can be weighted within the DVI fingerprint as separate entities or as a whole unit. These technological approaches include:

Texture trajectory analysis
Advanced Optical Character Recognition (OCR)
Spectrum trajectory analysis
Advanced scene analysis

Autonomy Virage can extract a comprehensive range of data from multimedia resources, including full transcripts of audio streams, on-screen character recognition, keyframes, facial recognition and speaker information, all of which is linked to the original video file, allowing users to locate content with pinpoint accuracy. Utilizing the power of IDOL, Autonomy Virage provides users with an unrivalled range of video analysis tools such as scene detection, 'find similar' functions and conceptual analysis such as automatic hyperlinking of related content, categorization and clustering.

In addition, video content can be automatically cross-referenced with any other form of information such as PowerPoint presentations, Word documents or web pages. In this way, Autonomy Virage ensures that rich media is immediately fully searchable and accessible by any user.

This is a selection of our forthcoming events, please visit our seminars page for more information.

Automatic Hyperlinks provided by IDOL Server

This is a small selection of the Autonomy case studies available, please visit our publications site at http://publications.autonomy.com/ for more information.

Automatic Hyperlinks provided by IDOL Server

Technology Information Connectivity | Video, Audio & Speech Analysis | Meaning Based Healthcare Technologies
About Us
Technology
Functionality
Products
Solutions
Services
Customers
Partners
News & Events
Contact Us