Spoken language is one of the most effective and common forms of communication. Everyone is constantly exposed to voice apps and audio content and often one is in need to access to such content quickly, just as you could do scanning through a written text, but of course with audio this is not possible. So it would obviously be extremely useful to be able to automatically summarize audio contents, be them dialogues, lessons, speeches. In particularly this last one is an interesting application because speeches are often quite long, do contain a lot of information and do not contain too much slang or other “noise” elements which could make the task even more difficult. Nonetheless automatic speech summarization IS a difficult task because it involves not only automatic summarization but also automatic language recognition.
So, first of all let’s see how to summarize a speech and then we’ll review some techniques used to automatically perform this task.
Summarizing a speech: key points
A summary should contain the main points of a text in a concise yet fully informative way. At the same time, it should not contain personal opinions or digressions about the text. This of course applies also to a speech: a summary should state the main ideas and the key points of findings, while cutting away all the minor details that usually enhance the speech but that in a summary would be useless. A speech, even more than a written text, could contain personal experiences, jokes, rhetoric tricks, that in a summary should all be eliminated. While this may be easy for a human, it could be extremely difficult for a machine since it doesn’t really “understands” language, so it doesn’t know which fragments of a speech are unessential. Just as occurs with plain text, also summarizing a speech is something that when done automatically encounters many obstacles.
The machine steps in: how to summarize a speech – automatically
There are three main tasks involved into automatically summarizing speeches:
- Recognizing and transcribing speech
- Normalizing the speech transcription by eliminating all “noises”
- Performing automatic summarization on the normalized text
Let’s examine them in some detail.
Recognizing and transcribing speech
Spoken document retrieval is one of the main applications of automatic speech recognition (ASR) technology. There is a huge difference in recognizing speech read from a text and spontaneous speech: transcriptions from the last one often contains errors or irrelevant information such as disfluencies, fillers, repetitions, repairs, and word fragments. In a speech, even more so if not read from a text, sentences lack boundaries that in a written text help the system to recognize parts of the period and so to weigh their relevance. Furthermore, speech contains also non-verbal, prosodic and/or emotional information that cannot be captured by text transcription. This problem cannot be solved unless instead of summarizing by means of text transcription one decides to use audio summary (so audio-to-audio summarization instead of speech-to-text). Prosodic or emotional information can on the other hand also represent an opportunity, because they give the system a clue about the most important parts of a speech, something that is vital for a good summary.
Many speech summarizing systems now rely on the “Speech to Text: automatic speech recognition by Google Cloud”, that adopts the latest artificial intelligence and deep learning technologies and delivers good quality results. Before extracting text with Google Cloud Speech API the original audio must be fragmented though, because the API cannot handle too big audio chunks.
Noise elimination and sentence weighting and identification
With the advance in speech recognition technologies the text resulting from the first step is of a better quality than in the first attempts, but nonetheless needs to be elaborated. The crucial tasks are sentence segmentation, sentence extraction (most research focusing for this on Latent Semantic Analysis technologies), and sentence compaction.
Automatic speech summarization – text version
On the sentences extracted in the preceding step it is now possible to apply the usual methods for automatic extractive text summarization: researchers are experimenting with several algorithms, among which BERT, Luhn, TextRank, LexRank and KLSum. The sentences and the words are tokenized and weighted against a model and then used to compose the summary. Of course to obtain acceptable results you have to train the system: one interesting approach is to do that scoring the automatic results against manually done summaries of the same speeches. Usually the results are evaluated using the ROUGE set of standards.
Automatic speech summarization – audio version
When you think of how to summarize a speech you immediately think of text summary but as we mentioned before one possibility is to never resort to text and remain sticked to audio instead. In this case the system should extract from the original audio the chunks that are judged more relevant and concatenate them just as it would do with written sentences. With the difficulty that with text sometimes you have to fill in the logical gaps within sentences with additional text (connectors, etc) and with audio this is always more difficult. In extreme cases it could be necessary to use vocal synthesizers to perform this task. The advantages of this approach is that in this way all the important prosodic, emotional information can be maintained, avoiding an information loss.