Which algorithms are used to summarize.
Text summarization is the process of producing a shorter version of a text (or also of more than one text) without loss of sense and of relevant information. In an era of information overload, where texts of all kinds (reports, papers, speech transcriptions, etc) do multiply at an impressive rate, having access to a reduced version of texts it’s really something essential. Automatic text summarization (ATS) is done with more and more sophisticated algorithms, among which stands the BART algorithm by Facebook AI. But the BART algorithm is not the only one: let’s have a look at what text summarization means and how it is done.
Automatic text summarization: the main approaches
Text summarization it’s not an easy task, even for humans, let alone machines. To perform it is necessary a good amount of comprehension of the text one is going to sum up and also an ability to rephrase things so to convey the same meaning and the essential information in a text that is readable, short and representative. Maybe this is the crucial point: identifying “essential” information pruning out all that is unnecessary.
Even though automatic text summarisation (ATS) it’s not a new research field (the first works in this area date back to 1958) it’s only when consistent progresses in the area on natural language processing (NLP), latent semantic analysis (LSA), machine learning (ML) and deep learning (DL) have been made that this discipline – because of the close link between all of this fields – has really undergone significant leaps forward.
Being a complex field, there are many different approaches to automatic text summarisation.
Basically, we can classify an automatic text summarisation method by the input document (size, specificity and form), by the purpose of summarization (so by it intended audience, its usage and its expansiveness), or by the kind of documents that will come out as the output of the process (so whether it is used an extraction or an abstraction approach, or whether the resulting abstract is neutral vs evaluative). For an overview of possibile ATS approaches cfr figure 1 (as found in Aries, Abdelkrime & Djamel Eddine, Zegour & Hidouci, Walid. (2019). Automatic text summarization: What has been done and what has to be done.)
We’ll now focus on general (so not domain specific) and generic (not query oriented) abstracts done using ML or DL (deep learning) methods, which is the approach that is now receiving much attention by big tech companies AI departments (Facebook, Google, MSN…) due to its flexibility and to the almost astounding results they bring.
Automatic text summarization: extractive vs abstractive approach
The first distinction that has to be made is whether we’re going to use an extractive vs an abstractive approach.
Putting it simply, if you use an extractive approach you try to find the most relevant informative sentences inside the document, then you “extract” them from the text and re-combine them to form a new shorter version of the original text, thus eliminating all redundancies and avoiding loss or information. In this approach you don’t “create” any new lines of text, you just extract sentences from the original text and recombine them. It’s the most straightforward and simple approach and also the most widely used.
In this approach to decide which sentences to abstract first of all you have to construct an intermediate representation of the input text (the two most widely used approaches being topic representation, i.e. transforming text with the aim of identifying the text main topics, and indicator representation, where a set of “indicators” – e.g. sentence lengths, or sentences containing certain “indicator” words – are used to represent relative importance of chunks of text). Then sentences in the text are weighted and ranked according to their score in the representation used. The most highly scoring sentences are used to build the summary.
If you use an abstractive approach, you try to “understand” the text using advanced natural language processing techniques and to produce an abstract conveying the meaning and the most relevant information by “writing” a new text (that is not a mere collection of sentences abstracted from the original one but is a rephrasing or a paraphrase, just like a human would do ). This approach is for obvious reasons more complex: it implies semantic comprehension of the text and of the bond between concepts, context and topics.
Then there are mixed approaches, in which an abstractive generator is used taking as input a text coming from an extractive summarizer. In this way the abstraction/generation process is more efficient because it works on a text already purged of all redundancies and irrelevant information. Many recent algorithms, and between them the BART algorithm, are based on this approach.
After you have produced a summary using a text summarisation system you measure its performance using various standardized assessment systems, like ROUGE, GLUE, RACE or SQuaD, which measure the distance between the expected and the obtained outcome.
Automatic text summarization: the main algorithms
Nowadays, being a field that brings really useful results, there are many different algorithms being used and continuously improved. The most recent ones all rely on a pre-trained approach, where you first pretrain your model to create a sort of “black box” which understands natural language, then you can further train the model to adapt it to more specific tasks. Also, all these models work on understanding the context of a word (they build a language model which is the task of predicting the next word in a sentence given all previous words): the first ones only worked from left to right or to right to left, while the most advanced ones (like the BART algorithm) are bidirectional, i.e. take into consideration the entire sentence or period at once. The other big difference in models is the vastity of corpus on which they are trained: for example XLNet is trained on 30 billion words!
Let’s have a look at the main ones:
- GPT (generative pretrained transformer), now arrived at version 3 (GPT3) , introduced in may 2020 and in testing phase in july 2020 is not only a summarizer but an automatic language generator, that is so powerful that it’s often virtually impossible to distinguish texts produced by it from text written by humans. GPT is based on neural networks deep learning technology.
- BERT (bidirectional encoder representation for transformers) uses a complete bidirectional unsupervised approach and is pre-trained only on a pure plain text corpus (Wikipedia). Bert representation weights are learned introducing a masking function (“masked language modeling”: some words are masked into a sentence) and then BERT has to predict which one is the missing word or if a sentence follows another sentence. BERT uses an attention mechanism that is capable of learning contextual relationships between words in a text. Derivations of BERT are SpanBert, RoBERTa, AlBERT, VideoBERT, and many others. Also the BART algorithm, which is currently the state of the art in this field, is derived by BERT.
- BART (bidirectional autoencoder representation for transformers). BART algorithm generalizes both the GPT and the BERT approach, taking the best of the two models. BART is trained corrupting text with a noising function (which adds “noise” to the text, not just masks) and then learning a model to reobtain the original text. It is based on a Tranformer-based neural machine translation architecture with bidirectional encoder (like BERT) and left-to-right decoder (like GPT). BART algorithm maps a corrupted document to the input document and can be applied to any type of document corruption (token masking, token deletion, text infilling, sentence permutation, document rotation, etc). BART algorithm reaches new state-of-the-art results on abstractive dialogue, text generation, question answering, and summarization tasks.
- XLNET takes the same approach as BERT but it has been trained on a larger corpus for more time (2000 GPU days as opposed to 450 GPU days of BERT) and is capable of taking into account the mutual dependencies of masked words while BERT assumes that the predicted tokens are completely independent. It outperforms BERT on many tasks.
- UNILM (Unified Language Model Pre-training for Natural Language Understanding and Generation). UNILM algorithm, unlike BERT which is used mainly for natural language understanding tasks, can be configured to aggregate context for different types of language models, and thus can be used for both natural language understanding and natural language generation tasks.
- PEGASUS (Pre-training with Extracted Gap-sentences for Abstractive Summarization). In PEGASUS important sentences are removed from a text and the output consists of the missing sentences concatenated together. Pegasus results in summarizing are so good that humans often are not able to distinguish between human vs PEGASUS generated summaries.
Now, just imagine how such amazing technologies could be applied to your business! Think of all the written documents of your company quickly summarized and thus made easily available. At PaperLit we use state of the art solutions (like the robust, massively efficient TexRank algorithm, inspired by the renown Google PageRank) to allow you to save time and resources and at the same time produce perfectly readable and enjoyable texts. Get in touch to have a free audit and see what we can do for you.