The internet is completely overloaded with videos, and while with texts one can quickly scan through them to find whether there are any interesting contents obviously with a video this is not feasible. So having access to summarized versions of videos would really be a life saving asset. The applications are really endless, to make an example just think of security videos, in which for hours and hours there are the same contents and you need to quickly access those few moments in which something happens. The good news is that thanks to the advances in artificial intelligence, machine learning, natural language processing and image processing these automatic video summarizing tools (AVST) start to be available, even though still in an immature form. Just to make an example, we can quote zapier’s MakeMySummary, which also works with texts and SMRZr, which is still in beta version. The drawback is that at present many such tools work with english only but of course progresses are made every day. But how do these tools work? How is it possible to automatically summarize a video?
Summarizing a video – from the technical point of view
To summarize a video, as you do when you automatically summarize a text, you have to split the video in smaller components, and decide which ones are the most relevant and should figure in the summary (a new more compact video) to be sure there is no information loss and that the resulting video is clearly intelligible.
In general what video summarization systems do is to extract image features of video frames, and then – usually using neural networks technologies – select the most representative frames analyzing the visual variations among visual features. There are two main ways of doing this:
- by extracting a set of static keyframes from the original video (keyframing), or
- by extracting a set of shots, complete with audio and motion (video skimming)
Of course the result of a video skimming technique is more interesting to watch, as it resembles a real video, but a video summary obtained by keyframing is more easily done because it involves less complexities (temporal concatenation, maintaining consistent audio, etc).
In both cases chunks of video (be them frames or shots) are examined and grouped in sets by similarity (for example for frames the presence inside the image of the same elements, same colors, etc) and the redundant ones are eliminated. The process can also be human supervised: in this case the training sets are labeled by humans and the result is more accurate even though it takes more time and effort to be done. Another approach that is being explored is identifying the main points of a video by its subtitles (and in this it overlaps with automatic text summarization).
How to summarize a video – who is delivering this service
As we have seen, summarizing a video is not an easy task and from the technical point of view is really cumbersome: it includes all the pains of automatic speech summarization with the addition of the moving image. To effectively summarize a video the system should not only produce a sensible collection of video shots but also “understand” what is being said in the video (if there are spoken words in it) and do not cut out any essential speech fragments.
Apart from automatic summarization apps and programs that also perform video summarization – often developed by startups – as it’s imaginable also the big tech companies are developing automatic video summarization systems.
Microsoft has developed the “The Azure Video Thumbnails Media Processor” that lets the user create a series of clips or “highlights” for a video and after that the “Azure Media Services: Video Indexer”, which includes a Video Optical Character Recognition, a Face Detector and a Content Moderator. “Amazon Rekognition” instead uses a different approach: using deep learning it manages to recognize key elements inside videos and to index them so that they are searchable, extractable and quickly retrievable. Similar services are offered by the “Google Video Intelligence API”: it recognizes objects, faces and scenes, extracts metadata that can be used to index, organize, and search video content, as well as control and filter content for what’s most relevant.
Of course this is not true video summarization, is more akin to video indexing (which from a certain point of view answers to the same needs, i.e. accessing quickly to the information contained into the videos).
True and effective automatic video summarization is still an unsolved problem and under intense research, but recent developments are really promising and fast pacing and currently available applications still offer interesting assets.