The media industry relies on audience ratings to understand viewership behaviour, and the ratings are defined through measurement periods known as ‘slots’ or ‘time bands’ of viewership. Here viewership directly maps to the airtime of different shows in the linear programming framework. But now that the video consumption landscape is changing and becoming more and more non-linear, this framework is proving to be less reliable.

Due to the advent of OTT and the proliferation of smart devices, media consumption habits have outpaced the traditional analytical frameworks & metrics to quantify viewer behaviour. The consumers today are mobile, and they watch the desired content on their preferred devices at their convenience.

Time-shifted viewing is becoming more and more popular. While VOD accounts for a quarter of the total media consumption, this fraction has only increased during the COVID period.

Interestingly this convenience has increased the total time available for content consumption, and the competition is no more at a slot level. These changes in viewership behaviour call for a focus-shift on content entities – which could give insights into the most significant competitive differentiator today – Content.

That perspective shift may happen earlier than anticipated, courtesy – Artificial Intelligence.

What are content-entities?

Content Entities refers to people, objects, location, dialogues, keywords or any other useful concept that can be used to meaningfully describe what constitutes the content. 

This applies to any kind of content we consume, whether it be Entertainment, Sports or News.

With developments in AI and cloud technology, we can now process content at scale. We can now generate highly granular time-coded streams of content entities (metadata) which explain the content structure with needed details.

But the objective here is to not generate volumes of metadata, instead use this metadata to create a meaningful taxonomy of content entities.

This collection of entities can be understood better when categorised into broad categories:

  • Logical Scenes – Temporal boundaries
  • Characters and people – the foreground elements
  • Content production & Quality – the background elements
Artificial Intelligence breaks a

1. Logical scenes – the temporal boundaries 

It involves defining a scene as a temporal entity. This depends upon which the point in time entities are detected. Scene identification involves identification of contiguous camera-cuts bearing a similar narrative. 

Camera-cuts constitute changes in the shot angle. AI identifies them by picking sharp changes in the colour histogram profile of two adjacent frames. A camera cut doesn’t always mean a change in the scene. For example, even when the camera changes between two people during a conversation, the scene remains the same.

Logical Scenes - The Temporal boundaries.

Image: Sherlock (BBC)

Scene boundaries are detected using a combination of CNN-LSTM architecture. Embedding vectors generated using CNNs at a camera-cut level capture information based on the rendering of different colours and composition of different shapes. It is then passed through Bi-directional LSTMs.

How are the scenes identified?

The similarity of two adjoining camera cuts in terms of the narrative (audio & visual) provides the context to identify homogeneous groups of camera cuts and that defines a logical boundary of the narrative – the scene boundary.

2. Characters & people – the foreground elements

These content entities are the active foreground elements – The characters or the people performing in the content. The detection of these entities involves identifying key characters that drive the narrative in a particular scene. In a movie, it could be the protagonist, in a game of soccer, it would be the players, and in a news debate, it’s the anchor and panellists.

However, defining the content entities using meaningful relationships and parameters is more important than just identifying the people present in the scene. AI creates this by picking frequent overlaps of screen presence and layering in contextual information.

Following are some examples of meaningful tracks of characters.

  • Relationships Tracks – Ex. father-son, protagonist-antagonist, lead couple
  • Age & Gender Tracks – Ex. Family with kids, kids with grandparents, teenagers
  • Gameplay Tracks – Ex. Arch-rivals, team-mates, team celebration
  • Studio Panel Tracks – Ex. Presidential nominees, political rivals, a celebrity couple

3. Content production & quality – the background elements

Here the content entities constitute the backdrop of the story. These are passive elements in the overall viewership experience, but it affects the overall experience nonetheless. This is the canvas on which the foreground elements operate, and the narrative unfolds.

  • Camera angles – Ex. Close-Up, Long shot, Goal-line cam, Netcam, Penalty cam
  • Set locale – Ex. In-studio, Indoor, Outdoor, Crowd, Field
  • Objects – Ex. Car, Ball, Trees, Door, Windows, Red-Yellow Cards, Brand Logos, Trophies 
  • Screen graphics – Ex. Split Screen, Scoreboard, Text elements, credits, Transitions
  • Overall ambience – Ex. Music, Silence, High decibel drama, Color palette & lighting

Not all metadata is relevant!

The effort of content entity creation must justify its business utility. As already mentioned, all the metadata generated by the AI may not be relevant to the use case. At the same time, the relevant metadata may have pockets of contextual inaccuracy.

Therefore, once the content entities are created, some techniques are used to improve the overall relevance of the metadata. Data quality parameters are defined to keep track of the overall quality of the metadata generated. Some post-processing techniques used are as follows.

  1. In identifying scene boundaries, noise in terms of additional scene cuts can get introduced by some unexpected visual artefacts. To filter those, it’s a good idea to triangulate the information with the sharp dips in the audio profile, which is an excellent indicator of a potential scene boundary.
  2. When character cluster tracks are created, there are instances where a character appears on the screen baring a couple of camera cuts in between. Considering empirical thresholds to normalise the data even if the character is not appearing on screen for that time often gives better results.
  3. Content Entities like faces and objects detected in a non-central position with a tiny region of interest are neither relevant for analysis nor for editorial use cases, and hence can be filtered out. 
  4. Some content entities like camera angles can be self-curated using the meta context around other entities. In the case of camera angles, closeups and medium shots can be curated by looking at the screen share and screen placement of the detected faces in those camera angles.

Applications & Use-cases in Media

Before we conclude, let’s talk about some of the industry use cases where these content entities find their applications. Now each use-case is a complex subject in itself and will be covered in the next article in the OwlThoughts series.

  1. Viewership analytics for general entertainment content: Understanding content elements drive the ratings up and what causes attrition for TV channels and OTT platforms.
  2. Story affinity across segments: Understanding the performance of different stories and their popularity across various audience segments.
  3. Determination of editorial neutrality: Quantifying screen share of different news topics and genres to ensure editorial objectivity.
  4. Brand visibility analysis:  Tracking and comparing of brand screen presence both quantitatively and qualitatively.


The purpose of this piece is two-fold:

  •  To explain how AI can help in understanding content via content entities
  •  Why content entities are the right alternative to define & analyse non-linear content in a decentralised & scalable way. 

As our media habits change and we gravitate towards a consumption behaviour which isn’t bounded by time slots, content takes the stage.

Here, the quest is to know the audiences better, have a better understanding of what engages them and what are the possible interventions. All of this is possible if we change our frame of reference to what matters most – Content.

Follow us on LinkedIn for similar pieces on Artificial Intelligence in Media

Contributed by – Sankalp Chaudhary