iPerceive: Applying Common-Sense Reasoning to Multi-Modal Dense Video Captioning

1. problems：

1.1 Most prior art in visual understanding relies solely on analyzing the “what” (e.g., event recognition) and “where” (e.g., event localization), which in some cases, fails to describe correct contextual relationships between events or leads to incorrect underlying visual attention.

1.2 Common-sense reasoning [43], which leads to the interesting question of “why”, is a thinking gap in today’s pattern learning-based systems which rely on the likelihood of observing object Y given object X, P(Y|X).

2. contribution：

2.1 present iPerceive, a framework that generates common-sense features by inferring the causal relationships between events in videos using contextual losses as self-supervised training mechanisms.

2.2 iPerceive DVC and iPerceive VideoQA on the ActivityNet Captions and TVQA datasets respectively furthers the state-of-the-art.

3. introduction:

Top: An example of a cognitive error in DVC. While the girl tries to block the boy’s dunking attempt, him jumping (event X) eventually leads to him dunking the basketball through the hoop (event Y)

Bottom: An example of incorrect attention where conventional DVC approaches correlate a chef and steak to the ac- tivity of cooking without even attending to the nearby oven.

4. framwork:

4.1 iPerceive DVC

iPerceive DVC generates common-sense vectors from the temporal events that the event proposal module localizes (left). Features from all modalities are sent to the corresponding encoder-decoder Transformers (middle). Upon fusing the processed features we finally output the next word in the caption using the distribution over the vocabulary (right).

4.2 iPerceive VideoQA

iPerceive VideoQA consists of two main components: feature fusion and frame selection.

For feature fusion, we encode features using a convolutional encoder, generate common-sense vectors from the input video sequence, and extract dense captions using iPerceive DVC (left). Features from all modalities (video, dense captions, QA and subtitles) are then fed to dual-layer attention: word/object and frame-level (middle). Upon fusing the attended features, we calculate frame-relevance scores (right).

5. result

5.1 comparison with the SoTA on TVQA validation set:

5.2 iPerceive VideoQA on TVQA:

We can see that iPerceive VideoQA furthers the SoTA in all the TV on TVQA

5.3 ablation analysis：

Using ablation studies, we showed that these common-sense features help the model better perceive relationships between events in videos, leading to improved performance on challenging video tasks that need cognition.