

First, a Convolutional Neural Network (CNN) extracts a set of feature maps from the original image. This model is composed of three main components.

To compute saliency maps, we employ the Saliency Attentive Model (SAM 1) presented in, which has shown state-of-the-art results on popular saliency benchmarks, such as the MIT Saliency Benchmark and the SALICON dataset. In this way, for each subject appearing in the video frames, we can obtain a measure of his individual relevance to the social interaction. We then define the social relevance of each person as the accumulation of the saliency values inside the person's bounding box summed over time. More in detail, given a video, we first extract the saliency maps for all video frames. By providing a distribution of eye fixations over an image, this lets us estimate the amount of fixations each person on the scene would receive from the wearer, and by extension, the social relevance of each person. Saliency prediction architectures predict the distribution of eye fixation points on a given image, and are trained on data captured from eye-tracking devices. Therefore, to estimate social relevance relying only on the frames captured by a wearable camera, we choose to rely on saliency prediction. Sensors that can objectively measure the relevance of a person from the point of view of an observer, like eye-tracking glasses, are expensive and more uncomfortable to wear in public than a tiny camera. To some extent, the camera wearer could give more importance to a distant person than to a closer one, or even to somebody who is turned away. Clearly, social relevance cannot be fully estimated from features like head pose and distance. By providing complementary information to what is provided by group estimation, we argue that social relevance enables a better understanding of the social dynamics from an egocentric prospective. We name this subjective property social relevance. What cannot be unveiled by detecting social groups are all the properties of the social formation which depend on the particular observer, like the importance attached by an observer to each person in a group. Group detection estimates how people interact with each other by analyzing the geometry of their social formations. Rita Cucchiara, in Multimodal Behavior Analysis in the Wild, 2019 10.5 Social relevance estimation Recognizing social relationships from an egocentric vision perspective
