First-person Activity Recognition by Modelling Subject-Action Relevance

Manav Prabhakar; Snehasis Mukherjee

The efficacy of Action Recognition methods depends upon relevance of the action (Verb) with respect to the subject (Noun). Existing methods overlook the suitability of Noun-Verb combination in defining an action. In this work, we propose an algorithm called Reduced Verb Set Generator (RVSGen) to reduce the number of possible verbs related to the actions, based upon the relevance of noun-verb combination. A dual modal fusion model for egocentric activity recognition is proposed here to combine the features extracted from the RGB channels and the Optical flow vectors for an egocentric video to recognize the human activity. Unlike state-of-the-art methods, where the key objects and temporal cues are extracted simultaneously, the proposed model first extracts the spatial features i.e., object information (Noun) from the RGB channels and then relates the Noun with the suitable action (Verb) obtained from motion information (Optical Flow). The verbs are predicted by an ConvLSTM architecture with the help of a modified softmax. The notion behind the modified softmax is to estimate the probability distribution with the reduced verb set obtained from RVSGen and the feature vector obtained from the ConvLSTM. With the help of an end-to-end trained architecture, the noun and verb are predicted which are then concatenated together constituting an action. The experiments are performed on a benchmark dataset. The results show the efficacy of the proposed method compared to the state-of-the-art. The codes related to this work can be found at: \url{https://github.com/mpLogics/EgoAR-RVSGen}.