The presentation briefly discusses my masters thesis research work on the challenge of multi-person gesture detection in artwork images with particular focus on improving the classification performance. The current transformer-based state-of-the-art object detection models are effective in localizing persons but are shown to struggle with accurately classifying their gestures. This limitation arises from confining object queries to predefined regions within the image, which restricts access to holistic features essential for accurate gesture classification. To address these limitations, two key modifications are proposed: Gesture-Specific Queries and a Combined Classification Decoder to an existing transformer-based architecture. These modifications are designed to improve its classification accuracy while maintaining localization capabilities. Experimental evaluations demonstrate that the proposed approaches improve the classification performance and establish the new state-of-the-art on SensoryArt dataset for gesture detection.