joints of a human skeleton in natural images and is one of the most
important visual recognition tasks in the scenes containing humans with
numerous applications in robotics, virtual and augmented reality, gaming
and healthcare among others. This thesis proposes methods for human pose
estimation and articulated pose tracking in images and video sequences.
Unlike most of the prior work, which is focused on pose estimation of
single pre-localised humans, we address a case with multiple people in
real world images which entails several challenges such as person-person
overlaps in highly crowded scenes, unknown number of people or people
entering and leaving video sequences. Our multi-person pose estimation
algorithm is based on the bottom-up detection-by-grouping paradigm
through an optimization of a multicut graph partitioning objective. Our
approach is general and is also applicable to multi-target pose tracking
in videos. In order to facilitate further research on this problem we
collect and annotate a large scale dataset and a benchmark for
articulated multi-person tracking. We also propose a method for
estimating 3D body pose using on-body wearable camera using a pair of
downward facing, head-mounted cameras that capture an entire body. Our
final contribution is a method for reconstructing 3D objects from weak
supervision. Our approach represents objects as 3D point clouds and is
able to learn them with 2D supervision only and without requiring camera
pose information at training time. We design a differentiable renderer
of point clouds as well as a novel loss formulation for dealing with
camera pose ambiguity.