For clarity, we only show the 2D poses, but 3D poses are also estimated. ![]() ![]() Then, the regression branch applies an anchor-pose-specific regression to refine the predicted pose in both 2D and 3D.įootage from a live demo of our approach (Video 1) shows that DOPE is fast even when performing in real time on relatively inexpensive equipment (a laptop with a GTX 1080 graphics card). Of these, the classification branch detects the most similar pose from a discrete set of predefined anchor poses. After the features of the image are pooled according to these boxes, additional convolutions are applied to separate the data into two branches. With a given image, LCR-Net++ extracts convolutional features and feeds them into a region proposal network (RPN) to generate candidate boxes that contain potential body instances. The original LCR-Net++ architecture and the adapted versions for hands and faces comprised the three part experts that we required for our architecture. Additionally, we adapted LCR-Net++ to tackle face-pose detection with facial features as key points. Here, the anchor poses represent a set of particular hand poses and the regression is applied to the hand key points. As well as estimating body poses, LCR-Net++ has been adapted ( 4) to address the challenge of hand-pose estimation. To achieve this, we extend LCR-Net++ ( 2), a recently developed body-pose-detecting architecture that’s robust in a variety of challenging real-world scenarios (an overview is shown in Figure 3). Since we are distilling the knowledge of each part expert into a single network for whole-body pose estimation, we call our method ‘distillation of part experts’, or DOPE.ĭOPE uses a detection architecture where bodies, hands and faces are the objects detected in each image. We also employ a distillation loss, which ensures the network makes predictions that are as close as possible to those made by the experts. do not change) during training of the whole-body network. Note that because we assume that these experts are already trained on dedicated datasets, they are frozen (i.e. We then combine the estimations of the three experts to obtain detections for whole bodies that can then be used as pseudo-ground-truth annotations to train our whole-body network. In the example shown in Figure 2, we run a body expert, a hand expert and a face expert to obtain detections and 2D/3D poses for each. As we don’t have ground-truth data, we propose instead using an expert (a network specialized for 2D/3D pose estimation of a given body part) for each. information about the exact 2D and 3D locations of every joint for each part-for bodies, hands and faces observed in the image. Given a training image, a whole-body network requires ground-truth annotations-i.e. An overview of our training framework is shown in Figure 2. More precisely, we use distillation to transfer the knowledge of several body-part experts into a unified network that outputs a more complete representation of the whole human body. We propose leveraging these datasets to train a single network for whole-body pose estimation using distillation. Our framework should also be able to accurately estimate poses in images that depict multiple people interacting either among themselves or with objects in the scene. Importantly, our aim is to tackle this problem in the wild, which means that our method must be robust to occlusion (where the object being tracked is partially obscured by another object or another person) and truncation at the image boundary (where the object is cut off at the corner or edge of the image). ![]() We’re proposing the first learning-based framework that can detect people in an image and estimate their whole-body pose (including body, face and hands) in both 2D and 3D, as shown in Figure 1 ( 6). Moreover, teaching a robot arm how to perform a task goes beyond the position of the arm to the hand nuanced information about finger movement would be required to enable manipulation of an object, for example. ![]() Animating an avatar in a realistic fashion, for example, requires the capture not only of limb position but also of facial expression and hand gesture, as finger poses and facial movements carry a great deal of non-verbal communication. To achieve a global understanding of human poses, though, one must capture information about all body parts at once. So far, pose recognition has been tackled by looking at specific body parts, such as the limbs (position of the arms and legs) ( 2, 3), hands (position of the fingers) ( 4) or face (position of eyes, nose, eyebrows, mouth and face contour) ( 5).
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |