Discussion

Moderation: The weight from moderators tell us how much we should trust the feature (information) from body image or part (head/hands) crops, the animation figure shows the visualization of the weights: the lighter color means it trust body more.
Face from SMPL-X model: In PIXIE, we use a common shape variable for the body and the face. The shape space of SMPL-X is learned from 3800 real full body scans and captures the correlation between body and face shape. In the samples script we support the option of predicting the shape of the full body from a face-only image. If you are only interested in getting accurate face shape, we suggest to try other face-specifc work, such as DECA.
Model conversion: If you want to convert the predicted SMPL-X body to a different body model, e.g. SMPL, please take a look here

Cropping matters: Even though we already did a lot of data augmentations durining training, the results will still vary a bit due to cropping differences for the input images. For simplicity, we chose Faster-RCNN as a person detector, while in the paper we used OpenPose keypoints to compute a bounding box.
Perspective projection: We use a scaled-orthographic/weak-perspective camera model in PIXIE, which does not work well for images with strong perpective distortion.
Misalignment issue: For regression works that output model (SMPL/SMPL-X) parameters, misalignment with the person in the image is always an issue. This can be improved by using PIXIE results as initialization for an optimization method like SMPLify-X to refine the pose. Note that, the moderator weight could also be utilized as a confidence measure during optimization.
Speed: The main bottleneck of PIXIE is the need for three separate encoders for the body, head and hand images. Changing the current backbone, i.e. Resnet50 or HRNet, to a lighter one, like MobileNet, should accelerate inference at the cost of performance. We will attempt to provide different options when we release the training code.
Texture: Similar to DECA, we rely upon the Basel face model for our albedo space. Its lack of ethnic diversity in the albedo causes the model to often compensate for skin tone with lighting.