I am currently a research engineer at valeo.ai and a PhD student with the IMAGINE lab at École Nationale des Ponts et Chaussées (ENPC) working on the topic of 4D Human Pose and Shape Estimation from LiDAR data. I work under the supervision of Dr. Nermin Samet on the valeo.ai lab side and Prof. David Picard on the IMAGINE lab side. Previously, I have been and intern and then a research engineer at NAVER LABS Europe in the 3D Humans team. I have obtained my MSc in decision theory from Université Paris Dauphine - PSL and my engineering degree in Industrial Computing and Automation from the National Institute of Applied Science and Technology (INSAT).
In this paper, we present a comprehensive review of 3D human pose estimation and human mesh recovery from in-the-wild LiDAR point clouds. We compare existing approaches across several key dimensions, and propose a structured taxonomy to classify these methods. Following this taxonomy, we analyze each method’s strengths, limitations, and design choices. In addition, (i) we perform a quantitative comparison of the three most widely used datasets, detailing their characteristics; (ii) we compile unified definitions of all evaluation metrics; and (iii) we establish benchmark tables for both tasks on these datasets to enable fair comparisons and promote progress in the field. We also outline open challenges and research directions critical for advancing LiDAR-based 3D human understanding. Moreover, we maintain an accompanying webpage that organizes papers according to our taxonomy and continuously update it with new studies: https://github.com/valeoai/3D-Human-Pose-Shape-Estimation-from-LiDAR
@article{galaaoui3DHumanPose2025,title={3D Human Pose and Shape Estimation from LiDAR Point Clouds: A Review},author={Galaaoui, Salma and Valle, Eduardo and Picard, David and Samet, Nermin},journal={ArXiv},year={2025},osf={https://github.com/valeoai/3D-Human-Pose-Shape-Estimation-from-LiDAR},preprint={https://arxiv.org/abs/2509.12197},type={Preprint}}
CondimenCVPRW
Condimen: Conditional Multi-Person Mesh Recovery
Romain Brégier, Fabien Baradel, Thomas Lucas, and 4 more authors
Multi-person human mesh recovery (HMR) consists in detecting all individuals in a given input image, and predicting the body shape, pose, and 3D location for each detected person. The dominant approaches to this task rely on neural networks trained to output a single prediction for each detected individual. In contrast, we propose CondiMen, a method that outputs a joint parametric distribution over likely poses, body shapes, intrinsics and distances to the camera, using a Bayesian network. This approach offers several advantages. First, a probability distribution can handle some inherent ambiguities of this task-such as the uncertainty between a person’s size and their distance to the camera, or more generally the loss of information that occurs when projecting 3D data onto a 2D image. Second, the output distribution can be combined with additional information to produce better predictions, by using eg known camera or body shape parameters, or by exploiting multi-view observations. Third, one can efficiently extract the most likely predictions from this output distribution, making the proposed approach suitable for real-time applications. Empirically we find that our model i) achieves performance on par with or better than the state-of-the-art, ii) captures uncertainties and correlations inherent in pose estimation and iii) can exploit additional information at test time, such as multi-view consistency or body shape priors. CondiMen spices up the modeling of ambiguity, using just the right in-gredients on hand.
@article{bregierCondimenConditionalMultiPerson2025,author={Brégier, Romain and Baradel, Fabien and Lucas, Thomas and Galaaoui, Salma and Armando, Matthieu and Weinzaepfel, Philippe and Rogez, Grégory},title={Condimen: Conditional Multi-Person Mesh Recovery},journal={IEEE- CVPR ROHBIN Workshop},doi={10.1109/CVPRW67362.2025.00373},year={2025},preprint={https://arxiv.org/abs/2412.13058},type={Conference Paper}}
MultiHMRECCV
Multi-HMR: Multi-Person Whole-Body Human Mesh Recovery in a Single Shot
Fabien Baradel, Matthieu Armando, Salma Galaaoui, and 4 more authors
We present Multi-HMR, a strong sigle-shot model for multi-person 3D human mesh recovery from a single RGB image. Predictions encompass the whole body, i.e., including hands and facial expressions, using the SMPL-X parametric model and 3D location in the camera coordinate system. Our model detects people by predicting coarse 2D heatmaps of person locations, using features produced by a standard Vision Transformer (ViT) backbone. It then predicts their whole-body pose, shape and 3D location using a new cross-attention module called the Human Prediction Head (HPH), with one query attending to the entire set of features for each detected person. As direct prediction of fine-grained hands and facial poses in a single shot, i.e., without relying on explicit crops around body parts, is hard to learn from existing data, we introduce CUFFS, the Close-Up Frames of Full-Body Subjects dataset, containing humans close to the camera with diverse hand poses. We show that incorporating it into the training data further enhances predictions, particularly for hands. Multi-HMR also optionally accounts for camera intrinsics, if available, by encoding camera ray directions for each image token. This simple design achieves strong performance on whole-body and body-only benchmarks simultaneously: a ViT-S backbone on 448x448 images already yields a fast and competitive model, while larger models and higher resolutions obtain state-of-the-art results.
@article{baradelMultiHMRMultipersonWholeBody2024,author={Baradel, Fabien and Armando, Matthieu and Galaaoui, Salma and Brégier, Romain and Weinzaepfel, Philippe and Rogez, Grégory and Lucas, Thomas},title={Multi-HMR: Multi-Person Whole-Body Human Mesh Recovery in a Single Shot},journal={Springer Nature - ECCV 2024},doi={10.1007/978-3-031-73337-6_12},year={2024},preprint={https://arxiv.org/abs/2402.14654},type={Conference Paper}}
CroCoManCVPR
Cross-view and Cross-pose Completion for 3D Human Understanding
Matthieu Armando, Salma Galaaoui, Fabien Baradel, and 5 more authors
Human perception and understanding is a major domain of computer vision which like many other vision subdomains recently stands to gain from the use of large models pre-trained on large datasets. We hypothesize that the most common pre-training strategy of relying on general purpose object-centric image datasets such as ImageNet is limited by an important domain shift. On the other hand collecting domain-specific ground truth such as 2D or 3D labels does not scale well. Therefore we propose a pre-training approach based on self-supervised learning that works on human-centric data using only images. Our method uses pairs of images of humans: the first is partially masked and the model is trained to reconstruct the masked parts given the visible ones and a second image. It relies on both stereoscopic (cross-view) pairs and temporal (cross-pose) pairs taken from videos in order to learn priors about 3D as well as human motion. We pre-train a model for body-centric tasks and one for hand-centric tasks. With a generic transformer architecture these models outperform existing self-supervised pre-training methods on a wide set of human-centric downstream tasks and obtain state-of-the-art performance for instance when fine-tuning for model-based and model-free human mesh recovery.
@article{armandoCrossViewCrossPoseCompletion2024,author={Armando, Matthieu and Galaaoui, Salma and Baradel, Fabien and Lucas, Thomas and Leroy, Vincent and Brégier, Romain and Weinzaepfel, Philippe and Rogez, Grégory},title={Cross-view and Cross-pose Completion for 3D Human Understanding},journal={IEEE - CVPR 2024},doi={10.1109/CVPR52733.2024.00150},year={2024},preprint={https://arxiv.org/abs/2311.09104},type={Conference Paper}}
You can contact me through linkedin or on my work email.