Human3.6M Dataset

Overview
Video Presentation
Subjects & Scenarios
Data
Mixed Reality
Code and Features
Acknowledgements

× New large-scale 3d human motion capture datasets for fitness, close human interactions, and self-contact available!

Diversity and Size

• 3.6 million 3D human poses and corresponding images
• 11 professional actors (6 male, 5 female)
• 17 scenarios (discussion, smoking, taking photo, talking on the phone...)

Accurate Capture and Synchronization

• High-resolution 50Hz video from 4 calibrated cameras
• Accurate 3D joint positions and joint angles from high-speed motion capture system
• Pixel-level 24 body part labels for each configuration
• Time-of-flight range data
• 3D laser scans of the actors
• Accurate background subtraction, person bounding boxes

Support for Development

• Precomputed image descriptors
• Software for visualization and discriminative human pose prediction
• Performance evaluation on withheld test set

Twitter

References

The datasets, large-scale learning techniques, and related experiments are described in:

Catalin Ionescu, Dragos Papava, Vlad Olaru and Cristian Sminchisescu, Human3.6M: Large Scale Datasets and Predictive Methods for 3D Human Sensing in Natural Environments, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 36, No. 7, July 2014 [pdf][bibtex]
Catalin Ionescu, Fuxin Li and Cristian Sminchisescu, Latent Structured Models for Human Pose Estimation, International Conference on Computer Vision, 2011 [pdf][bibtex]

The license agreement for data usage implies the citation of the two papers above. Please notice that citing the dataset URL instead of the publications would not be compliant with this license agreement.

Laboratory Setup

Our laboratory setup is schematically representated in the figure. It allows us to capture data from 15 sensors (4 digital video cameras, 1 time-of-flight sensor, 10 motion cameras), using hardware and software synchronization. The capture area was about 6m x 5m, and within it we had roughly 4m x 3m of effective capture space, where subjects were fully visible in all video cameras. Digital video (DV) cameras (4 units) are placed in the corners of the capture space. A time-of-flight sensor (TOF) is also placed next to one of the digital cameras. A set of 10 motion capture (MX) cameras are rigged on the walls to maximize the effective experimentation volume, 4 on each left and right edge and 2 roughly mid-way on the bottom horizontal edge. Detailed specifications of the system are given in table 1. A 3D laser body scanner from Human Solutions (Vitus Smart LC3) was used to obtain accurate 3d volumetric models for each of the subject actors participating in experiments.

Image Data

We use 4 basler high-resolution progressive scan cameras to acquire video data at 50 Hz. They are on same clock and trigger as the motion capture system which ensures perfect synchronization between the video and pose data. The system's default calibration procedure is very simple to perform but the camera model does not contain radial and tangential distortion parameters. Since we strive for exceptionally high quality pose information we have performed a more complex and more robust procedure that fits all these parameters as well. The total number of video frames for the entire dataset is over 3.6 Million.

Pose Data

Pose data is given with respect to a skeleton. For consistency and convenience we use the same skeleton of 32 joints for all of our parametrizations. In testing we reduce the number of joints the relevant ones e.g. leaving only one joint for each hand and each foot.

Common pose parametrizations considered in the literature include relative 3D joint positions (R3DJP) and kinematic representation (KR). Our dataset provides data in both parametrizations with a full skeletons containing the same joints in both cases. It also provides 2D joint positions since some methods may require this data.

In the first case (R3DJP), the joint positions in 3D space are provided. They are obtained from the joint angles provided by the Vicon skeleton fitting procedure by applying forward kinematics on the subject skeleton. R3DJP is challenging because it is very hard to estimate the size of the person. This problem is obviated in practice by providing the same skeleton (limb lenght) information for all subjects,including those in testing, if needed. The parametrization is called relative because there is a joint, usually called root joint (roughly corresponding to the human pelvis bone position), which is taken as the center of the prediction coordinate system and the other joints are estimated relatively to it.
The kinematic representation (KR), considers the relative joint angles between limbs and is more convenient because it is invariant to both scale and body proportions. The dependencies between variables are however much more complex making estimation more difficult. The process of obtaining joint angle values involves a complex constrained non-linear optimization process. We devoted significant effort to ensure that data is clean and the fitting process is accurate.
We use the camera parameters to project the 3D joint positions and obtain very accurate 2D pose information.

Outputs were visually inspected multiple times, during different processing phases, to ensure accuracy. These representations can be directly used in independent monocular predictions or in multi camera settings. The monocular prediction dataset can be increased 4-fold by globally rotating and translating the pose coordinates as to move the 4 DV cameras in a unique coordinate system (code is provided for this data manipulation). Poses are available at four-fold faster rates than images from DV cameras. The code provided can also be used to double both image and pose data by considering their mirror symmetric versions.

Time-of-Flight Data

Time of flight information is obtained using MESA Imaging SR4000 from SwissRanger at a 25Hz rate. This is a relatively standard Time-of-flight sensor on the market. One sensor was placed near one of the video cameras and captured the entire set of motions. Synchronization with the system is done by a software trigger. The camera itself has an internal clock at which data acquisition is performed. Time-stamps were taken for each acquisition to maintain synchronization. Analyzing the time-stamp information we found the acquisition to work at the desired framerate in a great majority of cases and code is provided to correct desynchronizations.

Scanner Data

All of the actors were scanned using a 3 sensor 3D scanner from Human Solutions called Vitus Smart LC3. The meshes obtained are preprocessed by Human Solution ScanWorks software and some manual intervention to repair the mesh was done by our staff. The mesh is available in Wavefront OBJ format and Matlab code is provided for loading the mesh.

Diversity and Size

Accurate Capture and Synchronization

Support for Development

References

Subjects

Scenarios

Laboratory Setup

Image Data

Pose Data

Time-of-Flight Data

Scanner Data

Code

Precomputed Segments

Precomputed Features

Acknowledgements