Develop Environment

Compare to radar and image processing, lidar point cloud processing has a lot more compuational accelerateions that is coded with c++. Compilation with c/c++ is not pleasant yet a necessary step for lidar based perception.

While there are lot cuda code for 3d inforamtion processing in the public domain, very few of them are actively maintained. MMlab is one exception and that is what we decided to use for this and future projects. You can learn more about mmlab here.

User Docker Image

It is easier for future users to just pull the docker image which has been complied. Specific instructions are the following:

install docker

install nvidia container runtime

install nvidia container runtime which will allow docker to interface with your local gpu. Detailed instructions can be find here

distribution=$(. /etc/os-release;echo $ID$VERSION_ID) \
   && curl -s -L | sudo apt-key add - \
   && curl -s -L$distribution/nvidia-docker.list | sudo tee /etc/apt/sources.list.d/nvidia-docker.list

sudo apt-get update
sudo apt-get install -y nvidia-docker2

Set nvidia-container-runtime as default docker runtime by edit the /etc/docker/daemon.json file:

    "runtimes": {
        "nvidia": {
            "path": "/usr/bin/nvidia-container-runtime",
            "runtimeArgs": []
    "default-runtime": "nvidia" 

restart docker sudo systemctl restart docker

get the docker image

pull image from docker hub with docker pull bailiping/mmlab:v3.0 or build it locally with the docker file provided. the base image for the docker file is spconv compilation environment. You will have to set nvidia container runtime as default for the building process as well.

run container

to run the docker image with command:

sudo docker run  --gpus all -it -v '/home/zhubinglab/Desktop/NuScenes_Project':/root/NuScenes_Project  --name mmlab bailiping/mmlab:v3.0 bash
  • --gpus all means expose all gpus to docker
  • -it means interactive
  • -v means mount a volumn

verify container

use to check if this environment works in your local machine

  • 1
  • 1

Setup your own machine

If you prefer to compile things for your own machine, here is how we set up ours for your reference. A key take away is that don’t try to manually uninstall nvidia related stuff. Coz most likely there would be broken links/package left and would make things tidious. Do it once, restart your computer right away to make sure the driver is working properly before you proceed.

  • current system
  • current system
  • current system

Alternatively, here is another setup that can work

  • alternative setup

You will have to download pretrained models in order for the code to work. The following is an sample script for how to set things up. we use this centerpoint model for our inference, yet you can download any other pretrained models. For detailed discription, please read mmlab documentation

mkdir checkpoint
cd checkpoint


Dataset Schema

Schema of NuScenes Dataset

visualization for the annotation (ground truth)

  • 1
  • 1
  • 1
  • 1
  • 1
  • 1
  • 1
  • 1

    Sample Data

    sample_data {
     "token":                   <str> -- Unique record identifier.
     "sample_token":            <str> -- Foreign key. Sample to which this sample_data is associated.
     "ego_pose_token":          <str> -- Foreign key.
     "calibrated_sensor_token": <str> -- Foreign key.
     "filename":                <str> -- Relative path to data-blob on disk.
     "fileformat":              <str> -- Data file format.
     "width":                   <int> -- If the sample data is an image, this is the image width in pixels.
     "height":                  <int> -- If the sample data is an image, this is the image height in pixels.
     "timestamp":               <int> -- Unix time stamp.
     "is_key_frame":            <bool> -- True if sample_data is part of key_frame, else False.
     "next":                    <str> -- Foreign key. Sample data from the same sensor that follows this in time. Empty if end of scene.
     "prev":                    <str> -- Foreign key. Sample data from the same sensor that precedes this in time. Empty if start of scene.

    Sample Annotation

    A bounding box defining the position of an object seen in a sample. All location data is given with respect to the GLOBAL coordinate system. ``` sample_annotation { “token”: -- Unique record identifier. "sample_token": -- Foreign key. NOTE: this points to a sample NOT a sample_data since annotations are done on the sample level taking all relevant sample_data into account. "instance_token": -- Foreign key. Which object instance is this annotating. An instance can have multiple annotations over time. "attribute_tokens": [n] -- Foreign keys. List of attributes for this annotation. Attributes can change over time, so they belong here, not in the instance table. "visibility_token": -- Foreign key. Visibility may also change over time. If no visibility is annotated, the token is an empty string. "translation": [3] -- Bounding box location in meters as center_x, center_y, center_z. "size": [3] -- Bounding box size in meters as width, length, height. "rotation": [4] -- Bounding box orientation as quaternion: w, x, y, z. "num_lidar_pts": -- Number of lidar points in this box. Points are counted during the lidar sweep identified with this sample. "num_radar_pts": -- Number of radar points in this box. Points are counted during the radar sweep identified with this sample. This number is summed across all radar sensors without any invalid point filtering. "next": -- Foreign key. Sample annotation from the same object instance that follows this in time. Empty if this is the last annotation for this object. "prev": -- Foreign key. Sample annotation from the same object instance that precedes this in time. Empty if this is the first annotation for this object. }

## Keyframe and Sweeps
the `keyframe` is a `2Hz`

After collecting the driving data, we sample well synchronized keyframes (image, LIDAR, RADAR) at 2Hz and send them to our annotation partner Scale for annotation.

the `lidar sweep` is at `20Hz`, image is at `12Hz`, radar is at `15Hz`

In order to achieve good cross-modality data alignment between the LIDAR and the cameras, the exposure of a camera is triggered when the top LIDAR sweeps across the center of the camera’s FOV. The timestamp of the image is the exposure trigger time; and the timestamp of the LIDAR scan is the time when the full rotation of the current LIDAR frame is achieved. Given that the camera’s exposure time is nearly instantaneous, this method generally yields good data alignment. Note that the cameras run at 12Hz while the LIDAR runs at 20Hz. The 12 camera exposures are spread as evenly as possible across the 20 LIDAR scans, so not all LIDAR scans have a corresponding camera frame. Reducing the frame rate of the cameras to 12Hz helps to reduce the compute, bandwidth and storage requirement of the perception system.

# Inference Output

## File Name
- The last 16 digits are the `timestamp` for the frame.
- use the `timestamp` in order to find out which frame it is in a scene.
## Bounding Boxes
the inference bbox information is giben in `LIDAR coordinate system`. Therefore need to be projected to the `GLOBAL coordinate system`.

class LiDARInstance3DBoxes(BaseInstance3DBoxes):
    """3D boxes of instances in LIDAR coordinates.

    Coordinates in LiDAR:

    .. code-block:: none

                            up z    x front (yaw=-0.5*pi)
                               ^   ^
                               |  /
                               | /
      (yaw=-pi) left y <------ 0 -------- (yaw=0)

    The relative coordinate of bottom center in a LiDAR box is (0.5, 0.5, 0),
    and the yaw is around the z axis, thus the rotation axis=2.
    The yaw is 0 at the negative direction of y axis, and decreases from
    the negative direction of y to the positive direction of x.

    A refactor is ongoing to make the three coordinate systems
    easier to understand and convert between each other.

        tensor (torch.Tensor): Float matrix of N x box_dim.
        box_dim (int): Integer indicating the dimension of a box.
            Each row is (x, y, z, x_size, y_size, z_size, yaw, ...).
        with_yaw (bool): If True, the value of yaw will be set to 0 as minmax

    def gravity_center(self):
        """torch.Tensor: A tensor with center of each box."""
        bottom_center = self.bottom_center
        gravity_center = torch.zeros_like(bottom_center)
        gravity_center[:, :2] = bottom_center[:, :2]
        gravity_center[:, 2] = bottom_center[:, 2] + self.tensor[:, 5] * 0.5
        return gravity_center

    def corners(self):
        """torch.Tensor: Coordinates of corners of all the boxes
        in shape (N, 8, 3).

        Convert the boxes to corners in clockwise order, in form of
        ``(x0y0z0, x0y0z1, x0y1z1, x0y1z0, x1y0z0, x1y0z1, x1y1z1, x1y1z0)``

        .. code-block:: none

                                           up z
                            front x           ^
                                 /            |
                                /             |
                  (x1, y0, z1) + -----------  + (x1, y1, z1)
                              /|            / |
                             / |           /  |
               (x0, y0, z1) + ----------- +   + (x1, y1, z0)
                            |  /      .   |  /
                            | / origin    | /
            left y<-------- + ----------- + (x0, y1, z0)
                (x0, y0, z0)
        # TODO: rotation_3d_in_axis function do not support
        #  empty tensor currently.
        assert len(self.tensor) != 0
        dims = self.dims
        corners_norm = torch.from_numpy(
            np.stack(np.unravel_index(np.arange(8), [2] * 3), axis=1)).to(
                device=dims.device, dtype=dims.dtype)

        corners_norm = corners_norm[[0, 1, 3, 2, 4, 5, 7, 6]]
        # use relative origin [0.5, 0.5, 0]
        corners_norm = corners_norm - dims.new_tensor([0.5, 0.5, 0])
        corners = dims.view([-1, 1, 3]) * corners_norm.reshape([1, 8, 3])

        # rotate around z axis
        corners = rotation_3d_in_axis(corners, self.tensor[:, 6], axis=2)
        corners += self.tensor[:, :3].view(-1, 1, 3)
        return corners

    def bev(self):
        """torch.Tensor: 2D BEV box of each box with rotation
        in XYWHR format."""
        return self.tensor[:, [0, 1, 3, 4, 6]]

    def nearest_bev(self):
        """torch.Tensor: A tensor of 2D BEV box of each box
        without rotation."""
        # Obtain BEV boxes with rotation in XYWHR format
        bev_rotated_boxes = self.bev
        # convert the rotation to a valid range
        rotations = bev_rotated_boxes[:, -1]
        normed_rotations = torch.abs(limit_period(rotations, 0.5, np.pi))

        # find the center of boxes
        conditions = (normed_rotations > np.pi / 4)[..., None]
        bboxes_xywh = torch.where(conditions, bev_rotated_boxes[:,
                                                                [0, 1, 3, 2]],
                                  bev_rotated_boxes[:, :4])

        centers = bboxes_xywh[:, :2]
        dims = bboxes_xywh[:, 2:]
        bev_boxes =[centers - dims / 2, centers + dims / 2], dim=-1)
        return bev_boxes

    def rotate(self, angle, points=None):
        """Rotate boxes with points (optional) with the given angle or \
        rotation matrix.

            angles (float | torch.Tensor | np.ndarray):
                Rotation angle or rotation matrix.
            points (torch.Tensor, numpy.ndarray, :obj:`BasePoints`, optional):
                Points to rotate. Defaults to None.

            tuple or None: When ``points`` is None, the function returns \
                None, otherwise it returns the rotated points and the \
                rotation matrix ``rot_mat_T``.
        if not isinstance(angle, torch.Tensor):
            angle = self.tensor.new_tensor(angle)
        assert angle.shape == torch.Size([3, 3]) or angle.numel() == 1, \
            f'invalid rotation angle shape {angle.shape}'

        if angle.numel() == 1:
            rot_sin = torch.sin(angle)
            rot_cos = torch.cos(angle)
            rot_mat_T = self.tensor.new_tensor([[rot_cos, -rot_sin, 0],
                                                [rot_sin, rot_cos, 0],
                                                [0, 0, 1]])
            rot_mat_T = angle
            rot_sin = rot_mat_T[1, 0]
            rot_cos = rot_mat_T[0, 0]
            angle = np.arctan2(rot_sin, rot_cos)

        self.tensor[:, :3] = self.tensor[:, :3] @ rot_mat_T
        self.tensor[:, 6] += angle

        if self.tensor.shape[1] == 9:
            # rotate velo vector
            self.tensor[:, 7:9] = self.tensor[:, 7:9] @ rot_mat_T[:2, :2]

        if points is not None:
            if isinstance(points, torch.Tensor):
                points[:, :3] = points[:, :3] @ rot_mat_T
            elif isinstance(points, np.ndarray):
                rot_mat_T = rot_mat_T.numpy()
                points[:, :3] =[:, :3], rot_mat_T)
            elif isinstance(points, BasePoints):
                # clockwise
                raise ValueError
            return points, rot_mat_T

    def flip(self, bev_direction='horizontal', points=None):
        """Flip the boxes in BEV along given BEV direction.

        In LIDAR coordinates, it flips the y (horizontal) or x (vertical) axis.

            bev_direction (str): Flip direction (horizontal or vertical).
            points (torch.Tensor, numpy.ndarray, :obj:`BasePoints`, None):
                Points to flip. Defaults to None.

            torch.Tensor, numpy.ndarray or None: Flipped points.
        assert bev_direction in ('horizontal', 'vertical')
        if bev_direction == 'horizontal':
            self.tensor[:, 1::7] = -self.tensor[:, 1::7]
            if self.with_yaw:
                self.tensor[:, 6] = -self.tensor[:, 6] + np.pi
        elif bev_direction == 'vertical':
            self.tensor[:, 0::7] = -self.tensor[:, 0::7]
            if self.with_yaw:
                self.tensor[:, 6] = -self.tensor[:, 6]

        if points is not None:
            assert isinstance(points, (torch.Tensor, np.ndarray, BasePoints))
            if isinstance(points, (torch.Tensor, np.ndarray)):
                if bev_direction == 'horizontal':
                    points[:, 1] = -points[:, 1]
                elif bev_direction == 'vertical':
                    points[:, 0] = -points[:, 0]
            elif isinstance(points, BasePoints):
            return points

    def in_range_bev(self, box_range):
        """Check whether the boxes are in the given range.

            box_range (list | torch.Tensor): the range of box
                (x_min, y_min, x_max, y_max)

            The original implementation of SECOND checks whether boxes in
            a range by checking whether the points are in a convex
            polygon, we reduce the burden for simpler cases.

            torch.Tensor: Whether each box is inside the reference range.
        in_range_flags = ((self.tensor[:, 0] > box_range[0])
                          & (self.tensor[:, 1] > box_range[1])
                          & (self.tensor[:, 0] < box_range[2])
                          & (self.tensor[:, 1] < box_range[3]))
        return in_range_flags

    def convert_to(self, dst, rt_mat=None):
        """Convert self to ``dst`` mode.

            dst (:obj:`Box3DMode`): the target Box mode
            rt_mat (np.ndarray | torch.Tensor): The rotation and translation
                matrix between different coordinates. Defaults to None.
                The conversion from ``src`` coordinates to ``dst`` coordinates
                usually comes along the change of sensors, e.g., from camera
                to LiDAR. This requires a transformation matrix.

            :obj:`BaseInstance3DBoxes`: \
                The converted box of the same type in the ``dst`` mode.
        from .box_3d_mode import Box3DMode
        return Box3DMode.convert(
            box=self, src=Box3DMode.LIDAR, dst=dst, rt_mat=rt_mat)

    def enlarged_box(self, extra_width):
        """Enlarge the length, width and height boxes.

            extra_width (float | torch.Tensor): Extra width to enlarge the box.

            :obj:`LiDARInstance3DBoxes`: Enlarged boxes.
        enlarged_boxes = self.tensor.clone()
        enlarged_boxes[:, 3:6] += extra_width * 2
        # bottom center z minus extra_width
        enlarged_boxes[:, 2] -= extra_width
        return self.new_box(enlarged_boxes)

    def points_in_boxes(self, points):
        """Find the box which the points are in.

            points (torch.Tensor): Points in shape (N, 3).

            torch.Tensor: The index of box where each point are in.
        box_idx = points_in_boxes_gpu(
        return box_idx





  • 1
    每一个bounding box的(x, y, z, yaw, pitch, roll)应该指的是,target bounding box本身这个坐标系(任何一个向量本身可以看做一个由向量起点为原点,向量朝向为x轴的一个坐标系)的原点在Lidar coordinate system下的位置(position)以及target bounding box本身这个坐标系的x轴朝向在Lidar coordinate system下的pose(yaw, pitch, roll)。而size_x, size_y, size_z即bounding box的长宽高(因为target bounding box本身这个坐标系的pose和bounding box的pose一致)。可是因为target bounding box本身这个坐标系的原点并非bounding box的Bottom平面中心点,所以想要表示bounding box的Bottom平面中心点在Lidar coordinate system下的位置话,需要在x轴平移0.5,y轴平移0.5。这是它那个The relative coordinate of bottom center in a LiDAR box is (0.5, 0.5, 0)表示的含义

Information required inorder to do the coordinate transformation.


Ego vehicle pose at a particular timestamp. Given with respect to global coordinate system of the log’s map. The ego_pose is the output of a lidar map-based localization algorithm described in our paper. The localization is 2-dimensional in the x-y plane.

ego_pose {
   "token":                   <str> -- Unique record identifier.
   "translation":             <float> [3] -- Coordinate system origin in meters: x, y, z. Note that z is always 0.
   "rotation":                <float> [4] -- Coordinate system orientation as quaternion: w, x, y, z.
   "timestamp":               <int> -- Unix time stamp.


Definition of a particular sensor (lidar/radar/camera) as calibrated on a particular vehicle. All extrinsic parameters are given with respect to the ego vehicle body frame. All camera images come undistorted and rectified.

calibrated_sensor {
   "token":                   <str> -- Unique record identifier.
   "sensor_token":            <str> -- Foreign key pointing to the sensor type.
   "translation":             <float> [3] -- Coordinate system origin in meters: x, y, z.
   "rotation":                <float> [4] -- Coordinate system orientation as quaternion: w, x, y, z.
   "camera_intrinsic":        <float> [3, 3] -- Intrinsic camera calibration. Empty for sensors that are not cameras.

3D visualization of the point cloud data(pcd)

This is an optional step since we don’t really need 3d visualization here, yet it is always a good idea to have more access to the dataset at hand.

This is a straight forward process, you convert .pcd file to .obj file, which can be done through the nuscenes_devkit or mmlab api. There are plenty 3d rendering engines, such as open3d or meshlab

The following is the result:

  • 1

Nuscenes Dataset Tutorials