ICG-Net: A Unified Approach for Instance-Centric Grasping

ICRA 2024

1ETH Zürich, 2Texas A&M University

ICG-Net directly predicts 6-DoF grasps, segmentation masks and mesh reconstructions for each instance in a cluttered scenes.

Instance Centric Predictions for Grasping and Scene Understanding

Input Pointcloud


Grasp Prediction



Accurate grasping is the key to several robotic tasks including assembly and household robotics. Executing a successful grasp in a cluttered environment requires multiple levels of scene understanding:
First, the robot needs to analyze the geometric properties of individual objects to find feasible grasps. These grasps need to be compliant with the local object geometry.
Second, for each proposed grasp, the robot needs to reason about the interactions with other objects in the scene.
Finally, the robot must compute a collision-free grasp trajectory while taking into account the geometry of the target object.
Most grasp detection algorithms directly predict grasp poses in a monolithic fashion, which does not capture the composability of the environment.

In this paper, we introduce an end-to-end architecture for object-centric grasping. The method uses pointcloud data from a single arbitrary viewing direction as an input and generates an instance-centric representation for each partially observed object in the scene. This representation is further used for object reconstruction and grasp detection in cluttered table-top scenes.

We show the effectiveness of the proposed method by extensively evaluating it against state-of-the-art methods on synthetic datasets, indicating superior performance for grasping and reconstruction. Additionally, we demonstrate real-world applicability by decluttering scenes with varying numbers of objects.

Model Architecture


Given an input pointcloud, we voxelize the pointcloud and extract volumetric and surface features at multiple scales using a sparse Minkowski and dense U-Net. The surface features are enriched with volumetric information and treated as tokens with positional encodings based on voxel locations.

Masked attention iteratively refines instance queries by cross-attending to extracted sparse tokens. This process allows each latent query to focus on a specific instance and to be classified as "semantic class" or "no object" The refined queries condition the task-specific decoders to model the occupancy of each instance directly or to predict grasp affordance scores and gripper widths.

Simulation Examples

Packed Scene
Pile Scene




  title={ICGNet: A Unified Approach for Instance-Centric Grasping},
  author={Zurbr{\"u}gg, Ren{\'e} and Liu, Yifan and Engelmann, Francis and Kumar, Suryansh and Hutter, Marco and Patil, Vaishakh and Yu, Fisher},
  journal={arXiv preprint arXiv:2401.09939},