Manipulating unseen objects is challenging with- out a 3D representation, as objects generally have occluded surfaces. This requires physical interaction with objects to build their internal representations. This paper presents an approach that enables a robot to rapidly learn the complete 3D model of a given object for manipulation in unfamiliar orientations. We use an ensemble of partially constructed NeRF models to quantify model uncertainty to determine the next action (a visual or re-orientation action) by optimizing informativeness and feasibility. Further, our approach determines when and how to grasp and re-orient an object given its partial NeRF model and re-estimates the object pose to rectify misalignments introduced during the interaction. Experiments with a simulated Franka Emika Robot Manipulator operating in a tabletop environment with benchmark objects demonstrate an improvement of (i) 14% in visual reconstruction quality (PSNR), (ii) 20% in the geometric/depth reconstruction of the object surface (F-score) and (iii) 71% in the task success rate of manipulating objects a-priori unseen orientations/stable configurations in the scene; over current methods
First, the RGB and Depth images are rendered from the object's current NeRF model. Using these, AnyGrasp detects potential grasps, which are then pruned based on the geometry of the generated point cloud and NeRF’s material density on grasp patches. The best grasp is selected from the remaining using our uncertainty-aware grasp score. The robot executes the chosen grasp to re-orient the object, and the modified iNeRF is employed to re-acquire the object's pose in its new orientation. We show the quality of the object models before and after the flip. The post-flip model is obtained by capturing images in a re-oriented position and adding them to the training dataset
We show the RGB images and uncertainty maps rendered from trained models during our active learning process. The GT images are shown for reference. We note from the figure that before flipping, the bottom surface of the object has high uncertainty, which only diminishes once we perform the flip and acquire information about the bottom surface. The robot then uses the acquired object model to manipulate the object in any orientation.
Below is a quantitative comparison with the baseline "ActivNeRF" [Pan et. al., 2022] which integrates the uncertainty estimation into the NeRF pipeline itself. It optimizes MLE loss in addition with MSE loss used by vanilla NeRF to learn the uncertainty associated with each point in 3D space. It solves the problem of next-best-view as a purely visual problem, hence, (i) it learns the whole model of the scene instead of just the object, (ii) it does not have a notion of action costs and (iii) it doesn't consider any physical interaction with the object. While comparing we incorporate a Flip() action in ActiveNeRF for the experiments when Flip() action is allowed.
We analyze the quality of the acquired models using a manipulation task, where the object is placed at a random pose in the workspace and the robot has to attempt a grasp using the learned models. In the figure we show the task success rate and failure scenarios. Active denotes vanilla ActiveNeRF [Pan et al.] trained on captured images without segmentation and with no Flip() action. S denotes object segmentation, F denotes possibility of Flip() action
1. [Pan et al., 2022] Pan, X., Lai, Z., Song, S., & Huang, G. (2022, October).
Activenerf: Learning where to see with uncertainty estimation.
In European Conference on Computer Vision (pp. 230-246). Cham: Springer Nature Switzerland.