Geometry Aware Field-to-field Transformations for 3D Semantic Segmentation

Dominik Hollidt^*†, Clinton Wang^*, Polina Golland^*, Marc Pollefeys^†

^*MIT ^†ETH Zürich

Abstract

We present a novel approach to perform 3D semantic segmentation solely from 2D supervision by leveraging Neural Radiance Fields (NeRFs). By extracting features along a surface point cloud, we achieve a compact representation of the scene which is sample-efficient and conducive to 3D reasoning. Learning this feature space in an unsupervised manner via masked autoencoding enables few-shot segmentation. Our method is agnostic to the scene parameterization, working on scenes fit with any type of NeRF.

A visualization of our method. (1) We shoot rays from several poses into the scene and sample points according to the proposal network of the NeRF. (2) We obtain a point cloud capturing the surface geometry of the scene by applying scene boundary filtering, density filtering, surface sampling and ground removal. (3) The point cloud gets segmented via a semantic segmentation network and integrated via volumetric rendering (4) The network is trained with the 2D semantic maps and cross-entropy loss.

The SPT outperforms all other methods on KLEVR. PointNet++ and Custom are significantly inferior. * indicates the use of S3DIS pretrained weights.

KLEVR qualitative predictions: PointNet++ and the plain transformer produces visually less good results. The SPT performs the best but the Field Head yields smoother predictions.

The SPT outperforms all other methods on ToyBox5. PointNet++ and Custom are significantly inferior. * indicates the use of S3DIS pretrained weights.

Pretraining

We utilize pretraining in data scarce scenarios to reduce the number of training data required. Within the pretraining an auto encoder structure is used to recover the rgb values or normals of masked points. Pretraining on normals can bootstrap the accuracy on the downstream task of semantic segmentation.

Your image here MY ALT TEXT

In the pretraining masked autoencoding stage part of the point cloud gets masked and properties of it have to be recovered from features extracted from the non-masked point cloud and just the position of the masked point cloud.

Your image here MY ALT TEXT

Pretraining has a positive effect for the SPT on KLEVR where only 10% of the data is used. Normal pretraining particularly boosts the accuracy.

Your image here MY ALT TEXT

Using only 20% of the training scenes and 10 images per scene shows that RGB pretraining harms the downstream task's performance significantly. Normal pretraining has the same effect as using S3DIS pretrained weights.

Your image here MY ALT TEXT

Using only 20% of the training scenes shows that RGB pretraining harms the downstream task's performance significantly. Normal pretraining has the same effect as using S3DIS pretrained weights.

Ablation Studies

Within ablation studies we investigate the impact of the design decisions (ground removal, proximity loss, surface sampling) and show that pretraining on normals provides a more accurate normal estimation.

Ground removal is benefitial in accuracy and reduces the number of points significantly.

Surface sampling can be benefitial in accuracy and reduces the number of points significantly.

The proximity loss improves the accuracy.

The model pretrained on normals predicts more accurate normals than the underlying NeRF.

BibTeX

@misc{hollidt2023geometry,
      title={Geometry Aware Field-to-field Transformations for 3D Semantic Segmentation}, 
      author={Dominik Hollidt and Clinton Wang and Polina Golland and Marc Pollefeys},
      year={2023},
      eprint={2310.05133},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}