Laser scanning produces exceptionally detailed 3D point clouds of real-world locations, but there are often lots of missing areas that the scanner didn't see. Is it possible to bring them back somehow?
I had the opportunity to investigate this question at Umbra for my Master's thesis, and to focus on applying modern neural network techniques to the problem. The approach taken here is to cast surface reconstruction as a 2D image inpainting problem, and do all the magic in image space. This way a neural network model can be used to iteratively repair the scan one small patch at a time.
Image inpainting is about filling in missing pixels of an image. There are many ways to do it, and recently Generative Adversarial Networks (GANs) have been used to great effect. See for example Globally and Locally Consistent Image Completion. GANs are also used in the method presented in this post.
But how do you fill holes in surface geometry? Images have just colors, right? Well, what if you represented a part of the scan as an image? The image could be a heightmap that stores a distance in every pixel instead of a color. Then holes in the heightmap could be filled in with any image inpainting method, which is exactly what we are going to do. Of course, this is not a new idea but recent advances in neural networks make it an interesting research project.
You can view the Umbrafied end result with the widget below.
An interactive 3D model that shows the input scan (left) and the repaired version (right). Open fullscreen view
In this post I'll describe how the method works, and then we shall look at some of its results. We conclude with some notes of its limitations and failure cases. Note that this is a research prototype and not part of any product.
So how does it work? I'm glad you asked...
A 105 million point segment of Trimble's Pharsalia dataset that is used as an example in this post. Each point has an RGB color which makes the scan very realistic looking. The stone stairs shown in the first picture are encircled in red.
The idea is this: extract circular surface patches near holes in the scan, and then repair them using a neural network. We assume the input is a point cloud that comes with RGB colors and surface normals. The processing steps are the following:
Let's go over the steps one-by-one, using Trimble's laser scanned Pharsalia dataset as an example.
To find holes in the scan, we just look for points at the edges of the scan. The holes can have very irregular shapes as can been seen in the picture below. The boundaries are found with the heuristic presented in Detecting Holes in Point Set Surfaces.
A segment of the Pharsalia dataset with boundaries colored in red.
Then we pick some boundary point (the red ones pictured above) but first we make sure none of its neighbors have already been visited. The scans have millions of points and we want to repair only areas that haven't been touched yet.
A boundary point is chosen. The black circle represents the boundary of a circular surface patch surrounding the point. The patch radius is roughly four centimeters.
Now the 2D projection happens: take all the points inside a small sphere of roughly 4 cm radius. Splat these points into a heightmap. The illustration below to shows how it would be done in 2D.
Projecting the neighborhood points. The fuchsia colored patch origin and its normalndefine a tangent plane (the dashed line) of the sampled surface (dark points). All neighbors inside the circle get projected to the tangent plane. We store the orthogonal distance d (highlighted in blue) of each point in the heightmap.
The result of the inpainting neural network. The artificially corrupted patch (left) is missing a square region and has some extra salt-and-pepper noise added in. The network generates a new image (middle) without missing pixels and most of the noise removed. The ground truth image (right) has some subtle noise added to stabilize GAN training.
Now the neural network comes in. The network takes the heightmap and fills in the missing parts. Hopefully it won’t make too many mistakes!
The inpainter neural network is a simple U-Net like autoencoder. In GANs, there also exists another network, the discriminator. It is used to make the results more realistic. Here it's just a small five layer convolutional network (CNN).
*The GAN training setup. The corrupted heightmap x and a corresponding binary mask M are repaired by the generator, producing y. The discriminator then judges the realism of y based on real images it has seen. The switch symbol represents how a discriminator is trained in GANs: it is shown alternatively real and fake images, and it is expected to classify both correctly. Finally, the L1 loss calculates a pixel-wise difference between y and the ground truth. This makes sure correct high-level structure is preserved.*
Note that we use two inpainting networks: one for heightmaps and the other for colors. It's possible to do everything in a joint network, but it could make it difficult to balance the losses between two outputs so that none is favored.
After the heightmap has been repaired, we can reproject its pixels back to the point cloud. This way they can show up later in other neighborhoods. This allows us to repair slightly larger missing areas.
That's all, now the patch has been repaired. Then we pick another random boundary point and repeat the same steps. This is done until all boundary regions have been processed.
The pixel grid pattern of the reprojected points is visible in the output. The bright colors correspond to different patches. Inaccuracies caused by the rasterization step still leave some visible gaps.
Note that only the 2D convex hull of input points is actually reprojected back to the scan, because otherwise the patches would extend surface outwards.
Colors work the same way as heightmaps. Instead of storing the orthogonal distance d to a heightmap, the projection step stores the RGB color. This is one of the strengths of the method: both geometry and colors are handled the same way.
A color patch and an increasing number of missing pixels. Top row: corrupted input patches. Bottom row: inpainting network outputs. Some artifacts appear when a large number of pixels are missing.
Results on two scenes are presented here; the backyard and the stone stairs we already saw in the beginning. To get an idea of performance, the stone stairs example takes 86 seconds to process on Intel i7 7700K and NVIDIA GTX 1070. Each patch takes roughly 20 ms, of which 15 ms is spent evaluating the neural network.
Before and after animation of the backyard. The clutter causes a lot of occlusions.
A partially scanned wooden barrel in the backyard scene. Note how the generated regions follow the shape of the original surface.
If two different scanners capture the same surface from different directions, the result can be messy. Here the method still produces a reasonable result despite the poorly sampled input data.
The stone stairs scene has occlusions caused by bushes and the scanner location.
Before and after animation of the stone stairs scene.
From left to right: input, output, and the output with new points highlighted in blue.
There are some limitations with the approach taken here. The main problem is how to tell a corruption from a real surface shape. For example a window should obviously be kept intact, but to the boundary detection logic here it's just another nasty artifact. In the examples of this post I simply segmented the objects (ground plane, a barrel, chairs etc.) by hand and then processed each object separately.
A problem with low resolution color data. The green color of some occluding plants gets projected to the wall. The neural network then obediently blends in the color to the generated regions.
Another obvious problem is the small patch size: there just isn't that much data to work with. See the picture above for an example. Extra input in the form of voxels could be added, and this combined with depth maps has been used to get some seriously impressive results.
A big problem with neural networks is that they usually require large amounts of labeled training data. In this case, the training data can be synthesized, working around the issue. The point cloud already contains many perfectly scanned regions, so we can just corrupt those artificially. The original uncorrupted surface acts as the ground truth. See the picture below for some examples of the corruptions.
Examples of artificially corrupted heightmaps. Top row: original heightmap, bottom row: corrupted heightmap. Each training sample is also randomly rotated.
In the results shown in this post the network was trained only on uncorrupted patches of the same scan. This means the inpainting model could severely overfit and only work on the training patches. However, thanks to the large training set (200k data points) and the augmentations used, this isn't really a problem. Of course it's also possible to train a more generic model with a more diverse dataset.
If the input is clean, it's possible to fix small occlusions locally one patch at a time. No labeled training data is needed, so the neural network model can be adapted to many different domains. The current approach still requires the user to process each object separately, but the system could be made fully automatic if individual objects were detected as a preprocessing step.
The output quality of the neural network depends heavily on the dataset, and of course a better GAN training technique could also make the results sharper.
For more details, you can read the full thesis. Note that the GAN training approach there is a bit different from the one used in this blog post.
The method presented here is by no means the only way to process 3D data with neural networks. For completeness, here are links to some interesting papers on the subject:
Thanks to Ilkka for help with the illustrations and to Hannu for proofreading and for Umbrafying the examples.