Introduction to NeRF and 3D Gaussian Splatting

Created

2024/07/21 00:46

Overview

Until now we have mainly discussed about models that were related to 2D image or visualization. Now let’s shift our attention by adding a new dimension.

One of the topics that are receiving great attention is the ability to re-construct 3D images based on 2D image inputs taken in different angles. This is very exciting because this means we can use these hyper-realistic reconstruction to offer more realistic preview in online shopping malls (even texture), or even use them for game scene designs!

Input: Images from multiple angles

Result: 3D Reconstructed Image

We can see texture details

YouTube패션 테크 스타트업 미타운 - 3D 의류 스캐닝 AI 시연 영상

Unreal Engine Gaussian Splatting

Background

Now, let’s delve into related terminologies and models associated with this exciting technology. Let’s now expand our scope to 3D space. How is 3D space different from 2D? It will have different color and density at each point of this 3D space.

For example, the shoe above would have different shadows or light depending on which angle we view the shoe. This will affect the current snapshot of this object we are viewing. The color of a particular point would differ based on how far we are viewing them, as well as the direction.

How would we predict all these with our models?

Our neural network models would have to learn these relationships and be able to generalize it.

NeRF

Overview of 3D Computer Graphic Models

NeRF (Neural Radiance Field - paper linked)

•

NeRF mainly uses density and color representation information. It learns different ray (of light). A ray of light would essentially be a unit that captures the viewer’s position (how far), direction, and the color densities along that ray path. 

One note is that: t used here does not represent time

tf denotes the farther end point of ray, while

tn denotes the closer starting point of the ray

r(t): position vector of a point on the ray at t, containing starting point and direction.

d: direction or ray

σ(r(t)): density of that ray point (light absorbance at that point)

c(r(t),d): color at the point r(t) in direction d

T(t) = exp(-∫σ(r(s))ds): transmittance, where it is the accumulated light density from tn to t.

We need both this integrated light density (transmittance) and the density of that entire ray AND the volume density of a specific point to capture local contribution and the global contribution of the light.

C(r): Now we integrate the above information from tn to tf, We are essentially integrating the contribution of color c along the ray, based on the transmittance and the density for that point of ray. Now we are able to connect all this information: position (how far), direction, and the color densities into one.

Limitation of NeRF

However, the limitation of NeRF was that it requires:

Heavy Computation / High Memory Usage

There are large number of parameters that need to be learned during the training process. We saw this from the above equation. Scaling to larger scenes would suffer even greater from this same problem or limitation.

The Rise of Gaussian Splatting

Gaussian Splatting was introduced to address these limitations. The very basic mechanism is some what similar to the diffusion we discussed in the previous post (in a way that most tries to estimate distributions). It incorporates

3D Gaussian Distribution

•

probability distribution that describes how data points are spread out in three dimension (mean, center)

Authors presented that:

State of the Art Quality (equivalent to MipNerf 360)

Real-Time Rendering (more than 100 fps)

Fast Training (less than 1h)

Some other terms to keep in mind

“The second component of our method is optimization of the properties of the 3D Gaussians – 3D position, opacity 𝛼, anisotropic covariance, and spherical harmonic (SH) coefficients”

-page 2

Isotropic:

uniform in all direction, same spread in all direction

Image Reference

Anisotropic:

vary depending on direction (different variance along different axis)

This “anisotropic covariance” allows the image to be well-constructed.

	NeRF	3D GS
Representation	Neural Network	Gaussian Distribution
Rendering	Integrating Rays	Summing Gaussians
Parameter	Model Weights	Means, Covariance, weight of Gaussians
Efficiency	slow	faster
Unit/Samples	Sample of Rays	Gaussian Sampling

Algorithm Overview:

Now let’s dive in deeper to the important parts of the algorithm

SfM (Structure from Motion):

Use feature detection, feature matching, camera pose estimation to reconstruct 3D scene from multi-view point 2D image inputs.

Image Reference: Math Works

Given multiple images of a scene from multiple view points, SfM detect and match corresponding features on each images. Using this information the camera poses are determined.

For Further Reading (CMSC426): https://cmsc426.github.io/sfm/

SfM points:

This is initial set of 3D points representing the scene from SfM. It is used as a starting basis. We obtain S (Covariance), C(Colors), A(Opacities)

Rasterize (Algorithm 2):

Let’s further delve into this portion of Algorithm 2 within 1

portion of Algorithm 1

Algorithm 2: Rasterize(M, S, C, A, V)

•

Cull Gaussian: keep the gaussian observed only in that viewpoint V (view frustum)

•

Frustum Culling: check if Gaussian is within the viewpoint

•

Screen Space Transformation: 

projection from 3D coordinate (Gaussian Coordinates) → 2D coordinate

→ Calculate S’ or the new corresponding covariance

J: Projective Transformation Metrics (Camera → Image)

W: Viewing Transformation (World → Camera)

Sigma: Input gaussian Covariance

Performing the above transformation allows us to transform from:

World → Camera → Image

Given transformation y = Ax

Note that above equation is just calculating the new covariance matrix that accounts this transformation.

HOWEVER,

“An obvious approach would be to directly optimize the covariance matrix Σ to obtain 3D Gaussians that represent the radiance field. However, covariance matrices have physical meaning only when they are positive semi-definite. For our optimization of all our parameters, we use gradient descent that cannot be easily constrained to produce such valid matrices, and update steps and gradients can very easily create invalid covariance matrices.”

Due to this limitation:

“The covariance matrix Σ of a 3D Gaussian is analogous to describing the configuration of an ellipsoid. Given a scaling matrix 𝑆 and rotation matrix 𝑅, we can find the corresponding Σ”:

Thus, this will mean that:

Σ = SS^TJT

This is now created in Tile or Grid Format, having a tile-based rasterization rather than pixel splatting in the previous work. This is why the author stated that:

“We introduce a new approach that combines the best of both worlds: our 3D Gaussian representation allows optimization with state-of-the-art (SOTA) visual quality and competitive training times, while our while our Tile-based splatting solution ensures real-time rendering at SOTA quality for 1080p resolution on several previously published dataset”

These tiles are now sorted with Radix Sort

“Sorting. Our design is based on the assumption of a high load of small splats, and we optimize for this by sorting splats once for each frame using radix sort at the beginning”

Then, it is now finally blended to give a 2D rasterized image.

Going back to (Algorithm 1):

Now that we got this rasterized component, we compare the Rasterized Image we just talked about to ground truth

Loss

Combine L1 and L(D-SSIM):

L1: focuses on more on pixel similarity

D-SSIM: focuses on structure similarity

Then in Refinement Iteration

If the Gaussian is too light (opacity a) < epsilon:

There is no point of keeping it, so we perform RemoveGaussian()

We allocate more parameters to the regions that need them

CloneGaussian

To account missing geometric features, stretch around to cover more area

SplitGaussian

To account for cases where one gaussian covers large area and can’t construct detailed region

Summary / Summing up

Rasterize the 3D component to 2D → Compute the Loss / Differentiate → Update Gaussian Parameters

Code Snippet

Let’s bring up the official implementation of train.py (link)

def training(dataset, opt, pipe, testing_iterations, saving_iterations, checkpoint_iterations, checkpoint, debug_from):
    first_iter = 0
    tb_writer = prepare_output_and_logger(dataset)
    gaussians = GaussianModel(dataset.sh_degree)
    scene = Scene(dataset, gaussians)
    gaussians.training_setup(opt)
    if checkpoint:
        (model_params, first_iter) = torch.load(checkpoint)
        gaussians.restore(model_params, opt)
...
JavaScript
복사

Rasterize the 3D Component to 2D

 render_pkg = render(viewpoint_cam, gaussians, pipe, bg)
 image, viewspace_point_tensor, visibility_filter, radii = render_pkg["render"], render_pkg["viewspace_points"], render_pkg["visibility_filter"], render_pkg["radii"]
JavaScript
복사

Compute the Loss / Differentiate

We see both the L1 Loss and L(D-SSIM)

gt_image = viewpoint_cam.original_image.cuda()
Ll1 = l1_loss(image, gt_image)
loss = (1.0 - opt.lambda_dssim) * Ll1 + opt.lambda_dssim * (1.0 - ssim(image, gt_image))
loss.backward()
JavaScript
복사

Update Gaussian Parameters

if iteration < opt.iterations:
	gaussians.optimizer.step()
	gaussians.optimizer.zero_grad(set_to_none = True)
JavaScript
복사

Refinement Iteration

We see that the pruning is performed based on the opacity

# Densification
if iteration < opt.densify_until_iter:
	# Keep track of max radii in image-space for pruning
  gaussians.max_radii2D[visibility_filter] = torch.max(gaussians.max_radii2D[visibility_filter], radii[visibility_filter])
  gaussians.add_densification_stats(viewspace_point_tensor, visibility_filter)

if iteration > opt.densify_from_iter and iteration % opt.densification_interval == 0:
   size_threshold = 20 if iteration > opt.opacity_reset_interval else None
   gaussians.densify_and_prune(opt.densify_grad_threshold, 0.005, scene.cameras_extent, size_threshold)
                
if iteration % opt.opacity_reset_interval == 0 or (dataset.white_background and iteration == opt.densify_from_iter):
   gaussians.reset_opacity()
JavaScript
복사

Hope this helped everyone!

Written By: Terry (Taehan) Kim,

Reviewed by Metown Corp. (Sangbin Jeon, PhD Researcher)