Overview
Until now we have mainly discussed about models that were related to 2D image or visualization. Now let’s shift our attention by adding a new dimension.
One of the topics that are receiving great attention is the ability to re-construct 3D images based on 2D image inputs taken in different angles. This is very exciting because this means we can use these hyper-realistic reconstruction to offer more realistic preview in online shopping malls (even texture), or even use them for game scene designs!
Input: Images from multiple angles
Result: 3D Reconstructed Image
We can see texture details
Background
Now, let’s delve into related terminologies and models associated with this exciting technology. Let’s now expand our scope to 3D space. How is 3D space different from 2D? It will have different color and density at each point of this 3D space.
For example, the shoe above would have different shadows or light depending on which angle we view the shoe. This will affect the current snapshot of this object we are viewing. The color of a particular point would differ based on how far we are viewing them, as well as the direction.
How would we predict all these with our models?
Our neural network models would have to learn these relationships and be able to generalize it.
NeRF
Overview of 3D Computer Graphic Models
•
NeRF mainly uses density and color representation information. It learns different ray (of light). A ray of light would essentially be a unit that captures the viewer’s position (how far), direction, and the color densities along that ray path.
One note is that: t used here does not represent time
tf denotes the farther end point of ray, while
tn denotes the closer starting point of the ray
r(t): position vector of a point on the ray at t, containing starting point and direction.
d: direction or ray
σ(r(t)): density of that ray point (light absorbance at that point)
c(r(t),d): color at the point r(t) in direction d
T(t) = exp(-∫σ(r(s))ds): transmittance, where it is the accumulated light density from tn to t.
We need both this integrated light density (transmittance) and the density of that entire ray AND the volume density of a specific point to capture local contribution and the global contribution of the light.
C(r): Now we integrate the above information from tn to tf, We are essentially integrating the contribution of color c along the ray, based on the transmittance and the density for that point of ray. Now we are able to connect all this information: position (how far), direction, and the color densities into one.
Limitation of NeRF
However, the limitation of NeRF was that it requires:
Heavy Computation / High Memory Usage
There are large number of parameters that need to be learned during the training process. We saw this from the above equation. Scaling to larger scenes would suffer even greater from this same problem or limitation.
The Rise of Gaussian Splatting
Gaussian Splatting was introduced to address these limitations. The very basic mechanism is some what similar to the diffusion we discussed in the previous post (in a way that most tries to estimate distributions). It incorporates
3D Gaussian Distribution
•
probability distribution that describes how data points are spread out in three dimension (mean, center)
Authors presented that:
1.
State of the Art Quality (equivalent to MipNerf 360)
2.
Real-Time Rendering (more than 100 fps)
3.
Fast Training (less than 1h)
Some other terms to keep in mind
“The second component of our method is optimization of the properties of the 3D Gaussians – 3D position, opacity 𝛼, anisotropic covariance, and spherical harmonic (SH) coefficients”
-page 2
Isotropic:
uniform in all direction, same spread in all direction
Anisotropic:
vary depending on direction (different variance along different axis)
This “anisotropic covariance” allows the image to be well-constructed.
NeRF | 3D GS | |
Representation | Neural Network
| Gaussian Distribution
|
Rendering | Integrating Rays | Summing Gaussians |
Parameter | Model Weights | Means, Covariance, weight of Gaussians |
Efficiency | slow | faster |
Unit/Samples | Sample of Rays | Gaussian Sampling |
Algorithm Overview:
Now let’s dive in deeper to the important parts of the algorithm
SfM (Structure from Motion):
Use feature detection, feature matching, camera pose estimation to reconstruct 3D scene from multi-view point 2D image inputs.
Given multiple images of a scene from multiple view points, SfM detect and match corresponding features on each images. Using this information the camera poses are determined.
SfM points:
This is initial set of 3D points representing the scene from SfM. It is used as a starting basis. We obtain S (Covariance), C(Colors), A(Opacities)
Rasterize (Algorithm 2):
Let’s further delve into this portion of Algorithm 2 within 1
portion of Algorithm 1
Algorithm 2: Rasterize(M, S, C, A, V)
•
Cull Gaussian: keep the gaussian observed only in that viewpoint V (view frustum)
•
Frustum Culling: check if Gaussian is within the viewpoint
•
Screen Space Transformation:
projection from 3D coordinate (Gaussian Coordinates) → 2D coordinate
→ Calculate S’ or the new corresponding covariance
J: Projective Transformation Metrics (Camera → Image)
W: Viewing Transformation (World → Camera)
Sigma: Input gaussian Covariance
Performing the above transformation allows us to transform from:
World → Camera → Image
Given transformation y = Ax
Note that above equation is just calculating the new covariance matrix that accounts this transformation.
HOWEVER,
“An obvious approach would be to directly optimize the covariance matrix Σ to obtain 3D Gaussians that represent the radiance field. However, covariance matrices have physical meaning only when they are positive semi-definite. For our optimization of all our parameters, we use gradient descent that cannot be easily constrained to produce such valid matrices, and update steps and gradients can very easily create invalid covariance matrices.”
Due to this limitation:
“The covariance matrix Σ of a 3D Gaussian is analogous to describing the configuration of an ellipsoid. Given a scaling matrix 𝑆 and rotation matrix 𝑅, we can find the corresponding Σ”:
Thus, this will mean that:
This is now created in Tile or Grid Format, having a tile-based rasterization rather than pixel splatting in the previous work. This is why the author stated that:
“We introduce a new approach that combines the best of both worlds: our 3D Gaussian representation allows optimization with state-of-the-art (SOTA) visual quality and competitive training times, while our while our Tile-based splatting solution ensures real-time rendering at SOTA quality for 1080p resolution on several previously published dataset”
These tiles are now sorted with Radix Sort
“Sorting. Our design is based on the assumption of a high load of small splats, and we optimize for this by sorting splats once for each frame using radix sort at the beginning”
Then, it is now finally blended to give a 2D rasterized image.
Going back to (Algorithm 1):
Now that we got this rasterized component, we compare the Rasterized Image we just talked about to ground truth
Loss
Combine L1 and L(D-SSIM):
L1: focuses on more on pixel similarity
D-SSIM: focuses on structure similarity
Then in Refinement Iteration
If the Gaussian is too light (opacity a) < epsilon:
There is no point of keeping it, so we perform RemoveGaussian()
We allocate more parameters to the regions that need them
CloneGaussian
To account missing geometric features, stretch around to cover more area
SplitGaussian
To account for cases where one gaussian covers large area and can’t construct detailed region
Summary / Summing up
Rasterize the 3D component to 2D → Compute the Loss / Differentiate → Update Gaussian Parameters
Code Snippet
def training(dataset, opt, pipe, testing_iterations, saving_iterations, checkpoint_iterations, checkpoint, debug_from):
first_iter = 0
tb_writer = prepare_output_and_logger(dataset)
gaussians = GaussianModel(dataset.sh_degree)
scene = Scene(dataset, gaussians)
gaussians.training_setup(opt)
if checkpoint:
(model_params, first_iter) = torch.load(checkpoint)
gaussians.restore(model_params, opt)
...
JavaScript
복사
Rasterize the 3D Component to 2D
render_pkg = render(viewpoint_cam, gaussians, pipe, bg)
image, viewspace_point_tensor, visibility_filter, radii = render_pkg["render"], render_pkg["viewspace_points"], render_pkg["visibility_filter"], render_pkg["radii"]
JavaScript
복사
Compute the Loss / Differentiate
We see both the L1 Loss and L(D-SSIM)
gt_image = viewpoint_cam.original_image.cuda()
Ll1 = l1_loss(image, gt_image)
loss = (1.0 - opt.lambda_dssim) * Ll1 + opt.lambda_dssim * (1.0 - ssim(image, gt_image))
loss.backward()
JavaScript
복사
Update Gaussian Parameters
if iteration < opt.iterations:
gaussians.optimizer.step()
gaussians.optimizer.zero_grad(set_to_none = True)
JavaScript
복사
Refinement Iteration
We see that the pruning is performed based on the opacity
# Densification
if iteration < opt.densify_until_iter:
# Keep track of max radii in image-space for pruning
gaussians.max_radii2D[visibility_filter] = torch.max(gaussians.max_radii2D[visibility_filter], radii[visibility_filter])
gaussians.add_densification_stats(viewspace_point_tensor, visibility_filter)
if iteration > opt.densify_from_iter and iteration % opt.densification_interval == 0:
size_threshold = 20 if iteration > opt.opacity_reset_interval else None
gaussians.densify_and_prune(opt.densify_grad_threshold, 0.005, scene.cameras_extent, size_threshold)
if iteration % opt.opacity_reset_interval == 0 or (dataset.white_background and iteration == opt.densify_from_iter):
gaussians.reset_opacity()
JavaScript
복사
Hope this helped everyone!
Written By: Terry (Taehan) Kim,
Reviewed by Metown Corp. (Sangbin Jeon, PhD Researcher)