CIS 580, Machine Perception, Fall 2023 Final Project Part B
1 Part 1: Fitting a 2D image (25 points)
Let’s consider a color image to be a mapping I : R
2 → R
3
, where the input is the pixel coordinates
(x, y) and the output is the RGB channels. We define a Multilayer Perceptron (MLP) network FΩ
that will learn this mapping I. We say we fit the network F to the image I.
1.1 Positional Encoding (5 points)
Positional encoding is used to map continuous input coordinates into a higher dimensional space to
enable a neural network to approximate a higher frequency function. Despite the fact that neural
networks are universal function approximators, it has been shown that having a network directly
operate on input coordinates (x, y) results in renderings that perform poorly at representing high
frequency variation in color and texture. Mapping the inputs to a higher dimensional space using
high-frequency functions before passing them to the network enables a better approximation of high
frequency variation.
In this homework, we will use the sinusoidal periodic function for positional encoding. Note that
other periodic functions can be used to map your data into higher dimensions or non-periodic functions.
Let x ∈ R
D be a D-dimensional vector. Then we define the positional encoding of x as:
and thus the positional encoding γ is a mapping from R
D → R
2DL, where L is fixed and chosen. Note
for example, that for a 2D-dimensional input x = (x1, x2), sin(x) = (sin(x1),sin(x2)).
For this part, complete the function positional_encoding() that takes a [N, D] vector as input,
where N is the batch size and D the feature dimension, and returns the positional encoding mapping
of the input. The function also has a Boolean argument that determines if x itself should be prepended
in γ(x) (think about how this changes the output dimension).
1.2 MLP Design (5 points)
The architecture of your MLP should be the following:
• It should consist of three linear layers. The first one will map the feature dimensions (after the
positional encoding, including the prepended x) to the filter_size dimension. The second
one should keep the feature space dimension the same. The final linear layer should map the
feature space dimension to the output dimension of the MLP.
• The first two linear layers should be followed by a ReLu activation while the last linear layer
should be followed by a Sigmoid activation function.
For this part, complete the class MLP() and specifically the functions __init()__ and forward() that
define the neural network that will be used to fit the 2D image.
1.3 Fitting the image (15 points)
For this part, complete the function train_2d_model(). We define the learning rate and number of
iterations for you. For the optimizer we will use the Adam optimizer and for the loss function, the
mean square error. You should transform the image to a vector of 2D points and then apply the
positional encoding to them. The pixel coordinates should be normalized between [0, 1]. Train the
model by fitting the points to the MLP, transforming the output back to an image, and computing the
loss between the original and reconstructed image. Finally, calculate the PSNR between the images
which is given by:
where R is the maximum valid value of a normalized pixel and MSE is the mean squared error
between the images. The PSNR computes the peak signal-to-noise ratio, in decibels, between two
images and is used as a quality measurement between the original and the reconstructed image.
After completing the function, train the model to fit the given image without applying positional
encoding to the input, and by applying positional encoding of two different frequencies to the input;
L = 2 and L = 6. What’s the effect of positional encoding and the effect of different numbers of
frequencies? To pass the autograder for this one, you need a PSNR higher than 15.5, 16.2, and 26,
for the three cases accordingly, after 10, 000 iterations. You should upload the weights of the trained
neural network along with the fitted image, for all three cases.
2 Part 2: Fitting a 3D scene (75 points)
In this part of the homework, our goal is to represent a 3D scene in a convenient and compact way,
to be later used for rendering 2D images. A 3D scene can be considered a set of points that contain
color and density. Though every fixed point in the scene has a fixed density, that’s not the case with
the color. Each point can have a different color depending on the viewing direction if we assume the
surface to be non-Lambertian.
Thus a 3D scene is representing by the mapping (x, y, z, θ, ϕ) = (x, d) → (c, σ) where x = (x, y, z)
is the 3D position of a point in the scene, d = (θ, ϕ) is the viewing direction, c = (r, g, b) is the color
of the point and σ is its density. The problem we will try to solve here, and the one that the NeRF
paper approached, is that given a number of 2D views of the same static scene, to be able to render
novel views of that scene.
2.1 Computing the images’ rays (10 points)
For every 2D image I, we are given the transformation between the camera and the world coordinates,
along with the intrinsic parameters of the camera. We need to calculate the origins and the directions
of each camera frame with respect to the world coordinate frame.
For this part, you need to complete the function get_rays() that returns the origins and the
directions of H × W rays of an image. Note that all the origins should be the same for each ray, while
the directions should slightly differ depending on the pixel the ray is passing through.
Finally, you should complete the function plot_all_poses() which plots the vectors (origin and
direction) from all the frames used to capture the 3D scene. This will offer a good visualization of the
whole setup of the data. The function should calculate two vectors that contain all the origins from
each image and all the directions that pass through each image’s center (u0, v0). The plotting part is
implemented for you.
2.2 Sampling points along a ray (10 points)
Now that we have all rays calculated, we need to be able to sample a number of points along a ray.
Recall that for a ray with an origin o and direction d, all the points that belong on the ray are given
by the equation: r = o + t · d, ∀t ∈ R.
For this part, you need to complete the function stratified_sampling() where you need to sample
a number of ti
, i = 1, . . . , N between tnear and tfar. The points should be chosen evenly along this
The function should return the (x, y, z) position of each point sampled from each ray of the given
image, along with the depth points (ti) of each ray.
2.3 NeRF MLP Design (20 points)
The neural network architecture of the NeRF paper is rather simple for the task it approaches to solve.
The input of the neural network is the position (x, y, z) of the sample points along the ray, and the
direction of these points (θ, ϕ), after applying positional encoding to both of them. Figure 4 below
depicts the architecture in more detail.
The network is comprised of 12 fully connected layers (arrows), followed by ReLU activation (black
arrows), no activation (orange arrow), or sigmoid activation (dashed black arrow). The number inside
the blue boxes depicts the output of the previous fully connected layer. A skip connection is included
that concatenates the input to the fifth layer’s activation. After the eighth layer with no activation
function, there is a layer that outputs the volume density σ and a layer that produces a 256-dimensional
feature vector. This feature vector is concatenated with the positional encoding of the input viewing
direction (γ(d)) and is processed by an additional fully connected ReLU layer with 128 channels. A
final layer (with a sigmoid activation) outputs the emitted RGB radiance at position x, as viewed by
a ray with direction d. Note that the network predicts the density of a point before it is given the
direction of that point, to enforce that the density is view-invariant. On the other hand, the color
needs both the position and view direction of the point to model plausible non-Lambertian properties
of the 3D scene. We will be using 10 and 4 as the number of frequencies for the positional encoding of
the position and direction input accordingly. You should also pass σ from a ReLU activation function
to avoid infinite values in the volumetric rendering part.
For this part, complete the class nerf_model() and specifically the functions __init()__ and
forward() that define the neural network that will be used to fit the 3D scene. Except for this
function, you need to complete the function get_batch() as well, which prepares the data for the neural
network. Specifically, this function takes as input the ray points and directions with [H, W, nsamples, 3]
and [H, W, 3] dimensions respectively. The function should normalize the directions, populate them
along each ray (repeat the direction of the ray to every point), flatten the vector, apply positional
encoding to it, and then call the helper function get_chunks(). Similarly, the function should flatten
the vector of the ray positions and then call get_chunks().