CIS580 代写

CIS 580, Machine Perception, Fall 2023 Final Project Part B

1 Part 1: Fitting a 2D image (25 points)

Let’s consider a color image to be a mapping I : R 2 → R 3 , where the input is the pixel coordinates

(x, y) and the output is the RGB channels. We define a Multilayer Perceptron (MLP) network FΩ

that will learn this mapping I. We say we fit the network F to the image I.

1.1 Positional Encoding (5 points)

Positional encoding is used to map continuous input coordinates into a higher dimensional space to

enable a neural network to approximate a higher frequency function. Despite the fact that neural

networks are universal function approximators, it has been shown that having a network directly

operate on input coordinates (x, y) results in renderings that perform poorly at representing high

frequency variation in color and texture. Mapping the inputs to a higher dimensional space using

high-frequency functions before passing them to the network enables a better approximation of high

frequency variation.

In this homework, we will use the sinusoidal periodic function for positional encoding. Note that

other periodic functions can be used to map your data into higher dimensions or non-periodic functions.

Let x ∈ R D be a D-dimensional vector. Then we define the positional encoding of x as:

and thus the positional encoding γ is a mapping from R D → R 2DL, where L is fixed and chosen. Note

for example, that for a 2D-dimensional input x = (x1, x2), sin(x) = (sin(x1),sin(x2)).

For this part, complete the function positional_encoding() that takes a [N, D] vector as input,

where N is the batch size and D the feature dimension, and returns the positional encoding mapping

of the input. The function also has a Boolean argument that determines if x itself should be prepended

in γ(x) (think about how this changes the output dimension).

1.2 MLP Design (5 points)

The architecture of your MLP should be the following:

• It should consist of three linear layers. The first one will map the feature dimensions (after the

positional encoding, including the prepended x) to the filter_size dimension. The second

one should keep the feature space dimension the same. The final linear layer should map the

feature space dimension to the output dimension of the MLP.

• The first two linear layers should be followed by a ReLu activation while the last linear layer

should be followed by a Sigmoid activation function.

For this part, complete the class MLP() and specifically the functions __init()__ and forward() that

define the neural network that will be used to fit the 2D image.

1.3 Fitting the image (15 points)

For this part, complete the function train_2d_model(). We define the learning rate and number of

iterations for you. For the optimizer we will use the Adam optimizer and for the loss function, the

mean square error. You should transform the image to a vector of 2D points and then apply the

positional encoding to them. The pixel coordinates should be normalized between [0, 1]. Train the

model by fitting the points to the MLP, transforming the output back to an image, and computing the

loss between the original and reconstructed image. Finally, calculate the PSNR between the images

which is given by:

where R is the maximum valid value of a normalized pixel and MSE is the mean squared error

between the images. The PSNR computes the peak signal-to-noise ratio, in decibels, between two

images and is used as a quality measurement between the original and the reconstructed image.

After completing the function, train the model to fit the given image without applying positional

encoding to the input, and by applying positional encoding of two different frequencies to the input;

L = 2 and L = 6. What’s the effect of positional encoding and the effect of different numbers of

frequencies? To pass the autograder for this one, you need a PSNR higher than 15.5, 16.2, and 26,

for the three cases accordingly, after 10, 000 iterations. You should upload the weights of the trained

neural network along with the fitted image, for all three cases.

2 Part 2: Fitting a 3D scene (75 points)

In this part of the homework, our goal is to represent a 3D scene in a convenient and compact way,

to be later used for rendering 2D images. A 3D scene can be considered a set of points that contain

color and density. Though every fixed point in the scene has a fixed density, that’s not the case with

the color. Each point can have a different color depending on the viewing direction if we assume the

surface to be non-Lambertian.

Thus a 3D scene is representing by the mapping (x, y, z, θ, ϕ) = (x, d) → (c, σ) where x = (x, y, z)

is the 3D position of a point in the scene, d = (θ, ϕ) is the viewing direction, c = (r, g, b) is the color

of the point and σ is its density. The problem we will try to solve here, and the one that the NeRF

paper approached, is that given a number of 2D views of the same static scene, to be able to render

novel views of that scene.

2.1 Computing the images’ rays (10 points)

For every 2D image I, we are given the transformation between the camera and the world coordinates,

along with the intrinsic parameters of the camera. We need to calculate the origins and the directions

of each camera frame with respect to the world coordinate frame.

For this part, you need to complete the function get_rays() that returns the origins and the

directions of H × W rays of an image. Note that all the origins should be the same for each ray, while

the directions should slightly differ depending on the pixel the ray is passing through.

Finally, you should complete the function plot_all_poses() which plots the vectors (origin and

direction) from all the frames used to capture the 3D scene. This will offer a good visualization of the

whole setup of the data. The function should calculate two vectors that contain all the origins from

each image and all the directions that pass through each image’s center (u0, v0). The plotting part is

implemented for you.

2.2 Sampling points along a ray (10 points)

Now that we have all rays calculated, we need to be able to sample a number of points along a ray.

Recall that for a ray with an origin o and direction d, all the points that belong on the ray are given

by the equation: r = o + t · d, ∀t ∈ R.

For this part, you need to complete the function stratified_sampling() where you need to sample

a number of ti , i = 1, . . . , N between tnear and tfar. The points should be chosen evenly along this

The function should return the (x, y, z) position of each point sampled from each ray of the given

image, along with the depth points (ti) of each ray.

2.3 NeRF MLP Design (20 points)

The neural network architecture of the NeRF paper is rather simple for the task it approaches to solve.

The input of the neural network is the position (x, y, z) of the sample points along the ray, and the

direction of these points (θ, ϕ), after applying positional encoding to both of them. Figure 4 below

depicts the architecture in more detail.

The network is comprised of 12 fully connected layers (arrows), followed by ReLU activation (black

arrows), no activation (orange arrow), or sigmoid activation (dashed black arrow). The number inside

the blue boxes depicts the output of the previous fully connected layer. A skip connection is included

that concatenates the input to the fifth layer’s activation. After the eighth layer with no activation

function, there is a layer that outputs the volume density σ and a layer that produces a 256-dimensional

feature vector. This feature vector is concatenated with the positional encoding of the input viewing

direction (γ(d)) and is processed by an additional fully connected ReLU layer with 128 channels. A

final layer (with a sigmoid activation) outputs the emitted RGB radiance at position x, as viewed by

a ray with direction d. Note that the network predicts the density of a point before it is given the

direction of that point, to enforce that the density is view-invariant. On the other hand, the color

needs both the position and view direction of the point to model plausible non-Lambertian properties

of the 3D scene. We will be using 10 and 4 as the number of frequencies for the positional encoding of

the position and direction input accordingly. You should also pass σ from a ReLU activation function

to avoid infinite values in the volumetric rendering part.

For this part, complete the class nerf_model() and specifically the functions __init()__ and

forward() that define the neural network that will be used to fit the 3D scene. Except for this

function, you need to complete the function get_batch() as well, which prepares the data for the neural

network. Specifically, this function takes as input the ray points and directions with [H, W, nsamples, 3]

and [H, W, 3] dimensions respectively. The function should normalize the directions, populate them

along each ray (repeat the direction of the ray to every point), flatten the vector, apply positional

encoding to it, and then call the helper function get_chunks(). Similarly, the function should flatten

the vector of the ray positions and then call get_chunks().

Quick Links

Get In Touch