18-213/18-613 Proxy Lab: Writing a Caching Web Proxy
1 Introduction
A proxy server is a computer program that acts as an intermediary between clients making requests to access
resources and the servers that satisfy those requests by serving content. A web proxy is a special type of proxy
server whose clients are typically web browsers and whose servers are web servers providing web content.
When a web browser uses a proxy, it contacts the proxy instead of communicating directly with the web
server; the proxy forwards the client’s request to the web server, reads the server’s response, then forwards the
response to the client.
Proxies are useful for many purposes. Sometimes, proxies are used in firewalls, so that browsers behind a
firewall can only contact a server beyond the firewall via the proxy. A proxy may also perform translations
on pages, for example, to make them viewable on web-enabled phones. Importantly, proxies are used as
anonymizers: by stripping requests of all identifying information, a proxy can make the browser anonymous
to web servers. Proxies can even be used to cache web objects by storing local copies of objects from servers
and then responding to future requests by reading them out of its cache rather than by communicating again
with a remote server.
This lab has three parts. An implementation of the first part will be submitted as your checkpoint. Your final
submission will then incorporate the extensions forming the second and third parts. For the first part, you
will create a proxy that accepts incoming connections, reads and parses requests, forwards requests to web
servers, reads the servers’ responses, and forwards the responses to the corresponding clients. The first part
will involve learning about basic HTTP operation and how to use sockets to write programs that communicate
over network connections. In the second part, you will upgrade your proxy to deal with multiple concurrent
connections. This will introduce you to dealing with concurrency, a crucial systems concept. In the third and
last part, you will add caching to your proxy using a simple main memory cache of recently accessed web
content.
You will debug and test your program with PxyDrive, a testing framework we provide, as well as by accessing
your proxy via standard tools, including a web browser. The grading of your code will involve automated
testing. Your code will also be reviewed for correctness and for style.
2 Logistics
This is an individual project. You are allowed only one grace day for the checkpoint and one grace day for the final.
3 Handout instructions
Create your GitHub Classroom repository by clicking the “Download handout" button on the proxylab Autolab
page. Then do the following on a Shark machine:
• Clone the repository that you just created using the git clone command. Do not download and
extract the zip file from GitHub.
• Type your name and Andrew ID in the header comment at the top of proxy.c.
3.1 Robust I/O package
The handout directory contains the files csapp.c and csapp.h, which comprise the CS:APP package
discussed in the CS:APP3e textbook. The CS:APP package includes the robust I/O (RIO) package. When
reading and writing socket data, you should use the RIO package instead of low-level I/O functions, such as
read, write, or standard I/O functions, such as fread, and fwrite.
The CS:APP package also contains a collection of wrapper functions for system calls that check the return
code and exit when there’s an error. You will find that the set of wrapper functions provided is a subset of
those from the textbook and the lecture notes. We have disabled ones for which exiting upon error is not the
correct behavior for a server program. For these, you must check the return code and devise ways to handle
these errors that minimize their impact.
3.2 HTTP parsing library
The handout directory contains the file http_parser.h, which defines the API for a small HTTP string
parsing library. The library includes functions for extracting important data fields from HTTP response
headers and storing them in a parser_t struct. A brief overview of the library is given below. Please refer to
the source files in your handout for the full documentation of the types, structs, and functions available for use
in the library.
To create a new instance of a parser struct, call parser_new(). The returned pointer can then be used as the
first argument to the other functions. parser_parse_line() will parse a line of an HTTP request and store
the result in the provided parser_t struct. Parsed fields of specified types may be retrieved from the struct
by calling parser_retrieve() and by providing a string pointer for the function to write to. Particular
headers may also be retrieved by name via parser_lookup_header(). Headers may instead be accessed in
an iterative fashion by successive calls to parser_retrieve_next_header().
3.3 Modularity
The skeleton file proxy.c, provided in the handout, contains a main function that does practically nothing.
You should fill in that file with your proxy implementation. Modularity, though, should be an important
consideration, and it is important for you to separate the individual modules of your implementation into
different files. For example, your cache should be largely (or completely) decoupled from the rest of your
proxy, so one good idea is to move the implementation of the cache into separate code and header files
cache.c and cache.h.
3.4 Makefile
You are free to add your own source and header files for this lab. The Makefile will automatically link all
.c files into the final binary. While you are free to update the provided Makefile (for example to define the
DEBUG macro), the autograder will use the original Makefile to grade your solution. As such, the entire project
should compile without warnings.
3.5 Other provided resources
Included with your starter code, in the pxy directory, is a pair of programs PxyDrive and PxyRegress (given
as files pxydrive.py and pxyregress.py, respectively.) PxyDrive is a testing framework for your proxy.
PxyRegress provides a way to run a series of standard tests on your proxy using PxyDrive. Both programs
are documented in the PxyDrive user manual, available at:
http://www.cs.cmu.edu/~18213/proxylab/pxydrive-manual.pdf.
Also included, in the tests directory, is a series of 51 test files to test various aspects of your proxy. Each of
these is a command file for PxyDrive. You will want to learn about the operation of PxyDrive and how each
of these tests operate.
Finally, you are provided with a reference implementation of a proxy, named proxy-ref. It is compiled to
execute on a Linux machine.
4 Part I: Implementing a sequential web proxy
The first step is implementing a basic sequential proxy that handles HTTP/1.0 GET requests. Your proxy need
not handle other request types, such as POST requests, but it should respond appropriately, as described below.
Your proxy also need not handle HTTPS requests (only HTTP).
When started, your proxy should listen for incoming connections on a port whose number is specified on the
command line. Once a connection is established, your proxy should read the entirety of the request from the
client and parse the request. It should determine whether the client has sent a valid HTTP request; if so, it
should 1) establish its own connection to the appropriate web server, 2) request the object the client specified,
and 3) read the server’s response and forward it to the client.
4.1 HTTP/1.0 GET requests
When an user enters a URL such as http://www.cmu.edu/hub/index.html into the address bar of a web
browser, the browser will send an HTTP request to the proxy that begins with a request line such as the
following:
GET http://www.cmu.edu:8080/hub/index.html HTTP/1.1\r\n
The proxy should parse the request URL into the host1
, in this case www.cmu.edu:8080, and the path2
,
consisting of the / character and everything following it. That way, the proxy can determine that it should
open a connection to hostname www.cmu.edu on port 8080 and send an HTTP request of its own, starting
with its own request line of the following form:
GET /hub/index.html HTTP/1.0\r\n
As these examples show, all lines in an HTTP request end with a carriage return (‘\r’) followed by a newline
(‘\n’). Also important is that every HTTP request must be terminated by an empty line, consisting of just the
string “\r\n”.
Notice in the above example that the web browser’s request line ends with HTTP/1.1, while the proxy’s
request line ends with HTTP/1.0. Modern web browsers will generate HTTP/1.1 requests, but your proxy
should handle them and forward them as HTTP/1.0 requests.
Additionally, in the above example, a port number of 8080 was specified as part of the host. If no port is
specified, the default HTTP port of 80 should be used.
4.2 Request headers
Request headers are very important elements of an HTTP request. Headers are key-value pairs provided
line-by-line following the first request line of an HTTP request, with they key and value separated by the
colon (‘:’) character. Of particular importance for this lab are the Host, User-Agent, Connection, and
Proxy-Connection headers. Your proxy must perform the following operations with regard to the listed
HTTP request headers:
• Always send a Host header. This header is necessary to coax sensible responses out of many web
servers, especially those that use virtual hosting.
The Host header describes the host of the web server your proxy is trying to access. For example, to
access http://www.cmu.edu:8080/hub/index.html, your proxy would send the following header:
Host: www.cmu.edu:8080\r\n
It is possible that the client will attach its own Host header to its HTTP requests. If that is the case,
your proxy should use the same Host header as the client.
4.3 Port numbers
There are two significant classes of port numbers for this lab: HTTP request ports and your proxy’s listening port.
The HTTP request port is an optional field in the URL of an HTTP request. That is, the URL may be
of the form, http://www.cmu.edu:8080/hub/index.html, in which case your proxy should connect to
the host www.cmu.edu on port 8080, and it should include the port number in the Host header (e.g., Host:
www.cmu.edu:8080.)
Your proxy must properly function whether or not the port number is included in the URL. If no port is
specified, the default HTTP port number of 80 should be used, which should not be included in the Host
header.
The listening port is the port on which your proxy should listen for incoming connections. Your proxy should
accept a command line argument specifying the listening port number for your proxy. For example, with the
following command, your proxy should listen for connections on port 12345:
linux> ./proxy 12345
The proxy must be given a port number every time it runs. When using PxyDrive, this will be done
automatically, but when you run your proxy on its own, you must provide a port number. You may select
any non-privileged port (greater than 1,024 and less than 32,768) as long as it is not used by other processes.
Since each proxy must use a unique listening port, and many students may be working simultaneously on
each machine, the script port-for-user.pl is provided to help you pick your own personal port number.
Use it to generate a port number based on your Andrew ID:
linux> ./port-for-user.pl bovik
bovik: 5232
The port, p, returned by port-for-user.pl is always an even number. So if you need an additional port
number, say for the Tiny server, you can safely use ports p and p + 1.
4.4 Error handling
In the case of invalid requests, or valid requests that your proxy is unable to handle, it should try to send the
appropriate HTTP status code back to the client (see clienterror() in tiny.c). Read more about HTTP
status codes at: https://developer.mozilla.org/en-US/docs/Web/HTTP/Status
In particular, your proxy must be able to respond to a POST request with the 501 Not Implemented status
code. The request line for a POST request will resemble the following:
POST http://exams.ugrad.cs.cmu.edu/Shibboleth.sso/SAML2/POST HTTP/1.1\r\n
In other cases, it is acceptable for your proxy to simply close the connection to the client when an error occurs,
using close(). Note that in all error cases, you should always clean up all resources being used to handle a
given request, including file descriptors and allocated memory.
Note: Upon normal execution, your proxy should not print anything. However, you should consider having a
verbose mode (set with -v on the command line) that prints useful information for debugging.
Completing Part I satisfies the requirements for the project checkpoint. See Section 7 regarding how your
proxy will be evaluated for the checkpoints.
5 Part II: Dealing with multiple concurrent requests
Production web proxies usually do not process requests sequentially; they process multiple requests in parallel.
This is particularly important when handling a single request can involve a lengthy delay (as it might when
contacting a remote web server). While your proxy waits for a response from the remote web server, it
can work on a pending request from another client. Indeed, most web browsers reduce latency by issuing
concurrent requests for the multiple URLs embedded in a single web page requested by a single client. Once
you have a working sequential proxy, you should alter it to simultaneously handle multiple requests.