CS533 Parallel Computing
1. (10 points) Parallel Performance
Amdahl’s law assumes that the sequential portion of execution cannot be sped up and that the parallel portion can be indefinitely parallelized. In more general models, both assumptions can be relaxed: some parallel segments might have limited parallelism and there are techniques to speed up the execution of sequential sections (as we will see later in the course).
Assume you have a multi-core processor with n Cores. You can configure the system before running a program so that r cores work together as a single larger r-Core that can run sequential code √
r times as fast as a primitive core; the resulting system therefore will have one r-Core and n − r primitive cores.
You have an application that runs in three phases. Phase 1 (20% of computation) is sequential; Phase 2 (50% of computation) is linearly and indefinitely parallelizable while Phase 3 (30% of computation) can be linearly parallelized up to 3 ways. Assume that
parallelization adds no overhead.
(a) For the application on a system with one r-Core and n − r primitive cores (n ≥ 1, n ≥ r), give the general formula for the speedup over sequential execution on a single primitive core.
(b) Find the optimal r that offers the highest possible speedup for the following systems: n = 2, n = 4 and n = 16.
2. (12 points) Cache Coherence vs. Memory Consistency
(a) Define the concept of coherence.
(b) Define the concept of memory consistency model.
(c) What are the differences between the two?
(d) Are the ideas of coherency and consistency relevant to a uniprocessor system? Why?
(e) Can a multithreaded program tell if it is running on a cache coherent multiprocessor machine or a uniprocessor machine with preemptive scheduling? The program should only decide based on the values returned by its loads.
(f) Can a multithreaded program tell if the hardware is not SC by observing the values returned by the loads? Is it possible to decide whether the hardware is TSO in this way?
4. (12 points) Directory Coherence Protocols
(a) Give examples of at least two commercial processors that use directory based coherence schemes.
(b) The Compute Express Link (CXL) 1
technology expands the cache coherency to memories of devices connected in PCIE. Does it use directory-based or snoopy coherence protocol? Explain.
(c) Consider a 128-processor shared memory system using Illinois coherence with 32GB of memory per node. The cache block size is 64B.
(i) What is the directory memory overhead for DirN (full bit vector scheme)?
(ii) What is the directory memory overhead for Dir4CV ?
(iii) At how many processors does the DirN overhead become 100%?
(iv) At how many processors does the Dir4CV overhead become 100%?
(d) Coherence Protocol implementations in real systems are notoriously hard to verify. This is because the many transient states that need to be included in actual implementations exponentially increase the number of transitions which must be checked.
(i) Why are there are extra transient states in real coherence protocols?
(ii) Give an example of a coherence protocol verifier. Describe at a high level how the verifier works.
5. (8 points) Sharing Patterns
(a) Which coherence protocol would you prefer in each of the following situations: Illinois (MESI) or Firefly? Why?
(i) Multiprogramming workload with no sharing
(ii) Producer-consumer through a FIFO buffer
(iii) Producer-consumer where producer writes the consumed location multiple times before a consumer reads
(iv) Physics simulation with N particles
(b) What is the purpose of the “exclusive” state in the Firefly and Illinois (MESI) protocols? What sharing patterns benefit from this state?
8. (10 points) Interaction of Consistency and Coherence
Consider a directory-based machine with Weak Ordering (i.e., reordering of data reads and writes is allowed) using invalidation-based coherence over a network that does not preserve message ordering. In a correct implementation of this system, the directory
serializes accesses to each line l as follows: Upon receiving a write from some node in the system, the directory sends invalidations to all sharers of l and waits for all sharers to acknowledge the invalidation before servicing any subsequent read requests for l.
Now consider a faulty implementation of the above system where the directory does not wait for all invalidations to be received before providing the new value to subsequent readers. The following code demonstrates the coherence/consistency bug. The fence instructions in the code are full (RW → RW) memory fences. When run on the faulty system, the resulting vector {r1, r2, r3, r4} can take on a final value that is illegal under WO. What is the illegal value and why is it illegal?