Simulectics Radio | Computing (Season 2025-Q4)

Episode #6 | December 22, 2025 @ 4:00 PM EST

Memory Hierarchy and the Tyranny of Bandwidth

Dr. Onur Mutlu (Computer Architect, ETH Zurich)

Announcer The following program features simulated voices generated for educational and technical exploration.

Sam Dietrich Good evening. I'm Sam Dietrich.

Kara Rousseau And I'm Kara Rousseau. Welcome to Simulectics Radio.

Sam Dietrich Tonight we're examining one of the most fundamental constraints in modern computing: the memory hierarchy. For decades, processor speeds have increased faster than memory access times, creating what's known as the memory wall. The result is that in many applications, the CPU spends most of its time waiting for data rather than computing with it. We've built elaborate hierarchies—registers, L1 cache, L2 cache, L3 cache, main memory, storage—each level trading capacity for speed. But these are engineering mitigations, not solutions. The underlying physics of data movement—the energy cost, the latency imposed by distance, the bandwidth limits of interconnects—remains the dominant constraint in system performance.

Kara Rousseau And the challenge is architectural as much as physical. How do you design a system where computation and data are separated by orders of magnitude in access time? Caching helps when you have locality—temporal and spatial patterns in how programs access memory—but not all workloads are cache-friendly. Streaming data, random access patterns, pointer-chasing algorithms—these defeat caches and expose the raw latency of main memory. We've tried various approaches: prefetching to hide latency, bandwidth provisioning through wider buses and higher frequencies, non-uniform memory architectures that place memory closer to processors. Each approach has trade-offs. The question is whether we're addressing symptoms or whether there's a more fundamental rethinking required.

Sam Dietrich To explore the memory hierarchy, its physical and architectural limits, and potential paths forward, we're joined by Dr. Onur Mutlu, Professor of Computer Science at ETH Zurich. Dr. Mutlu's research spans memory systems, computer architecture, and the design of reliable and efficient computing systems. He's worked on everything from DRAM reliability to processing-in-memory architectures. Dr. Mutlu, welcome.

Dr. Onur Mutlu Thank you for having me. It's great to be here.

Kara Rousseau Let's start with the basics. Why is memory access so much slower than computation, and is this gap fundamental or an artifact of current technology?

Dr. Onur Mutlu The gap is rooted in physics. When you do computation in a processor, you're manipulating electrical signals within a very small area—a few square millimeters. Signal propagation is essentially instantaneous at those scales. But memory is physically separated from the processor. DRAM chips sit on DIMMs that are centimeters away. Light travels about thirty centimeters in a nanosecond, so even at the speed of light, there's a fundamental lower bound on latency imposed by distance. Add to that the internal architecture of DRAM—row activation, column access, precharge cycles—and you get access latencies on the order of fifty to a hundred nanoseconds, while a processor cycle is under a nanosecond. That's two orders of magnitude difference.

Sam Dietrich Can we reduce that gap by integrating memory closer to the processor? Stacked DRAM, high-bandwidth memory, even embedding memory directly in the processor die?

Dr. Onur Mutlu Yes, and we're seeing that with technologies like HBM—high-bandwidth memory—which stacks DRAM dies vertically and connects them to the processor through silicon interposers. This reduces distance, which lowers latency and increases bandwidth. But there are limits. DRAM manufacturing requires different processes than logic, so integrating them on the same die is challenging. You can stack them using 3D integration, but that adds cost and thermal management issues. And even with stacking, you're still limited by the internal latency of DRAM itself. The fundamental access mechanism—charging and discharging capacitors—takes time. You can't eliminate that without changing the memory technology.

Kara Rousseau So the memory hierarchy exists because we can't build a single level of storage that's both large and fast. We use small, fast caches close to the processor and large, slow main memory farther away. How do we decide where to draw the boundaries between these levels?

Dr. Onur Mutlu The boundaries are determined by cost, technology constraints, and the performance characteristics of different memory types. SRAM, which is used for caches, is fast—access times under a nanosecond—but expensive in terms of area. You can fit maybe tens of megabytes on a processor die. DRAM is denser and cheaper per bit, but slower. You use it for main memory where you need gigabytes or terabytes. Flash is even denser and cheaper, but much slower, so it's used for storage. The hierarchy reflects these trade-offs. You put the most frequently accessed data in the fastest, smallest cache, and less frequently accessed data in larger, slower levels.

Sam Dietrich That assumes your program has locality—that it accesses the same data repeatedly or accesses nearby data in sequence. What happens when locality is poor? Are there workloads where the memory hierarchy fundamentally doesn't help?

Dr. Onur Mutlu Absolutely. Graph traversal is a canonical example. You're following pointers through a large data structure with little spatial or temporal locality. Each access is essentially random from the cache's perspective. The cache can't predict what you'll access next, so it can't prefetch effectively, and the working set is too large to fit in cache. You end up with most accesses going to main memory, and the processor stalls waiting for data. Similarly, streaming workloads that process large amounts of data once—think video encoding or scientific simulations—don't benefit much from caching because there's no reuse.

Kara Rousseau For those workloads, is the solution to increase bandwidth rather than reduce latency? If you can't make individual accesses faster, at least you can overlap many accesses in flight.

Dr. Onur Mutlu That's one approach, and it's why we've seen enormous investment in memory bandwidth. Modern processors have multiple memory controllers, wide data buses, and support for many outstanding memory requests. GPUs take this to an extreme—they have very wide memory interfaces, hundreds of bits, to feed thousands of execution units. But bandwidth provisioning has costs. Wider buses consume more power, both in the I/O circuits and in the memory chips themselves. And at some point, you're limited by pin count on the package and by signal integrity at high data rates.

Sam Dietrich Let's talk about DRAM itself. What are the fundamental limits of DRAM technology, and are we approaching them?

Dr. Onur Mutlu DRAM stores each bit in a capacitor, and the charge in that capacitor leaks over time, which is why it needs to be refreshed periodically. As we scale DRAM to smaller process nodes to increase density, the capacitors get smaller, which means they hold less charge. That makes them more vulnerable to noise, radiation, and temperature variations. We're seeing increasing error rates in DRAM, particularly row hammer attacks where repeated accesses to one row can cause bit flips in adjacent rows due to electromagnetic coupling. We're also hitting limits on refresh overhead—the fraction of time spent refreshing memory rather than serving accesses—which reduces effective bandwidth.

Kara Rousseau Row hammer is interesting because it's a physical phenomenon that creates a security vulnerability. Can we fix it in hardware, or do we need software mitigations?

Dr. Onur Mutlu Ideally, you'd fix it in the DRAM itself, but that's difficult without changing the fundamental architecture. Some DRAM manufacturers have implemented targeted refresh—when you access a row frequently, the controller also refreshes adjacent rows to prevent bit flips. But this adds latency and complexity. Another approach is error-correcting codes, but traditional ECC can only correct single-bit errors per word. Row hammer can cause multiple errors. We've proposed more sophisticated error correction and memory isolation techniques, but they all have performance and cost trade-offs.

Sam Dietrich Are there alternative memory technologies that could replace DRAM? Phase-change memory, resistive RAM, magnetic RAM—do any of these have the potential to break the DRAM bottleneck?

Dr. Onur Mutlu Each of these technologies has interesting properties. Phase-change memory is non-volatile and has better scalability than DRAM, but it's slower and has limited write endurance. Resistive RAM and magnetic RAM—specifically spin-transfer torque MRAM—are also non-volatile and potentially faster than flash, but they're not yet competitive with DRAM on access latency or cost. The challenge is that DRAM has decades of manufacturing optimization behind it. To displace it, a new technology needs to be not just better in one dimension, but significantly better across multiple dimensions—speed, density, cost, endurance. So far, none of the emerging technologies have achieved that.

Kara Rousseau Let's shift to architecture. One proposal is processing-in-memory—putting compute capabilities into the memory itself to avoid moving data. What's the state of that approach?

Dr. Onur Mutlu Processing-in-memory has been proposed many times over the decades, and we're finally seeing commercial implementations. The idea is that if data movement is expensive, you do computation where the data already is. You could put simple processing elements inside DRAM chips or in the memory controller. For operations like bulk data manipulation—initializing memory, copying data, performing simple reductions—this can be much more efficient than moving data to the CPU. But there are challenges. DRAM process technology isn't optimized for logic, so the compute elements are less efficient than what you'd get in a CPU. And programming models are unclear—how do you express which computations happen in memory versus in the processor?

Sam Dietrich That sounds like a return to the heterogeneity theme. You have different compute resources with different capabilities, and the burden is on the programmer or compiler to partition work appropriately.

Dr. Onur Mutlu Exactly. And that's been the stumbling block for many processing-in-memory proposals. If it requires rewriting software to expose the memory-side computation, the adoption barrier is high. What we've seen work is transparent acceleration of specific operations—memory controllers that can do in-memory copies or pattern matching without requiring application changes. But for general-purpose processing-in-memory, we don't yet have the right abstractions.

Kara Rousseau Another architectural approach is non-uniform memory access, or NUMA, where you have multiple memory controllers and the latency depends on which processor and which memory bank you're accessing. Does NUMA help with the bandwidth problem?

Dr. Onur Mutlu NUMA helps with bandwidth scaling because you can have multiple memory controllers operating in parallel. But it makes programming harder because now you have to think about data placement. If a thread on one socket accesses data in memory attached to another socket, you pay a latency penalty for the inter-socket communication. Operating systems and runtime systems try to manage this—placing data near the threads that use it—but it's not always possible, especially with dynamic workloads where access patterns change over time.

Sam Dietrich Let's talk about prefetching. If latency is the problem, can we predict what data will be needed and fetch it early to hide the latency?

Dr. Onur Mutlu Prefetching is critical in modern processors. Hardware prefetchers monitor access patterns and try to anticipate what will be accessed next. For sequential access patterns—iterating through an array—prefetching works very well. You can fetch the next cache line before the processor asks for it. For more complex patterns—strided accesses, indirect accesses—prefetching is harder. You need more sophisticated prediction mechanisms, and you risk prefetching data that won't be used, which wastes bandwidth and can evict useful data from the cache. There's also software prefetching, where the compiler or programmer inserts explicit prefetch instructions, but that requires knowing the access pattern ahead of time.

Kara Rousseau So prefetching is effective when you have predictable access patterns, but many interesting workloads don't have that predictability. Is there a fundamental limit to how much latency we can hide?

Dr. Onur Mutlu There is. If your program has dependencies—the next operation depends on the result of the current one—you can't prefetch because you don't know what you'll need until you've computed it. Pointer chasing is the extreme case: you load a pointer, use it to compute an address, load from that address to get the next pointer, and so on. Each step depends on the previous one, so there's no parallelism to exploit, and prefetching can't help. For these workloads, the only solution is reducing latency, not hiding it.

Sam Dietrich What about bandwidth? Are we approaching physical limits on how much data we can move per second?

Dr. Onur Mutlu We're hitting practical limits, yes. Signaling rates on memory interfaces are in the gigahertz range now, and at those frequencies, signal integrity becomes a major challenge. You have to worry about crosstalk, impedance matching, power delivery. Increasing bandwidth further requires either wider buses, which means more pins and more power, or higher signaling rates, which makes the physical layer harder. There's also the power cost of data movement itself. Moving a bit of data across a chip consumes energy proportional to the capacitance of the wire. Off-chip communication is even more expensive. At some point, the power budget limits how much bandwidth you can afford.

Kara Rousseau It sounds like we're in a regime where optimizations at one level—faster processors, larger caches, higher bandwidth—are running into physical and economic limits. Does that mean we need to rethink the problem from first principles? Maybe accept that data movement is expensive and design algorithms and systems around that constraint?

Dr. Onur Mutlu That's increasingly necessary. Algorithm design has traditionally focused on computational complexity—how many operations are required. But in the memory-bound regime, we need to think about data movement complexity. How much data do you have to move, and can you structure your algorithm to minimize that? This leads to techniques like cache-oblivious algorithms, which are designed to perform well across different levels of the memory hierarchy without knowing the specific cache sizes. It also motivates approximation—if you can accept a less precise answer, can you compute it with less data movement? For some applications, that trade-off is acceptable.

Sam Dietrich Approximation is interesting, but it requires domain knowledge to know where you can relax precision. It's not a general solution.

Dr. Onur Mutlu Agreed. But I think the broader point is that we can't treat memory as if it were free or infinitely fast. The abstractions we've built—uniform memory access, the illusion that all memory is equally available—are leaking. Programmers and system designers need to be aware of the memory hierarchy and design accordingly. That might mean using data structures that fit in cache, or organizing computation to maximize reuse, or choosing algorithms based on memory access patterns, not just operation counts.

Kara Rousseau So the memory wall forces us to abandon some of the abstractions that made programming tractable. We're back to worrying about hardware details that we thought we'd abstracted away.

Dr. Onur Mutlu In performance-critical domains, yes. For many applications, current systems are fast enough, and the abstractions are fine. But for high-performance computing, data analytics, machine learning—where you're pushing the limits of what's computationally feasible—you have to engage with the hardware. That's not necessarily a bad thing. It means there's opportunity for significant performance improvements through careful system design. But it does require more expertise and more effort.

Sam Dietrich Dr. Mutlu, this has been an illuminating discussion. Thank you for your time and expertise.

Dr. Onur Mutlu Thank you. It's been a pleasure.

Kara Rousseau That's our program for tonight. Until tomorrow, remember that in computing, data locality isn't just an optimization—it's survival.

Sam Dietrich And that the speed of light imposes limits no engineering cleverness can fully evade. Good night.

Sponsor Message

MemBridge Coherence Fabrics

Memory latency costs performance. Memory bandwidth costs power. MemBridge delivers cache-coherent interconnects for distributed memory systems—multi-socket servers, accelerator clusters, disaggregated architectures. Our fabrics provide hardware-enforced coherence protocols, adaptive routing for congestion management, and quality-of-service guarantees for mixed workloads. Low-latency message passing, RDMA support, and transparent memory pooling across chassis boundaries. We handle the complexity of coherence so your applications see uniform memory access semantics at scale. From two sockets to two hundred nodes. MemBridge—because data movement is the real bottleneck.

Because data movement is the real bottleneck