Simulectics Radio | Computing (Season 2025-Q4)

Episode #2 | December 18, 2025 @ 4:00 PM EST

Power Walls and Performance Ceilings: Life After Dennard Scaling

Dr. Mark Horowitz (Electrical Engineer, Stanford University)

Announcer The following program features simulated voices generated for educational and technical exploration.

Sam Dietrich Good evening. I'm Sam Dietrich.

Kara Rousseau And I'm Kara Rousseau. Welcome to Simulectics Radio.

Sam Dietrich Tonight we're examining what happened when the fundamental bargain of processor design broke down. For decades, Dennard scaling meant that as transistors got smaller, you could run them faster without increasing power density. Shrink the dimensions by thirty percent, reduce the voltage proportionally, and you get forty percent better performance at constant power per unit area. That ended around two thousand six. Voltage scaling hit a wall—you can't reduce gate threshold voltages arbitrarily without leakage currents overwhelming active switching power. The question is what architectural strategies remain when you can't simply clock everything faster.

Kara Rousseau And whether performance itself needs redefinition. We've trained ourselves to think performance means clock frequency, but that was always a proxy for what we actually care about—work completed per unit time, per watt, per dollar. When the easy path disappears, maybe we discover that we were optimizing the wrong metric all along.

Sam Dietrich To understand both the physics of what broke and the engineering of what comes next, we're joined by Dr. Mark Horowitz, Professor of Electrical Engineering at Stanford, whose work spans circuit design, computer architecture, and the energy efficiency of computation. Dr. Horowitz, welcome.

Dr. Mark Horowitz Thank you. Glad to be here.

Kara Rousseau Let's start with the mechanism. Why did voltage scaling stop working? What's the physical constraint that prevents us from continuing to reduce supply voltages?

Dr. Mark Horowitz The fundamental issue is subthreshold leakage. A transistor is supposed to be off when the gate voltage is below the threshold, but quantum tunneling means some current flows anyway. That subthreshold current increases exponentially as you reduce the threshold voltage. At the same time, you need some minimum threshold voltage to ensure the transistor actually turns off. When supply voltages drop below about seven hundred millivolts, the ratio between on-current and off-current degrades to the point where the transistor stops being an effective switch. You leak too much power even when supposedly idle.

Sam Dietrich So there's a floor imposed by the physics of the transistor itself—specifically, the subthermal swing, the minimum voltage change needed to modulate current by a factor of ten. At room temperature, that's about sixty millivolts per decade, and you need several decades of on-off ratio to build reliable logic. That sets a lower bound on threshold voltage, which sets a lower bound on supply voltage.

Dr. Mark Horowitz Exactly. And once voltage scaling stops, you lose the mechanism that kept power density constant as you added more transistors. You can still shrink transistors and fit more per unit area, but now they're all consuming power, and you can't cool the chip fast enough. This is the origin of the power wall—the observation that we have dark silicon, transistors we physically can't turn on simultaneously because the chip would melt.

Kara Rousseau Dark silicon is such a strange situation from a software perspective. We have the transistors, they're manufactured and working, but we deliberately leave them idle because of thermal constraints. It's like having a datacenter where you can't power on all the servers. What does that imply for how we think about computation?

Dr. Mark Horowitz It means we need to be selective about what we accelerate. You can't build a faster general-purpose core because you're power-limited. But you can build specialized accelerators that do specific tasks much more efficiently—less energy per operation—and use the power budget to run those when needed. The future is heterogeneous: many different types of computational units that excel at different tasks, with the system dynamically choosing which to activate based on workload.

Sam Dietrich This is where I want to push on the notion of efficiency. A GPU is more energy-efficient than a CPU for certain workloads, but it achieves that by exploiting parallelism and regularity. You need thousands of identical operations on different data. That's not always available. How general-purpose can these specialized accelerators actually be?

Dr. Mark Horowitz That's the right question. There's a fundamental tension between generality and efficiency. A Turing-complete processor can run any program, but it carries overhead from that flexibility—instruction fetch and decode, branch prediction, speculative execution. An accelerator discards flexibility to eliminate overhead. The challenge is identifying which operations are common enough to justify dedicated hardware. Matrix multiplication for machine learning is an obvious candidate. Video encoding, cryptography, compression—these have well-defined, high-volume use cases. But if you try to accelerate everything, you're back to building a general-purpose processor.

Kara Rousseau And you face an abstraction problem. How does software discover and utilize these accelerators? Do we expose them through the instruction set, through library calls, through compiler intrinsics? Each choice has implications for portability and forward compatibility. If my program is written for today's accelerators, does it still run efficiently on next year's chip with a different mix of specialized units?

Dr. Mark Horowitz The industry hasn't converged on a single answer. GPUs hide their complexity behind APIs like CUDA or OpenCL, but that creates vendor lock-in and requires programmers to think explicitly about parallelism. Neural network accelerators are typically accessed through frameworks like TensorFlow that compile high-level graphs to hardware. We're still experimenting with the right abstraction layers.

Sam Dietrich One thing that strikes me about the post-Dennard era is that it privileges certain kinds of computation. Embarrassingly parallel workloads benefit enormously from having many simple cores or accelerators. Deeply serial, dependent computations don't parallelize and can't be accelerated. Are we creating a computational divide where some problems get exponentially better while others stagnate?

Dr. Mark Horowitz Absolutely. We're seeing divergence in performance trends based on problem structure. Machine learning inference is hundreds of times faster than it was ten years ago, but single-threaded integer code is maybe twice as fast. That's a profound shift from the era when all programs rode the same Moore's Law curve together. It means the nature of the problem increasingly determines achievable performance, not just the quality of the implementation.

Kara Rousseau Which has implications for algorithmic design. If I can reformulate my problem to expose more parallelism or regularity, I might unlock orders of magnitude in performance. The algorithm isn't just about asymptotic complexity anymore—it's about mapping to available hardware. That's a very different optimization space.

Dr. Mark Horowitz And it's not static. As hardware capabilities evolve, the optimal algorithmic strategy changes. An algorithm designed for single-core execution might be suboptimal on a many-core system, and vice versa. We're seeing co-evolution of algorithms and architectures, which makes the system harder to reason about but potentially more efficient.

Sam Dietrich Let's talk about memory. Even if we solve the power problem and can run all our transistors, we hit the memory wall—the observation that memory bandwidth and latency haven't scaled as fast as compute capability. Are there architectural solutions, or is this a fundamental bottleneck?

Dr. Mark Horowitz Memory is the defining constraint of modern computing. DRAM latency has barely improved in twenty years. Bandwidth increases through wider buses and higher clock rates, but that consumes power and area. The energy cost of moving data from DRAM to the processor is now much higher than the cost of the actual computation. This is why caches are so critical—you can't afford to go to main memory frequently. It's also why we're seeing more computation moved to the data, with processing-in-memory or near-memory architectures.

Kara Rousseau Processing-in-memory seems compelling in principle—avoid data movement by computing where the data lives. But doesn't that require very different programming models? Memory is traditionally passive storage, not an active computational element.

Dr. Mark Horowitz Correct. Most PIM proposals require explicitly managing what computation happens in memory versus in the processor, which breaks the abstraction of uniform memory access. There are also technical challenges—DRAM process technology is optimized for density, not logic speed. Integrating high-performance logic with DRAM is difficult. Some of the most promising approaches use 3D stacking to put logic dies directly on top of memory dies, connected through high-bandwidth vertical interconnects. That gives you the bandwidth without mixing incompatible manufacturing processes.

Sam Dietrich The 3D integration path makes sense thermally—you can potentially cool both the logic and memory layers. But it adds design complexity and cost. Are we approaching a point where the marginal returns on architectural complexity outweigh the benefits?

Dr. Mark Horowitz That's a real concern. Each new optimization adds design and verification burden. We have limited engineering resources and limited time to market. At some point, the complexity becomes unsustainable. This is one reason why domain-specific architectures are attractive—if you narrow the scope, you can manage more complexity in that narrower domain. A neural network accelerator doesn't need branch prediction or cache coherence or virtual memory. Simplifying the requirements lets you optimize more aggressively.

Kara Rousseau But then we're back to the fragmentation problem. If every domain has its own architecture, how do we build coherent systems? How do these components communicate? Who manages resource allocation between them?

Dr. Mark Horowitz These are open questions. Current solutions are ad-hoc—each accelerator has its own driver, its own memory management, its own interface to the rest of the system. There's work on unified memory spaces and coherent interconnects, but it's hard. You're trying to provide flexibility without sacrificing the efficiency that came from specialization. It's a fundamental tension.

Sam Dietrich I want to return to power for a moment. We've talked about power as a constraint, but it's also a cost—operational cost for datacenters, battery life for mobile devices. How much of the drive toward efficiency is about hitting thermal limits versus reducing energy bills?

Dr. Mark Horowitz Both matter, but in different contexts. For mobile, it's battery life and thermal comfort—users don't want phones that get hot or run out of charge. For datacenters, it's operational cost and capacity. Power delivery and cooling infrastructure are major capital expenses. If you can do the same work with half the power, you can either fit twice as many servers in the same facility or build a smaller facility. Energy efficiency directly translates to economics.

Kara Rousseau And there are environmental implications, though maybe that's secondary to the economic drivers. If computation becomes a larger fraction of global energy consumption, efficiency matters at societal scale.

Dr. Mark Horowitz It's significant and growing. Datacenters are a few percent of global electricity consumption now, and that fraction is increasing. AI training runs, in particular, are extremely energy-intensive. There's legitimate concern about whether we're creating unsustainable demand. Efficiency improvements help, but if demand grows faster than efficiency, total consumption still increases.

Sam Dietrich That leads to an interesting question about diminishing returns. We're putting enormous effort into squeezing out incremental efficiency gains—five percent here, ten percent there. At what point do we accept that we're near the practical limits of silicon-based digital logic and need fundamentally different computational substrates?

Dr. Mark Horowitz We're not close to fundamental physical limits—transistors can still get more efficient. But we are hitting practical limits of what's economically viable. The question is whether the problems we're trying to solve are worth the engineering investment. For some applications, the answer is clearly yes. For others, maybe we should accept good enough rather than pushing for optimal. That's a value judgment, not a technical one.

Kara Rousseau And it varies by domain. For scientific computing or national security applications, you might justify extraordinary expense. For consumer applications, the bar is different. The industry is stratifying into these tiers with very different performance and cost characteristics.

Sam Dietrich Dr. Horowitz, this has been tremendously insightful. Thank you for joining us.

Dr. Mark Horowitz My pleasure. Thank you for having me.

Kara Rousseau That's our program for tonight. Until tomorrow, mind the gap between compute and memory.

Sam Dietrich And question whether faster is always better. Good night.

Sponsor Message

Faraday Mesh Networks

Tired of electromagnetic interference corrupting your high-speed serial links? Faraday Mesh provides military-grade signal integrity for demanding environments. Our shielded interconnect topology uses active noise cancellation and adaptive equalization to maintain gigabit throughput in the presence of intentional jamming. Each node includes hardware authentication and encrypted channels resistant to physical layer attacks. From industrial control systems to secure communications, when your data path traverses hostile territory, Faraday Mesh ensures clean delivery. Independently certified to exceed TEMPEST standards. Faraday Mesh Networks—because copper still matters.

Because copper still matters