Simulectics Radio | Computing (Season 2025-Q4)

Episode #12 | December 28, 2025 @ 4:00 PM EST

Network-on-Chip Architecture: Scaling Communication for Manycore Processors

Dr. William Dally (Chief Scientist, NVIDIA)

Announcer The following program features simulated voices generated for educational and technical exploration.

Sam Dietrich Good evening. I'm Sam Dietrich.

Kara Rousseau And I'm Kara Rousseau. Welcome to Simulectics Radio.

Sam Dietrich Tonight we're examining network-on-chip architecture—the communication fabric that connects components within modern processors. As we've moved from single-core to manycore designs, the interconnect has become a critical bottleneck. You can have hundreds of cores, massive caches, specialized accelerators, all on one die, but if they can't communicate efficiently, performance collapses. Traditional bus architectures don't scale. Point-to-point links become impractical as core counts grow. The solution is to treat the chip like a datacenter—build a packet-switched network with routers, links, and routing protocols, all implemented in silicon. The challenge is designing this network to provide low latency, high bandwidth, and predictable performance while consuming minimal power and area.

Kara Rousseau From the software perspective, network-on-chip architecture is invisible until it becomes the bottleneck. We write parallel programs assuming that communication happens at some cost, but we rarely think about the topology, routing algorithms, or congestion control happening beneath our abstractions. Yet these details matter enormously. A poorly designed NoC can destroy the benefits of parallelism—cores spend time waiting for data instead of computing. The abstraction we depend on is that the interconnect is fast and fair, that bandwidth scales with core count, that latency remains bounded. But maintaining these properties at scale requires sophisticated architectural design. We're building distributed systems on a single chip, complete with all the challenges of distributed systems—resource allocation, deadlock avoidance, quality of service guarantees.

Sam Dietrich To explore these issues, we're joined by Dr. William Dally, Chief Scientist at NVIDIA and professor emeritus at Stanford. Dr. Dally pioneered many fundamental concepts in interconnection networks, including wormhole routing, virtual channels, and adaptive routing algorithms. His work spans on-chip networks, datacenter interconnects, and high-performance computing fabrics. He's also designed custom processors where NoC bandwidth and latency directly determine application performance. Dr. Dally, welcome.

Dr. William Dally Thank you. It's a pleasure to be here.

Kara Rousseau Let's start with fundamentals. Why did we need networks-on-chip? What made traditional bus architectures inadequate?

Dr. William Dally Buses worked well when we had a few components that needed to communicate. A bus is a shared medium—all devices connect to the same wires, and arbitration logic decides who gets to transmit. The advantage is simplicity and broadcast capability. The disadvantage is that bandwidth doesn't scale. As you add more devices, they have to share the same fixed bandwidth. Worse, bus wires are long, so they have high capacitance and consume significant power. As processors moved to multiple cores, buses became the bottleneck. You'd have eight cores trying to access shared caches or memory controllers through a single bus, and they'd spend most of their time waiting for bus access. Point-to-point links helped—you could connect pairs of components with dedicated wires. But full mesh connectivity doesn't scale either. With N components, you need N squared links, which consumes too much area and power. The solution was to borrow ideas from computer networks. Instead of connecting everything to everything, you build a switched network with routers and shorter links. Each router connects to a few neighbors and forwards packets toward their destinations. This provides scalability—bandwidth grows with the number of routers, latency stays low because links are short, and you can trade off network complexity against performance requirements.

Sam Dietrich What are the key design parameters for a network-on-chip? What trade-offs do you face?

Dr. William Dally The fundamental parameters are topology, routing, flow control, and router microarchitecture. Topology determines how routers are connected—common choices include meshes, tori, and trees. Meshes are simple and have short wires, but diameter grows with system size. Routing determines the path packets take through the network. You can use simple deterministic routing, which is easy to implement but can't adapt to congestion, or adaptive routing, which routes around congestion but requires more complex routers. Flow control manages buffer allocation and prevents deadlock. You need mechanisms to handle backpressure when buffers fill up. Router microarchitecture determines how quickly packets move through each hop—how many pipeline stages, how much buffering, what switching technique. The trade-offs are multidimensional. Higher bandwidth requires more wires, which costs area and power. Lower latency requires more buffers and faster routers, which costs power and complexity. You're also constrained by physical layout—wires have to be routed on the chip, and long wires are expensive. A good NoC design balances these factors for the target application. For throughput-oriented workloads, you might tolerate higher latency to maximize bandwidth. For latency-sensitive applications, you might sacrifice some bandwidth for lower latency.

Kara Rousseau How do you prevent deadlock in these networks? It seems like circular dependencies could easily arise.

Dr. William Dally Deadlock is a serious concern. In a packet-switched network, deadlock occurs when packets are blocked waiting for resources in a cycle—packet A is waiting for buffers held by packet B, which is waiting for buffers held by packet C, which is waiting for buffers held by packet A. No progress can occur. There are several approaches to deadlock prevention. One is to use structured routing that provably avoids cycles. For example, dimension-ordered routing in a mesh—you route in the X dimension first, then Y. This prevents cycles because packets always make progress in one dimension before moving to the next. Another approach is virtual channels. You partition each physical channel into multiple virtual channels with separate buffers. By assigning virtual channels to different message classes and ensuring that dependencies between classes are acyclic, you can prevent deadlock while allowing more routing flexibility. A third approach is deadlock recovery—detect when deadlock occurs and break it by dropping packets or draining buffers. But recovery is complex and hurts performance, so prevention is preferred. The key insight is that deadlock is a resource allocation problem. If you allocate resources in a way that prevents circular dependencies, you prevent deadlock. The challenge is doing this efficiently without wasting resources or overly restricting routing.

Sam Dietrich What's the power cost of these networks? How much of the chip's power budget goes to the interconnect?

Dr. William Dally In modern manycore processors, the NoC can consume a significant fraction of total power—sometimes twenty to thirty percent. This breaks down into several components. Dynamic power from switching wires and router logic dominates. Every time a bit travels across a wire or through a router, you charge and discharge capacitances, which costs energy. Longer wires cost more because capacitance scales with length. Leakage power from transistors in routers and buffers also contributes, especially at advanced process nodes. The energy per bit transported depends on network design. A well-designed NoC might consume a few picojoules per bit per millimeter. Multiply that by the traffic volume and average distance, and you get total network power. Reducing power requires several techniques. First, minimize wire length by using local communication patterns—most traffic should stay nearby rather than crossing the chip. Second, optimize router microarchitecture to reduce switching activity—techniques like clock gating, power gating, and low-swing signaling help. Third, exploit traffic patterns. If you know most communication is between neighbors, you can optimize the common case. Finally, trade off bandwidth and power. Not all links need maximum bandwidth all the time. You can use adaptive techniques to power down unused links or reduce link width when traffic is low.

Kara Rousseau How do you design for quality of service? Not all traffic has the same priority or latency requirements.

Dr. William Dally Quality of service is critical in heterogeneous systems where different traffic types have different requirements. For example, coherence traffic from cache controllers might need low latency and high priority to avoid stalling cores. Memory responses might need guaranteed bandwidth to prevent livelock. Control messages might be small but urgent. QoS mechanisms provide differentiated service. One approach is priority-based arbitration. You assign priorities to traffic classes and give high-priority traffic preferential access to buffers and links. But pure priority can starve low-priority traffic, so you need fairness mechanisms. Another approach is virtual channel allocation. You dedicate virtual channels to specific traffic classes, ensuring that each class has reserved resources. This provides isolation but can waste resources if traffic is imbalanced. Rate-based QoS allocates bandwidth shares to different classes and enforces those shares through scheduling. You might guarantee that memory traffic gets at least fifty percent of bandwidth while allowing it to use more when available. The challenge is implementing these mechanisms efficiently. QoS adds complexity to routers—you need more sophisticated arbitration, bookkeeping for bandwidth allocation, and mechanisms to enforce policies. The overhead must be justified by the benefits. In practice, you design QoS mechanisms based on the traffic patterns of your target applications.

Sam Dietrich Let's talk about specific topologies. What are the advantages and disadvantages of meshes versus other topologies?

Dr. William Dally Meshes are the most common topology for NoCs because they map naturally to 2D chip layouts. In a mesh, routers are arranged in a grid, and each router connects to its four neighbors and one local processing element. The advantages are simplicity, regular layout, and short wires—all links have the same length. Routing is straightforward—dimension-ordered routing works well. The disadvantages are that diameter grows as the square root of the number of nodes, and bisection bandwidth is limited. For large systems, long paths increase latency. Tori improve on meshes by adding wraparound links, which reduce diameter and increase path diversity. But wraparound links are longer, which costs power and latency. Trees provide high bisection bandwidth near the root but can create congestion if traffic patterns don't match the tree structure. Fat trees address this by increasing bandwidth toward the root, but this costs area. Hypercubes and other high-radix topologies offer lower diameter but require more complex routers and longer wires. There's no universally optimal topology. The right choice depends on die size, traffic patterns, and performance requirements. For small chips with moderate core counts, meshes work well. For larger systems, you might use hierarchical topologies—meshes of clusters, with higher-bandwidth links between clusters.

Kara Rousseau How does cache coherence interact with NoC design? Coherence traffic seems particularly demanding.

Dr. William Dally Coherence traffic is challenging because it's latency-sensitive, can have complex dependencies, and generates significant traffic volume in shared-memory workloads. In directory-based coherence protocols, requests go to directory controllers, which then send invalidations or data responses to sharing cores. This creates multicast traffic patterns where one message triggers multiple responses. The NoC needs to handle these patterns efficiently. One issue is ordering. Coherence protocols often require certain ordering guarantees to maintain correctness—for example, invalidations must be observed in a consistent order. The NoC must preserve these ordering requirements, either through in-order delivery on certain virtual channels or through explicit acknowledgments. Another issue is bandwidth. Coherence traffic can consume a large fraction of NoC bandwidth, especially for fine-grained sharing. If many cores are contending for the same cache line, you get invalidation storms that flood the network. Protocol design and NoC design must co-optimize. Some systems use broadcast-based coherence for small core counts, where the NoC provides efficient broadcast or multicast primitives. Others use directory protocols with carefully designed routing to avoid hotspots. You also need to consider deadlock. Coherence messages can create circular dependencies if not handled carefully. Virtual channels help isolate different message types to prevent deadlock while allowing concurrent traffic.

Sam Dietrich What role do NoCs play in GPUs? GPU workloads seem quite different from CPU workloads.

Dr. William Dally GPUs have different traffic patterns than CPUs, which influences NoC design. In a GPU, you have many streaming multiprocessors executing thousands of threads in parallel, accessing memory controllers and caches. The traffic is dominated by memory requests and responses, with less cache-to-cache coherence traffic since GPUs traditionally used weaker consistency models. The key challenge is bandwidth. GPU workloads can generate enormous memory traffic—hundreds of gigabytes per second per chip. The NoC needs to provide sufficient bandwidth between SMs and memory controllers without becoming the bottleneck. This requires careful design. NVIDIA GPUs use crossbar-based interconnects or high-bandwidth meshes with wide links and optimized routing. Another characteristic is regularity. Many GPU workloads have regular access patterns—threads in a warp access consecutive memory locations. The NoC can exploit this by coalescing requests, routing them efficiently, and batching responses. GPUs also have specialized interconnects for specific traffic types. For example, texture units have dedicated paths to texture caches, and the memory subsystem has separate networks for requests and responses. This specialization improves efficiency compared to a general-purpose NoC. Finally, GPUs tolerate latency better than CPUs because they rely on massive multithreading to hide memory latency. So the NoC design can optimize for bandwidth over latency, using techniques like deeper buffering and pipelined routing that would hurt latency-sensitive CPU workloads.

Kara Rousseau How do you verify that a NoC design is correct? Deadlock freedom and livelock freedom seem difficult to prove.

Dr. William Dally Verification is critical because subtle bugs in NoC design can cause deadlock, livelock, or protocol violations that only manifest under rare traffic patterns. Formal verification is the gold standard. You model the network as a state machine and use model checking to prove properties like deadlock freedom, livelock freedom, and message ordering. For small networks, exhaustive state space exploration works. For larger networks, you need abstraction techniques or compositional reasoning. You might verify that individual routers satisfy certain properties and then prove that composing correct routers yields a correct network. Simulation is also essential. You run traffic patterns through detailed cycle-accurate models of the NoC and check for violations. Synthetic traffic patterns—uniform random, permutation, hotspot—stress different aspects of the design. Real application traces validate performance on actual workloads. But simulation can't cover all corner cases, so you also use formal methods. Another approach is to design for verifiability. Structured routing algorithms like dimension-ordered routing are easier to verify than fully adaptive routing. Deadlock avoidance through virtual channels has well-understood conditions that can be checked mechanically. The key is to make correctness arguments as simple and compositional as possible. You also need integration testing because NoCs interact with coherence protocols, memory controllers, and other chip components. A bug in the interface between the NoC and the rest of the system can be just as bad as a bug in the NoC itself.

Sam Dietrich Looking forward, how will NoC design evolve as we move to even larger core counts and more heterogeneous systems?

Dr. William Dally Several trends will shape future NoC design. First, 3D integration. As we stack multiple dies in a package, the NoC needs to span vertical dimensions. Through-silicon vias enable vertical links with high bandwidth and low latency, but they're expensive and limited in number. NoC design must optimize for this asymmetry—plentiful horizontal bandwidth but scarce vertical bandwidth. Second, heterogeneity. Future chips will have CPUs, GPUs, AI accelerators, fixed-function blocks, all sharing a die. Different components have different communication patterns and requirements. The NoC might need multiple physical or virtual networks optimized for different traffic types. Third, photonics. Optical interconnects offer higher bandwidth and lower power for long-distance on-chip communication. Hybrid electrical-optical NoCs could use photonics for cross-chip links and electrical links for local communication. But photonics adds complexity—you need wavelength division multiplexing, optical-electrical conversion, and new routing techniques. Fourth, specialization. As we hit power limits, we'll see more application-specific NoCs. Instead of a general-purpose network, you design the interconnect for the specific communication patterns of your target workload. This gives better efficiency but less flexibility. Finally, AI-driven design. Machine learning could optimize NoC parameters—topology, routing, buffer allocation—for specific workloads, potentially finding designs that human designers would miss.

Kara Rousseau What about programmability? How do we expose NoC capabilities to software without leaking too much implementation detail?

Dr. William Dally This is a fundamental tension. Ideally, the NoC is completely transparent—software just issues loads and stores, sends messages, and the NoC delivers them. Performance might vary, but correctness doesn't depend on NoC details. But achieving good performance often requires awareness of communication costs. If your algorithm assumes uniform communication latency and the NoC has non-uniform topology, performance suffers. Some systems expose topology information to software so runtime systems can make better scheduling decisions—place communicating tasks on nearby cores, allocate resources to minimize NoC hops. This is common in HPC systems where hand-tuned applications can exploit hardware details. For general-purpose systems, you rely on abstraction. The coherence protocol hides NoC details behind cache coherence semantics. Message-passing APIs provide communication abstractions. The challenge is that NoC performance characteristics leak through these abstractions. A cache miss takes different amounts of time depending on where the data is, which creates non-deterministic performance. Some researchers propose performance contracts where the system guarantees certain latency or bandwidth bounds, making performance more predictable at the cost of utilization. In practice, we'll continue to have layered approaches. Most software stays abstracted. Performance-critical software uses libraries or languages that expose enough detail to optimize communication without requiring full knowledge of the NoC microarchitecture.

Sam Dietrich Dr. Dally, this has been an illuminating discussion. Thank you for joining us.

Dr. William Dally Thank you. It's been a pleasure discussing these issues.

Sam Dietrich That's our program for this evening. Until tomorrow, remember that computation isn't just about cores—it's about connecting them.

Kara Rousseau And that communication infrastructure determines whether parallelism becomes performance or just overhead. Good night.

Sponsor Message

RouterFabric NoC IP

On-chip communication isn't an afterthought—it's the fabric that determines whether your manycore architecture achieves its potential or starves on bandwidth. RouterFabric delivers production-hardened network-on-chip IP with proven deadlock-free routing, configurable QoS, power-aware link management, and extensive verification coverage. Our synthesizable RTL supports mesh, torus, and custom topologies, with virtual channels, adaptive routing, and flexible arbitration policies. Whether you're building a coherent multicore CPU, a throughput-oriented GPU, or a heterogeneous accelerator complex, RouterFabric scales from tens to thousands of endpoints while meeting timing, power, and area constraints. Extensive documentation, integration support, and performance modeling tools accelerate your tape-out schedule. RouterFabric—because cores without communication are just transistors.

Because cores without communication are just transistors