Using RF communications for broadcast operations in NoC.
Speeding-up parallel programing with broadcast communications based on hybrid wireless/wired network
ContextThe de facto way of programming multi/manycore chips assumes that the memory is shared, and the hardware support for that is cache coherence throughout the memory hierarchy. The question of enabling the scaling of the protocol needed to ensure coherence is recurrent, because it requires to broadcast coherence messages to all the caches, or to multicast these messages to an identified subset of the caches. similarly, collective synchronizations like barriers or condition signaling hardly scale. By nature, radio communications provide broadcast capabilities and negligible latency, they have thus the potential to disseminate information very quickly at the scale of a circuit and thus to be an opening for solving these issues.
The Figure above presents the typical architecture of a manycore utilizing the rakes project results: n \times mn×m clusters of different types connected through a noc whose routers are also equipped with an rf transmitter and receiver. Clusters consist of pp processors and local L2 cache, and may also include portion of distributed last-level cache (LLC) or DRAM controller.
wireless links in noc have emerged as a solution to reduce latency of multi-hop paths. In rakes, we aim to solve the current challenges that impede the exploration of the promised lands expected by the parallel computing community, namely the use of broadcast capabilities for cache coherence protocol and parallel programming mechanisms. Available broadcast is a key feature of the project that will allow three scientific breakthroughs as compared to current solutions.
- the first one is the virtualization of communication channels that can be dynamically allocated by means of radio access techniques. We believe a relevant example of such allocation method is CDMA (Code Division Multiple Access) access technique. With CDMA, communication addresses (e.g. cluster ii) are replaced by codes and the communication medium can be shared among the different users (i.e., computing clusters). With this approach we intend first to solve the multipath issues of the wireless channel and secondly to partially replace addresses by codes. However, the efficient implementation of such a code-based channel allocation requires a new, adaptive (application aware) and energy-efficient transceiver.
- the second one addresses the cache coherency challenge in manycore architectures. At first glance one might think that a classic snoop cache protocol will be the solution, but two immediate constraints show up. First, the area and power consumption induced by rf communications do not make it possible to have a transmitter/receiver per cache, which requires the protocol to use partly the noc and partly the radio, I.e. a hybrid solution. Second, broadcast on a wired noc is serialized. In the case of radio, two caches connected to a transmitter/receiver wishing to broadcast must share the frequency band. One approach would be to use cdma which makes it possible in particular to multicast data, a strategy which is relatively difficult to implement in a noc, but that it is immediately available on radio.
- the third point addresses the design and implementation of noc-based support for the efficient execution of parallel programing primitives (locks, conditions, synchronization barriers) that can also benefit from fast broadcast/multicast mechanisms in two ways. The first one is the reduction of notification latencies and the second one is the possibility to implement cooperative scheduling policies. Such improvements depend on the distribution of processing and data over the manycore architecture, so this third topic will be explored in close collaboration with the work on the cache protocols. The demonstration of our proposal will rely on the following achievements:
- definition and use of realistic propagation channels within the chip,
- design of a radio transceiver including advanced dynamic power management,
- realistic power estimation model including radio and wired communications, routers and network interfaces,
- a new hybrid and possibly manyfold noc designed to minimize latency of cache protocols and synchronization mechanisms including new pre/post processing capabilities,
- network on chip simulator implementing wired and wireless routers and multicast/broadcast mechanisms,
- multiprocessor simulator cooperating with the noc simulator to run standard and real-life parallel benchmarks,
- the cost, performance and power consumption of the proposed solutions will be estimated with hardware implementations. The new mechanisms, at cache, network interface and router levels will be integrated in a small multicore prototype based on open-source risc-v processor and open source noc
- Funding: ANR
- Project Site Web
- SLS team of the TIMA
- CAIRN team of the IRISA Lab
- MOCS team of Lab-STICC