However, the energy inefficiency associated with shuttling synaptic weight values between memory and processing units has posed a significant challenge for hardware implementations of these networks.
In response to this challenge, researchers have explored the potential of Analog In-Memory Computing (AIMC), leveraging spatially instantiated synaptic weights to perform matrix-vector multiplications (MVMs) directly within network weights stored on a chip.
Overcoming Energy Inefficiency Challenges
The need to repeatedly move synaptic weight values between memory and processing units has been a major contributor to energy inefficiency in hardware implementations of artificial neural networks. AIMC offers a solution by executing MVMs directly within the network weights stored on a chip, eliminating the energy-intensive data transfer. However, realizing complete end-to-end improvements in latency and energy efficiency requires combining AIMC with on-chip digital operations and communication to enable full inference workloads to be processed entirely on the chip.
The IBM HERMES Project Chip: A Breakthrough in AIMC
The IBM HERMES Project introduces a groundbreaking multi-core AIMC chip fabricated using 14-nm complementary metal-oxide-semiconductor (CMOS) technology and integrated with phase-change memory (PCM). This fully-integrated chip comprises 64 AIMC cores interconnected through an on-chip communication network. Notably, the chip encompasses digital activation functions and processing elements essential for Convolutional Neural Networks (CNNs) and Long Short-Term Memory (LSTM) networks.
Achieving Near Software-Equivalent Inference Accuracy
One of the primary goals of the IBM HERMES Project was to demonstrate near software-equivalent inference accuracy with AIMC. This was accomplished through the implementation of computations associated with weight layers and activation functions entirely on-chip. Inference accuracy was validated with widely used ResNet and LSTM networks. The chip exhibited outstanding accuracy results, rivaling those achieved by digital accelerators.
Impressive Throughput and Energy Efficiency
The multi-core AIMC chip achieved exceptional performance metrics. It showcased a maximum throughput of 63.1 TOPS (Tera Operations Per Second) while maintaining an energy efficiency of 9.76 TOPS per watt for 8-bit input/output matrix-vector multiplications. These metrics highlight the chip’s ability to process complex operations swiftly while maintaining high energy efficiency, a critical factor in contemporary computing solutions.
Advancements Over Previous Implementations
Early AIMC implementations integrated digital-to-analog and analog-to-digital conversions, activation functions, and other digital operations through off-chip software or hardware. The HERMES Project Chip, in contrast, represents a substantial leap by fully integrating these functionalities on-chip. Moreover, while earlier AIMC chips were limited to small networks that could fit entirely on a core, the HERMES Project Chip supports larger networks and offers competitive energy efficiency.
Harnessing Non-Volatile Memory for Improved Efficiency
A notable feature of the IBM HERMES Project Chip is its integration of phase-change memory (PCM), a non-volatile memory technology. PCM enables high weight capacity and density due to its analog storage capability, with only four PCM devices required to encode a weight. This characteristic addresses the challenge of holding extensive weight data on-chip, eliminating the need for off-chip weight buffers and further enhancing efficiency.
Chip Architecture
The architecture of the IBM HERMES Project Chip represents a significant leap in the field of Analog In-Memory Computing (AIMC). This chapter delves into the intricate design and layout of the chip, detailing its core components, connectivity, and the integration of phase-change memory (PCM) technology.
- Layout and Core Structure The physical layout of the chip reveals a square design with dimensions measuring 12 mm × 12 mm. The chip comprises a total of 64 cores, organized in an 8×8 grid formation. Each core occupies an area of 1.2 mm × 1.16 mm, contributing to the chip’s overall architecture. The core arrangement facilitates efficient interconnection while ensuring a balanced distribution of processing power.
- PCM-Based Weight Storage and Computation At the heart of each core lies a PCM crossbar array, capable of storing a 256×256 weight matrix. This matrix storage is essential for performing analog MVMs directly within the core. PCM devices enable analog weight storage, with each unit-cell accommodating four PCM devices for encoding both positive and negative weight values.
- Global Digital Processing Units (GDPUs) Strategically positioned between the fourth and fifth rows of cores, a row of eight GDPUs provides essential digital post-processing capabilities. These units play a crucial role in handling computations related to Long Short-Term Memory (LSTM) networks, enhancing the chip’s versatility.
- Communication Architecture The architecture of the chip revolves around a communication fabric that facilitates data exchange between cores. A grid of horizontal and vertical digital communication links interconnects the core outputs and GDPU inputs. This setup enables a Parallel Prism communication fabric topology, enhancing data transmission efficiency and supporting a total of 418 physical communication links.
The Computational Memory Core
The computational memory core lies at the heart of the IBM HERMES Project Chip’s performance. This chapter provides an in-depth overview of the core’s architecture, highlighting its components, PCM-based storage, and its role in analog MVM operations.
- PCM Crossbar Array The core’s central component is a 256×256-sized crossbar array, constructed using PCM-based unit-cells. Four PCM devices are allocated per unit-cell, with two devices dedicated to representing positive weights and two for negative weights. The phase-configuration of the PCM devices influences their conductance, allowing for precise weight encoding.
- Programming and Calibration To program individual conductances, a diagonal decoding scheme is employed. A per-core programming finite-state machine (FSM) manages the programming process, enabling the selection and programming of specific devices within the crossbar array. Current-steering digital-to-analog converters (DACs) execute the programming, generating various pulse shapes required for PCM device programming.
- Analog MVM Operations After programming, the core is capable of performing analog MVMs. A pulse-width modulated (PWM) read voltage is applied to the PCM array through the input modulator. The resulting analog signals are digitized by an array of 256 time-based current-ADCs. These ADCs, placed at the left and right sides of the PCM array, contribute to accurate and noise-resistant current measurements.
- Digital Post-Processing The measured results from ADCs undergo a series of digital post-processing steps. These include converting 12-bit unsigned integer outputs to half-precision floating-point format (FP16) and utilizing fused multiply-add (FMA) units to eliminate gain and offset errors. Digital processing units handle activation functions, incorporating options for ReLU operations.
- Inter-Core Communication The core’s link controller enables data transmission and reception across neighboring cores. Each core-to-core link consists of eight parallel channels transferring data at every clock cycle. Routing tables and routing registers facilitate the selection of data sources and destinations for efficient on-chip data exchange.
Conclusion
The development of the IBM HERMES Project Chip represents a significant milestone in the realm of Analog In-Memory Computing. By addressing key challenges related to energy efficiency, weight capacity, accuracy, and on-chip operations, this multi-core AIMC chip paves the way for the realization of highly efficient and accurate neural network inference. Its integration of phase-change memory, combined with advanced analog storage and high-precision digital units, positions the chip at the forefront of AI hardware acceleration. With its exceptional throughput, energy efficiency, and inference accuracy, the IBM HERMES Project Chip offers a glimpse into the future of AI hardware innovation.
reference link : https://arxiv.org/abs/2212.02872