Advancing In-Memory Computing: The IBM HERMES Project Chip

0
313

Artificial neural networks have demonstrated remarkable capabilities in a variety of applications, from image recognition to natural language processing.

However, the energy inefficiency associated with shuttling synaptic weight values between memory and processing units has posed a significant challenge for hardware implementations of these networks.

In response to this challenge, researchers have explored the potential of Analog In-Memory Computing (AIMC), leveraging spatially instantiated synaptic weights to perform matrix-vector multiplications (MVMs) directly within network weights stored on a chip.

This approach holds the promise of reducing energy consumption and improving inference latency. In this article, we delve into the innovative multi-core AIMC chip developed as part of the IBM HERMES Project, detailing its architecture, advancements, and impact.

Overcoming Energy Inefficiency Challenges

The need to repeatedly move synaptic weight values between memory and processing units has been a major contributor to energy inefficiency in hardware implementations of artificial neural networks. AIMC offers a solution by executing MVMs directly within the network weights stored on a chip, eliminating the energy-intensive data transfer. However, realizing complete end-to-end improvements in latency and energy efficiency requires combining AIMC with on-chip digital operations and communication to enable full inference workloads to be processed entirely on the chip.

The IBM HERMES Project Chip: A Breakthrough in AIMC

The IBM HERMES Project introduces a groundbreaking multi-core AIMC chip fabricated using 14-nm complementary metal-oxide-semiconductor (CMOS) technology and integrated with phase-change memory (PCM). This fully-integrated chip comprises 64 AIMC cores interconnected through an on-chip communication network. Notably, the chip encompasses digital activation functions and processing elements essential for Convolutional Neural Networks (CNNs) and Long Short-Term Memory (LSTM) networks.

Achieving Near Software-Equivalent Inference Accuracy

One of the primary goals of the IBM HERMES Project was to demonstrate near software-equivalent inference accuracy with AIMC. This was accomplished through the implementation of computations associated with weight layers and activation functions entirely on-chip. Inference accuracy was validated with widely used ResNet and LSTM networks. The chip exhibited outstanding accuracy results, rivaling those achieved by digital accelerators.

Impressive Throughput and Energy Efficiency

The multi-core AIMC chip achieved exceptional performance metrics. It showcased a maximum throughput of 63.1 TOPS (Tera Operations Per Second) while maintaining an energy efficiency of 9.76 TOPS per watt for 8-bit input/output matrix-vector multiplications. These metrics highlight the chip’s ability to process complex operations swiftly while maintaining high energy efficiency, a critical factor in contemporary computing solutions.

Advancements Over Previous Implementations

Early AIMC implementations integrated digital-to-analog and analog-to-digital conversions, activation functions, and other digital operations through off-chip software or hardware. The HERMES Project Chip, in contrast, represents a substantial leap by fully integrating these functionalities on-chip. Moreover, while earlier AIMC chips were limited to small networks that could fit entirely on a core, the HERMES Project Chip supports larger networks and offers competitive energy efficiency.

Harnessing Non-Volatile Memory for Improved Efficiency

A notable feature of the IBM HERMES Project Chip is its integration of phase-change memory (PCM), a non-volatile memory technology. PCM enables high weight capacity and density due to its analog storage capability, with only four PCM devices required to encode a weight. This characteristic addresses the challenge of holding extensive weight data on-chip, eliminating the need for off-chip weight buffers and further enhancing efficiency.

Chip Architecture

The architecture of the IBM HERMES Project Chip represents a significant leap in the field of Analog In-Memory Computing (AIMC). This chapter delves into the intricate design and layout of the chip, detailing its core components, connectivity, and the integration of phase-change memory (PCM) technology.

  • Layout and Core Structure The physical layout of the chip reveals a square design with dimensions measuring 12 mm × 12 mm. The chip comprises a total of 64 cores, organized in an 8×8 grid formation. Each core occupies an area of 1.2 mm × 1.16 mm, contributing to the chip’s overall architecture. The core arrangement facilitates efficient interconnection while ensuring a balanced distribution of processing power.
  • PCM-Based Weight Storage and Computation At the heart of each core lies a PCM crossbar array, capable of storing a 256×256 weight matrix. This matrix storage is essential for performing analog MVMs directly within the core. PCM devices enable analog weight storage, with each unit-cell accommodating four PCM devices for encoding both positive and negative weight values.
  • Global Digital Processing Units (GDPUs) Strategically positioned between the fourth and fifth rows of cores, a row of eight GDPUs provides essential digital post-processing capabilities. These units play a crucial role in handling computations related to Long Short-Term Memory (LSTM) networks, enhancing the chip’s versatility.
  • Communication Architecture The architecture of the chip revolves around a communication fabric that facilitates data exchange between cores. A grid of horizontal and vertical digital communication links interconnects the core outputs and GDPU inputs. This setup enables a Parallel Prism communication fabric topology, enhancing data transmission efficiency and supporting a total of 418 physical communication links.

The Computational Memory Core

The computational memory core lies at the heart of the IBM HERMES Project Chip’s performance. This chapter provides an in-depth overview of the core’s architecture, highlighting its components, PCM-based storage, and its role in analog MVM operations.

  • PCM Crossbar Array The core’s central component is a 256×256-sized crossbar array, constructed using PCM-based unit-cells. Four PCM devices are allocated per unit-cell, with two devices dedicated to representing positive weights and two for negative weights. The phase-configuration of the PCM devices influences their conductance, allowing for precise weight encoding.
  • Programming and Calibration To program individual conductances, a diagonal decoding scheme is employed. A per-core programming finite-state machine (FSM) manages the programming process, enabling the selection and programming of specific devices within the crossbar array. Current-steering digital-to-analog converters (DACs) execute the programming, generating various pulse shapes required for PCM device programming.
  • Analog MVM Operations After programming, the core is capable of performing analog MVMs. A pulse-width modulated (PWM) read voltage is applied to the PCM array through the input modulator. The resulting analog signals are digitized by an array of 256 time-based current-ADCs. These ADCs, placed at the left and right sides of the PCM array, contribute to accurate and noise-resistant current measurements.
  • Digital Post-Processing The measured results from ADCs undergo a series of digital post-processing steps. These include converting 12-bit unsigned integer outputs to half-precision floating-point format (FP16) and utilizing fused multiply-add (FMA) units to eliminate gain and offset errors. Digital processing units handle activation functions, incorporating options for ReLU operations.
  • Inter-Core Communication The core’s link controller enables data transmission and reception across neighboring cores. Each core-to-core link consists of eight parallel channels transferring data at every clock cycle. Routing tables and routing registers facilitate the selection of data sources and destinations for efficient on-chip data exchange.

Conclusion

The development of the IBM HERMES Project Chip represents a significant milestone in the realm of Analog In-Memory Computing. By addressing key challenges related to energy efficiency, weight capacity, accuracy, and on-chip operations, this multi-core AIMC chip paves the way for the realization of highly efficient and accurate neural network inference. Its integration of phase-change memory, combined with advanced analog storage and high-precision digital units, positions the chip at the forefront of AI hardware acceleration. With its exceptional throughput, energy efficiency, and inference accuracy, the IBM HERMES Project Chip offers a glimpse into the future of AI hardware innovation.


reference link : https://arxiv.org/abs/2212.02872

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Questo sito usa Akismet per ridurre lo spam. Scopri come i tuoi dati vengono elaborati.