We’ve come a long way since Intel introduced the first microprocessor in 1971. Their 4004 held 2,300 transistors, with today’s best chips exceeding billions, harnessing more and more power since their birth.
But every time Intel releases a new computer chip, it’s a costly investment, as they need to add new instructions of computer programs that tell it which data to process and how to process it. These are things a user doesn’t see, but that power tasks like image processing, machine learning, and video coding.
However, the programs that process this new information, called compilers, can’t always use these more complex instructions. The burden then often falls on expert developers to do more of the work by hand, and to perform error-prone and cumbersome tasks like writing assembly code.
Scientists from MIT’s Computer Science and Artificial Intelligence Laboratory (CSAIL) came up with a way to better navigate the complexity of supporting these instructions. Their tool “VeGen” (pronounced “vegan”) automatically generates compiler plugins to effectively use more complicated instructions.
CSAIL Ph.D. student Yishen Chen says that VeGen takes in the same documentation that Intel gives to software developers, and automatically generates a compiler plugin that lets the compiler exploit something called non-Single Instruction Multiple Data (SIMD), which are instructions that are more complicated to accelerate a given user-supplied program.
“Without VeGen, compiler engineers have to read the documentation and manually modify the compiler to use these instructions,” says Chen, an author on a new paper about VeGen. “The problems here are that this is still manual, and current compiler algorithms are not effective at using these instructions.”
Processors also support something called vector instructions, which are instructions that do multiple but identical operations at once, such as “A1=B1+C1 and A2=B2+C2.” These are both considered more traditional “SIMD” instructions.
“Non-SIMD” instructions are more complicated, but even more powerful and efficient, such as instructions that perform both additions and subtractions simultaneously. Chen says that VeGen is mostly motivated by instructions that don’t fit the SIMD model, in one way or another.
Think of the whole process like a restaurant:
- The programmer is the celebrity chef who thinks of a program (a dish)
- The compiler is the sous chef who takes the program (the dish) and creates instructions to be executed (recipe) knowing the resources available in the kitchen
- The processor is the kitchen
- The instructions (the recipe) are executed (prepared) by the line cooks taking advantage of different equipment in the kitchen
- A new processor design that adds new capabilities to the processor is the remodeling of the kitchen, or the addition of new kitchen equipment.
If the sous chef and his team don’t know how to use the new equipment, the restaurant owners who spend all the money remodeling the kitchen will not be happy.
“With the advent of complex instructions, it’s become hard for compiler developers to keep code generation strategies up-to-date in order to harness the full potential supported by the underlying hardware,” says Charith Mendis, professor at the University of Illinois at Urbana-Champaign, an author on a paper about the tool.
“VeGen’s approach to building code generator generators alleviates this burden by automatically generating parts of the compiler responsible for identifying code sequences that can exploit new hardware features. We hope that VeGen’s approach to building compiler components will lead to more sustainable and maintainable compiler infrastructures in the future.”
Initial results showed that, for example, on select video coding kernels, VeGen could automatically use non-SIMD vector instructions and get speedup from 27 percent to 300 percent.
“Putting all the Intel instruction manuals together is more than one foot wide, going into thousands of pages,” says MIT professor Saman Amarasinghe, an author on the paper about VeGen. “Normally, the compiler writer has to pour over the fine details of instruction changes, spread over hundreds of pages, but VeGen totally bypasses the tedious work.”
“As hardware becomes more complicated to accelerate compute-intensive domains, we believe VenGen is a valuable contribution,” says Chen. “The long-term goal is that, whenever you add new features on your hardware, we can automatically figure out a way – without having to rewrite your code – to use those hardware accelerators.”
Chen wrote the paper alongside Mendis, and MIT professors Michael Carbin and Saman Amarasinghe. They will present the paper virtually at the Architectural Support for Programming Languages and Operating Systems (ASPLOS) conference in April.
Vector instructions are ubiquitous in modern processors. Previous work on auto-vectorization has focused on single instruction multiple data (SIMD) instructions, but there is little research on systematically targeting non-SIMD vector instructions, which has applications in domains such as digital signal processing, image processing, and machine learning (e.g., Intel’s VNNI extension and the dot-product instructions in ARMv8 ).
In contrast with the SIMD instruction shown in Figure 1(a), Figures 1(b)ś1(d) show three examples of the non-SIMD instructions from the AVX2 instruction set. Figure 1(b) shows a single instruction, multiple operations, mul- tiple data (SIMOMD) instruction  that performs additions and subtractions on alternating lanes (vaddsubpd); Figure 1(c) shows a horizontal addition with lane interleaving (vhaddpd); and Figure 1(d) shows an instruction computing dot-products (vpmaddwd). To date, there is no unified model of parallelism that captures the capabilities of these instructions.
Automatic Vectorization. There are two mainstream techniques for extracting SIMD parallelism: loop vectorization [1, 20, 21] and superword level parallelism (SLP) based vectorization [15, 17, 24]. Both techniques make two fundamental assumptions about vector instructions: a SIMD instruction performs isomorphic operations across all lanes, and the instruction applies the operations element-wise (i.e., there is no cross-lane operation). Relying on these two as- sumptions, these algorithms enable compiler developers to support SIMD instructions across a variety of architectures with relatively little incremental effort.
Existing Support for Non-SIMD Instructions. Because non- SIMD instructions violate the two fundamental assumptions of existing vectorization algorithms, compiler developers support non- SIMD instructions using ad hoc approaches that are cumbersome and often ineffective.
For most non-SIMD instructions, compiler developers support them with backend peephole rewrites. However, because these peephole rewrites do not generate vector instructions by themselvesÐthey fuse sequences of SIMD instructions and vec- tor shuffles into more non-SIMD instructionsÐrelying on peephole rewrites alone is ineffective.
A relatively more effective but more labor-intensive strategy involves coordinating with the compiler’s vectorizers to generate SIMD vector patterns that are tailored for those rewrite rules. For instance, the initial support in LLVM  for the addsub instruction family (Figure 1(b)) required three co- ordinated changes to LLVM: refactoring LLVM’s SLP vectorizer to support alternating opcodes, changing LLVM’s cost model to recognize a special case of vector shuffle (blending odd and even lanes), and modifying LLVM’s backend lowering logic to detect the special patterns generated by the SLP vectorizer.
As processor vendors continue to add more complex non-SIMD instructions, this methodology is not sustainable. Compilers are falling behind in identifying the complex code sequences that can be mapped to these instructions, and these multibillion-dollar investments by the processor vendors in enhancing the vector instruction sets go un- derutilized without expert developers manually writing assembly or compiler intrinsics.
Our Approach: VeGen. In this paper, we describe an extensible framework for systematically targeting non-SIMD vector instruc- tions. We define a new model of vector parallelism more general than SIMD parallelism, and we present a vectorizer generator that can effectively extract this new model of parallelism using non- SIMD instructions.
To broaden the parallelism modeled by existing vectorizers, we introduce Lane Level Parallelism (LLP), which generalizes super- word level parallelism (SLP)  beyond SIMD in two ways: (1) An instruction can execute multiple non-isomorphic operations, and
- the operation on each output lane can use values from arbitrary input lanes. These two properties of LLP depend on the semantics of a given target vector instruction. Consequently, our framework encapsulates the two LLP properties (i.e., which operation exe- cutes on a given lane and which values the operation uses) in a
couple of target-dependent vectorization utility functions. By inter- facing with these utilities, the core vectorization algorithm in our framework remains target-independent, as traditional vectorization algorithms do.
We realize this framework with VeGen, a system that automat- ically generates target-architecture-aware vectorizers to uncover LLP in straight-line code sequences while using only instruction semantics as input. From these instruction semantics, VeGen au- tomatically generates the implementation of the aforementioned vectorization utilities as a compiler library to describe the specific kind of LLP supported by the target architecture. With this auto- matically generated target-description library, VeGen’s vectorizer can automatically use non-SIMD vector instructions. We added support for newer classes of non-SIMD vector instructions (e.g., those found in AVX512-VNNI, which are not fully supported by LLVM) by providing only their semantics.
We make the following contributions in this paper:
- We introduce Lane Level Parallelism, which captures the type of parallelism implemented by both SIMD and non- SIMD vector instructions.
- We describe a code-generation framework that jointly per- forms vectorization and vector instruction selection while maintaining the modularity of traditional target-independent vectorizers designed for SIMD instructions.
- We present VeGen, a vectorizer generator that automati- cally uses complex non-SIMD instructions using only their documented semantics as input.
- We integrated VeGen into LLVM. VeGen can use non-SIMD vector instructions effectively, e.g., getting speedup 3× (com- pared to Clang’s vectorizer) on x265’s idct4 kernel.
referencel link: https://groups.csail.mit.edu/commit/papers/2021/vegen.pdf
Provided by Massachusetts Institute of Technology