Researchers named the vulnerability Spectre because the flaw was built into modern computer processors that get their speed from a technique called “speculative execution,” in which the processor predicts instructions it might end up executing and preps by following the predicted path to pull the instructions from memory.
A Spectre attack tricks the processor into executing instructions along the wrong path. Even though the processor recovers and correctly completes its task, hackers can access confidential data while the processor is heading the wrong way.
Since Spectre was discovered, the world’s most talented computer scientists from industry and academia have worked on software patches and hardware defenses, confident they’ve been able to protect the most vulnerable points in the speculative execution process without slowing down computing speeds too much.
They will have to go back to the drawing board.
A team of University of Virginia School of Engineering computer science researchers has uncovered a line of attack that breaks all Spectre defenses, meaning that billions of computers and other devices across the globe are just as vulnerable today as they were when Spectre was first announced.
The researchers, led by Ashish Venkat, William Wulf Career Enhancement Assistant Professor of Computer Science at UVA Engineering, found a whole new way for hackers to exploit something called a “micro-op cache,” which speeds up computing by storing simple commands and allowing the processor to fetch them quickly and early in the speculative execution process. Micro-op caches have been built into Intel computers manufactured since 2011.
Venkat’s team discovered that hackers can steal data when a processor fetches commands from the micro-op cache.
“Think about a hypothetical airport security scenario where TSA lets you in without checking your boarding pass because (1) it is fast and efficient, and (2) you will be checked for your boarding pass at the gate anyway,” Venkat said. “A computer processor does something similar. It predicts that the check will pass and could let instructions into the pipeline.
Ultimately, if the prediction is incorrect, it will throw those instructions out of the pipeline, but this might be too late because those instructions could leave side-effects while waiting in the pipeline that an attacker could later exploit to infer secrets such as a password.”
Because all current Spectre defenses protect the processor in a later stage of speculative execution, they are useless in the face of Venkat’s team’s new attacks. Two variants of the attacks the team discovered can steal speculatively accessed information from Intel and AMD processors.
“Intel’s suggested defense against Spectre, which is called LFENCE, places sensitive code in a waiting area until the security checks are executed, and only then is the sensitive code allowed to execute,” Venkat said. “But it turns out the walls of this waiting area have ears, which our attack exploits. We show how an attacker can smuggle secrets through the micro-op cache by using it as a covert channel.”
Venkat’s team includes three of his computer science graduate students, Ph.D. student Xida Ren, Ph.D. student Logan Moody and master’s degree recipient Matthew Jordan. The UVA team collaborated with Dean Tullsen, professor of the Department of Computer Science and Engineering at the University of California, San Diego, and his Ph.D. student Mohammadkazem Taram to reverse-engineer certain undocumented features in Intel and AMD processors.
They have detailed the findings in their paper: “I See Dead μops: Leaking Secrets via Intel/AMD Micro-Op Caches.”
This newly discovered vulnerability will be much harder to fix.
“In the case of the previous Spectre attacks, developers have come up with a relatively easy way to prevent any sort of attack without a major performance penalty” for computing, Moody said. “The difference with this attack is you take a much greater performance penalty than those previous attacks.”
“Patches that disable the micro-op cache or halt speculative execution on legacy hardware would effectively roll back critical performance innovations in most modern Intel and AMD processors, and this just isn’t feasible,” Ren, the lead student author, said.
“It is really unclear how to solve this problem in a way that offers high performance to legacy hardware, but we have to make it work,” Venkat said. “Securing the micro-op cache is an interesting line of research and one that we are considering.”
Venkat’s team has disclosed the vulnerability to the product security teams at Intel and AMD. Ren and Moody gave a tech talk at Intel Labs worldwide April 27 to discuss the impact and potential fixes. Venkat expects computer scientists in academia and industry to work quickly together, as they did with Spectre, to find solutions.
The team’s paper has been accepted by the highly competitive International Symposium on Computer Architecture, or ISCA. The annual ISCA conference is the leading forum for new ideas and research results in computer architecture and will be held virtually in June.
Venkat is also working in close collaboration with the Processor Architecture Team at Intel Labs on other microarchitectural innovations, through the National Science Foundation/Intel Partnership on Foundational Microarchitecture Research Program.
Venkat was well prepared to lead the UVA research team into this discovery. He has forged a long-running partnership with Intel that started in 2012 when he interned with the company while he was a computer science graduate student at the University of California, San Diego.
This research, like other projects Venkat leads, is funded by the National Science Foundation and Defense Advanced Research Projects Agency.
Venkat is also one of the university researchers who co-authored a paper with collaborators Mohammadkazem Taram and Tullsen from UC San Diego that introduce a more targeted microcode-based defense against Spectre. Context-sensitive fencing, as it is called, allows the processor to patch running code with speculation fences on the fly.
Introducing one of just a handful more targeted microcode-based defenses developed to stop Spectre in its tracks, “Context-Sensitive Fencing: Securing Speculative Execution via Microcode Customization” was published at the ACM International Conference on Architectural Support for Programming Languages and Operating Systems in April 2019. The paper was also selected as a top pick among all computer architecture, computer security, and VLSI design conference papers published in the six-year period between 2014 and 2019.
The new Spectre variants Venkat’s team discovered even break the context-sensitive fencing mechanism outlined in Venkat’s award-winning paper. But in this type of research, breaking your own defense is just another big win. Each security improvement allows researchers to dig even deeper into the hardware and uncover more flaws, which is exactly what Venkat’s research group did.
Last night, a group of computer security researchers lead by Ashish Venkat (University of Virginia) published a paper titled “I See Dead µops: Leaking Secrets via Intel/AMD Micro-Op Caches” which they have submitted for ISCA ’21 (International Symposium on Computer Architecture, a prestigious academic conference).
The paper concerns a structure called the micro-op (uop) cache that is commonly used within modern microprocessors. A thorough analysis of the organization and implementation of these structures allows the research team to propose novel timing side-channel attacks similar to those of “Spectre“.
Modern microprocessors are decoupled into a “front-end” that decodes the instructions as presented by the programmer (in the form of compiled machine code) and a “back-end” that actually executes instructions (quite likely in the form of a data flow model, Out-of-Order with respect to how the program was actually written, but nonetheless respecting the actual dependencies).
Notice I didn’t say “these instructions” a second time. What the back-end actually executes may look similar to the machine code presented at the front, but it might also look very different. Perhaps it has simply been optimized (e.g. fusing two – or more – adjacent instructions together into a more efficient one) but more often than not the backend of the machine will be executing very different instructions known as micro-ops.
A micro-op (uop) is a very simple instruction. For example, it might be an ADD instruction. It may take a couple of inputs and produce one output. In some cases (such as a simple add), the “macro” ops of the machine code written by the tooling used by a programmer map 1:1 on to the uops used by the machine. But often, a single macro op is actually decomposed into many individual uops.
In RISC-like machines (Arm, RISC-V, etc.) such cracking into multiple uops is usually limited to relatively complex instructions (such as certain loads that also perform complex address generation logic), but most modern CISC machines (e.g. x86) will actually decompose every x86 macro instruction into a stream of RISC uops.
In such a machine, the processor frontend will be coupled to the instruction cache through a “fetch” unit that will retrieve a certain number of bytes (e.g. 16 in the paper cited) per processor cycle (clock “tick”) from the “L1” (level 1) instruction cache. The L1 instruction cache contains the program machine instruction “code”, represented as a sequence of bytes, typically stored into 64 byte cache “lines”.
The L1 is small but extremely fast. It is typically “coherent” with respect to other caches in the machine, meaning that if you were to write self-modifying code out via the data cache, the instruction cache automatically observes this. The L1 might do many other things, such as pre-decode instructions (e.g. tag those that are branches ahead of time, etc.), or be the target of other timing side-channel attacks (such as the original Spectre attacks), but that is not the focus of this latest analysis.
Once bytes of instructions are fetched, they are run through decoders. Particularly with x86 (but also elsewhere), decoding macro operations is expensive. Worse, if a piece of code is “hot” because it is part of a loop that is being executed hundreds or millions of times, decoding the same macro instructions over and over can add needless expense (both to performance overhead, but also energy use).
For this reason, it is common in some designs to cache decoded uops and replay them out of the uop cache as needed. Think of the uop cache as a companion to the other caches we already have in our modern processors – the instruction and data caches being just two of many different examples in modern designs.
Keeping a cache of decoded uops can save a lot of overhead, and increase throughput, especially when doing a CISC->RISC translation. Individual x86 macro instructions are variable width, up to 15 bytes in length (perhaps theoretically 16?), and decoding them is extremely unpleasant. Worse, the throughput per cycle is dependent upon the complexity of the instructions.
Unlike a clean RISC architecture where you might build a wide uniform decoder that can fetch and stream 4, 8, or even more instructions to the processor backend each cycle, doing this at a sustained rate on x86 would require a much wider fetch (as each instruction might be from 1 to 15 bytes in length), and a lot of bit and byte swizzling, instruction length boundary detection, and so on. It’s a huge potential bottleneck, so the uop cache allows a higher sustained rate of instruction throughput than otherwise.
Furthermore, adding the uop cache means that the backend of a modern x86 processor isn’t all that different from any other RISC machine. If you want to learn more about that, search for the original “RISC86” work done by Greg Favor at NexGen (later AMD).
The problem with having a uop cache is that it adds an observable timing side-channel into the mix. Depending upon which instructions are executing, they may be fetched from the uop cache, or they may have to be decoded directly from the x86 macro ops.
Worse, the uop cache is populated during instruction fetch/decode prior to any part of the backend getting involved. This means that existing “Spectre” mitigations that rely upon controlling execution happen too late in the pipeline – the offending uops are already present by then.
The paper first determines the “organization” (cache size and structure – sets and ways, etc.) and then conclusively shows that the uop cache implementation used by modern Intel processors features a “hot replacement” policy that guarantees recently executed uops will be present. It then introduces a “tiger” / “zebra” displacement attack that can be used by an attacker in order to determine which uops were recently executed.
The uop cache structure used by Intel in their popular “Skylake” derived designs is revealed to feature lines of 6 uops that are organized into an 8-way set-associative cache with 32-sets. Instructions are “streamed” out of the uop cache set-by-set sequentially, following various rules.
It is claimed that Intel processors do not competitively share uop cache space between sibling “Hyperthreads” of an SMT (Simultaneous Multi-Threading) core. This means that the uop cache is split physically (but evenly) between two SMT threads in a way that one cannot measure the other.
However, the researchers claim that certain AMD processors do have competitive sharing of SMT resources, meaning that a sibling thread might be able to monitor the uop cache footprint of the other. At the same time, it should be noted that the entire industry is moving away from a model of mixing trusted and untrusted workloads across sibling hyperthreads. Today, it is more common to give siblings only to the same workload, precisely because of the number of concerns that have emerged about SMT.
The rules for which instructions can live in the uop cache include a limit on the number of uops per macro instruction (“a given 32-byte code region may consume a maximum of 3 lines in the set (i.e. up to 18 micro-ops)”), as well as requiring that an unconditional branch must be last instruction in a uop group. There are many other constraints, which when taken together with the determined replacement policy allow sequences of instructions to be manufactured that will have a known layout in the uop cache. One group of “tigers” interfere with other “tigers” because they intentionally map to the same uop cache sets, while “tigers” and “zebra” are constructed to be mutually exclusive with one another.
As a result of determining the behavior of the uop cache, it is possible to construct sequences of code that can be used for covert communication through the uop cache (fun), or (worse) leverage measurements of the uop cache to determine when a data dependent branch has caused instructions for that dependent branch to be populated into the cache.
The latter case is most dangerous. It allows for (potential) circumvention of some of the Spectre-v1 (bounds check bypass) mitigations commonly adopted in software. Let’s remind ourselves how Spectre worked. It all comes down to memory being dramatically slower than compute, meaning that processors need to frequently wait for reads (loads) from memory to complete.
Out-of-Order execution allows the backend to get around this in two ways. First, if a load is outstanding but some immediately subsequent code does not depend upon the result, then the subsequent code can be executed and the ordering restored later. Second, the processor can enter a mode of “speculation” in which it guesses the result of the load. Ignoring value prediction for this article, consider this instead:
char array[1024];
int array_size = 1024;
char secret[1024];
extern uint8_t victim_function(size_t i) {
// bounds check:
if (i >= 0 && i < array_size) {
// misspeculation of this branch
// bypasses the bounds check
return array[i];
}
return -1;
}
Listing: Victim Method for the Variant-1 Attack (as given in “Listing 4” in the paper)
When the processor reaches the “if” branch above, which performs the bounds check on the attacker controlled “i” variable, it does not know whether to proceed until it has “resolved” the branch by performing the necessary load of array_size and the comparison.
While this load is happening, the processor may nonetheless speculate ahead that the bounds check will succeed, especially if it has done so in the past (and thus the branch predictor has been “trained” to behave a certain way). In Spectre, the attacker causes secret leakage by first exceeding a bounds check such that “secret” which immediately follows “array” in memory becomes the target, and then finding some “gadget” in the code following the bounds check that performs a second memory access based upon the value of the first. This data-dependent second access leaves breadcrumbs in the L1 data cache of the processor.
In Spectre, the second access after the first is important because it causes the L1 data cache at a predictable location (based upon its organization) to become populated based upon whether or not the first (secret) memory had a certain value. Timing analysis of the data cache access latency at that second location can subsequently be used to determine whether it was populated or not. Through careful analysis of target programs, it is possible to find sufficient Spectre “gadgets” to leak secrets one bit at a time (or better).
Mitigating Spectre-v1 typically requires restraining speculation beyond sensitive branches (either identified by manual review, or in an automated fashion using tooling or compilers). One way in which this is performed is to use a form of serializing memory barrier known as a load fence (LFENCE) on x86. Inserting an LFENCE as a Spectre mitigation works because it seeks to prevent the backend of the machine from executing any further loads beyond the potentially dangerous load. However, it apparently doesn’t prevent the frontend of the machine from potentially fetching code (e.g. from a branch) beyond the bounds check.
In the paper, the researchers replace a secret-dependent second load with a secret-dependent jump or function call. The latter causes the processor to begin preparing to execute the instructions following the branch, ensuring they are present in the uop cache.
Using the “tiger” and “zebra” analysis methods previously developed in the paper, the researchers are subsequently able to determine the presence or absence of uops and to reconstruct the secret. Finally (of course) they were able to find a few example “gadgets” in the Linux kernel source that could be theoretically exploited to leak data in this manner.
The researchers primarily focus on same address space / same privilege level attacks where a different context (e.g. managed code runtime, such as Java or JavaScript) might be interpreting untrusted code within a sandbox. Practical attacks of this form are likely to be fairly minimal, however, due to the extensive work done by the industry in the wake of Spectre (and its companions) to separate untrusted code into separate address spaces. It is, for example, no longer the case that web browsers are running untrusted JavaScript code within an interpreter that exists within the same address space as trusted browser code.
More worrying, however, is the potential for cross-domain attacks (one privilege level to another), or cross-SMT thread attacks. The researchers claim that they were able to perform user-to-kernel type attacks (and presumably this could be a kernel to hypervisor in place of user to kernel).
They were not able to target Intel’s SGX (Software Guard Extensions, also known as enclaves) because entry into an enclave causes the processor’s instruction TLB (Translation Lookaside Buffer, a structure that caches memory translations between the “virtual” memory view used by programs, and the physical memory underneath) to be invalidated and this has the side-effect of invalidating the uop cache as a benefit.
But the cross-domain attack is a potential concern. If it turns out to be significant, then a logical fix would be to invalidate the uop cache when crossing this privilege boundary. We know that this can be done by invalidating the iTLB (and we know this is also a side effect of the original “Meltdown” Page Table Isolation or “PTI” solution as it causes the kernel and userspace to use entirely different address spaces – what we don’t know is whether all of the optimizations for this such as PCID invalidations also cause uop cache invalidation).
Overall, the paper is interesting reading. It’s far from the world-ending sensationalism implied by the “Defenseless” language on the Virginia site, and in the press pick up thus far. Quite the contrary really. The industry had a huge problem on its hands with Spectre, and as a direct consequence a great deal of effort was invested in separating privilege, isolating workloads, and using different contexts. There may be some cleanup needed in light of this latest paper, but there are mitigations available, albeit always at some performance cost.
Provided by University of Virginia School of Engineering and Applied Science