The tool, named SugarCoat, targets scripts that harm users’ privacy – for example, by tracking their browsing history around the Web – yet are essential for the websites that embed them to function.
SugarCoat replaces these scripts with scripts that have the same properties, minus the privacy-harming features. SugarCoat is designed to be integrated into existing privacy-focused browsers like Brave, Firefox, and Tor, and browser extensions like uBlock Origin. SugarCoat is open source and is currently being integrated into the Brave browser.
“SugarCoat is a practical system designed to address the lose-lose dilemma that privacy-focused tools face today: Block privacy-harming scripts, but break websites that rely on them; or keep sites working, but give up on privacy,” said Deian Stefan, an assistant professor in the UC San Diego Department of Computer Science and Engineering.
The researchers will describe their work at the ACM Conference on Computer and Communications Security (CCS) taking place in Seoul, Korea, Nov. 14 to 19, 2021.
“SugarCoat integrates with existing content-blocking tools, like ad blockers, to empower users to browse the Web without giving up their privacy,” said Michael Smith, a Ph.D. student in Stefan’s research group, who is leading the project.
In practice, though, some scripts are both privacy-harming and necessary for websites to function – and most tools inevitably choose to make an exception and allow these scripts to run. Today, there are more than 6,000 exception rules letting through these privacy-harming scripts.
There is a better approach, though. Instead of blocking a script entirely or allowing it to run, content-blocking tools can replace its source code with an alternative privacy-preserving version. For example, instead of loading popular website analytics scripts which also track users, content-blocking tools replace these scripts with fake versions that look the same.
This ensures that the content-blocking tools are not breaking web pages that embed these scripts and that the scripts can’t access private data (and thus report it back to the analytics companies).
To date, crafting such privacy-preserving replacement scripts has been a slow, manual task even for privacy engineering experts. uBlock Origin, for example, maintains replacements for only 27 scripts, compared to the over 6,000 exception rules.
How SugarCoat changes the game
The researchers developed SugarCoat precisely to address this gap by automatically generating privacy-preserving replacement scripts. The tool uses the PageGraph tracing framework – Smith was key to the development of the framework – to follow the behavior of privacy-harming scripts throughout the browser engine.
SugarCoat then rewrites the scripts’ source code to talk to fake “SugarCoated” APIs instead, which look like the Web Platform APIs but don’t actually expose any private data.
To evaluate the impact of SugarCoat on Web functionality and performance, the team integrated the rewritten scripts into the Brave browser; they found that SugarCoat effectively protected users’ private data without impacting functionality or page load performance. SugarCoat is now being deployed in production at Brave.
“Brave is excited to start deploying the results of the year-long SugarCoat research project,” said Peter Snyder, senior privacy researcher and director of privacy at Brave Software. “SugarCoat gives Brave and other privacy projects a powerful, new capability for defeating online trackers, and helps keep users in control of the Web.”
SUGARCOAT DESIGN
In this section we present the design of SugarCoat, a system for programmatically generating privacy-preserving resource replace- ments. SugarCoat combines dynamic browser instrumentation with static code analysis to patch out the privacy-harming portions of real-world JavaScript code.
Privacy developers can use SugarCoat to solve the privacy/compatibility trade-off without manually reverse- engineering scripts or writing individual resource replacements.
Generating resource replacements with SugarCoat is a three-step process (see Figure 4):
► The privacy developer visits Web pages using our modified PageGraph-instrumented browser, which dynamically traces the execution of all scripts embedded by the visited pages (§3.1).
► The developer marks certain scripts that they consider privacy- harming, and feeds this target script set into the first stage of the SugarCoat pipeline. Using graph analysis, this stage builds behav- ioral profiles of the target scripts from the collected PageGraph browser data, concretizing privacy-relevant Web API accesses to textual locations in the JavaScript source (§3.2).
replacements. Privacy developers collect this behavioral data by visiting Web pages in a modified browser equipped with Page- Graph [37], an instrumentation system for Blink- and V8-based browser engines. The browser can either be driven manually or by scripted automation (e.g., in our evaluation we use Puppeteer [13]).
For all pages loaded in the instrumented browser, PageGraph records page “actions” that occur during execution (e.g., DOM node modifications, Web API calls, HTTP requests), the “actors” respon- sible for the actions (e.g., the parser, running scripts), and the “re- ceivers” which are acted upon (e.g., DOM nodes, network resources, other actors), along with relevant attributes and metadata.
This history of actions is represented as an interconnected directed graph, with nodes representing actors and receivers, and edges representing actions as well as the DOM tree relationships in the page.
Figure 5 illustrates the graph structure resulting from a script inserting a DOM node at runtime, recorded by PageGraph as (a) a node representing the script (the “actor”), (b) a node representing the inserted DOM node (the “receiver”), (c) an edge connecting the two nodes, representing the insertion (the “action”), and (d) an edge connecting the inserted DOM node to its new parent DOM node. The nodes and edges are annotated with metadata, like the source URL and V8 script ID for the script actor node, and references to
parent and sibling nodes for the insertion action edge.
For this work, we extended PageGraph with additional capabil- ities for tracking the Web API accesses performed by scripts. By hooking into the JavaScript binding layer, PageGraph can now track accesses to arbitrary Web APIs as actions in the graph, as long as simple annotations are added to the WebIDL code defining the APIs (see Appendix C for an example).
For each access, we record the concretized JavaScript source text location within every script on the stack at the point the access occurs, saving this as metadata in the graph. This data is collected for all scripts embedded in pages visited with the PageGraph browser, and then extracted from the PageGraph recordings by the SugarCoat pipeline.
Identifying Call Sites via Graph Analysis
The first stage of the SugarCoat pipeline builds behavioral profiles of privacy-harming scripts. It takes as input a set of target scripts, in source code form, and PageGraph recordings of pages which embed the target scripts. SugarCoat matches the target script source code with script actor nodes in the PageGraph graphs. It then performs two graph traversals: one to expand the target script set, and one to generate a trace map to drive script rewriting later in the pipeline.
Expanding the Target Script Set.
A script can dynamically in- ject other scripts into the page at runtime—this is a common pattern for ad and tracking scripts in particular. Blocking one script has the knock-on effect of blocking the scripts it would have injected into the page if allowed to run. These additional scripts may not be in filter lists, but may still be privacy-harming. To provide equivalent coverage without blocking, SugarCoat expands the input target script set to include all scripts injected by another target script (and all scripts injected by those scripts, recursively).
The most common method by which scripts inject other scripts is to insert
Then instead of guessing which pieces of a script might corre- spond to privacy-relevant API calls, SugarCoat mines the input PageGraph recordings for calls observed at runtime. For each “Web API access” graph node linked to a privacy-relevant API, SugarCoat loops through the connected scripts, starting with the script most recently pushed to the JavaScript stack at the time of the call. The call is attributed to the first script that is in the target script set.
In Figure 7, for example, tracking.js is at the top of the stack when a call is logged to localStorage, and if tracking.js is in the target script set, then SugarCoat attributes this call to tracking.js at source location 58 (i.e., the 58th character in the script source text). Otherwise, userInterface.js is next on the stack, and is checked for its membership in the target script set; and so on.
The output of this process is a trace map linking target script source locations to sets of privacy-relevant APIs accessed at those lo- cations, in the form (target script, code location) → {Web API, . . .}.
Analyzing and Rewriting JavaScript
Next, SugarCoat transforms the original source code of the tar- get privacy-harming scripts into non-privacy-harming resource replacements, by redirecting privacy-relevant Web API accesses to harmless “mock” implementations of those APIs. Source code locations where these accesses occur are drawn from the trace map produced by the previous pipeline stage. For each API that should be intercepted, the privacy developer supplies a mock im- plementation, written in JavaScript, which emulates its expected behavior in a compatible but privacy-preserving way (e.g., a Web Storage API mock would keep all data in memory, while a Fetch
API mock would return fake responses for network requests). Mock implementations are written once per API, not per script, and are reusable and shareable between resource replacements. Privacy developers can optionally specify policies which enable and disable mocks for each target script, controlling the capabilities available to the rewritten versions. Table 1 lists the initial set of Web APIs for which we implemented mocks; see Section 5.2 for a discussion of the scalability of developing additional mocks for SugarCoat.
SugarCoat produces resource replacements in three steps. First, it parses target scripts into abstract syntax trees (ASTs) using the ESPrima6 JavaScript parser. Then, it rewrites the script ASTs to redi- rect privacy-relevant API calls to mock implementations. Finally, it transforms the rewritten ASTs into JavaScript source and bundles the source alongside the mock API implementations in a form con- sumable by off-the-shelf content blocking tools. We describe the most interesting step—the script rewriting—next.
3.3.1 Script AST Rewriting. Given the trace map of source code locations to privacy-relevant Web API calls made at those locations, SugarCoat selectively rewrites the target script ASTs so that the calls are redirected to mock implementations of the same APIs.
A naive approach to rewriting scripts would be to perform an in- place replacement of the exact JavaScript expressions encoding the Web API accesses (e.g., window.localStorage) with expressions that access the mocks instead (e.g., $mockLocalStorage). This, un- fortunately, is fragile: Figure 8 shows how this could unintentionally change the meaning of the getTrackingId function from Figure 6 and break compatibility.
Our approach is to work at the JavaScript scope level instead of the expression level.
A “scope” here refers to a function body or the top-level statements in a script; we ignore block scoping. We wrap each scope containing a privacy-relevant Web API call with entry and exit guards. While control flow is inside a wrapped scope, references in the JavaScript environment to the specific APIs called within that scope are temporarily replaced with their mock equivalents.
As shown in Figure 9, the original code is wrapped in a try-finally block; when control flow enters the block, entry guards overwrite localStorage and sessionStorage with mocks; when control flow exits the block, exit guards restore them. In Ap- pendix A we describe how SugarCoat rewrites different constructs to preserve the scoping of code placed within these try-finally blocks.
This scope-based approach ensures that calls to privacy-relevant APIs can be redirected even when the calls themselves are per- formed by separate, shared libraries like jQuery—shared libraries which may be used legitimately by other, non-privacy-harming scripts on the page, and therefore aren’t targeted for rewriting.
As discussed in Section 3.2.2, such calls are attributed to the target script most recently pushed onto the stack at the time the call occurs. When a target script calls into a shared library, and that shared library calls a privacy-sensitive API on behalf of the target script, we inject mocks in the calling target script before control is transferred to the library and remove them after the library returns control to the target script.
Scope-Narrowing Rewriting Algorithm.
SugarCoat’s AST rewriter is tasked with inserting guards into the AST such that all code locations in the trace map are correctly covered by corresponding guards, while minimizing performance overhead, code bloat from excessive guard insertion, and impact on the rest of the JavaScript environment.
To do this, the rewriter follows the algorithm in Figure 10, starting from the AST node corresponding to the top- level script scope and descending recursively into nested function scopes. Each scope “consumes” from the trace map the privacy- relevant code locations between the start and end points covered by the scope’s AST node.
The rewriter then descends into the scopes nested within the current scope, which in turn “consume” the code locations that belong to them. After traversal, the current scope is left with a list of privacy-relevant code locations for which it is the narrowest, most deeply-nested containing scope. Whenever this list is non-empty, SugarCoat wraps the scope with entry and exit guards corresponding to the Web APIs used at the code locations.
Figure 9 shows how rewriting applies to the tracking.js script from Figure 6. In the original script, references to localStorage and sessionStorage occur in a nested scope: the getTrackingId function scope, contained within the initializeTracking func- tion scope, which is contained in the top-level script scope. Since the getTrackingId function scope is the narrowest scope containing the code location, the rewriter selects this scope.
Code Generation and Bundling.
As a final step, SugarCoat turns the rewritten ASTs into JavaScript resource replacements. Each resource replacement script is prefixed with mock implemen- tations of the privacy-relevant Web APIs used in the original script (we give a simple Fetch API mock in Appendix B).
The rewritten AST is converted to JavaScript code like the sample in Figure 9, and then appended after the mock implementations. SugarCoat pack- ages the resulting source code files into a resource replacement bundle, and generates accompanying EasyList-style filter rules7 to intercept requests to the original scripts and redirect them to the resource replacements. The output can be dropped into any compat- ible content blocking tool, such as uBlock Origin [19], AdGuard [2], or the Brave Browser’s adblock-rust8 engine.
More information: Michael Smith et al, SugarCoat: Programmatically Generating Privacy-Preserving, Web-Compatible Resource Replacements for Content Blocking is available as a PDF at brave.com/wp-content/uploads/2 … garcoat-ccs-2021.pdf