Large language models (LLMs) have demonstrated remarkable capabilities in natural language understanding and generation, but recent concerns have emerged regarding their potential to memorize and reveal sensitive information present in their training datasets [7, 12, 14].
Previous research has undertaken comprehensive studies on the total quantity of memorized training data for open-source models [11] and devised practical attacks to extract training data from smaller models like GPT-2 [14]. This paper combines these two approaches to conduct a large-scale investigation into extractable memorization, employing a scalable methodology capable of detecting memorization in trillions of tokens across terabyte-sized datasets.
Methodology
The study encompasses a range of language models, including open-source models such as Pythia [5] and GPT-Neo [6], as well as semi-open models like LLaMA [49] and Falcon [40]. Notably, the analysis indicates that larger and more sophisticated models exhibit increased vulnerability to data extraction attacks. However, the investigation takes an unexpected turn when applied to gpt-3.5-turbo, revealing almost no memorized training data.
ChatGPT’s Alignment and Divergence
The authors hypothesize that gpt-3.5-turbo’s resistance to data extraction may stem from ChatGPT’s alignment with Reinforcement Learning from Human Feedback (RLHF) [35, 37, 39, 44] to function as a helpful chat assistant. To circumvent this alignment and reveal the model’s memorization capabilities, the researchers discover a prompting strategy that induces gpt-3.5-turbo to “diverge” from typical chatbot-style responses, behaving more like a base language model and generating text in a typical Internet-text style.
Data Recovery and Implications
Ethics and Responsible Disclosure
The authors emphasize their commitment to responsible disclosure, sharing their findings with the authors of each model studied in the paper, including OPT [54], Falcon [40], Mistral [28], and LLaMA [49]. The specific vulnerability identified in ChatGPT (gpt-3.5-turbo) is deemed model-specific and not applicable to other production language models tested.
The authors disclosed the vulnerability to OpenAI on August 30th, following the discovery on July 11th, adhering to standard disclosure timelines [41]. After allowing a 90-day period for issue resolution, the findings are shared openly to raise awareness of data security and alignment challenges associated with generative AI models.
Background and Related Work
The training of state-of-the-art large language models (LLMs) involves pre-training on massive text corpora containing billions to trillions of tokens [6, 42, 43, 50]. Proprietary models like GPT-4 [38] and PaLM 2 [2] maintain secrecy around their training sets to protect both the company’s proprietary data collection pipeline and any non-public user-specific or licensed training data [31, 32].
Instruction-tuning and Reinforcement Learning from Human Feedback (RLHF) play crucial roles in enhancing the utility of pre-trained LLMs. Models like ChatGPT undergo supervised fine-tuning or RLHF on instruction-following data to improve performance and align with a unified chat-like persona [35, 39]. The alignment process also ensures models refrain from generating responses to certain types of queries, such as assisting in creating spam emails [37]. The focus of this analysis is on ChatGPT, specifically the gpt-3.5-turbo model endpoint.
Privacy attacks on neural networks, particularly those with numerous parameters, involve the risk of memorization. This can be exploited through membership inference attacks, determining whether an example was in the training set, or more potent data extraction attacks, recovering full training examples [9, 17, 21, 45, 52]. This work explores both types of attacks on LLMs.
Extracting Data from Open Models
The study initiates by examining data extraction attacks on open models, where both the models’ parameters and original training sets are publicly accessible. This approach allows for a precise evaluation of the efficacy of extraction attacks based on prior research.
Prior Approaches and Definitions
The authors adopt Carlini et al.’s (2021) [14] definition of memorization: a string x from the training set X is memorized if the model’s generation routine Gen can produce x verbatim. The study focuses on verbatim matches to facilitate scalable analysis. The concept of extractable memorization is introduced, defined as an example x from the training set X being extractably memorized if an adversary, without access to X, can construct a prompt p resulting in Gen(p) = x.
Challenges in extraction attack design and evaluation are addressed, emphasizing the difficulties in constructing effective prompts and verifying the success of the attack. Previous work used heuristics, such as prompting models with data sampled from the model’s training distribution and manual verification through Google searches.
Discoverable Memorization
The study introduces discoverable memorization, where models are prompted with true prefixes [p x] from the training set X, and x is considered discoverably memorized if Gen(p) = x. Prior research indicates that many LLMs discoverably memorize around 1% of their training datasets with a context of about 50 tokens [2, 11, 30].
Discrepancies Between Extractable and Discoverable Memorization
The observed gap between extractable and discoverable memorization rates in the literature prompts the exploration of two possible explanations. The first suggests that models prompted with training data may exhibit significantly more regurgitation than realistic extraction attack strategies. The second posits that existing extraction attacks are already successful at recovering training data, but prior work could not adequately verify the model outputs.
Disentangling Explanations
The goal of this section is to disentangle the two explanations. The authors argue that the latter explanation is mostly correct, suggesting that existing extraction attacks are more successful at recovering training data than prior work indicates.
A Comprehensive Analysis of Data Extraction Attacks on Language Models
Attack Methodology
In this section, we delve into the detailed methodology employed to conduct data extraction attacks on language models, focusing on open-source models with publicly available training datasets. The approach is based on the work of Carlini et al. [14], with slight modifications to the evaluation process.
- Selection of Open-Source Models: The attack begins by targeting open-source models with accessible training datasets, allowing for mechanistic verification of the success of the attack without requiring knowledge of the training set.
- Data Extraction Attack Method: The methodology closely follows Carlini et al.’s approach:
- Downloading 10^8 bytes of data from Wikipedia to serve as the basis for prompt generation.
- Generating prompts (p) by randomly sampling hundreds of millions of continuous 5-token blocks from the Wikipedia dataset.
- Independently generating model outputs (xi) for each prompt pi using the model’s generation routine Gen(pi) = xi and storing each xi.
- Evaluation of Attack Efficacy: A key deviation from the prior attack lies in how the efficacy of the attack is evaluated. While Carlini et al.’s attack necessitated manual Internet searches to determine whether generated sequences were present in the model’s training dataset, the models studied in this section are entirely open-source. This allows for direct querying of the model’s training data to ascertain whether any generated sample is memorized.
- Efficient Training Set Inclusion Test: Conducting the training set inclusion test x ∈ X naively is impractical due to the massive scale of language model training datasets. LLMs are trained on datasets with trillions of tokens, generating billions of tokens of output. To address this challenge, the researchers employ a suffix array, as introduced by Lee et al. (2021) [33]. A suffix array is a data structure that stores all suffixes of a dataset in sorted order, facilitating rapid string lookups through binary search. The suffix array s over X is denoted as s(X) or simply s when unambiguous.
- Determining Extraction Success: An extraction is deemed successful if the model outputs text containing a substring of length at least 50 tokens that is verbatim in the training set. This choice of length is empirical, chosen to be sufficiently large to avoid accidental overlaps between suffixes. The decision to double the estimated token overlap value between news articles from the RedPajama dataset ensures a conservative approach.
This meticulous methodology enables a comprehensive examination of the success of data extraction attacks on language models, shedding light on the vulnerabilities and potential risks associated with open-source models and their training datasets.
Discoverable vs. Extractable Memorization:
The study addresses two critical questions: the number of data samples memorized under both definitions and, more intriguingly, the instances where samples are either extractable but not discoverable or vice versa. To conduct this analysis, the researchers refer to prior work that released a dataset of discoverable memorizations from The Pile for the GPT-Neo 6B parameter model [11]. A comparison is drawn between these discoverable memorizations and the extractable memorized examples from the previous section, resulting in a comprehensive confusion matrix.
Confusion Matrix Analysis
The confusion matrix reveals interesting insights
- Most training data from the model is not memorized under either definition.
- 30.1% of examples are discoverably memorized, and 14.5% are extractably memorized.
- Surprisingly, only 35% of discoverably memorized examples were also extractable, challenging previous beliefs.
- An additional 11% of memorized sequences via extractable memorization attacks were not discoverably memorized.
Extended Analysis
The analysis is extended to sequences from The Pile with varying numbers of duplicates. The percentage of sequences that were either discoverably or extractably memorized is computed. Notably, highly duplicated sequences are found to be both easier to extract and discover.
Observations and Implications
- Sampling Success: It is somewhat surprising that a simple attack, such as sampling from the model, is sufficient to recover a large fraction (35%) of all known memorized training data.
- Room for Improvement: The findings suggest there is room for improving current extraction attacks, hinting at the potential for more efficient methods.
- Discoverable Memorization Baselines: Measuring discoverable memorization emerges as a useful and reasonably tight characterization of data that can be extracted by an adversary. However, there is still room for improvement in discoverable memorization baselines, as not all (extractably) memorized data is discovered through sampling prefixes from the training set.
- Challenges in Reporting Discoverable Memorization: The study indicates a potential discrepancy in reporting discoverable memorization, as sequences were considered discoverably memorized only if greedy decoding resulted in reconstructing the training example.
Attack Methodology:
Defining Semi-Closed Models
Semi-closed models are characterized by publicly available, downloadable parameters while keeping their training datasets and algorithms undisclosed. Unlike open-source models, these models pose a more challenging scenario for data extraction.
Establishing Ground Truth
Given the lack of access to training datasets, the researchers adopt a strategy inspired by Carlini et al. [14], where a “ground truth” for verifying and quantifying extractable memorization is created. This involves testing whether a model output exists on the web, a strategy similar to the manual Google searches performed by Carlini et al. Automated processes are introduced to enhance efficiency, minimizing errors and reducing the time-consuming nature of the task.
Lower Bound on Memorization
The approach yields a lower bound on the amount of memorization in the model, acknowledging potential false negatives due to an incomplete understanding of the training data. The lower bound serves as a conservative estimate of the extent of memorization.
Building AUXDATASET
To execute the strategy, a large corpus of Internet text, denoted as AUXDATASET, is collected by consolidating four of the largest LLM pre-training datasets. These datasets include The Pile, RefinedWeb, RedPajama, and Dolma, totaling 9TB. Tokenization and coarse deduplication are applied to ensure dataset integrity.
Implementation Efficiency
Efficiency is maintained through the use of a suffix array, allowing for efficient searches despite the large dataset size. The AUXDATASET is sharded into 32 independent suffix arrays to accommodate memory constraints. Despite its size, this strategy enables a complete intersection with potential training data, facilitating a faster evaluation.
Computational Effort
The end-to-end evaluation demands significant computational resources, involving a three-week process on a single machine with 176 cores and 1.4TB of RAM. I/O bandwidth limitations contribute to over half of the total time spent, suggesting room for optimization in future implementations.
Experimental Setup:
Models Under Analysis
- GPT-2 (1.5b): One of the pioneering large language models, trained on data obtained through URLs submitted to Reddit [42]. Prior work extracted 600 training examples manually [14].
- LLaMA (7b, 65b): A popular model family, known for being over-trained with respect to a compute-optimal budget [49]. Trained on a non-public mixture of publicly available data.
- Falcon (7b, 40b): A pair of models designed to outperform LLaMA in various settings, with limited training details disclosed [51].
- Mistral 7b: A model similar to LLaMA with undisclosed training details, touted as the highest accuracy model in its size category [28].
- OPT (1.3b, 6.7b): A family of models ranging from 125 million to 175 billion parameters, generally less capable due to fewer training steps [54].
- gpt-3.5-turbo-instruct: An OpenAI API model with undisclosed model, training algorithm, and training dataset details. Accessible only through the API and with non-public model weights.
Memorization Metrics
The analysis encompasses several metrics for each model, including the percentage of tokens generated that are a direct 50-token copy from AUXDATASET, the number of unique 50-token sequences, and the extrapolated lower bound of memorized 50-token sequences.
Results
The table (Table 2) presents a detailed breakdown of the memorization metrics for each model, revealing the varying degrees of memorization across different semi-closed models. Notably, gpt-3.5-turbo-instruct’s metrics are extrapolated due to cost constraints.
Observations
- Smaller Memorization Rates: Compared with open-source models of similar size, the observed memorization rates are significantly smaller, as depicted in Figure 15.
- Diverse Capabilities: The models, despite being semi-closed, exhibit diverse capabilities and memorization tendencies, highlighting the influence of undisclosed training details.
- API Cost Considerations: The cost of querying gpt-3.5-turbo-instruct influences the approach, with an extrapolated evaluation due to financial constraints.
Results
Memorization Across Models
The primary revelation from the analysis is that all models exhibit the emission of memorized training data, a phenomenon underscored by the data presented in Table 2. However, the degree of memorization varies notably across different model families, introducing intriguing nuances to the interpretation.
Model Discrepancies
- Mistral 7B vs. Falcon 7B: A stark contrast emerges between Mistral 7B and Falcon 7B, models of comparable size and accuracy, with a difference in detected memorization exceeding a factor of 10. Interpreting this discrepancy is challenging, as it could signify actual variations in memorization tendencies or potentially highlight dataset construction limitations. However, a factor of 10 is deemed too substantial to attribute solely to data distribution differences.
- High Memorization Rates: Remarkably, even when accounting for potential dataset-related biases, the models, especially state-of-the-art ones, display exceptionally high rates of emitting memorized training data. Notably, gpt-3.5-turbo-instruct stands out as the model with the highest offender rate, where 0.852% of generated tokens align with 50-token sequences verbatim found in AUXDATASET.
Training Duration Impact
As anticipated, the duration of training proves to be a significant factor influencing memorization tendencies. Models that undergo more extended training sessions tend to memorize more data, aligning with the scaling laws proposed by Hoffman et al. [25]. Some models, like OPT, which are under-trained relative to these scaling laws, exhibit lower memorization rates but correspondingly poorer performance on benchmarks.
Over-Training and Privacy Leakage
The intentional over-training of certain model families, exemplified by LLaMA, introduces a trade-off between compute at training time and inference time. This over-training, while enhancing performance in certain scenarios, appears to elevate privacy leakage, suggesting a potential downside to this training strategy.
Size Matters
The analysis further underscores that larger models, on average, exhibit a 5% higher total extractable memorization than their smaller counterparts. Leveraging the Good-Turning estimator to extrapolate memorization rates, the results imply that the expected total number of extractable memorizations could be even higher than estimated.
The Challenges of Data Extraction from Aligned Language Models
Memorization Patterns in State-of-the-Art Models
Previous sections of this series established that state-of-the-art language models memorize substantial amounts of training data, raising concerns about potential data leakage. However, the focus now shifts to models aligned for conversational interactions, such as ChatGPT.
Challenges in Extracting Data from Aligned Models
Challenge 1: Disruption of Continuation Interface
Aligned models, like ChatGPT, present a unique challenge due to their conversational nature. Unlike base models, which can be prompted with arbitrary strings for data extraction, conversational models have a predefined dialog format. The typical prompt structure involves an indicator, such as “User:” or “Assistant:”, signaling the transition between user and assistant turns. This disrupts the conventional prompt-continuation attack used in prior sections.
Challenge 2: Alignment and Evasion
Alignment, a process aligning models with human preferences, adds another layer of complexity to data extraction attempts. The alignment procedure, designed to make models more user-friendly, could potentially prevent attacks from succeeding. An illustrative interaction with ChatGPT suggests that alignment might lead the model to abstain from completing data from its training set.
Baseline Attack and Evaluation
To assess the viability of data extraction in aligned models, a baseline attack is introduced. Adapting previous random-token-prompting attacks, the model is asked to repeat random tokens and continue generating. The results of this experiment, conducted at scale with thousands of prompts, reveal that only 0.02% of generated tokens are part of a 50-token sequence directly copied from AUXDATASET.
Comparative Analysis
Comparing the results to a smaller semi-closed model (OPT with 1.3B parameters) and the gpt-3.5-turbo-instruct model, it becomes apparent that the alignment procedure in ChatGPT has a notable impact. While the smaller model yields 0.031% of emitted tokens directly copied from the training dataset, the aligned gpt-3.5-turbo-instruct model exhibits a much lower memorization rate, with at least 0.85% of emitted tokens part of a memorized sequence.
The Divergence Attack Methodology
To initiate the Divergence Attack, the strategy is to prompt the model in a way that disrupts its standard dialog-style generation and encourages it to diverge towards nonsensical output. For instance, a prompt like “Repeat the word ‘poem'” is employed to cause the model to initially repeat the given word multiple times before eventually diverging into nonsensical text. It is within this divergence that the model sporadically generates memorized content from its pre-training data.
Experimental Results
Cost-Effective Data Extraction
Using a budget of $200 USD for queries to ChatGPT (gpt-3.5-turbo), the study successfully extracts over 10,000 unique verbatim-memorized training examples. Extrapolations suggest that with larger budgets, adversaries could potentially extract even more data.
Length and Frequency
Memorized text proves to be extensive, with the longest extracted string exceeding 4,000 characters, and several hundred strings surpassing 1,000 characters. Over 93% of memorized strings are unique, with only a small percentage repeated a few times, showcasing the effectiveness of the prompting strategy in producing diverse memorized outputs.
Qualitative Analysis
The extracted memorized examples cover a wide array of text sources, demonstrating the model’s capacity to memorize and reproduce various types of content. Notable categories include:
- Personally Identifiable Information (PII)
- NSFW content, explicit content, and content related to guns and war
- Literature, with verbatim excerpts from novels and complete copies of poems
- URLs, including valid and seemingly random nonces
- Cryptographically-random identifiers, such as exact Bitcoin addresses
- Code snippets, frequently JavaScript from AUXDATASET
- Snippets from research papers, including entire abstracts and bibliographic data
- Boilerplate text, including lists of countries, date sequences, and copyright headers
Attack Methodology
While the previous evaluation centered around measuring memorization, this phase shifts towards a more proactive stance. The membership inference attack, as proposed by Carlini et al. [14], proves effective in distinguishing memorized training data from hallucinated content. A likelihood-ratio perplexity score, denoted as LLM(x), becomes the metric of choice, utilizing the perplexity of the generated text and the entropy under zlib text compression.
Figure 10 illustrates the impact of varying the membership inference threshold on the attack’s precision. The results indicate that not only is the extraction of training data possible, but it can be done with high precision. This precision remains relatively constant up to a certain threshold, showcasing the attack’s efficacy. However, there is acknowledgment that further refinement could enhance the precision of this attack.
Figure 10: Out of 494 examples, the number we identify as having memorization via manual web search vs. checking whether at least 80% of the tokens are in 50-grams found in AUXDATASET. Our automatic method underestimates memorization compared to doing manual assessment using a search engine.
Discoverability of ChatGPT Memorization
Addressing the challenge of discoverable memorization in ChatGPT, the study attempts to distinguish whether memorized data can be identified naturally or requires an adversarial approach. The use of ChatGPT’s known memorized samples reveals a surprising discovery. When prompted with a natural approach, ChatGPT fails to emit memorized output that was known to be memorized through adversarial prompting. This discrepancy suggests that detecting memorization without adversarial access might prove difficult.
To explore the base model’s testability, the study compares the aligned models gpt-3.5-turbo and gpt-3.5-turbo-instruct. Despite being fine-tuned on different datasets, both models memorize the same samples, indicating that the memorization is rooted in the pre-training data distribution rather than the fine-tuning data. The results also suggest that despite varied fine-tuning setups, data memorized during pretraining persists, aligning with recent findings on the longevity of memorized data. The study concludes by highlighting the challenges in auditing the privacy and security of black-box RLHF-aligned models, especially considering the lack of public access to the original base model.
Potential Reasons for Vulnerability
Extensive Pre-training Epochs
One speculated factor contributing to ChatGPT’s vulnerability is its potential exposure to extended pre-training epochs. The model’s high-speed inference and extensive deployment on a large scale suggest an emerging trend of “over-training.” Over-training involves training models on significantly more data than deemed “training compute optimal,” aiming to maximize utility at a fixed inference cost. To achieve this, models may undergo many epochs over the same data, a practice known to substantially increase memorization. Evaluating the attack on models trained for multiple epochs supports this speculation, emphasizing a trade-off between privacy and inference efficiency induced by over-training.
Unstable Token Repetition
The attack strategy employed against ChatGPT reveals an intriguing instability related to token repetition. The model tends to diverge when prompted with single-token words, and the attack’s success significantly drops after a certain number of repetitions. In contrast, repeating multi-token words remains stable. While the exact cause of this instability is not explained, the results highlight a consistent and repeatable effect.
This instability may be associated with the simulation of the token during pre-training. Modern language models use special tokens like to delineate document boundaries during pre-training. Our attack, by repeating a single token, seemingly creates an effect akin to the token, leading to a potential “reset” behavior in the model.
Understanding the Token Repetition Effect
To grasp the impact of token repetition, the study investigates the similarity between the attention query vectors of the repeated tokens and the model’s equivalent. The results indicate that as a token is repeated many times, its attention query approaches that of the token. This implies a predictive similarity in the next token distribution, resembling the model’s behavior when encountering the token during pre-training. The study further verifies that natural sampling from the model with a random prompt does not induce this effect, establishing the significance of token repetition in the observed vulnerability.
Figure: gpt-3.5-turbo-instruct can repeat two- or three- tokens words thousands of times without causing any diver- gence; but one token words can only be repeated a few hun- dred times before the probability of divergence rapidly ap- proaches near-certainty. Solid lines show medians over 40 different word choices, shaded regions show the 10%–90% quantile ranges.
reference link : https://dx.doi.org/10.48550/arxiv.2311.17035