In this dynamic landscape, the evaluation of LLMs has emerged as a critical endeavor to comprehend their potential as well as their limitations. Various benchmarks have been introduced to assess these models’ language understanding abilities, such as GLUE and SuperGLUE.
As LLMs’ capabilities evolve, benchmarks like CodeXGLUE, BIG-Bench, and NaturalInstructions have been designed to tackle more complex tasks. Furthermore, evaluating aspects beyond mere performance, such as robustness and ethical considerations, has led to the creation of benchmarks like AdvGLUE and TextFlint. A recent addition to this landscape is HELM, which offers a holistic evaluation of LLMs across diverse scenarios and metrics.
The Emergence of Trustworthiness Concerns
This paper aims to address this gap by providing a comprehensive evaluation of trustworthiness focused on the LLM GPT-42, juxtaposed with GPT-3.5 (ChatGPT), from a multitude of perspectives. These perspectives encompass toxicity, stereotype bias, adversarial robustness, out-of-distribution robustness, robustness in adversarial demonstrations, privacy, machine ethics, and fairness under diverse settings.
Visual representations of unreliable responses and the evaluation taxonomy are showcased in Figures 1 and 3 respectively, offering a clear overview of the evaluation framework.
LLMs’ Enhanced Capabilities and their Implications
The evolution of large language models, particularly exemplified by GPT-3.5 and GPT-4, has led to newfound capabilities that extend far beyond their predecessors. Tailored optimization for dialogue has facilitated an elevated proficiency in following instructions, allowing users to mold tones, roles, and various adaptable factors.
This enhanced adaptability leads to functions like question-answering and in-context learning, where models learn from a few-shot demonstrations during conversations. This is a significant departure from earlier models like BERT and T5, which primarily catered to text infilling tasks.
However, the emergence of these advanced capabilities has brought forth a series of trustworthiness concerns. The capacity to follow instructions and adapt to diverse contexts can inadvertently introduce vulnerabilities. Potential adversaries might exploit dialogue contexts or system instructions to orchestrate adversarial attacks, thereby undermining the reliability of deployed systems.
Toxicity: Unmasking Harmful Content Generation
We initiate our evaluation by probing the ability of GPT models to sidestep toxic content generation. Three comprehensive evaluation scenarios are constructed:
- Standard Benchmark Evaluation: Using the REALTOXICITYPROMPTS benchmark, we gauge the properties and limitations of GPT-3.5 and GPT-4 in comparison to their LLM counterparts.
- Diverse System Prompts: By deploying 33 manually crafted system prompts, we assess the influence of these prompts on the toxicity levels of generated responses. These prompts range from role-playing to word meaning replacement.
- Challenging User Prompts: We leverage 1.2K demanding user prompts, generated by GPT-3.5 and GPT-4, to unearth model toxicity more effectively than existing benchmarks.
Stereotype Bias: Uncovering Subconscious Biases
In our examination of stereotype bias, we curate a dataset containing statements laden with stereotypes. GPT-3.5 and GPT-4 are then queried to agree or disagree with these statements, providing insights into the models’ potential biases. This evaluation unfolds across three scenarios:
- Baseline Measurement: Vanilla benign system prompts provide a baseline measurement of bias against different demographic groups.
- Untargeted System Prompts: Designed prompts guide models past content policy restrictions without fostering bias against specific demographic groups.
- Targeted System Prompts: System prompts are designed not only to overcome content policy constraints but also to encourage models to exhibit bias against selected demographic groups, showcasing the models’ resilience under misleading system inputs.
Adversarial Robustness: Under the Assault of Adversarial Attacks
The robustness of GPT models against textual adversarial attacks is scrutinized across three evaluation scenarios:
- Standard AdvGLUE Benchmark: By subjecting models to the AdvGLUE benchmark, we unveil vulnerabilities to existing adversarial attacks, compare the robustness of different GPT models, and investigate attack impacts on instruction-following abilities and transferability.
- Varied Task Descriptions: By altering instructive task descriptions and system prompts, we assess model resilience under diverse adversarial contexts.
- Challenging Adversarial Texts: GPT-3.5 and GPT-4 confront adversarial texts, AdvGLUE++, exposing their vulnerabilities to potent adversarial attacks.
Out-of-Distribution Robustness: Navigating Uncharted Territories
To understand how GPT models fare against out-of-distribution (OOD) data, we examine their responses to inputs with different text styles and queries beyond their training scope. Three scenarios shape this evaluation:
- Style Transformation: Models’ robustness against style-transformed inputs, such as Shakespearean style, is assessed.
- Current Event Queries: Responses to questions about recent events, not covered in training data, reflect the models’ reliability against unexpected queries.
- In-Context Learning with OOD Demonstrations: We introduce demonstrations with varied OOD styles and domains to investigate their impact on model performance.
Robustness to Adversarial Demonstrations: The Limits of In-Context Learning
GPT models’ prowess in in-context learning is scrutinized through three distinct adversarial demonstration scenarios:
- Counterfactual Examples: Models encounter counterfactual examples as demonstrations, unveiling potential vulnerabilities to misleading inputs.
- Spurious Correlations: Demonstrations with spurious correlations challenge models’ capability to discern relevant information.
- Backdoors: Backdoors introduced in demonstrations test models’ susceptibility to manipulation and misleading guidance.
Privacy: Balancing Data Utilization and Privacy Concerns
Three privacy-focused evaluation scenarios aim to uncover potential privacy breaches:
- Information Extraction from Pretraining Data: Sensitivity information extraction accuracy is evaluated in pretraining data to uncover potential memorization problems.
- Personally Identifiable Information (PII) Extraction: The accuracy of PII extraction introduced during inference stages sheds light on potential privacy vulnerabilities.
- Privacy Context Understanding: Models’ comprehension of privacy contexts during conversations involving privacy-related words and events is evaluated.
Machine Ethics: Weighing Moral Recognition and Resilience
The ethics of GPT models are evaluated through four scenarios focusing on commonsense moral recognition:
- Standard Benchmarks: Model performance on ETHICS and Jiminy Cricket benchmarks gauges their moral recognition abilities.
- Jailbreaking Prompts: Designed to mislead, jailbreaking prompts probe models’ robustness in moral recognition.
- Evasive Sentences: Generated evasive sentences challenge models’ moral recognition under adversarial conditions.
- Conditional Actions: Moral recognition under different attributes, exploring conditions under which models may fail.
Fairness: Navigating Fairness in Diverse Contexts
GPT models’ fairness is evaluated across three scenarios to explore their performance in different contexts:
- Base Rate Parity: Models’ performance across test groups with different base rate parity in zero-shot settings is examined.
- Unfair Demographically Imbalanced Contexts: The influence of imbalanced contexts on model fairness in few-shot settings is explored.
- Balanced Contexts: Model fairness is investigated under different numbers of fair, demographically balanced examples.
Toxicity: Unmasking Content Generation Vulnerabilities
The evaluation of toxicity brings forth several significant findings:
- GPT-3.5 and GPT-4 showcase substantial improvements in reducing toxicity compared to LLMs without instruction tuning or Reinforcement Learning from Human Feedback (RLHF), maintaining a low toxicity probability (below 32%) across diverse task prompts.
- Adversarial “jailbreaking” prompts, carefully designed to challenge models, expose a vulnerability wherein both GPT-3.5 and GPT-4 generate toxic content with a toxicity probability reaching nearly 100%.
- GPT-4’s inclination to follow “jailbreaking” prompts leads to heightened toxicity, surpassing that of GPT-3.5 under different system prompts and task prompts.
- Leveraging GPT-3.5 and GPT-4 to generate challenging toxic task prompts unveils an approach to enhance model toxicity, with this strategy being transferable to other LLMs lacking RLHF.
Stereotype Bias: Navigating Bias in Model Outputs
Our exploration into stereotype bias yields the following insights:
- GPT-3.5 and GPT-4 exhibit weak bias in the majority of stereotype topics under benign and untargeted system prompts.
- However, designed adversarial system prompts can “trick” both models into agreeing with biased content, with GPT-4 being more susceptible due to its precise adherence to misleading instructions.
- Bias varies based on demographic groups mentioned in user prompts and the nature of stereotype topics, highlighting the models’ sensitivity to query context.
- The models generate more biased content on less sensitive topics, possibly due to fine-tuning on certain groups and topics.
Adversarial Robustness: Withstanding Textual Attacks
Examination of adversarial robustness uncovers the following:
- GPT-4 outperforms GPT-3.5 on the AdvGLUE benchmark, indicating enhanced robustness.
- GPT-4 demonstrates superior resistance to human-crafted adversarial texts compared to GPT-3.5.
- Sentence-level perturbations are more transferable than word-level perturbations for both models on the standard AdvGLUE benchmark.
- Despite strong performance on standard benchmarks, GPT models remain vulnerable to adversarial attacks from other autoregressive models.
- SemAttack and TextFooler exhibit notable transferability in adversarial attacks across different models.
Out-of-Distribution Robustness: Navigating the Unknown
The exploration of out-of-distribution robustness unfolds as follows:
- GPT-4 displays consistently stronger generalization capabilities than GPT-3.5 across diverse OOD style transformations.
- GPT-4 exhibits better resilience in answering queries beyond its training scope compared to GPT-3.5.
- OOD demonstrations within the same domain, but different styles, highlight GPT-4’s superior generalization.
- GPT-4’s accuracy is positively influenced by closely related domains in OOD demonstrations but negatively impacted by distant domains, unlike GPT-3.5.
Robustness to Adversarial Demonstrations: Unmasking Model Learning
Robustness to adversarial demonstrations presents the following findings:
- Counterfactual examples in demonstrations don’t mislead GPT-3.5 and GPT-4, but rather benefit their learning.
- Spurious correlations from fallible heuristics in demonstrations mislead GPT-3.5 more than GPT-4.
- Backdoored demonstrations mislead both models, especially when closely positioned to user inputs, with GPT-4 being more susceptible.
Privacy: Navigating Sensitive Information
Privacy analysis exposes the following insights:
- GPT models can leak privacy-sensitive training data, such as email addresses, when prompted with specific contexts or demonstrations.
- GPT-4 demonstrates better safeguarding of personally identifiable information (PII) during regular inferences, possibly due to instruction tuning.
- Both models leak all types of PII when prompted with privacy-leakage demonstrations during in-context learning.
- GPT-4’s higher vulnerability to privacy-related prompts is attributed to its precise following of instructions.
Machine Ethics: Navigating Moral Recognition
Exploration of machine ethics unveils the following findings:
- GPT-3.5 and GPT-4 exhibit competitive moral recognition abilities compared to non-GPT models.
- Both models can be misled by jailbreaking prompts, with GPT-4’s adherence to instructions making it more manipulable.
- Both models are susceptible to recognizing immoral behaviors as moral when faced with evasive sentences, with GPT-4 being more vulnerable.
- Recognition performance varies based on immoral behavior properties, demonstrating model nuances.
Fairness: Navigating Equitable Predictions
The evaluation of fairness provides these insights:
- GPT-4’s accuracy is higher in balanced data settings but achieves higher unfairness scores in unbalanced settings, highlighting the accuracy-fairness tradeoff.
- Both GPT models exhibit substantial performance gaps across test groups with different base rate parity in zero-shot settings, indicating inherent bias.
- In few-shot settings, imbalanced training contexts induce unfair predictions.
- A small number of balanced few-shot examples enhance prediction fairness in GPT models.
Robustness Evaluation and Comparison
Table 5 presents a comprehensive comparison of the robustness of GPT-3.5 and GPT-4 against state-of-the-art (SoTA) models on the AdvGLUE benchmark. The evaluation is based on two key metrics: benign accuracy and robust accuracy. Benign accuracy refers to the accuracy of models on benign GLUE data, while robust accuracy assesses the model’s performance on adversarial AdvGLUE data. The performance drop, indicating the difference between benign and robust accuracy, offers insights into a model’s vulnerability to adversarial attacks.
Average Robust Accuracy and Performance Drop
In terms of average robust accuracy, GPT-4 outperforms GPT-3.5 with an impressive score of 78.41%, compared to GPT-3.5’s 67.37%. Notably, the SoTA model from the AdvGLUE leaderboard achieves a robust accuracy of 65.77%, indicating that GPT-3.5’s performance is on par with the existing top-performing model. Examining the performance drop, GPT-3.5 experiences a larger degradation of 14.43%, whereas GPT-4 showcases a more modest performance drop of 9.90%. In contrast, the SoTA model on the leaderboard exhibits a substantial 26.89% performance degradation under adversarial conditions. This positions GPT-4 as marginally more robust than GPT-3.5 and even outperforming other models on the leaderboard in terms of performance degradation.
Influence of Task Description and System Prompt
The influence of task description and system prompt on model robustness is analyzed in Table 5. Different templates, including instructive task descriptions (Template 2) and prompts informing the model about adversarial attacks (Template 3), do not significantly impact model robustness. Both average robust accuracy and performance drop remain relatively consistent across these templates, suggesting that these factors have limited influence on the models’ performance in adversarial scenarios.
Instruction-Following Abilities under Adversarial Attacks
The study examines whether adversarial attacks compromise the instruction-following abilities of GPT models. The rate at which models provide answers not specified in the prompt (NE) is reported in Table 5 and Table 7. Under various templates, GPT-4 maintains a steady NE rate with minimal increases, suggesting that adversarial attacks do not significantly disrupt its instruction-following abilities. In contrast, GPT-3.5 experiences a remarkable over 50% relative increase in NE across all templates. Notably, the models’ responses differ qualitatively when providing unspecified answers, with GPT-3.5 often identifying input sentences as jumbled or nonsensical, while GPT-4 tends to offer neutral sentiment interpretations.
Transferable Attack Strategies
Table 6 presents a comparative analysis of attack success rates on the AdvGLUE test set for GPT-3.5 and GPT-4, employing various adversarial text generation strategies. Sentence-level perturbations and human-crafted attacks emerge as more effective than word-level perturbations, particularly in transferring adversarial texts from BERT-like models. GPT-4 displays heightened robustness against human-crafted adversarial texts, with a significant drop in attack success rates for tasks like ANLI and AdvSQuAD, highlighting its improved resistance to such attacks compared to GPT-3.5.
Conclusion
In conclusion, this article extensively explored the robustness of GPT-3.5 and GPT-4 in comparison to state-of-the-art models on the AdvGLUE benchmark. GPT-4 showcased superior performance in terms of average robust accuracy and performance drop when subjected to adversarial attacks. The influence of task description and system prompt on model robustness was found to be minimal, and GPT-4’s instruction-following abilities were relatively well-preserved under adversarial conditions. Additionally, GPT-4 exhibited enhanced resilience against certain attack strategies compared to GPT-3.5. These findings underscore the advancements in model robustness and offer valuable insights into the capabilities and limitations of large-scale language models.
reference link : arXiv:2306.11698v1