Assessments
TRUSTLLM Preliminaries
The Preliminaries of TRUSTLLM section lays the groundwork for understanding the benchmark design in evaluating various Language Large Models (LLMs).
The inclusion of both proprietary and open-weight LLMs showcases an in-depth and inclusive approach.
Moreover, the emphasis on experimental setup – detailing datasets, tasks, prompt templates, and evaluation methods – provides a clear and systematic approach for assessment.
The ethical consideration highlighted reflects a responsible and conscientious approach to research, especially considering the potential impact of LLM outputs on individuals.
This foundational approach aims to enhance the development of more reliable and trustworthy LLMs, which is a commendable and crucial objective in the rapidly evolving field of AI.
Agreement: 85% Stars: ⭐⭐⭐⭐✩
Curated List of LLMs
The Curated List of LLMs in the study is impressive, featuring a diverse set of 16 LLMs, including both proprietary and open-weight models.
This comprehensive collection, spanning a wide range of model sizes, training data, methodologies, and capabilities, provides a robust and detailed landscape for evaluation.
The inclusion of specific examples like ChatGPT & GPT-4 and Vicuna, with their unique features and applications, showcases the depth of the study.
This section not only serves as an informative resource but also highlights the significant strides made in the field of conversational AI and natural language processing.
The detailed approach taken in this curation is pivotal for understanding the current landscape and future potential of LLMs.
Agreement: 90% Stars: ⭐⭐⭐⭐✩
Original Text
-------------------------------------------------
Preliminaries of TRUSTLLM
In this section, we will introduce the design of our benchmark.
As shown in Figure 2, we will introduce the model selection of LLMs in Section 5.1, including proprietary and open-weight LLMs.
We will introduce our experimental setup in Section 5.2, including datasets, tasks, prompt templates, and evaluation methods.
Ethical consideration. In illustrating the examples within the assessment tasks, certain outputs produced by LLMs may be disconcerting for individuals.
We emphasize that our work is solely for research purposes, and no one should misuse the datasets/methods of TRUSTLLM in illegal ways.
The ultimate goal of our work is to foster the development of more reliable and trustworthy LLMs.
Curated List of LLMs
In this study, we meticulously curate a diverse set of 16 LLMs, encompassing proprietary and open-weight examples.
This collection represents a broad spectrum of model size, training data, methodologies employed, and functional capabilities, offering a comprehensive landscape for evaluation.
We summarize the information of each LLM in Table 3.
ChatGPT & GPT-4 [324]. ChatGPT and GPT-4, developed by OpenAI, represent specialized adaptations of the GPT architecture explicitly tailored for conversational AI tasks.
These models signify the dawn of the authentic era of LLMs. Trained on extensive collections of internet text data, they can generate responses that closely mimic human conversational patterns. Further refinement is achieved through fine-tuning with RLHF [43], which enhances their proficiency in producing coherent and contextually appropriate responses.
GPT models represent a monumental leap in conversational AI, establishing a benchmark for future LLM developments and solidifying their position at the forefront of this technological revolution.
Vicuna [82]. The Vicuna series (7b, 13b, and 33b) are developed by researchers from LMSYS [325], targeting a wide array of natural language processing tasks.
Central to Vicuna is an emphasis on intricate performance and structural nuance, with models fine-tuned on a substantial dataset comprising approximately 70,000 user-
Table 3: The details of LLMs in the benchmark.
For the use of the PaLM 2 API, we have removed the safety restrictions [323], as its safety restrictions resulted in many of the returned content being none.
Model |
Model Size |
Open-Weight |
Version |
Creator |
Source |
GPT-3.5-turbo (ChatGPT) GPT-4 |
unknown unknown |
q q |
- - |
OpenAI |
OpenAI API OpenAI API |
ERNIE-3.5-turbo |
unknown |
q |
- |
Baidu Inc. |
ERNIE API |
text-bison-001 (PaLM 2) |
unknown |
q |
- |
|
Google API |
Llama2-7b-chat |
7b |
¥ |
- |
|
HuggingFace |
Llama2-13b-chat Llama2-70b-chat |
13b 70b |
¥ ¥ |
- - |
Meta |
HuggingFace HuggingFace |
Mistral-7b |
7b |
¥ |
v0.1 |
Mistral AI |
HuggingFace |
Vicuna-33b |
33b |
¥ |
v1.3 |
|
HuggingFace |
Vicuna-13b Vicuna-7b |
13b 7b |
¥ ¥ |
v1.3 v1.3 |
LMSYS |
HuggingFace HuggingFace |
ChatGLM2 |
6b |
¥ |
v1.0 |
Tsinghua & Zhipu |
HuggingFace |
Baichuan-13b |
13b |
¥ |
- |
Baichuan Inc. |
HuggingFace |
Wizardlm-13b |
13b |
¥ |
v1.2 |
Microsoft |
HuggingFace |
Koala-13b |
13b |
¥ |
- |
UCB |
HuggingFace |
Oasst-12b |
12b |
¥ |
- |
LAION |
HuggingFace |
shared ChatGPT conversations.
Vicuna-33b employs advanced memory optimization techniques to manage longer conversational content during training, achieving cost-effective efficiency.
ChatGLM2 [326]. ChatGLM2 is released by the KEG Lab [327] of Tsinghua University and Zhipu AI [328] in 2023, advancing from its predecessor ChatGLM.
With 6 billion parameters and the General Language Model (GLM) architecture, it supports various NLP tasks like natural language generation, text classification, and machine translation.
ChatGLM2-6B benefits from robust pre-training on 1.4T Chinese and English tokens and fine-tuning aligning with human preferences, which lead to substantial performance boosts on several benchmarks.
The model also adopts flash attention [329] and multi-query attention, extending the context length to 32K and improving inference efficiency, respectively.
These enhancements make ChatGLM2-6B a competitive model in the open-source community, with more extended context handling and efficient inference, marking a notable evolution in the ChatGLM series.
Koala-13b [330]. Koala-13b is developed by BAIR [331] for academic research with a parameter count of 13 billion.
It has undergone extensive human evaluations on various test sets, including real user queries, showcasing its effectiveness in assistant-like applications.
Llama2 [69]. The Llama2 series, developed by Meta [68], consists of models ranging from 7b to 70b parameters.
These models are notable for being trained on 2 trillion tokens. The series includes specialized variants like Llama Chat, fine-tuned with over 1 million human annotations.
Llama2 excels in external benchmarks, showcasing its proficiency in reasoning, coding, and knowledge tests. To bolster the safety aspect of Llama2, measures such as a toxicity filter, context distillation learning, and red teaming are incorporated.
WizardLM-13b [332]. WizardLM-13b is a powerful language model developed by Microsoft Research
[333]. Unlike traditional training methods, WizardLM-13b leverages an innovative process known as EvolInstruct [332], which utilizes LLMs to automatically generate various open-domain instructions of varying complexity levels.
This process involves evolving existing instructions to increase complexity and difficulty and creating new instructions to enhance diversity.
Oasst-12b [81]. Oasst(Open Assistant), developed by the LAION organization [334], represents the initial English SFT iteration of the Open-Assistant project.
Its training data is based on the basic data structure of conversation trees, and the model is fine-tuned on approximately 22,000 human demonstrations of assistant conversations.
Baichuan-13b [335]. Baichuan-13b is developed by Baichuan AI [181]. With a parameter count of 13 billion, Baichuan-13b is a large-scale language model known for its exceptional performance on Chinese benchmarks.
It distinguishes itself by being trained on a massive corpus of 1.4 trillion tokens and supports both Chinese and English, using ALiBi [336] position coding with a context window length of 4096.
ERNIE [79]. Ernie is an LLM developed by Baidu [337], which exemplifies a generative AI product that is augmented with a knowledge-enhanced framework.
This model’s robust pre-training on numerous Chinese and English tokens, combined with its fine-tuning in line with human preferences, highlights its pivotal contribution to the advancement of AI in China. Ernie’s versatile applications range from everyday household tasks to industrial and manufacturing innovations.
Mistral 7B [338]. Mistral 7B, a 7b-parameter LLM by Mistral AI [339], effectively handles text generation and diverse NLP tasks, whose benchmark covers areas like commonsense reasoning, world knowledge, math and reading comprehension, showcasing its broad applicability.
It utilizes a sliding window attention mechanism [340, 341], supports English and coding languages, and operates with an 8k context length.
PaLM 2 [36]. PaLM 2 is a capable language model developed by Google [342]. It shows strong multilingual language processing, code generation, and reasoning capabilities, reflecting advancements in computational scaling, dataset diversity, and architectural improvements.
Experimental Settings
We categorize the tasks in the benchmark into two main groups: Generation and Classification. Drawing from prior studies [71], we employ a temperature setting of 0 for classification tasks to ensure more precise outputs.
Conversely, for generation tasks, we set the temperature to 1, fostering a more diverse range of results and exploring potential worst-case scenarios.
For instance, recent research suggests that elevating the temperature can enhance the success rate of jailbreaking [242]. For other settings like decoding methods, we use the default setting of each LLM.
Datasets. In the benchmark, we introduce a collection of 30 datasets that have been meticulously selected to ensure a comprehensive evaluation of the diverse capabilities of LLMs. Each dataset provides a unique set of challenges.
They benchmark the LLMs across various dimensions of trustworthy tasks. A detailed description and the specifications of these datasets are provided in Table 4.
Tasks. In specific subsections, we have crafted a variety of tasks and datasets to augment the thoroughness of our findings.
Additionally, in light of the expansive and diverse outputs generated by LLMs compared to conventional LMs, we have incorporated a range of new tasks to evaluate this unique aspect. Table 5 lists all the tasks encompassed in the benchmark.
Prompts. In most tasks, particularly for classification, our prompts are designed for LLMs to incorporate specific keywords, aiding our evaluation process.
For example, we expect LLMs to generate relevant category labels (such as “yes" or “no"), which allows for efficient regular expression matching in automated assessments. Furthermore, except for privacy leakage evaluation (where we aim to increase the probability of LLMs leaking privacy information), we deliberately exclude few-shot learning from the prompts.
A key reason for this is the complexity involved in choosing examples [362, 363, 364], as varying exemplars may significantly influence the final performance of LLMs. Moreover, even though there are various prompt methods proposed in prior studies like Chain of Thoughts (CoT) [365, 366, 367, 368], Tree of Thoughts (ToT) [369], and so on [370], we do not involve these methods in our benchmark as the benchmark aims at a plain result of LLMs.
Evaluation. Our benchmark includes numerous generative tasks, posing the challenge of defining a standard ground-truth for assessment.
To avoid manual evaluation’s high cost and low efficiency, we’ve integrated a specialized classifier [73] and ChatGPT/GPT-4 into our evaluation framework.
For the tasks with ground-truth labels, our evaluation focuses on keyword matching and regular expressions.
When the approach fails to assess particular responses accurately, we utilize ChatGPT/GPT-4 to extract keywords in answers before the evaluation process.
Regarding generative tasks, they yield various answers, often including reasoning and explanations, making traditional keyword/regex matching ineffective. Recent studies have validated the effectiveness of LLMs
Table 4: Datasets and metrics in the benchmark. ¥ means the dataset is from prior work, and q means the dataset is first proposed in our benchmark.
Dataset |
Description |
Num. |
Exist? |
Section |
|
SQUAD2.0 [343] |
It combines questions in SQuAD1.1 [344] with over 50,000 unanswerable questions. |
100 |
¥ |
Misinformation tion(§6.1) |
Genera- |
CODAH [345] |
It contains 28,000 commonsense questions. |
100 |
¥ |
Misinformation tion(§6.1) |
Genera- |
HOTPOTQA [346] |
It contains 113k Wikipedia-based question-answer pairs for complex multihop reasoning. |
100 |
¥ |
Misinformation tion(§6.1) |
Genera- |
ADVERSARIALQA [347] |
It contains 30,000 adversarial reading comprehension question-answer pairs. |
100 |
¥ |
Misinformation tion(§6.1) |
Genera- |
CLIMATE-FEVER [348] |
It contains 7,675 climate change-related claims manually curated by human fact-checkers. |
100 |
¥ |
Misinformation tion(§6.1) |
Genera- |
SCIFACT [349] |
It contains 1,400 expert-written scientific claims pairs with evidence abstracts. |
100 |
¥ |
Misinformation tion(§6.1) |
Genera- |
COVID-FACT [350] |
It contains 4,086 real-world COVID claims. |
100 |
¥ |
Misinformation tion(§6.1) |
Genera- |
HEALTHVER [351] |
It contains 14,330 health-related claims against scientific articles. |
100 |
¥ |
Misinformation tion(§6.1) |
Genera- |
TRUTHFULQA [219] |
The multiple-choice questions to evaluate whether a language model is truthful in generating answers to questions. |
352 |
¥ |
Hallucination(§6.2) |
|
HALUEVAL [190] |
It contains 35,000 generated and human-annotated hallucinated samples. |
300 |
¥ |
Hallucination(§6.2) |
|
LM-EXP-SYCOPHANCY [352] |
A dataset consists of human questions with one sycophancy response example and one non-sycophancy response example. |
179 |
¥ |
Sycophancy in Responses(§6.3) |
|
OPINION PAIRS |
It contains 120 pairs of opposite opinions. |
240 |
q |
Sycophancy in Responses(§6.3) |
|
|
|
120 |
|
Preference Bias in Subjective Choices(§8.3) |
|
CROWS-PAIR [353] |
It contains examples that cover stereotypes dealing with nine types of bias, like race, religion, and age. |
1000 |
¥ |
Stereotypes(§8.1) |
|
STEREOSET [354] |
It contains the sentences that measure model preferences across gender, race, religion, and profession. |
734 |
¥ |
Stereotypes(§8.1) |
|
ADULT [355] |
The dataset, containing attributes like sex, race, age, education, work hours, and work type, is utilized to predict salary levels for individuals. |
810 |
¥ |
Disparagement(§8.2) |
|
JAILBRAEK TRIGGER |
The dataset contains the prompts based on 13 jailbreak attacks. |
1300 |
q |
Jailbreak(§7.1) ,Toxicity(§7.3) |
|
MISUSE (ADDITIONAL) |
This dataset contains prompts crafted to assess how LLMs react when confronted by attackers or malicious users seeking to exploit the model for harmful purposes. |
261 |
q |
Misuse(§7.4) |
|
DO-NOT-ANSWER [73] |
It is curated and filtered to consist only of prompts to which responsible LLMs do not answer. |
344 + 95 |
¥ |
Misuse(§7.4), Stereotypes(§8.1) |
|
ADVGLUE [266] |
A multi-task dataset with different adversarial attacks. |
912 |
¥ |
Robustness against Input with Natural Noise(§9.1) |
|
ADVINSTRUCTION |
600 instructions generated by 11 perturbation methods. |
600 |
q |
Robustness against Input with Natural Noise(§9.1) |
|
TOOLE [140] |
A dataset with the users’ queries which may trigger LLMs to use external tools. |
241 |
¥ |
OOD (§9.2) |
|
FLIPKART [356] |
A product review dataset, collected starting from December 2022. |
400 |
¥ |
OOD (§9.2) |
|
DDXPLUS [357] |
A 2022 medical diagnosis dataset comprising synthetic data representing about 1.3 million patient cases. |
100 |
¥ |
OOD (§9.2) |
|
ETHICS [358] |
It contains numerous morally relevant scenarios descriptions and their moral correctness. |
500 |
¥ |
Implicit Ethics(§11.1) |
|
SOCIAL CHEMISTRY 101 [359] |
It contains various social norms, each consisting of an action and its label. |
500 |
¥ |
Implicit Ethics(§11.1) |
|
MORALCHOICE [360] |
It consists of different contexts with morally correct and wrong actions. |
668 |
¥ |
Explicit Ethics(§11.2) |
|
CONFAIDE [201] |
It contains the description of how information is used. |
196 |
¥ |
Privacy Awareness(§10.1) |
|
PRIVACY AWARENESS |
It includes different privacy information queries about various scenarios. |
280 |
q |
Privacy Awareness(§10.1) |
|
ENRON EMAIL [84] |
It contains approximately 500,000 emails generated by employees of the Enron Corporation. |
400 |
¥ |
Privacy Leakage(§10.2) |
|
XSTEST [361] |
It’s a test suite for identifying exaggerated safety behaviors in LLMs. |
200 |
¥ |
Exaggerated Safety(§7.2) |
in evaluation [371, 372, 73, 373, 374], enabling their use as cost-effective alternatives to human evaluators.
Consequently, for complex generative tasks such as “Adversarial Factuality" (§6.4), we employ GPT-4, whereas, for more straightforward generative tasks, ChatGPT (GPT-3.5) is used to ensure cost-effectiveness.
Additionally, we employ a previously researched evaluator (i.e., a trained classifier) [73] to categorize responses based on whether LLMs refuse to answer (e.g., responses like “As an AI language model, I cannot ...").
This evaluator, a finely-tuned Longformer classifier (600M) † [73], has shown an evaluation performance closely
https://huggingface.co/LibrAI/longformer-harmful-ro
evaluation by ChatGPT, GPT-4 or longformer, and# G# means the mixture evaluation. RtA stands for Refuse to Answer.
ASR means Table 5: Task Overview. means evaluation through the automatic scripts (e.g., keywords matching), means the automatic
Attack Success Rate. RS is the Robustness Score. More trustworthy LLMs are expected to have a higher value of the metrics with ↑ and a lower value with ↓.
Task Name Metrics Type Eval Subsection
Closed-book QA |
Accuracy (↑) |
Generation |
|
Misinformation(Internal) |
Fact-Checking |
Macro F-1 (↑) |
Classification |
|
Misinformation(External) |
Multiple Choice QA |
Accuracy (↑) |
Classification |
# |
Hallucination |
Hallucination Classification |
Accuracy (↑) |
Classification |
# |
Hallucination |
Persona Sycophancy |
Embedding similarity (↑) |
Generation |
# |
Sycophancy |
Opinion Sycophancy |
Percentage change (↓) |
Generation |
G# |
Sycophancy |
Factuality Correction |
Percentage change (↑) |
Generation |
|
Adversarial Factuality |
Jailbreak Attack Evaluation |
RtA (↑) |
Generation |
|
Jailbreak |
Toxicity Measurement |
Toxicity Value (↓) |
Generation |
|
Toxicity |
Misuse Evaluation |
RtA (↑) |
Generation |
# |
Misuse |
Exaggerated Safety Evaluation |
RtA (↓) |
Generation |
|
Exaggerated Safety |
Agreement on Stereotypes |
Accuracy (↑) |
Generation |
|
Stereotype |
Recognition of Stereotypes |
Agreement Percentage (↓) |
Classification |
G# |
Stereotype |
Stereotype Query Test |
RtA (↑) |
Generation |
G# |
Stereotype |
Preference Selection |
RtA (↑) |
Generation |
|
Preference |
Salary Prediction |
p-value (↑) |
Generation |
|
Disparagement |
Adversarial Perturbation in Downstream Tasks |
ASR (↓), RS (↑) |
Generation |
# |
Natural Noise |
Adversarial Perturbation in Open-Ended Tasks |
Embedding similarity (↑) |
Generation |
G# |
Natural Noise |
OOD Detection |
RtA (↑) |
Generation |
G# |
OOD |
OOD Generalization |
Micro F1 (↑) |
Classification |
|
OOD |
Agreement on Privacy Information |
Pearson’s correlation (↑) |
Classification |
|
Privacy Awareness |
Privacy Scenario Test |
RtA (↑) |
Generation |
# |
Privacy Awareness |
Probing Privacy Information Usage |
RtA (↑), Accuracy (↓) |
Generation |
|
Privacy Leakage |
Moral Action Judgement |
Accuracy (↑) |
Classification |
G# |
Implicit Ethics |
Moral Reaction Selection (Low-Ambiguity) |
Accuracy (↑) |
Classification |
G# |
Explicit Ethics |
Moral Reaction Selection (High-Ambiguity) |
RtA (↑) |
Generation |
G# |
Explicit Ethics |
Emotion Classification |
Accuracy (↑) |
Classification |
|
Emotional Awareness |
#
mirroring that of human evaluators and GPT-4.
It categorizes LLMs’ responses into either refusing or not refusing to answer.