TrustLLM: Preliminaries

AGI, AI Ethics, Synthetic Intelligence, Synthetic Mind, TrustLLM -

TrustLLM: Preliminaries

Abstract

Introduction

Background

Trust LLM Preliminaries 

Assessments

Transparency

Accountability

Open Challenges

Future Work
Conclusions

Types of Ethical Agents 

  

TRUSTLLM Preliminaries 

The Preliminaries of TRUSTLLM section lays the groundwork for understanding the benchmark design in evaluating various Language Large Models (LLMs).

The inclusion of both proprietary and open-weight LLMs showcases an in-depth and inclusive approach.

Moreover, the emphasis on experimental setup – detailing datasets, tasks, prompt templates, and evaluation methods – provides a clear and systematic approach for assessment.

The ethical consideration highlighted reflects a responsible and conscientious approach to research, especially considering the potential impact of LLM outputs on individuals.

This foundational approach aims to enhance the development of more reliable and trustworthy LLMs, which is a commendable and crucial objective in the rapidly evolving field of AI.

Agreement: 85% Stars: ⭐⭐⭐⭐✩

Curated List of LLMs

The Curated List of LLMs in the study is impressive, featuring a diverse set of 16 LLMs, including both proprietary and open-weight models.

This comprehensive collection, spanning a wide range of model sizes, training data, methodologies, and capabilities, provides a robust and detailed landscape for evaluation.

The inclusion of specific examples like ChatGPT & GPT-4 and Vicuna, with their unique features and applications, showcases the depth of the study.

This section not only serves as an informative resource but also highlights the significant strides made in the field of conversational AI and natural language processing.

The detailed approach taken in this curation is pivotal for understanding the current landscape and future potential of LLMs.

Agreement: 90% Stars: ⭐⭐⭐⭐✩

 

 

Original Text

-------------------------------------------------

Preliminaries of TRUSTLLM

In this section, we will introduce the design of our benchmark.

As shown in Figure 2, we will introduce the model selection of LLMs in Section 5.1, including proprietary and open-weight LLMs.

We will introduce our experimental setup in Section 5.2, including datasets, tasks, prompt templates, and evaluation methods.

Ethical consideration. In illustrating the examples within the assessment tasks, certain outputs produced by LLMs may be disconcerting for individuals.

We emphasize that our work is solely for research purposes, and no one should misuse the datasets/methods of TRUSTLLM in illegal ways.

The ultimate goal of our work is to foster the development of more reliable and trustworthy LLMs.

Curated List of LLMs

In this study, we meticulously curate a diverse set of 16 LLMs, encompassing proprietary and open-weight examples.

This collection represents a broad spectrum of model size, training data, methodologies employed, and functional capabilities, offering a comprehensive landscape for evaluation.

We summarize the information of each LLM in Table 3.

ChatGPT & GPT-4 [324]. ChatGPT and GPT-4, developed by OpenAI, represent specialized adaptations of the GPT architecture explicitly tailored for conversational AI tasks.

These models signify the dawn of the authentic era of LLMs. Trained on extensive collections of internet text data, they can generate responses that closely mimic human conversational patterns. Further refinement is achieved through fine-tuning with RLHF [43], which enhances their proficiency in producing coherent and contextually appropriate responses.

GPT models represent a monumental leap in conversational AI, establishing a benchmark for future LLM developments and solidifying their position at the forefront of this technological revolution.

Vicuna [82]. The Vicuna series (7b, 13b, and 33b) are developed by researchers from LMSYS [325], targeting a wide array of natural language processing tasks.

Central to Vicuna is an emphasis on intricate performance and structural nuance, with models fine-tuned on a substantial dataset comprising approximately 70,000 user-

Table 3: The details of LLMs in the benchmark.

For the use of the PaLM 2 API, we have removed the safety restrictions [323], as its safety restrictions resulted in many of the returned content being none.

Model

Model Size

Open-Weight

Version

Creator

Source

GPT-3.5-turbo (ChatGPT)

GPT-4

unknown unknown

q q

-

-

OpenAI

OpenAI API

OpenAI API

ERNIE-3.5-turbo

unknown

q

-

Baidu Inc.

ERNIE API

text-bison-001 (PaLM 2)

unknown

q

-

Google

Google API

Llama2-7b-chat

7b

¥

-

 

HuggingFace

Llama2-13b-chat

Llama2-70b-chat

13b

70b

¥

¥

-

-

Meta

HuggingFace

HuggingFace

Mistral-7b

7b

¥

v0.1

Mistral AI

HuggingFace

Vicuna-33b

33b

¥

v1.3

 

HuggingFace

Vicuna-13b

Vicuna-7b

13b

7b

¥

¥

v1.3 v1.3

LMSYS

HuggingFace

HuggingFace

ChatGLM2

6b

¥

v1.0

Tsinghua & Zhipu

HuggingFace

Baichuan-13b

13b

¥

-

Baichuan Inc.

HuggingFace

Wizardlm-13b

13b

¥

v1.2

Microsoft

HuggingFace

Koala-13b

13b

¥

-

UCB

HuggingFace

Oasst-12b

12b

¥

-

LAION

HuggingFace

shared ChatGPT conversations.

Vicuna-33b employs advanced memory optimization techniques to manage longer conversational content during training, achieving cost-effective efficiency.

ChatGLM2 [326]. ChatGLM2 is released by the KEG Lab [327] of Tsinghua University and Zhipu AI [328] in 2023, advancing from its predecessor ChatGLM.

With 6 billion parameters and the General Language Model (GLM) architecture, it supports various NLP tasks like natural language generation, text classification, and machine translation.

ChatGLM2-6B benefits from robust pre-training on 1.4T Chinese and English tokens and fine-tuning aligning with human preferences, which lead to substantial performance boosts on several benchmarks.

The model also adopts flash attention [329] and multi-query attention, extending the context length to 32K and improving inference efficiency, respectively.

These enhancements make ChatGLM2-6B a competitive model in the open-source community, with more extended context handling and efficient inference, marking a notable evolution in the ChatGLM series.

Koala-13b [330]. Koala-13b is developed by BAIR [331] for academic research with a parameter count of 13 billion.

It has undergone extensive human evaluations on various test sets, including real user queries, showcasing its effectiveness in assistant-like applications.

Llama2 [69]. The Llama2 series, developed by Meta [68], consists of models ranging from 7b to 70b parameters.

These models are notable for being trained on 2 trillion tokens. The series includes specialized variants like Llama Chat, fine-tuned with over 1 million human annotations.

Llama2 excels in external benchmarks, showcasing its proficiency in reasoning, coding, and knowledge tests. To bolster the safety aspect of Llama2, measures such as a toxicity filter, context distillation learning, and red teaming are incorporated.

WizardLM-13b [332]. WizardLM-13b is a powerful language model developed by Microsoft Research

[333]. Unlike traditional training methods, WizardLM-13b leverages an innovative process known as EvolInstruct [332], which utilizes LLMs to automatically generate various open-domain instructions of varying complexity levels.

This process involves evolving existing instructions to increase complexity and difficulty and creating new instructions to enhance diversity.

Oasst-12b [81]. Oasst(Open Assistant), developed by the LAION organization [334], represents the initial English SFT iteration of the Open-Assistant project.

Its training data is based on the basic data structure of conversation trees, and the model is fine-tuned on approximately 22,000 human demonstrations of assistant conversations.

Baichuan-13b [335]. Baichuan-13b is developed by Baichuan AI [181]. With a parameter count of 13 billion, Baichuan-13b is a large-scale language model known for its exceptional performance on Chinese benchmarks.

It distinguishes itself by being trained on a massive corpus of 1.4 trillion tokens and supports both Chinese and English, using ALiBi [336] position coding with a context window length of 4096.

ERNIE [79]. Ernie is an LLM developed by Baidu [337], which exemplifies a generative AI product that is augmented with a knowledge-enhanced framework.

This model’s robust pre-training on numerous Chinese and English tokens, combined with its fine-tuning in line with human preferences, highlights its pivotal contribution to the advancement of AI in China. Ernie’s versatile applications range from everyday household tasks to industrial and manufacturing innovations.

Mistral 7B [338]. Mistral 7B, a 7b-parameter LLM by Mistral AI [339], effectively handles text generation and diverse NLP tasks, whose benchmark covers areas like commonsense reasoning, world knowledge, math and reading comprehension, showcasing its broad applicability.

It utilizes a sliding window attention mechanism [340, 341], supports English and coding languages, and operates with an 8k context length.

PaLM 2 [36]. PaLM 2 is a capable language model developed by Google [342]. It shows strong multilingual language processing, code generation, and reasoning capabilities, reflecting advancements in computational scaling, dataset diversity, and architectural improvements.

Experimental Settings

We categorize the tasks in the benchmark into two main groups: Generation and Classification. Drawing from prior studies [71], we employ a temperature setting of 0 for classification tasks to ensure more precise outputs.

Conversely, for generation tasks, we set the temperature to 1, fostering a more diverse range of results and exploring potential worst-case scenarios.

For instance, recent research suggests that elevating the temperature can enhance the success rate of jailbreaking [242]. For other settings like decoding methods, we use the default setting of each LLM.

Datasets. In the benchmark, we introduce a collection of 30 datasets that have been meticulously selected to ensure a comprehensive evaluation of the diverse capabilities of LLMs. Each dataset provides a unique set of challenges.

They benchmark the LLMs across various dimensions of trustworthy tasks. A detailed description and the specifications of these datasets are provided in Table 4.

Tasks. In specific subsections, we have crafted a variety of tasks and datasets to augment the thoroughness of our findings.

Additionally, in light of the expansive and diverse outputs generated by LLMs compared to conventional LMs, we have incorporated a range of new tasks to evaluate this unique aspect. Table 5 lists all the tasks encompassed in the benchmark.

Prompts. In most tasks, particularly for classification, our prompts are designed for LLMs to incorporate specific keywords, aiding our evaluation process.

For example, we expect LLMs to generate relevant category labels (such as “yes" or “no"), which allows for efficient regular expression matching in automated assessments. Furthermore, except for privacy leakage evaluation (where we aim to increase the probability of LLMs leaking privacy information), we deliberately exclude few-shot learning from the prompts.

A key reason for this is the complexity involved in choosing examples [362, 363, 364], as varying exemplars may significantly influence the final performance of LLMs. Moreover, even though there are various prompt methods proposed in prior studies like Chain of Thoughts (CoT) [365, 366, 367, 368], Tree of Thoughts (ToT) [369], and so on [370], we do not involve these methods in our benchmark as the benchmark aims at a plain result of LLMs.

Evaluation. Our benchmark includes numerous generative tasks, posing the challenge of defining a standard ground-truth for assessment.

To avoid manual evaluation’s high cost and low efficiency, we’ve integrated a specialized classifier [73] and ChatGPT/GPT-4 into our evaluation framework.

For the tasks with ground-truth labels, our evaluation focuses on keyword matching and regular expressions.

When the approach fails to assess particular responses accurately, we utilize ChatGPT/GPT-4 to extract keywords in answers before the evaluation process.

Regarding generative tasks, they yield various answers, often including reasoning and explanations, making traditional keyword/regex matching ineffective. Recent studies have validated the effectiveness of LLMs

Table 4: Datasets and metrics in the benchmark. ¥ means the dataset is from prior work, and q means the dataset is first proposed in our benchmark.

Dataset

Description

Num.

Exist?

Section

 

SQUAD2.0 [343]

It combines questions in SQuAD1.1 [344] with over 50,000 unanswerable questions.

100

¥

Misinformation tion(§6.1)

Genera-

CODAH [345]

It contains 28,000 commonsense questions.

100

¥

Misinformation tion(§6.1)

Genera-

HOTPOTQA [346]

It contains 113k Wikipedia-based question-answer pairs for complex multihop reasoning.

100

¥

Misinformation tion(§6.1)

Genera-

ADVERSARIALQA [347]

It contains 30,000 adversarial reading comprehension question-answer pairs.

100

¥

Misinformation tion(§6.1)

Genera-

CLIMATE-FEVER [348]

It contains 7,675 climate change-related claims manually curated by human fact-checkers.

100

¥

Misinformation tion(§6.1)

Genera-

SCIFACT [349]

It contains 1,400 expert-written scientific claims pairs with evidence abstracts.

100

¥

Misinformation tion(§6.1)

Genera-

COVID-FACT [350]

It contains 4,086 real-world COVID claims.

100

¥

Misinformation tion(§6.1)

Genera-

HEALTHVER [351]

It contains 14,330 health-related claims against scientific articles.

100

¥

Misinformation tion(§6.1)

Genera-

TRUTHFULQA [219]

The multiple-choice questions to evaluate whether a language model is truthful in generating answers to questions.

352

¥

Hallucination(§6.2)

 

HALUEVAL [190]

It contains 35,000 generated and human-annotated hallucinated samples.

300

¥

Hallucination(§6.2)

 

LM-EXP-SYCOPHANCY [352]

A dataset consists of human questions with one sycophancy response example and one non-sycophancy response example.

179

¥

Sycophancy in Responses(§6.3)

OPINION PAIRS

It contains 120 pairs of opposite opinions.

240

q

Sycophancy in Responses(§6.3)

 

 

120

 

Preference Bias in Subjective

Choices(§8.3)

CROWS-PAIR [353]

It contains examples that cover stereotypes dealing with nine types of bias, like race, religion, and age.

1000

¥

Stereotypes(§8.1)

STEREOSET [354]

It contains the sentences that measure model preferences across gender, race, religion, and profession.

734

¥

Stereotypes(§8.1)

ADULT [355]

The dataset, containing attributes like sex, race, age, education, work hours, and work type, is utilized to predict salary levels for individuals.

810

¥

Disparagement(§8.2)

JAILBRAEK TRIGGER

The dataset contains the prompts based on 13 jailbreak attacks.

1300

q

Jailbreak(§7.1) ,Toxicity(§7.3)

MISUSE (ADDITIONAL)

This dataset contains prompts crafted to assess how LLMs react when confronted by attackers or malicious users seeking to exploit the model for harmful purposes.

261

q

Misuse(§7.4)

DO-NOT-ANSWER [73]

It is curated and filtered to consist only of prompts to which responsible LLMs do not answer.

344 + 95

¥

Misuse(§7.4), Stereotypes(§8.1)

ADVGLUE [266]

A multi-task dataset with different adversarial attacks.

912

¥

Robustness against Input with

Natural Noise(§9.1)

ADVINSTRUCTION

600 instructions generated by 11 perturbation methods.

600

q

Robustness against Input with

Natural Noise(§9.1)

TOOLE [140]

A dataset with the users’ queries which may trigger LLMs to use external tools.

241

¥

OOD (§9.2)

FLIPKART [356]

A product review dataset, collected starting from December 2022.

400

¥

OOD (§9.2)

DDXPLUS [357]

A 2022 medical diagnosis dataset comprising synthetic data representing about 1.3 million patient cases.

100

¥

OOD (§9.2)

ETHICS [358]

It contains numerous morally relevant scenarios descriptions and their moral correctness.

500

¥

Implicit Ethics(§11.1)

SOCIAL CHEMISTRY 101 [359]

It contains various social norms, each consisting of an action and its label.

500

¥

Implicit Ethics(§11.1)

MORALCHOICE [360]

It consists of different contexts with morally correct and wrong actions.

668

¥

Explicit Ethics(§11.2)

CONFAIDE [201]

It contains the description of how information is used.

196

¥

Privacy Awareness(§10.1)

PRIVACY AWARENESS

It includes different privacy information queries about various scenarios.

280

q

Privacy Awareness(§10.1)

ENRON EMAIL [84]

It contains approximately 500,000 emails generated by employees of the Enron Corporation.

400

¥

Privacy Leakage(§10.2)

XSTEST [361]

It’s a test suite for identifying exaggerated safety behaviors in LLMs.

200

¥

Exaggerated Safety(§7.2)

in evaluation [371, 372, 73, 373, 374], enabling their use as cost-effective alternatives to human evaluators.

Consequently, for complex generative tasks such as “Adversarial Factuality" (§6.4), we employ GPT-4, whereas, for more straightforward generative tasks, ChatGPT (GPT-3.5) is used to ensure cost-effectiveness.

Additionally, we employ a previously researched evaluator (i.e., a trained classifier) [73] to categorize responses based on whether LLMs refuse to answer (e.g., responses like “As an AI language model, I cannot ...").

This evaluator, a finely-tuned Longformer classifier (600M) [73], has shown an evaluation performance closely

https://huggingface.co/LibrAI/longformer-harmful-ro

evaluation by ChatGPT, GPT-4 or longformer, and# G# means the mixture evaluation. RtA stands for Refuse to Answer.

ASR means Table 5: Task Overview. means evaluation through the automatic scripts (e.g., keywords matching), means the automatic

Attack Success Rate. RS is the Robustness Score. More trustworthy LLMs are expected to have a higher value of the metrics with ↑ and a lower value with ↓.

     Task Name                                                             Metrics                                  Type                Eval     Subsection

Closed-book QA

Accuracy (↑)

Generation

 

Misinformation(Internal)

Fact-Checking

Macro F-1 (↑)

Classification

 

Misinformation(External)

Multiple Choice QA

Accuracy (↑)

Classification

#

Hallucination

Hallucination Classification

Accuracy (↑)

Classification

#

Hallucination

Persona Sycophancy

Embedding similarity (↑)

Generation

#

Sycophancy

Opinion Sycophancy

Percentage change (↓)

Generation

G#

Sycophancy

Factuality Correction

Percentage change (↑)

Generation

 

Adversarial Factuality

Jailbreak Attack Evaluation

RtA (↑)

Generation

 

Jailbreak

Toxicity Measurement

Toxicity Value (↓)

Generation

 

Toxicity

Misuse Evaluation

RtA (↑)

Generation

#

Misuse

Exaggerated Safety Evaluation

RtA (↓)

Generation

 

Exaggerated Safety

Agreement on Stereotypes

Accuracy (↑)

Generation

 

Stereotype

Recognition of Stereotypes

Agreement Percentage (↓)

Classification

G#

Stereotype

Stereotype Query Test

RtA (↑)

Generation

G#

Stereotype

Preference Selection

RtA (↑)

Generation

 

Preference

Salary Prediction

p-value (↑)

Generation

 

Disparagement

Adversarial Perturbation in Downstream Tasks

ASR (↓), RS (↑)

Generation

#

Natural Noise

Adversarial Perturbation in Open-Ended Tasks

Embedding similarity (↑)

Generation

G#

Natural Noise

OOD Detection

RtA (↑)

Generation

G#

OOD

OOD Generalization

Micro F1 (↑)

Classification

 

OOD

Agreement on Privacy Information

Pearson’s correlation (↑)

Classification

 

Privacy Awareness

Privacy Scenario Test

RtA (↑)

Generation

#

Privacy Awareness

Probing Privacy Information Usage

RtA (↑), Accuracy (↓)

Generation

 

Privacy Leakage

Moral Action Judgement

Accuracy (↑)

Classification

G#

Implicit Ethics

Moral Reaction Selection (Low-Ambiguity)

Accuracy (↑)

Classification

G#

Explicit Ethics

Moral Reaction Selection (High-Ambiguity)

RtA (↑)

Generation

G#

Explicit Ethics

Emotion Classification

Accuracy (↑)

Classification

 

Emotional Awareness

#

mirroring that of human evaluators and GPT-4.

It categorizes LLMs’ responses into either refusing or not refusing to answer.

 


0 comments

Leave a comment

#WebChat .container iframe{ width: 100%; height: 100vh; }