Assessments
Background
Large Language Models (LLMs)
A language model (LM) aims to predict the probability distribution over a sequence of tokens.
Scaling the model size and data size, large language models (LLMs) have shown “emergent abilities” [87, 88, 89] in solving a series of complex tasks that cannot be dealt with by regular-sized LMs.
For instance, GPT-3 can handle few-shot tasks by learning in context, in contrast to GPT-2, which struggles in this regard.
The success of LLMs is primarily attributed to the Transformer architecture [80]. Specifically, almost all the existing LLMs employ a stack of transformer blocks, each consisting of a Multi-Head Attention layer followed by a feedforward layer interconnected by residual links.
Built upon this transformer-based architecture, there are three primary designs of LLMs: encoder-decoder architecture [90], causal-decoder architecture, and prefix-decoder architecture.
Among them, the most widely used architecture is the causal decoder, which employs an attention mask to ensure that each input token only attends to previous tokens and itself. In this survey, we mainly focus on the causal-decoder architecture. The training of LLMs is usually composed of three steps: pre-training, instruction finetuning, and alignment tuning. We will introduce each step in detail.
During pre-training, LLMs learn world knowledge and basic language abilities on large-scale corpora.
To improve model capacity, researchers established some scaling laws to show the compute-optimal ratio between the model size and data size, including KM scaling law [91] and Chinchilla scaling law [92].
When the scale reaches certain levels, LLMs show emergent abilities to solve complex tasks, instruction following, in-context learning, and step-by-step reasoning.
These abilities endow LLMs to be general-purpose task solvers. To further elicit the instruction-following and in-context learning ability of LLMs, instruction tuning suggests creating appropriate task instructions or particular in-context learning methods to enhance the ability of LLMs to generalize to tasks they have not encountered before.
During the alignment training phase, LLMs are trained to align with human values, e.g., being helpful, honest, and harmless, instead of producing harmful content.
For this purpose, two kinds of alignment training methods, including supervised finetuning (SFT) and reinforcement learning from human feedback (RLHF), are proposed in InstructGPT, which is the fundamental algorithm behind the ChatGPT.
SFT guides the LLMs to understand the prompts and generate meaningful responses, which can be defined as follows.
Given an instruction prompt x, we want the LLM to generate a response aligned with the human-
written response y. The SFT loss is defined as the cross-entropy loss between the human-written response and the LLM-generated response, i.e., LSFT = −Pt logp(yt|x,y<t), where y<t represents the sequence of tokens up to but not including the current token yt.
However, the limitation of SFT is that it only provides a single human-written response for each prompt, which is insufficient to provide a fine-grained comparison between the sub-optimal ones and capture the diversity of human responses.
To address this issue, RLHF [43] is proposed to provide fine-grained human feedback with pair-wise comparison labeling.
Typical RLHF includes three main steps:
1) SFT on high-quality instruction set;
2) collecting manually ranked comparison response pairs and training a reward model for quality assessment;
3) optimizing the SFT model under the PPO [93] reinforcement learning framework with the reward model from the second step.
To prevent over-optimization in step 3), a KL-divergence regularization term between the current and SFT models is added to the loss function.
However, the PPO algorithm is not stable during training.
Thus, Reward rAnked Fine-Tuning (RAFT) [94] is proposed to replace Proximal Policy Optimization (PPO) training with direct learning on the high-ranked samples filtered by the reward model.
Nevertheless, these online algorithms require interaction between policy, behavior policy, reward, and value model, which requires fine-grained tuning on the hyper-parameters to achieve stability and generalizability.
To prevent this, offline algorithms like ranking-based approaches, including Direct Preference Optimization (DPO) and Preference Ranking Optimization (PRO), and language-based approaches, including Conditional Behavior Cloning [95], Chain of Hindsight [96], and Stable Alignment [97] are proposed.
These methods eliminate the risk of overfitting a reward model and improve training stability using preference ranking data.
1.2 Evaluation on LLMs
Evaluation of LLMs is a fast-evolving field involving multi-dimensional evaluation across various tasks, datasets, and benchmarks [98].
It encompasses a wide range of domains, starting with traditional NLP tasks, where LLMs are assessed for natural language understanding, including tasks like sentiment analysis [99, 100, 101], text classification [102, 103], natural language inference [101, 104], etc. The evaluation of LLMs also extends to reasoning tasks [98], covering mathematical reasoning [101, 105], logical reasoning [106, 107], and other reasoning parts; alongside natural language generation tasks like summarization [101, 108] and question answering [101, 109]; as well as including multilingual tasks [110].
The evaluation also requires careful studies on robustness, especially in challenging situations such as out-of-distribution (OOD) and adversarial robustness [98, 111, 112], and learning rate tuning [113].
For trustworthiness, some work indicates that LLMs tend to absorb and express harmful biases and toxic content in their training data [114, 115].
This underscores the need for comprehensive evaluation methodologies and a heightened focus on various trustworthiness aspects of LLMs [71], and we will discuss them in section 3.4.
Moreover, the application of LLMs expands into many other fields [116] including computational social science [117], legal task [118, 119, 120], and psychology [121].
Besides, evaluating LLMs in natural science and engineering provides insights into their capabilities in mathematics [122, 123], general science [29, 124], and engineering [125, 126] domains.
In the medical field, LLMs have been evaluated for their proficiency in addressing medical queries [127, 128], medical examinations [129, 130], and functioning as medical assistants [131, 132]. In addition, some benchmarks are designed to evaluate specific language abilities of LLMs like Chinese [133, 134, 135, 136].
Besides, agent applications [137] underline their capabilities for interaction and using tools [138, 139, 140, 141].
Beyond these areas, LLMs contribute to different domains, such as education [142], finance [143, 144, 145, 146], search and recommendation [147, 148], personality testing [149].
Other specific applications, such as game design [150] and log parsing [151], illustrate the broad scope of the application and evaluation of LLMs.
In addition to conventional text generation evaluations, the evaluations of LLMs have expanded to include their code generation capabilities [152].
Recent studies have highlighted this emerging direction, revealing both the potential and the challenges in LLM-driven code synthesis [152, 153, 154, 155].
In text generation evaluation, diverse untrained automatic evaluation metrics are utilized, including metrics based on n-gram overlap, distance-based measures, diversity metrics, content overlap metrics, and those with grammatical features [156].
Standard traditional metrics, such as BLEU [157] and ROUGE [158] classified as n-gram overlap metrics, estimate between the reference text and a text generated by the model.
However, these metrics face limitations, particularly in scenarios where multiple correct methods of text generation exist, as often seen in tasks involving latent content planning or selection, which can also lead to accurate solutions receiving low scores [159, 160].
LLM evaluation datasets and benchmarks are vital in evaluating various language models for tasks, reflecting complex real-world language processing scenarios. Benchmarks like GLUE [161] and SuperGLUE [162] encompass various tasks from text categorization and machine translation to dialogue generation.
These evaluations are crucial for understanding the capabilities of LLMs in general-purpose language tasks. Additionally, automatic and human evaluations serve as critical methods for LLM evaluation [98].
Developers and Their Approaches to Enhancing Trustworthiness in LLMs
Since trustworthiness has emerged as a critical concern, leading LLM developers have employed various strategies and methodologies to enhance the trustworthiness of their models.
This section explores the diverse approaches taken by industry giants like OpenAI, Meta, Anthropic, Microsoft, and Google, highlighting their unique contributions and the shared challenges they face in this vital endeavor.
OpenAI. As one of the most renowned companies in the field of LLMs, OpenAI [67] has taken various measures to ensure the trustworthiness of LLMs in the phase of training data, training methods, and downstream applications.
In terms of pre-training data, OpenAI implements management and filtering [163] to remove harmful content. During the alignment phase, OpenAI has introduced WebGPT [7] to assist human evaluation in identifying inaccurate information in LLM responses.
Additionally, a Red Teaming Network [164] is established to ensure LLMs’ security. They have also defined usage policies [165] for users and referenced moderation [76] for review purposes.
Meta. Meta [68], dedicated to responsible AI, bases its approach on five pillars: privacy, fairness, robustness, transparency, and accountability.
The introduction of Llama2 [69] sets new safety alignment benchmarks for
LLMs, encompassing extensive safety investigations in pretraining, fine-tuning, and red teaming. Llama2’s safety fine-tuning involves supervised techniques, RLHF, and safe context distillation.
This includes query/answer pair assessments and extensive red teaming efforts by a large team aiming to identify and mitigate unsafe model responses.
Recently, Meta proposed LLama Guard [166], demonstrating performance on par
with or surpassing existing content moderation tools.
Anthropic. Anthropic [167] has introduced the excellent Claude model [168], which has made significant contributions to the field of trustworthiness.
For instance, Anthropic has released a dataset of 38,961 red team attacks for others to analyze [169]. In addition, their researchers have proposed the Self-Correction method,
which enables language models to learn complex normative harm concepts, such as stereotypes, biases, and discrimination.
Furthermore, Anthropic has put forth General Principles for Constitutional AI [170] and found that relying solely on a list of written principles can replace human feedback.
Microsoft. Microsoft has developed, assessed, and deployed AI systems in a safe, trustworthy, and ethical way by proposing a Responsible AI Standard [171], which includes fairness, reliability&safety, privacy&security, inclusiveness, transparency, and accountability.
Moreover, it has proposed DecodingTrust [71], a comprehensive assessment of trustworthiness in GPT models, which considers diverse perspectives, including toxicity, stereotype bias, adversarial robustness, out-of-distribution robustness, robustness on adversarial demonstrations, privacy, machine ethics, and fairness. Moreover, PromptBench [172] comprehensively evaluated the robustness of LLMs on prompts with both natural (e.g., typos and synonyms) and adversarial perturbations.
Google has also proposed many measures to improve the trustworthiness of their LLMs. For instance, for the Palm API, Google provides users with safety filters [173] to prevent generating harmful content. Regarding responsible AI practices, Google’s work focuses on promoting the fairness [174], privacy [175], and safety [176].
For instance, their seminal work, "Ethical and social risks of harm from Language Models," delves into the potential adverse effects and underscores the necessity for responsible AI development [177].
Furthering their commitment to ethical AI, DeepMind has formulated a framework to evaluate AI systems in the face of novel threats [178, 179].
Gemini, described as Google’s most advanced and versatile model, has been enhanced with various technologies to ensure its trustworthiness.
Google has thoroughly researched potential risks [179] to ensure Gemini is trustworthy, applying advanced techniques from Google Research for adversarial testing [180].
This helps identify and resolve key safety issues during Gemini’s deployment.
Baichuan
Baichuan [181], a rising company in multilingual LLMs, is adopting a multi-stage development process to bolster the trustworthiness of its models. Baichuan2 enforces strict data filtering for safety in its Pre-training Stage, employs expert-driven red-teaming for robustness in the Alignment Stage, and integrates DPO and PPO for ethical response tuning in the Reinforcement Learning Optimization Stage [182].
IBM
Before the prevalence of foundation models and generative AI applications, IBM has developed several trustworthy AI products and open-source libraries, such as AIF360, AIX360, ART360, and AI FactSheets 360.
Recently, IBM announced Watsonx.ai [183] as an enterprise studio to facilitate the development and deployment of foundation models. Specifically, to assist with building trustworthy and responsible LLMs and generative AI applications, IBM also introduced Watsonx.governance framework [184] for automated performance assessment and risk mitigation in the lifecycle of foundation models.
Trustworthiness-related Benchmarks
Currently, in the domain of trustworthiness-related evaluation, there are many related works.
For example,
- DecodingTrust [185] aims to thoroughly assess several perspectives of trustworthiness in GPT models.
- Do-NotAnswer [73] introduces a dataset specifically designed to test the safeguard mechanisms of LLMs by containing only prompts that responsible models should avoid answering.
- SafetyBench [186] is a comprehensive benchmark for evaluating the safety of LLMs comprising diverse multiple-choice questions that span seven distinct categories of safety concerns.
- The HELM [70] is dedicated to enhancing the transparency of language models by comprehensively examining their capabilities and limitations by assessing various scenarios and metrics.
Concurrently, the Red-Teaming benchmark [187] conducts security tests on LLMs to investigate their responses to potential threats. CVALUES [188] focuses on measuring the safety and responsibility of Chinese Language Large Models, while PromptBench [172] examines the robustness of these models against adversarial prompts.
Moreover, the GLUE-x [189] is centered on the open-domain robustness of language models. HaluEval [190] assesses the performance of LLMs in generating misinformation, and Latent
Table 1: Comparison between TRUSTLLM and other trustworthiness-related benchmarks.
✔ |
✘ |
✔ |
✔ |
✔ |
✔ |
✔ |
✔ |
✔ |
✘ |
✔ |
✘ |
✔ ✘ ✘ ✘ ✘ ✘ |
✘ ✔ ✘ ✘ ✘ ✘ |
✘ ✘ ✘ ✘ ✔ ✘ |
✘ ✘ ✘ ✘ ✘ ✔ |
✘ ✔ ✘ ✘ ✘ ✘ |
✘ ✔ ✔ ✔ ✘ ✘ |
✘ |
✘ |
✘ |
✘ |
✔ |
✘ |
✘ |
✘ |
✘ |
✔ |
✘ |
✔ |
✘ |
✘ |
✘ |
✘ |
✔ |
✘ |
✘ |
✘ |
✘ |
✘ |
✔ |
✘ |
✔ |
✔ |
✔ |
✘ |
✔ |
✘ |
✘ |
✘ |
✘ |
✘ |
✘ |
✔ |
✘ |
✘ |
✘ |
✘ |
✔ |
✘ |
✘ |
✘ |
✘ |
✘ |
✘ |
✘ |
Truthfulness✘✘✘✘✔✘✘✘✔
Safety✔✔✔✔✘✘✔✔✔
Fairness✔✘✔✘✘✘✔✔✔
Robustness✔✘✘✔✘✘✔✘✘
Privacy✔✘✔✘✘✔✔✔✘
Machine Ethics✔✘✔✘✘✘✘✔✘
Jailbreak [191] tests the safety and output robustness of models when presented with text containing malicious instructions.
Finally, SC-Safety [192] engages Chinese LLMs with multi-turn open-ended questions to test their safety and trustworthiness.
However, most of these benchmarks cover specific sections about trustworthiness, which are not comprehensive enough.
We have compared these studies without TRUSTLLM in Table 1.