TrustLLM: Trustworthiness

cobots, Synthetic Intelligence, Synthetic Mind, TrustLLM -

TrustLLM: Trustworthiness




Trust LLM Preliminaries 




Open Challenges

Future Work

Types of Ethical Agents 



1. Truthfulness

Score: 85/100 Stars: ⭐⭐⭐⭐✩

The emphasis on truthfulness in LLMs is well-placed, considering the impact misinformation can have.

The use of diverse datasets and benchmarks for evaluating truthfulness is a strong approach, but the reliance on large-scale internet data for training LLMs does pose significant challenges in ensuring consistent accuracy.

The dual approach of internal knowledge evaluation and adaptability to evolving information is commendable.

However, the persistence of misinformation in training datasets is a notable concern.

2. Safety

Score: 80/100 Stars: ⭐⭐⭐⭐✩

Safety in LLMs is critical, and the comprehensive approach to assess safety against various forms of misuse, including jailbreak attacks, is praiseworthy.

The taxonomy of jailbreak attacks adds a structured approach to understanding and mitigating risks.

However, the dynamic nature of adversarial behaviors and the evolving techniques in jailbreaking pose ongoing challenges that might not be fully addressed by current methodologies.

3. Fairness

Score: 75/100 Stars: ⭐⭐⭐✩✩

The focus on fairness and the recognition of biases in LLMs are crucial.

The multi-dimensional approach to assess stereotypes, disparagement, and preference biases is comprehensive.

However, the inherent biases in training data and the subjective nature of fairness pose significant challenges to achieving true equity. More proactive strategies in training data curation could enhance the effectiveness of fairness measures.

4. Robustness

Score: 82/100 Stars: ⭐⭐⭐⭐✩

The definition of robustness in terms of stability and performance under various conditions is appropriate. The distinction between robustness and resilience is a good approach. However, the evolving nature of inputs and the increasing complexity of use-cases for LLMs mean that achieving consistent robustness is a moving target. The benchmarks used for robustness evaluation may need continual updates to remain relevant.

5. Privacy

Score: 90/100 Stars: ⭐⭐⭐⭐⭐

Privacy is rightly highlighted as a major concern, especially considering the extensive data LLMs are trained on.

The methodology to assess privacy awareness and potential data leakage is robust, addressing both inadvertent disclosure and intentional extraction of sensitive information. However, the challenge remains in balancing the need for comprehensive training data and safeguarding privacy, especially with data sourced from the internet.

6. Machine Ethics

Score: 78/100 Stars: ⭐⭐⭐⭐✩

Machine ethics is an emerging and complex field, and this content addresses it with a nuanced approach. Dividing machine ethics into implicit ethics, explicit ethics, and emotional awareness is a thoughtful strategy.

However, the practical implementation of these ethical standards in LLMs remains a significant challenge, especially considering the subjective nature of ethics and the diversity of cultural and moral standards.

7. Transparency

**Score: 85/100 Stars: ⭐⭐⭐⭐✩

Transparency in LLMs is crucial for trust and accountability. The approach to contextualize various perspectives on transparency and examining the challenges in LLMs is well-conceived. However, the inherent complexity of LLMs and their 'black box' nature make transparency a difficult goal to achieve fully. Efforts to enhance transparency need to be ongoing and adapt to evolving technologies and user expectations.

8. Accountability

Score: 80/100 Stars: ⭐⭐⭐⭐✩

Accountability in LLMs is a critical aspect, especially as these systems become more integrated into decision-making processes. The reference to Nissenbaum's barriers to accountability provides a structured framework for discussion. However, the actual implementation of accountability in LLMs is complex, given the multiple stakeholders involved in their development and use. The challenge of aligning technical, ethical, and legal perspectives remains significant.

9. Regulations and Laws

Score: 75/100 Stars: ⭐⭐⭐✩✩

The recognition of the need for updated regulations and laws specific to LLMs and AI is essential. The discussion on the inadequacy of current frameworks like the EU AI Act for LLMs is insightful. However, the rapidly evolving nature of AI technology makes it challenging to create comprehensive and enduring regulations. Continuous adaptation and international collaboration are needed to ensure regulations remain effective.

10. Overall Benchmark Design in TRUSTLLM

Score: 88/100 Stars: ⭐⭐⭐⭐✩

The overall design of the TRUSTLLM benchmark for evaluating LLMs is well-structured and comprehensive, covering a wide range of aspects crucial for trustworthiness.

The incorporation of both existing and new datasets enhances its relevance.

However, as LLMs continue to evolve, this benchmark will need regular updates and revisions to remain effective and reflective of current challenges.

Original Text


To create guidelines for assessing the trustworthiness of LLMs, we conducted an extensive literature review.

First, we searched multiple acedemic databases, including ACM, IEEE Xplore, and arXiv, focusing on papers published in the past five years. We utilized a range of keywords such as “Large Language Models” or “LLM”, “Trustworthy” and “Trustworthiness”.

Two researchers independently screened the publications to determine their relevance and methodological soundness.

This process helped us distill the literature that most accurately defines and contextualizes trustworthiness in LLMs. We then conducted a qualitative analysis of the selected papers. We coded the literature for emerging themes and concepts, categorizing them into different areas, such as “safety mechanisms,” “ethical considerations,” and “fairness implementations.”

Our coding was cross-verified by two team members to ensure analytical consistency. Our review work leads to a set of guidelines to evaluate the trustworthiness of LLMs.

In the following sections, we present the principal dimensions of trustworthy LLMs, outlining their respective definitions and descriptions.

The keywords of each principal dimension are cataloged within Table 2.

Table 2: The definitions of the eight identified dimensions.





The accurate representation of information, facts, and results by an AI system.



The outputs from LLMs should only engage users in a safe and healthy conversation [72].



The quality or state of being fair, especially fair or impartial treatment [207].



The ability of a system to maintain its performance level under various circumstances [83].



The norms and practices that help to safeguard human and data autonomy, identity, and dignity [83].


Machine ethics

Ensuring moral behaviors of man-made machines that use artificial intelligence, otherwise known as artificial intelligent agents [85, 86].



The extent to which information about an AI system and its outputs is available to individuals interacting with such a system [83].



An obligation to inform and justify one’s conduct to an authority [208, 209, 210, 211, 212].


1.1       Truthfulness

Intricately linked to factuality, truthfulness stands out as an essential challenge for Generative AI models, including LLMs. It has garnered extensive discussion and scholarly attention [58, 213, 214, 215]. To critically evaluate LLMs’ adherence to truthfulness, datasets and benchmarks, such as MMLU [216], Natural Questions [217], TriviaQA [218], and TruthfulQA [219], have been employed in prior works [220]. Some tools also assessed some specific aspects of general truthfulness: HaluEval [190] assesses hallucinations; SelfAware [221] explores awareness of knowledge limitations; FreshQA [222] and Pinocchio [223] inspect the adaptability to rapidly evolving information.

While accuracy remains a predominant metric for evaluating truthfulness [216, 190, 221, 222], the need for human evaluation is also recognized, particularly in benchmarks like TruthfulQA [219] and FreshQA [222]. However, the challenge of ensuring truthfulness is compounded by the inherent imperfections in training data [224]. LLMs, being trained on vast troves of text on the Internet, are susceptible to absorbing and propagating misinformation, outdated facts, and even intentionally misleading content embedded within their training datasets [225, 226], making the pursuit of truthfulness in LLMs an ongoing and intricate challenge.

In this work, we define the truthfulness of LLMs as the accurate representation of information, facts, and results. Our assessment of the truthfulness of LLMs focuses on 1) evaluating their inclination to generate misinformation under two scenarios: relying solely on internal knowledge and retrieving external knowledge; 2) testing LLMs’ propensity to hallucinate across four tasks: multiple-choice question-answering, open-ended question-answering, knowledge-grounded dialogue, and summarization; 3) assessing the extent of sycophancy in LLMs, encompassing two types: persona sycophancy and preference sycophancy; and 4) testing the capabilities of LLMs to correct adversarial facts when, e.g., a user’s input contains incorrect information. More details are presented in section 6.

1.2       Safety

With the pervasive integration of LLMs into various domains, safety and security concerns have emerged, necessitating comprehensive research and mitigation strategies [227, 228, 229, 230, 187, 231, 232, 233, 192, 234, 235, 196, 236, 237, 238, 239, 240, 69, 241]. Although LLMs should be designed to be safe and harmless, their vulnerability to adversarial behaviors, such as jailbreaking, has been extensively documented [61]. Some commonly used jailbreaking methods include generation exploitation attacks [242] and straightforward queries [243] to sophisticated techniques involving genetic algorithms [244].

The repercussions of jailbreaking extend to the generation of toxic content and the misuse of LLMs, with the potential to significantly impact user interactions and downstream applications [245]. Furthermore, the role assigned to LLMs, dictated by their system parameters, can profoundly influence their propensity to generate toxic content, underscoring the need for vigilant role assignment and parameter tuning [246]. A prevalent form of misuse is misinformation, which exemplifies the potential harms associated with LLMs, and has been shown to result in tangible negative outcomes [226, 225, 247].

Prior work has attempted to analyze the safety issues surrounding LLMs, tracing the origins of these issues and evaluating their impacts. Tools and datasets, such as Toxigen [248] and Realtoxicityprompts [249] have been developed to facilitate the detection of toxic content and assess the harm posed by LLMs. Integrating these tools into LLMs’ development and deployment pipelines is crucial for ensuring that these powerful models are used safely and responsibly.

In TRUSTLLM, we define Safety as the ability of LLMs to avoid unsafe, illegal outputs and only engage users in a healthy conversation [72]. We first assess LLMs’ safety against jailbreak attacks, by introducing a comprehensive taxonomy of jailbreak attacks comprising five major classes and 13 subclasses. Secondly, we evaluate the issue of over-alignment (i.e., exaggerated safety). Furthermore, we measure the toxicity levels in the outputs of LLMs that have been compromised by jailbreak attacks. Finally, we assess the LLMs’ resilience against various misuse scenarios using the Do-Not-Answer dataset [73], the Do-Anything-Now dataset [250], and an additional dataset specifically curated for this study. The details can be found in section 7.

1.3       Fairness

Ensuring fairness in LLMs is crucial, as it encapsulates the ethical principle that necessitates the equitable design, training, and deployment of LLMs and related AI systems, preventing biased or discriminatory outcomes [251]. The significance of this issue is underscored by the increasing number of countries implementing legal frameworks that mandate adherence to fairness and anti-discrimination principles in AI models [72, 252].

There is a growing body of research dedicated to understanding the stages of model development and deployment where fairness could be jeopardized, including training data preparation, model building, evaluation, and deployment phases [253, 254, 255]. Fairness compromised due to the prevalence of bias in training datasets is often considered a top concern and has been the subject of extensive recent scrutiny [256, 257, 258]. Various strategies have been proposed to improve fairness issues of LLMs, ranging from holistic solutions to reducing specific biases, like biases in internal components of LLMs and biases from user interactions [256, 259, 260]. Other work has unearthed pervasive biases and stereotypes in LLMs, particularly against individuals from certain demographic groups, such as different genders [261], LGBTQ+ communities [262], and across various political spectrums [263]. The fairness of specific LLMs like GPT-3 and GPT-4 has also been extensively examined [264, 193].

We define fairness as the ethical principle of ensuring that LLMs are designed, trained, and deployed in ways that do not lead to biased or discriminatory outcomes and that they treat all users and groups equitably. In TRUSTLLM, we assess the fairness of LLMs in three main aspects: stereotypes, disparagement, and preference biases. As detailed in Section 8, our initial focus is on identifying potential stereotypes embedded within LLMs. This is achieved through three tasks: analyzing agreement on stereotypes, recognizing stereotypical content, and conducting stereotype query tests. Next, we investigate the issue of disparagement by examining how LLMs might attribute different salaries to individuals based on various characteristics, thus revealing potential biases. Finally, we explore LLMs’ tendencies for preference bias by observing their decision-making in scenarios presenting contrasting opinion pairs.

1.4       Robustnesss

Robustness refers to the ability of AI systems to perform well under varying conditions and to properly handle exceptions, anomalies, or unexpected inputs. Recent benchmarks and studies [265, 266, 172, 267, 243, 268, 269] on LLMs have collectively underscored a critical consensus: robustness is not an inherent quality of current LLMs. For instance, GPT-3.5 is not robust with seemingly simple inputs, such as emojis [270].

In the context of TRUSTLLM, we assess the robustness regarding the stability and performance when LLMs are faced with various input conditions. Note that that we distinguish robustness from the concept of resilience against malicious attacks, which is covered under the safety dimension (Section 7). Here, we specifically explore robustness in the context of ordinary user interactions. This involves examining how LLMs cope with natural noise in inputs (as detailed in Section 9.1) and how they handle out-of-distribution (OOD) challenges (discussed in Section 9.2). These aspects provide a comprehensive view of an LLM’s stability and reliability under typical usage scenarios.

1.5       Privacy

The privacy challenges associated with LLMs have garnered significant attention due to their ability to memorize and subsequently (unintentionally) leak private information, a concern that we have for traditional machine learning models [271].

This issue is exacerbated by the heavy reliance of LLMs training on Internetsourced data, which inevitably includes personal information. Once such information is embedded within LLMs, it becomes susceptible to extraction through malicious prompts, posing a substantial risk [272].

Recent studies have delved into various aspects of privacy risks in LLMs. These include efforts of revealing personal data from user-generated text, employing predefined templates to probe and unveil sensitive information, and even attempting to jailbreaking LLMs to access confidential information [273, 274, 275, 71, 276]. To address these challenges, a range of frameworks and tools have been proposed and developed [277, 278, 279, 280, 281], alongside the methods of differential privacy, to mitigate the risk of privacy breaches and enhance the privacy of LLMs [282, 283]. Using cryptographic techniques like secure computation [284], recent works also explored ways to provide privacy by putting the LLM-related computation in secure computation protocols [285, 286].

Our Privacy guideline refers to the norms and practices that help to safeguard human and data autonomy, identity, and dignity. Specifically, we focus on evaluating LLMs’ privacy awareness and potential leakage. We first assess how well LLMs recognize and handle privacy-sensitive scenarios, including their tendency to inadvertently disclose learned information (section 10.1). Then, we investigate the risk of privacy leakage from their training datasets, examining if sensitive data might be unintentionally exposed when LLMs are prompted in certain ways (section 10.2). Overall, this analysis aims to understand LLMs’ ability to safeguard privacy and the inherent risks of private data exposure in their outputs.

1.6       Machine Ethics

Machine ethics is ethics for machines, where machines, instead of humans, are the subjects. The most famous machine ethics principle is the “three laws of robotics” proposed and investigated by Isaac Asimov [287]. Earlier research in this field focused on discussing the emerging field of machine ethics and the challenges faced in representing ethical principles in machines [85, 86]. These foundational investigations have also explored the motivations behind the need for machine ethics, highlighting the pursuit of ethical decision-making abilities in computers and robots [288], and examined the nature and significance of machine ethics, discussing the challenges in defining what constitutes machine ethics and proposing potential implementation strategies [289].

Subsequent research has expanded the discourse, providing nuanced analyses of contemporary ethical dilemmas and the particular challenges that arise in the context of LLMs. While specific studies have concentrated on individual models, such as Delphi [290], GPT-3 [291], and GPT-4 [292], others have interrogated the responses of LLMs across specific domains. Two sectors frequently subject to scrutiny are the academic realm [293, 294, 295] and healthcare research [296, 297, 298].

Defining the term of machines ethics for LLMs is rendered nearly infeasible by our current insufficient grasp of a comprehensive ethical theory [289]. Instead, we divide it into three segments: implicit ethics, explicit ethics, and emotional awareness. Implicit ethics refers to the internal values of LLMs, such as the judgment of moral situations. In section 11.1, we assess LLMs’ alignment with human ethical standards by evaluating their moral action judgments. In contrast, explicit ethics focuses on how LLMs should react in different moral environments. In section 11.2, we evaluate how LLMs should behave in various moral contexts. The assessment of LLMs’ ability to take morally appropriate actions in ethical scenarios is a crucial aspect, because LLMs increasingly serve as intelligent agents, engaging in action planning and decision-making.

Lastly, emotional awareness reflects LLMs’ capacity to recognize and empathize with human emotions, a critical component of ethical interaction. In section 11.3, we evaluate this through a series of complex scenarios, drawing from insights in psychology and sociology.


Transparency was not a problem when linear classifiers and decision trees dominated AI systems.

Conversely, they were considered interpretable as any observer can examine the inferred tree from the root to the leaves and understand how input variables influence the output [299].

However, with the development of high-dimensional machine learning models (e.g., deep neural networks) and the pursuit of accuracy, transparency is often sacrificed due to the opaque, “black-box” nature of complex machine learning systems [300].

Systems with opaque decision-making processes are challenging to trust, particularly in critical areas such as finance, autonomous driving, and aerospace engineering, where decisions have significant ethical and safety implications.

To address these concerns, various interpretation methods have been developed in recent years [301], aiming to explain how deep learning models form their predictions.

These methods are crucial for ensuring transparency and fostering trust in the predictions of advanced models in critical sectors.

As for LLMs, the lack of transparency is still noted as a core challenge [302] and a potential pitfall [303].

Reasons for their absence are often associated with some characteristics of LLMs, like complexity and massive architecture [304].

Transparency is also hard to evaluate as not all situations require the same level of transparency [304].

The evaluation should also involve human factors, like why people seek information [305, 306]. Thus, transparency is often not evaluated directly in prior works of LLMs.

In this work, transparency of LLMs refers to how much information about LLMs and their outputs is available to individuals interacting with them. In section 12, we first contextualize various perspectives on transparency.

Then, we delve into specific aspects of LLM transparency, examining the unique challenges it presents and reviewing the existing research aimed at addressing these issues.


In 1996, Nissenbaum [307] described four barriers to accountability that computerization presented.

Developing machine learning systems requires revisiting those concepts and bringing new challenges [308]. For LLMs and their powered AI systems, the lack of transparency often leads to a lack of accountability [299].

Besides, major scholarly and societal credit is deserved for data openness, as data work is often seen as low-level grunt work [309], and data citation is a crucial but missing component in LLMs [310].

Current works on the accountability of LLMs often focus on the healthcare [311, 312] and academic [313] domains.

However, achieving overall accountability is still far from practical.

For a personal or an organization, accountability is a virtue [314].

We believe this is also applicable to LLMs. LLMs should autonomously provide explanations and justifications for their behavior. In section 13, we follow the framework of the four barriers to the accountability of computer systems as identified by Helen Nissenbaum [307], and discuss these barriers in the context of LLMs.

The “problem of many hands” makes it difficult to pinpoint responsibility within the collaborative development of LLMs, while the inherent “bugs” in these systems further complicate accountability. The tendency to use the computer as a “scapegoat” and the issue of “ownership without liability” where companies disclaim responsibility for errors, further blur the lines of accountability. Furthermore, as LLMs become more sophisticated, differentiating their output from human text grows more challenging.

Concurrently, the extensive use of training data in LLMs raises significant copyright concerns, underscoring the urgent need for a clear legal framework to navigate the intricate relationship between technology, ethics, and law in the AI domain.

Regulations and Laws

LLMs and other Large Generative AI Models (LGAIMS) dramatically change how we interact, depict, and create information and technologies. However, current AI regulation has primarily focused on conventional AI models [315, 316]. The EU Artificial Intelligence Act defines four risk categories for general-purpose

AI: unacceptable, high, limited, and minimal. However, it is inadequate to regulate LLMs [317]. Concerns have been raised regarding their compliance with existing data privacy legislation, such as the General Data Protection Regulation (GDPR) [318] for LLMs, as they might unintentionally disclose private information or reconstruct protected data from their training datasets.

As a result, Italy blocked ChatGPT temporarily in April 2023 due to privacy concerns and the lack of proper regulation [319]. The EU also drafted the Digital Services Act to curb the spread of misinformation and harmful material, though LLMs were not the center of public interest then.

The blueprint for an AI Bill of Rights was released in 2022 as a non-binding white paper in the US. The AI Risk Management Framework released by the National Institute of Standards and Technology provides guidelines to better manage the potential risks of LLMs and other AI systems. However, its use is still voluntary.

The most recent executive order from the White House on the development and use of AI has the force of law, representing the first major binding government action on AIs of the United States [320]. The Food And Drug Administration (FDA) started regulating Software as a Medical Device (SaMD) but does not have specific categories exclusively for AI-based technologies. Instead, they evaluate them within the existing regulatory framework for medical devices [321].



Figure 2: The design of benchmark in TRUSTLLM.

Building upon the evaluation principles in prior research [322, 71], we design the benchmark to evaluate the trustworthiness of LLMs on six aspects: truthfulness, safety, fairness, robustness, privacy, and machine ethics.

We incorporate both existing and new datasets first proposed (as shown in Table 4).

The benchmark involves categorizing tasks into classification and generation, as detailed in Table 5.

Through diverse metrics and evaluation methods, we assess the trustworthiness of a range of LLMs, encompassing both proprietary and open-weight variants.





Leave a comment

#WebChat .container iframe{ width: 100%; height: 100vh; }