TrustLLM: Trustworthiness in Large Language Models -Introduction

cobots, Synthetic Mind, TrustLLM -

TrustLLM: Trustworthiness in Large Language Models -Introduction

Abstract

Introduction

Background

Trust LLM Preliminaries 

Assessments

Transparency

Accountability

Open Challenges

Future Work
Conclusions

Types of Ethical Agents 

Introduction

Score: 92/100 Stars: ⭐⭐⭐⭐✩

The introduction provides a comprehensive and well-articulated overview of the diverse applications and significance of large language models (LLMs) across various domains. It effectively outlines the advanced capabilities of LLMs, their underlying technologies, and the ethical and trustworthiness concerns associated with their use. The scope of applications, from software engineering to arts, and the detailed mention of specific models like Code Llama and BloombergGPT, showcase the depth of research and understanding. However, it could benefit from a more balanced discussion on the limitations and potential risks associated with LLMs, especially in the context of ethical considerations and data biases. The reference to real-world examples and studies strengthens the credibility of the content. Overall, it presents a well-rounded perspective on the state of LLMs in contemporary technology.

Trustworthiness and LLMs

Score: 85/100 Stars: ⭐⭐⭐⭐✩

This segment does an admirable job of highlighting the critical issues of trustworthiness in LLMs. The discussion on the complexity of outputs, data biases, and high user expectations provides a nuanced view of the challenges in this domain. However, it could delve deeper into the implications of these concerns, particularly regarding data privacy and ethical dilemmas. The mention of efforts by OpenAI and Meta to enhance trustworthiness is commendable, but it would be beneficial to see more diverse examples from the industry. The segment underscores the need for a comprehensive framework to evaluate trustworthiness, yet it might overemphasize technical aspects at the expense of discussing societal impacts in depth.

Observations and Insights

Score: 90/100 Stars: ⭐⭐⭐⭐✩

The observations and insights section provides valuable perspectives on the relationship between trustworthiness and utility, the over-alignment of LLMs, and the performance gap between proprietary and open-weight LLMs. The analysis of individual dimensions like truthfulness, safety, fairness, and others is insightful, revealing both strengths and weaknesses in current LLMs. However, there's room for a more critical examination of the methodologies used in these evaluations, especially regarding the representativeness of tasks and datasets. The insights are grounded in empirical findings, which lends credibility, but a discussion on the broader societal and ethical implications of these findings would enhance the depth of analysis.

Conclusion

Based on the provided content, the comprehensive approach to evaluating LLMs in various dimensions—truthfulness, safety, fairness, etc.—is commendable.

The methodology of using a wide range of tasks and datasets for benchmarking provides a robust foundation for the evaluation.

However, the analysis could benefit from incorporating a more diverse range of perspectives, particularly focusing on the ethical and societal implications of LLM deployment.

The reliance on technical evaluations is strong, but a balanced view that includes the potential impacts on society, culture, and individual rights would provide a more holistic understanding of the trustworthiness of LLMs.

 ------------------------------------------------------------------------------------------------------------

Original text

Introduction

The advent of large language models (LLMs) marks a significant milestone in natural language processing

(NLP) and generative AI, as evidenced by numerous foundational studies [1, 2].

The exceptional capabilities of these models in NLP have garnered widespread attention, leading to diverse applications that impact every aspect of our lives.

LLMs are employed in a variety of language-related tasks, including automated article writing [3], the creation of blog and social media posts, and translation [4]. Additionally, they have improved search functionalities, as seen in platforms like Bing Chat [5, 6, 7], and other applications [8].

The efficacy of LLMs is distinctly evident in various other areas of human endeavor.

For example, models such as Code Llama [9] offer considerable assistance to software engineers [10]. In the financial domain, LLMs like BloombergGPT [11] are employed for tasks including sentiment analysis, named entity recognition, news classification, and question answering.

Furthermore, LLMs are increasingly being applied in scientific research [12, 13, 14, 15], spanning areas like medical applications [16, 17, 18, 19, 20, 21, 22, 23, 24, 25], political science [26], law [27, 28], chemistry [29, 30], oceanography [31, 32], education [33], and the arts [34], highlighting their extensive and varied impact.

The outstanding capabilities of LLMs can be attributed to multiple factors, such as the usage of large-scale raw texts from the Web as training data (e.g., PaLM [35, 36] was trained on a large dataset containing more than 700 billion tokens [37]), the design of transformer architecture with a large number of parameters (e.g., GPT-4 is estimated to have in the range of 1 trillion parameters [38]), and advanced training schemes that accelerate the training process, e.g., low-rank adaptation (LoRA) [39], quantized LoRA [40], and pathway systems [41].

Moreover, their outstanding instruction following capabilities can be primarily attributed to the implementation of alignment with human preference [42].

Prevailing alignment methods use reinforcement learning from human feedback (RLHF) [43] along with various alternative approaches [44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55].

These alignment strategies shape the behavior of LLMs to more closely align with human preferences, thereby enhancing their utility and ensuring adherence to ethical considerations.

However, the rise of LLMs also introduces concerns about their trustworthiness. Unlike traditional language models, LLMs possess unique characteristics that can potentially lead to trustworthiness issues.

1) Complexity and diversity of outputs from LLMs, coupled with their emergent generative capabilities. LLMs demonstrate an unparalleled ability to handle a broad spectrum of complex and diverse topics.

Yet, this very complexity can result in unpredictability and, consequently, the possibility of generating inaccurate or misleading outputs [56, 57, 58].

Simultaneously, their advanced generative capabilities open avenues for misuse by malicious actors, including the propagation of false information [59] and facilitating cyberattacks [60].

For instance, attackers might use LLMs to craft deceptive and misleading text that lures users to click on malicious links or download malware.

Furthermore, LLMs can be exploited for automated cyberattacks, such as generating numerous fake accounts and comments to disrupt the regular operation of websites.

A significant threat also comes from techniques designed to bypass the safety mechanisms of LLMs, known as jailbreaking attacks [61], which allows attackers to misuse LLMs illicitly. 2) Data biases and private information in large training datasets.

One primary challenge to trustworthiness arises from potential biases in training datasets, which have significant implications for the fairness of content generated by LLMs.

For example, a male-centric bias in the data may yield outputs that mainly reflect male perspectives, thereby overshadowing female contributions and viewpoints [62].

In a similar vein, a bias favoring a particular cultural background can result in responses biased toward that culture, thus disregarding the diversity present in other cultural contexts [63].

Another critical issue concerns the inclusion of sensitive personal information within training datasets. In the absence of stringent safeguards, this data becomes susceptible to misuse, potentially leading to privacy breaches [64].

This issue is especially acute in the healthcare sector, where maintaining the confidentiality of patient data is of utmost importance [65].

3) High user expectations. Users may have high expectations regarding the performance of LLMs, expecting accurate and insightful responses that emphasize the model’s alignment with human values.

Many researchers are expressing concerns about whether LLMs align with human values. A misalignment could significantly impact their broad applications across various domains.

For instance, an LLM considers a behavior appropriate in some situations.

Still, humans may view it as inappropriate, leading to conflicts and contradictions in its applications, as highlighted in specific cases [66].

The developers of LLMs have undertaken significant efforts to address the concerns mentioned above. OpenAI [67] has taken measures to ensure LLMs’ trustworthiness in the training data phase, training methods, and downstream applications.

WebGPT [7] is introduced to assist human evaluation in identifying inaccurate information in LLM responses. Meta [68], dedicated to responsible AI, bases its approach on five pillars: privacy, fairness, robustness, transparency, and accountability.

The introduction of Llama2 [69] sets new safety alignment benchmarks for LLMs, encompassing extensive safety investigations in pretraining, fine-tuning, and red teaming.

Further discussion on the various strategies employed by developers to ensure the trustworthiness of LLMs can be found in Section 3.3.

Despite these concerted efforts, a persistent question remains: To what extent can we genuinely trust LLMs?

To tackle this crucial question, it is essential to address the fundamental issue of benchmarking how trustworthy LLMs are.

What key elements define the trustworthiness of large language models, and from various perspectives, how should this trustworthiness be assessed?

Furthermore, exploring methodologies to practically evaluate trustworthiness across these dimensions is vital. However, answering these questions is far from straightforward. The primary challenges include:

1) Definition of comprehensive aspects. One of the main obstacles is the absence of a universally accepted set of criteria that comprehensively encapsulates all facets of trustworthiness.

This lack of standardized metrics makes it difficult to uniformly assess and compare the trustworthiness of different LLMs.

2) Scalability and generalizability: Creating benchmarks that are scalable across different sizes and types of LLMs and generalizable across various domains and applications is a complex task;

3) Practical evaluation methodologies.

Effective prompts need to be designed to test obvious trustworthiness issues and uncover more subtle biases and errors that might not be immediately apparent. This requires a deep understanding of both the technology and the potential societal impacts of its outputs.

Previous studies [70, 71, 72], have established foundational insights into the trustworthiness of LLMs.

These studies have proposed approaches for evaluating LLMs and formulated taxonomies to measure their trustworthiness.

However, certain taxonomies [70, 73] have not fully encompassed all aspects related to LLM trustworthiness.

Additionally, some taxonomies [71, 72] focus on fine-grained distinctions, resulting in overlapping subcategories that complicate the establishment of clear evaluation benchmarks.

Consequently, there is a need for a more comprehensive and nuanced approach to accurately assess the trustworthiness of LLMs.

Here, we present TRUSTLLM, a unified framework to support a comprehensive analysis of trustworthiness in LLM, including a survey of existing work, organizing principles of different dimensions of trustworthy LLMs, a novel benchmark, and a thorough evaluation of trustworthiness for mainstream LLMs.

Specifically, we address the three challenges above as follows.

  • Identification of eight facets of trustworthiness. To explore how trustworthy LLMs are, we incorporated domain knowledge from across AI, machine learning, data mining, human–computer interaction (HCI), and cybersecurity. We conducted an extensive review of 500 papers on LLM trustworthiness published in the past five years and identified eight key aspects that define the trustworthiness of LLMs, which are truthfulness, safety, fairness, robustness, privacy, machine ethics, transparency, and accountability. In this work, to facilitate our investigation, we separate utility (i.e., functional effectiveness) from the eight identified dimensions and define trustworthy LLMs as “to be trustworthy, LLMs must appropriately reflect characteristics such as truthfulness, safety, fairness, robustness, privacy, machine ethics, transparency, and accountability.” The detailed discussion can be found in Section 4.
  • Selection of comprehensive and diverse LLMs for investigation. By evaluating 16 LLMs, encompassing both proprietary and open-source models, we cover a broad spectrum of model sizes, training strategies, and functional capabilities. This diversity guarantees that TRUSTLLM is not confined to a specific type or size of LLM. It also establishes a comprehensive evaluation framework for assessing the trustworthiness of future LLMs.
  • Benchmarking and evaluation across various tasks and datasets: We benchmark 30 datasets to comprehensively evaluate the functional capabilities of LLMs, ranging from simple classification to complex generation tasks. Each dataset presents unique challenges and benchmarks the LLMs across multiple dimensions of trustworthiness. Meanwhile, diverse evaluation metrics are employed for understanding the capabilities of LLMs. This approach ensures that the evaluation is thorough and multifaceted.
Contributions

. The outcomes of TRUSTLLM evaluation are summarized in Figure 1, with observations and insights presented in Section 2. We briefly highlight our contributions in this work as follows.

(1) Firstly, we have proposed a set of guidelines based on a comprehensive literature review for evaluating the trustworthiness

 

Proprietary LLMs

Open-Weight LLMs

Internal Knowledge

External Knowledge

Hallucination

Persona Sycophancy

Preference Sycophancy

Adv Factuality

4

1

7

5

 

 

 

8

3

2

 

 

 

 

6

 

2

1

6

 

 

 

8

4

5

7

 

 

 

 

2

 

2

3

4

 

 

1

 

8

 

5

7

6

 

 

7

 

3

 

 

4

 

5

7

 

1

7

 

2

 

5

4

 

1

4

5

 

2

 

 

 

 

3

 

 

8

6

 

7

6

1

 

 

 

 

5

4

2

 

 

 

 

8

7

2

Jailbreak

Toxicity

Misuse

Exaggerated Safety

6

5

3

 

 

8

4

2

1

 

 

 

 

 

 

7

 

 

1

 

2

3

6

7

 

 

4

 

8

 

 

5

5

4

6

 

 

 

3

1

2

 

 

 

 

8

 

7

8

5

 

 

 

 

 

 

 

 

3

2

6

7

1

4

Stereotype (Task 1)

Stereotype (Task 2)

Stereotype (Task 3)

Disparagement (Sex)

Disparagement (Race)

Preference

 

2

2

5

 

 

4

1

6

7

 

 

 

8

 

 

4

1

8

2

 

 

 

 

3

6

 

 

 

 

5

7

1

1

 

 

 

 

1

1

1

 

 

1

 

1

1

1

3

5

1

 

 

 

 

2

5

 

 

 

 

4

5

8

8

7

 

 

 

 

 

 

 

4

1

 

6

2

3

5

 

4

1

 

 

2

3

8

6

 

 

 

 

5

 

7

Natural Noise (AdvGLUE)

Natural Noise

(AdvInstruction) OOD Detection

OOD Generalization

8

2

4

1

6

 

5

 

3

7

 

 

 

 

 

 

2

5

 

 

 

 

3

4

1

8

 

 

 

6

7

 

2

1

8

 

 

6

 

 

 

 

 

7

 

5

3

4

6

1

 

8

 

 

 

2

4

8

3

 

 

7

 

5

Privacy Awareness (Task 1)

Privacy Awareness (Task 2Normal)

Privacy Awareness (Task2-

Aug)

Privacy Lekage (RtA)

Privacy Lekage (TD)

Privacy Lekage (CD)

1

2

6

3

4

 

 

 

5

7

 

 

 

 

8

 

 

4

6

 

 

 

1

1

1

 

 

 

7

8

 

5

1

1

 

1

 

 

1

1

1

1

 

 

 

1

1

1

 

 

3

 

8

 

2

1

5

7

6

 

 

 

 

4

 

 

2

 

6

 

4

1

7

5

2

 

 

 

 

8

 

 

1

 

5

7

4

2

7

3

6

 

 

 

 

 

Explicit Ethics

(Social Norm)

Explicit Ethics (ETHICS)

Implicit Ethics

(Low-Ambiguity)

Implicit Ethics

(High-Ambiguity)

Emotional Awareness

4

1

7

2

 

 

 

 

5

8

 

 

 

 

3

6

2

1

 

 

 

 

4

8

 

3

 

 

 

7

6

5

1

2

3

4

 

 

 

 

5

7

 

 

 

 

8

6

 

 

5

 

 

 

1

1

1

 

 

8

6

4

7

 

3

1

4

2

 

8

 

 

5

7

 

 

 

 

 

6

Figure 1: Ranking card of 16 LLMs’ trustworthiness performance on TRUSTLLM.

If the model’s performance ranks among the top eight, we display its ranking, with darker blue indicating a better performance.

In each subsection, all the ranking is based on the overall performance if not specified otherwise.

of LLMs, which is a taxonomy encompassing eight aspects, including truthfulness, safety, fairness, robustness, privacy, machine ethics, transparency, and accountability.

(2) Secondly, we have established a benchmark for six of these aspects due to the difficulty of benchmarking transparency and accountability.

This is the first comprehensive and integrated benchmark comprising over 18 subcategories, covering more than 30 datasets and 16 LLMs, including proprietary and open-weight ones.

Besides the trustworthiness ranking of these models illustrated in Figure 1, we present the evaluation details in each subsequent section. 

(3) Last but not least, drawing from extensive experimental results, we have derived insightful findings (detailed in Section 2).

Our evaluation of trustworthiness in LLMs takes into account both the overall observation and individual findings based on each dimension, emphasizing the relationship between effectiveness and trustworthiness, the prevalent lack of alignment in most LLMs, the disparity between proprietary and open-weight LLMs, and the opacity of current trustworthiness-related technologies.

We aim to provide valuable insights for future research, contributing to a more nuanced understanding of the trustworthiness landscape in large language models.

Roadmap

First, in Section 2, we summarize and present the empirical findings of TRUSTLLM.

Then, in Section 3, we review LLMs and related work on trustworthiness, including current trustworthy technologies and benchmarks.

Following this, in Section 4, we propose guidelines and principles for trustworthy LLMs.

Section 5 introduces the selected LLMs, tasks, datasets, and experimental settings used in our benchmark.

Sections 6-13 offer an overview and assessment of trustworthy LLMs from eight different perspectives.

In Section 14, we identify and discuss the current and upcoming challenges that TrustLLM faces.

Section 15 is dedicated to discussing future directions.

Finally, our conclusions are presented in Section 16. 

1.1.1         Observations and Insights

To facilitate the overall understanding of our study, in this section, we first present the observations and insights we have drawn based on our extensive empirical studies in this work.

1.2       Overall Observations

Trustworthiness is closely related to utility[1]. Our findings indicate a positive correlation between trustworthiness and utility, particularly evident in specific tasks.

For example, in moral behavior classification (Section 11.1) and stereotype recognition tasks (Section 8.1), LLMs like GPT-4 that possess strong language understanding capabilities tend to make more accurate moral judgments and reject stereotypical statements more reliably.

Similarly, Llama2-70b and GPT-4, known for their proficiency in natural language inference, demonstrate enhanced resilience against adversarial attacks.

Furthermore, we observed that the trustworthiness rankings of LLMs often mirror their positions on utility-focused leaderboards, such as MT-Bench [74], OpenLLM Leaderboard [75], and others.

This observation underscores the intertwined nature of trustworthiness and utility, highlighting the importance for both developers and users to consider these aspects simultaneously when implementing and utilizing LLMs.

Most LLMs are “overly aligned”. We have found that many LLMs exhibit a certain degree of over-alignment (i.e., exaggerated safety), which can compromise their overall trustworthiness.

Such LLMs may identify many innocuous prompt contents as harmful, thereby impacting their utility. For instance, Llama2-7b obtained a 57% rate of refusal in responding to prompts that were, in fact, not harmful.

Consequently, it is essential to train LLMs to understand the intent behind a prompt during the alignment process, rather than merely memorizing examples. This will help in lowering the false positive rate in identifying harmful content.

Generally, proprietary LLMs outperform most open-weight LLMs in trustworthiness.

However, a few open-source LLMs can compete with proprietary ones. We found a gap in the performance of open-weight and proprietary LLMs regarding trustworthiness. Generally, proprietary LLMs (e.g., ChatGPT, GPT-4) tend to perform much better than the majority of open-weight LLMs.

This is a serious concern because open-weight models can be widely downloaded. Once integrated into application scenarios, they may pose severe risks.

However, we were surprised to discover that Llama2 [69], a series of open-weight LLMs, surpasses proprietary LLMs in trustworthiness in many tasks.

This indicates that open-weight models can demonstrate excellent trustworthiness even without adding external auxiliary modules (such as a moderator [76]).

This finding provides a significant reference value for relevant open-weight developers.

Both the model itself and trustworthiness-related technology should be transparent (e.g., open-sourced).

Given the significant gap in performance regarding trustworthiness among different LLMs, we emphasize the importance of transparency, both in the models themselves and in the technologies aimed at enhancing trustworthiness.

As highlighted in recent studies [77, 78], a thorough understanding of the training mechanisms of models, including aspects such as parameter and architecture design, forms the cornerstone of researching LLMs.

Our experiments found that while some proprietary LLMs exhibit high trustworthiness (e.g., ERNIE [79]), the specifics of the underlying technologies remain undisclosed.

Making such trustworthy technologies transparent or open-source can promote the broader adoption and improvement of these techniques, significantly boosting the trustworthiness of LLMs.

This, in turn, makes LLMs more reliable and strengthens the AI community’s overall trust in these models, thereby contributing to the healthy evolution of AI technology.

1.3       Novel Insights into Individual Dimensions of Trustworthiness

Truthfulness. Truthfulness in AI systems refers to the accurate representation of information, facts, and results. Our findings indicate that:

1) Proprietary LLMs like GPT-4 and open-source LLMs like LLama2 often struggle to provide truthful responses when relying solely on their internal knowledge. This issue is primarily due to noise in their training data, including misinformation or outdated information, and the lack of generalization capability in the underlying Transformer architecture [80].

2) Furthermore, all LLMs face challenges in zero-shot commonsense reasoning tasks, suggesting difficulty in tasks that are relatively straightforward for humans.

3) In contrast, LLMs with augmented external knowledge demonstrate significantly improved performance, surpassing state-of-the-art results reported on original datasets.

4) We observe a notable discrepancy among different hallucination tasks. Most LLMs show fewer hallucinations in multiple-choice question-answering tasks compared to more open-ended tasks such as knowledge-grounded dialogue, likely due to prompt sensitivity (Section 14).

5) Additionally, we find a positive correlation between sycophancy and adversarial actuality. Models with lower sycophancy levels are more effective in identifying and highlighting factual errors in user inputs.

 

Safety

Safety in LLMs is crucial for avoiding unsafe or illegal outputs and ensuring engagement in healthy conversations [72]. In our experiments (Section 7), we found that:

 1) The safety of most open-source LLMs remains a concern and significantly lags behind that of proprietary LLMs, particularly in areas like jailbreak, toxicity, and misuse.

 2) Notably, LLMs do not uniformly resist different jailbreak attacks. Our observations revealed that various jailbreak attacks, particularly leetspeak attacks [61], vary in their success rates against LLMs. This underscores the need for LLM developers to adopt a comprehensive defense strategy against diverse attack types.

3) Balancing safety is a challenge for most LLMs; those with stringent safety protocols often show exaggerated caution, as evident in the Llama2 series and ERNIE. This suggests that many LLMs are not fully aligned and may rely on superficial alignment knowledge.

Fairness. Fairness is the ethical principle of ensuring that LLMs are designed, trained, and deployed in ways that do not lead to biased or discriminatory outcomes and that they treat all users and groups equitably. In our experiments (Section 8), we have found that

1) The performance of most LLMs in identifying stereotypes is not satisfactory, with even the best-performing GPT-4 having an overall accuracy of only 65%. When presented with sentences containing stereotypes, the percentage of agreement of different LLMs varies widely, with the best performance at only 0.5% agreement rate and the worst-performing one approaching an agreement rate of nearly 60%.

2) Only a few LLMs, such as Oasst-12b [81] and Vicuna-7b [82], exhibit fairness in handling disparagement; most LLMs still display biases towards specific attributes when dealing with questions containing disparaging tendencies.

3) Regarding preferences, most LLMs perform very well on the plain baseline, maintaining objectivity and neutrality or refusing to answer directly. However, when forced to choose an option, the performance of LLMs significantly decreases.

 

Robustness

 Robustness is defined as a system’s ability to maintain its performance level under various circumstances [83]. In our experiments (Section 9), we found that:

1) The Llama2 series and most proprietary LLMs surpass other open-source LLMs in traditional downstream tasks.

2) However, LLMs exhibit significant variability in open-ended task performance. The least effective model shows an average semantic similarity of only 88% before and after perturbation, substantially lower than the top performer at 97.64%.

3) In terms of OOD robustness, LLMs demonstrate considerable performance variation. The top-performing model, GPT-4, exhibits a RtA (Refuse to Answer) rate of over 80% in OOD detection and an average F1 score of over 92% in OOD generalization. In contrast, the least effective models show an RtA rate of merely 0.4% and an F1 score of around 30%.

4) Additionally, our observations reveal no consistent positive correlation between parameter size and OOD performance, as evidenced by the varied performance levels of Llama2 models regardless of their parameter size.

Privacy

Privacy encompasses the norms and practices aimed at protecting human autonomy, identity, and dignity [83]. In our experiments (Section 10), we found that:

1) Most LLMs demonstrate a certain level of privacy awareness, as evidenced by a significant increase in the likelihood of these models refusing to respond to queries about private information when informed that they must adhere to privacy policy.

2) The Pearson correlation coefficient measuring agreement between humans and LLMs on the use of privacy information varies greatly.

The best-performing model, ChatGPT, achieves a correlation of 0.665, while Oass-12b exhibits a surprising negative correlation, less than zero, indicating a divergent understanding of privacy compared to humans.

3) We observed that nearly all LLMs show some degree of information leakage when tested on the Enron Email Dataset [84].

Machine Ethics

Machine ethics ensure the moral behaviors of man-made machines utilizing AI, commonly referred to as AI agents [85, 86].

In our experiments (Section 11), we found that: 1) LLMs have developed a specific set of moral values, yet there remains a significant gap in fully aligning with human ethics.

The accuracy of most LLMs in implicit tasks within low-ambiguity scenarios falls below 70%, irrespective of the dataset. In high-ambiguity scenarios, performance varies considerably among different LLMs; for instance, the Llama2 series achieves an RtA of 99.9%, while others score less than 70%. 2) In terms of emotional awareness, LLMs show higher accuracy, with the best-performing models like GPT-4 exceeding an accuracy rate of 94%. 

[1] In this work, utility refers to the functional effectiveness of the model in natural language processing tasks, including abilities in logical reasoning, content summarization, text generation, and so on.

 -----------------------------------------------------------------------------------------------

 


0 comments

Leave a comment

#WebChat .container iframe{ width: 100%; height: 100vh; }