Accounting for cognitive bias in human evaluation of large language models

Large language models (LLMs) can generate extremely fluent natural-linguistic texts, and move can fool the human mind into neglecting the quality of the content. For example, psychological studies have that very fluent content can pierce as more truthful and useful than fluid content.

Preference for Floating Speech is an example of a Cognitive BiasA shortcut the mind takes that although evolutionary use can lead to systematic errors. In a position document we presented at this year’s meeting in Association for Computational Linguistics (ACL), we take practical insights about cognitive bias by comparing evaluations of the real world of LLMs with studies in human psychology.

Science depends on the liability of experimental results, and at the age of LLMs, it is crucial to measure the right way crucial to ensure bound. For example, in an experience of determining where the output of an LLM is truthful and useful in an application context, such as providing legal or medical advice, it is important to give birth for factors such as fluid and the user’s cognitive load (alias mental load). If long, fluid content causes to overlook critical errors, rating of defective content a lot, experimental design needs a redesign.

With consideration, content is divided into individual facts, and human evaluators simply judge that there are special facts.

Therefore, for tasks such as evaluation of truth, we recommend that the content be divided into individuals and that the human evaluator merely assesses that a given fact is correct – rather than, says, the assignment of a number rating to the content as a whole. It is also important to give birth to human context in responsible-ai (RAI) Evaluation: Toxicity and Stereotype are considering the eye of the consideration. Therefore, a model’s evaluators must be different as possible.

When evaluating LLMs, it is also important to investigate their strengths and weaknesses in relation to special use cases. End users ask LLMs all kinds of questions. Accounting for this diversity is especially important in security -critical uses such as medicine where error costs can be high.

Similarly, the same prompt can be framed in many ways and test scenarios have to reflect this variation. If they donate, the numbers we get back may not reproduce the performance of the model in nature.

Related content

The fight against hallucination in models of fetch-augmented generation starts with a method of accuracy that fits it.

Evaluation criteria also matter. While there are good general approaches to evaluation, such as the useful, honest and harmless (HHH) benchmark, domain -specific criteria go very crucial. For example, in the legal domain, we may want to know how good the model is to predict case results in view of the evidence.

Another basic principle of scientific experimentation is reproducibility, and again it is a principle that is also used for LLM evaluation. While automated evaluation procedures are reproducible, human evaluation may vary depending on the personalities, backgrounds, moods and cognitive states of evaluators. In our paper, we emphasize that human evaluation of not in itself establishes a gold standard: We need to understand the cognitive behavior of users evaluating our system.

Finlly, the practical aspects of human evaluation are time and costs. Human evaluation is an existing process, and understanding which aspects of evaluation can be automated or simplified is critical of a broader adoption.

In our paper, we distill these arguments in six key principles of performing human evaluation of large language models, which we consolidate during abbreviation for consistency, writing criteria, differentiation, experience, responsibility and scale

  • StupidSistens of human evaluation: The results of human evaluation must be reliable and generalizable.
  • S.Coring CrITERIA: The scoring criteria must both include general criteria, such as readability and be tailored to fit the goals of the target tasks or domains.
  • D.IFTERENTIALING: The evaluation test set must be able to differentiate the capabilities and weaknesses of the generative LLMS.
  • Use eXperience: Must take into account the evaluation of the evaluator’s experiences, including their emotions and cognitive parties, both in the design of experiment and interpretation of results.
  • R.Esponsibility: The evaluation must comply with the standards of responsible AI that account for Fortors such as Bias, Security, Robustness and Privacy.
  • S.Calability: To promote widespread adoption, human evaluation must be scalable.

For more information about the use of the framework, see our paper, “Considering the human evaluation framework: reconsider human evaluation for general large language models”.

Leave a Comment