Accounting for cognitive bias in human evaluation of large language models

Large language models (LLMs) can generate extremely fluent natural-linguistic texts, and move can fool the human mind into neglecting the quality of the content. For example, psychological studies have that very fluent content can pierce as more truthful and useful than fluid content.

Preference for Floating Speech is an example of a Cognitive BiasA shortcut the mind takes that although evolutionary use can lead to systematic errors. In a position document we presented at this year’s meeting in Association for Computational Linguistics (ACL), we take practical insights about cognitive bias by comparing evaluations of the real world of LLMs with studies in human psychology.

Science depends on the liability of experimental results, and at the age of LLMs, it is crucial to measure the right way crucial to ensure bound. For example, in an experience of determining where the output of an LLM is truthful and useful in an application context, such as providing legal or medical advice, it is important to give birth for factors such as fluid and the user’s cognitive load (alias mental load). If long, fluid content causes to overlook critical errors, rating of defective content a lot, experimental design needs a redesign.

With consideration, content is divided into individual facts, and human evaluators simply judge that there are special facts.

Therefore, for tasks such as evaluation of truth, we recommend that the content be divided into individuals and that the human evaluator merely assesses that a given fact is correct – rather than, says, the assignment of a number rating to the content as a whole. It is also important to give birth to human context in responsible-ai (RAI) Evaluation: Toxicity and Stereotype are considering the eye of the consideration. Therefore, a model’s evaluators must be different as possible.

When evaluating LLMs, it is also important to investigate their strengths and weaknesses in relation to special use cases. End users ask LLMs all kinds of questions. Accounting for this diversity is especially important in security -critical uses such as medicine where error costs can be high.

Similarly, the same prompt can be framed in many ways and test scenarios have to reflect this variation. If they donate, the numbers we get back may not reproduce the performance of the model in nature.

Leave a Comment Cancel reply