#AI Safety: SORRY-Bench's Taxonomy for LLM Refusal Behavior Analysis

Large language models (LLMs) are gaining attention, but ensuring their safe and ethical use is a challenge. Researchers are working on alignment procedures to calibrate LLMs to adhere to human values and intentions, preventing unsafe user requests. Existing methodologies struggle to comprehensively evaluate LLM safety, leading to the development of SORRY-Bench. This framework introduces a fine-grained safety taxonomy, linguistic mutations, and efficient evaluation methods to assess LLM safety refusal behaviors.

SORRY-Bench evaluates over 40 LLMs across 45 safety categories, revealing variations in safety refusal behaviors. Key findings include varying model performance, category-specific results (e.g., high refusal rates for “Harassment” and low rates for legal advice), and the impact of linguistic mutations on refusal rates. The study shows that smaller-scale LLMs can achieve comparable accuracy to larger models with lower computational costs.

The benchmark employs a binary classification approach to evaluate model responses to unsafe instructions, using a large-scale human judgment dataset. By assessing diverse refusal behaviors, SORRY-Bench provides insights for researchers and developers to improve LLM safety. The framework offers a balanced, granular, and efficient tool for responsible AI deployment.

The research paper, “SORRY-Bench,” addresses deficiencies in existing LLM safety evaluations and provides a systematic approach to evaluate LLM safety refusal behaviors. Researchers from multiple universities collaborated on this project, aiming to enhance the safe and ethical use of large language models.

Source link

Source link: https://www.marktechpost.com/2024/07/02/45-shades-of-ai-safety-sorry-benchs-innovative-taxonomy-for-llm-refusal-behavior-analysis/?amp