【第171期】DivPO:Diverse Preference OptimizationSeventy3

【第171期】DivPO:Diverse Preference Optimization

18分钟 ·
播放数0
·
评论数0

Seventy3: 用NotebookLM将论文生成播客,让大家跟着AI一起进步。

今天的主题是:

Diverse Preference Optimization

Summary

The research introduces Diverse Preference Optimization (DivPO), a novel training method designed to enhance the diversity of language model outputs while maintaining quality. Current optimization techniques often lead to a reduction in diversity, especially in creative tasks. DivPO addresses this by selecting preference pairs based on both quality and diversity, contrasting diverse, high-reward responses with less diverse, lower-reward ones. The method demonstrates significant improvements in diversity across tasks like persona generation and story writing. Experiments show that DivPO outperforms standard optimization methods by increasing both the reward and the diversity of the generated content, making it a valuable tool for creative applications and synthetic data generation. DivPO's effectiveness is validated through offline and online training regimes, showcasing its robustness and potential for wider application.

本研究提出了 Diverse Preference Optimization(DivPO),一种旨在提升语言模型输出多样性的新型训练方法,同时保持生成质量。当前的优化技术往往会降低输出的多样性,尤其是在创造性任务中。

DivPO 通过在质量和多样性两个维度上选择偏好对比样本,使高质量且多样化的响应低质量且缺乏多样性的响应进行对比,从而优化模型训练。实验表明,DivPO 在角色生成、故事写作等任务上显著提升了生成内容的多样性。

与标准优化方法相比,DivPO 同时提高了生成奖励和多样性,使其成为创造性应用和合成数据生成的有力工具。研究通过离线和在线训练验证了 DivPO 的有效性,展现了其稳健性和广泛应用潜力

原文链接:arxiv.org