gpt3 精读Language Models are Few-Shot Learners 语言模型是少样本学习者

Recent work has demonstrated substantial gains on many NLP tasks and benchmarks by pre-training on a large corpus of text followed by fine-tuning on a specific task. While typically task-agnostic in architecture, this method still requires task-specific fine-tuning datasets of thousands or tens of thousands of examples. By contrast, humans can generally perform a new language task from only a few examples or from simple instructions - something which current NLP systems still largely struggle to do. Here we show that scaling up language models greatly improves task-agnostic, few-shot performance, sometimes even reaching competitiveness with prior state-of-the-art fine-tuning approaches. Specifically, we train GPT-3, an autoregressive language model with 175 billion parameters, 10x more than any previous non-sparse language model, and test its performance in the few-shot setting. For all tasks, GPT-3 is applied without any gradient updates or fine-tuning, with tasks and few-shot demonstrations specified purely via text interaction with the model. GPT-3 achieves strong performance on many NLP datasets, including translation, question-answering, and cloze tasks, as well as several tasks that require on-the-fly reasoning or domain adaptation, such as unscrambling words, using a novel word in a sentence, or performing 3-digit arithmetic. At the same time, we also identify some datasets where GPT-3's few-shot learning still struggles, as well as some datasets where GPT-3 faces methodological issues related to training on large web corpora. Finally, we find that GPT-3 can generate samples of news articles which human evaluators have difficulty distinguishing from articles written by humans. We discuss broader societal impacts of this finding and of GPT-3 in general.
最近的研究表明，通过在大规模文本语料库上进行预训练，然后针对特定任务进行微调，许多自然语言处理（NLP）任务和基准测试都取得了显著进展。虽然这种方法在架构上通常与任务无关，但它仍然需要针对特定任务的、包含数千或数万个示例的微调数据集。相比之下，人类通常仅通过几个示例或简单指令就能执行一项新的语言任务，而这正是当前 NLP 系统仍难以做到的。在此我们表明，扩大语言模型的规模能极大提升与任务无关的少样本学习性能，有时甚至能与先前最先进的微调方法相媲美。具体而言，我们训练了拥有 1750 亿参数的自回归语言模型 GPT-3，其参数数量比此前任何非稀疏语言模型都多 10 倍，并在少样本设置下测试了它的性能。对于所有任务，GPT-3 在应用时不进行任何梯度更新或微调，任务和少样本示例完全通过与模型的文本交互来指定。 GPT-3 在许多自然语言处理（NLP）数据集上都取得了出色的表现，包括翻译、问答和完形填空任务，以及一些需要即时推理或领域适应的任务，如解乱序单词、在句子中使用新单词或进行三位数算术运算。同时，我们也指出了一些 GPT-3 的少样本学习仍存在困难的数据集，以及一些 GPT-3 在大规模网络语料库上训练时面临方法学问题的数据集。最后，我们发现 GPT-3 可以生成新闻文章样本，人类评估者很难将其与人类撰写的文章区分开来。我们讨论了这一发现以及 GPT-3 总体上对社会更广泛的影响。

arxiv.org