#Unlearning in LLMs: Abstract and Introduction. #HarryPotter

In this paper, authors Ronen Eldan and Mark Russinovich from Microsoft Research propose a novel technique for unlearning a subset of training data from Large Language Models (LLMs) without having to retrain them from scratch. They demonstrate the effectiveness of their technique by unlearning the Harry Potter books from the Llama2-7b model, showing that in about 1 GPU hour of finetuning, they erase the model’s ability to generate or recall Harry Potter-related content while maintaining performance on common benchmarks. The technique involves using a reinforced model to identify related tokens, replacing idiosyncratic expressions with generic counterparts, and finetuning the model on alternative labels to erase the original text from the model’s memory.

The authors highlight the ethical, legal, and technological challenges posed by LLMs trained on internet corpora containing copyrighted or problematic content. They emphasize the need for techniques to selectively unlearn specific subsets of training data in LLMs to address concerns such as copyright infringement, toxic data, and fake content. The authors present empirical evidence of their technique’s efficacy and suggest that it could lead to more responsible, adaptable, and legally compliant LLMs in the future.

The paper discusses related work in unlearning techniques for generative models and LLMs, noting the slim literature in this area. They provide comparisons of the baseline and fine-tuned models on various benchmarks and demonstrate how the next token probabilities shift during fine-tuning. The authors also mention other works that propose unlearning techniques for generative models, highlighting the relevance and limitations of these approaches.

Source link

Source link: https://hackernoon.com/whos-harry-potter-approximate-unlearning-in-llms-abstract-and-introduction