AI Seminar: Data governance and transparency for Large Language Models: lessons from the BigScience Workshop

Decorative

Join us for a talk by Anna Rogers, Assistant Professor in the Center for Social Data Science at the University of Copenhagen. Everybody is welcome to attend.

Title

Data governance and transparency for Large Language Models: lessons from the BigScience Workshop

Abstract

The continued growth of LLMs and their wide-scale adoption in commercial applications such as chatGPT make it increasingly important to (a) develop ways to source their training data in a more transparent way, and (b) to investigate it, both for research and for ethical issues. This talk will discuss the current state of affairs and some data governance lessons learned from Big Science, an open-source effort to train a multilingual LLM - including an ongoing effort for investigating the 1.6 Tb multilingual ROOTS corpus.