AI Seminar: What is the best atomic unit to represent text?
Matthias Gallé, Group Lead of the Natural Language Processing group, Naver Labs Europe.
Text Representation Units for Neural Machine Translation
What is the best atomic unit to represent text?
This important decision lies at the heart of the intersection between the continuous representation of modern NLP and the discrete world.
To understand the effectiveness of BPE, we test the hypothesis that it lies in the compression capacity of that algorithm. We test this by linking it to the broader family of dictionary-based compression algorithms.
We then study character-based NMT with Transformer models, showing the consequences of using character as atomic symbols on overall translation quality, robustness as well as the need of deeper models.
This is joint work with Rohit Gupta, Laurent Besacier and Marc Dymetman.
The seminar is free and open for everyone.
This seminar is a part of the AI Seminar Series organised by SCIENCE AI Centre. The series highlights advances and challenges in research within Machine Learning, Data Science, and AI. Like the AI Centre itself, the seminar series has a broad scope, covering both new methodological contributions, ground-breaking applications, and impacts on society.