AI Seminar: What is the best atomic unit to represent text?


Matthias Gallé, Group Lead of the Natural Language Processing group, Naver Labs Europe.


Text Representation Units for Neural Machine Translation


What is the best atomic unit to represent text?

This important decision lies at the heart of the intersection between the continuous representation of modern NLP and the discrete world.

To understand the effectiveness of BPE, we test the hypothesis that it lies in the compression capacity of that algorithm. We test this by linking it to the broader family of dictionary-based compression algorithms. 

We then study character-based NMT with Transformer models, showing the consequences of using character as atomic symbols on overall translation quality, robustness as well as the need of deeper models.

This is joint work with Rohit Gupta, Laurent Besacier and Marc Dymetman.

The seminar is free and open for everyone.