This paper describes RETVec, an efficient, resilient, and multilingual text vec-torizer designed for neural-based text processing. RETVec combines a novelcharacter encoding with an optional small embedding model to embed wordsinto a 256-dimensional vector space. The RETVec embedding model is pre-trained using pair-wise metric learning to be robust against typos and character-level adversarial attacks. In this paper, we evaluate and compare RETVec tostate-of-the-art vectorizers and word embeddings on popular model architec-tures and datasets. These comparisons demonstrate that RETVec leads to com-petitive, multilingual models that are significantly more resilient to typos andadversarial text attacks. RETVec is available under the Apache 2 license at https://github.com/google-research/retvec
RETVec: Resilient and Efficient Text Vectorizer
Available Media | Publication (Pdf) |
Conference | Neural Information Processing Systems (NeurIPS) - 2023 |
Authors | Elie Bursztein , Marina Zhang , Owen Vallis , |
Citation |