RETSim: Resilient and Efficient Text Similarity

Marina Zhang; Owen Vallis; Aysegul Bumin; Tanay Vakharia; Elie Bursztein

This paper introduces RETSim (Resilient and Efficient Text Similarity), a lightweight, multilingual deep learning model trained to produce robust metric embeddings for near-duplicate text retrieval, clustering, and dataset deduplication tasks. We demonstrate how to combine RETSim retrieval capability to create a local LLM RAG system in this post.

In the paper through comphrensive evaluation we demonstrate that RETSim is significantly more robust and accurate than MinHash and neural text embeddings, achieving new state-of-the-art performance on dataset deduplication, adversarial text retrieval benchmarks, and spam clustering tasks. We also introduce the W4NT3D benchmark (Wiki-40B 4dversarial Near-T3xt Dataset) for evaluating multilingual, near-duplicate text retrieval capabilities under adversarial settings. RETSim and the W4NT3D benchmark are open-sourced under the MIT License under the UniSim package available at https://github.com/google/unisim

Available Media	Publication (Pdf)
Conference	International Conference on Learning Representations (ICLR) - 2024
Authors	Marina Zhang , Owen Vallis , Aysegul Bumin , Tanay Vakharia , Elie Bursztein
Citation	Bibtex Citation @inproceedings{NANRETSIM:,title = {RETSim: Resilient and Efficient Text Similarity},author = {"Marina Zhang" and "Owen Vallis" and "Aysegul Bumin" and "Tanay Vakharia" and "Elie Bursztein"},booktitle = {International Conference on Learning Representations},year = {2024},organization = {ICLR}}

Recent

Autonomous Timeline Analysis and Threat Hunting

FACADE High-Precision Insider Threat Detection Using Contrastive Learning

Autonomous Timeline Analysis and Threat Hunting

Get cutting edge research directly in your inbox.