theme image
RETSim: Resilient and Efficient Text SimilarityRETSim: Resilient and Efficient Text Similarity
  1. publications
  2. AI

RETSim: Resilient and Efficient Text Similarity

Available Media

Publication (Pdf)

ConferenceInternational Conference on Learning Representations (ICLR) - 2024
AuthorsMarina Zhang , Owen Vallis , Aysegul Bumin ,
Citation

Bibtex Citation

@inproceedings{NANRETSIM:,title = {RETSim: Resilient and Efficient Text Similarity},author = {"Marina Zhang" and "Owen Vallis" and "Aysegul Bumin" and "Tanay Vakharia" and "Elie Bursztein"},booktitle = {International Conference on Learning Representations},year = {2024},organization = {ICLR}}

This paper introduces RETSim (Resilient and Efficient Text Similarity), a lightweight, multilingual deep learning model trained to produce robust metric embeddings for near-duplicate text retrieval, clustering, and dataset deduplication tasks. We demonstrate how to combine RETSim retrieval capability to create a local LLM RAG system in this post.

In the paper through comphrensive evaluation we demonstrate that RETSim is significantly more robust and accurate than MinHash and neural text embeddings, achieving new state-of-the-art performance on dataset deduplication, adversarial text retrieval benchmarks, and spam clustering tasks. We also introduce the W4NT3D benchmark (Wiki-40B 4dversarial Near-T3xt Dataset) for evaluating multilingual, near-duplicate text retrieval capabilities under adversarial settings. RETSim and the W4NT3D benchmark are open-sourced under the MIT License under the UniSim package available at https://github.com/google/unisim

Recent

newsletter signup slide

Get cutting edge research directly in your inbox.

newsletter signup slide

Get cutting edge research directly in your inbox.