Investigating Privacy Leakage in Text Embeddings Through Inversion
“How much private information do text em-beddings reveal about the original text? We investigate the problem of embedding inver-sion, reconstructing the full text represented in dense text embeddings. We frame the problem as controlled generation: generating text that, when reembedded, is close to a fixed point in latent space. We find that although a naïve model conditioned on the embedding performs poorly, a multi-step method that iteratively corrects and re-embeds text is able to recover 92% of 32-token text inputs exactly. We train our model to decode text embeddings from two state-of-the-art embedding models, and also show that our model can recover important personal information (full names) from a dataset of clinical notes.” https://arxiv.org/pdf/2310.06816v1.pdf
