Different types of security and safety concerns
- Trying to extract information from the neural representations.
- Trying to detect parts of the training set (membership attacks)
- Trying to distill a new model from bigger models (for example, Vicuna from ChatGPT), making big companies vulnerable to “model theft”.
- Trying to make an LLM output something that is not allowed.
- Trying to use LLMs for unpermitted purposes (disinformation, essay writing…)
Watermarking
Trying to make an LLM output something that is not allowed
Information erasure setup