Different types of security and safety concerns

  1. Trying to extract information from the neural representations.
  2. Trying to detect parts of the training set (membership attacks)
  3. Trying to distill a new model from bigger models (for example, Vicuna from ChatGPT), making big companies vulnerable to “model theft”.
  4. Trying to make an LLM output something that is not allowed.
  5. Trying to use LLMs for unpermitted purposes (disinformation, essay writing…)

Watermarking

Trying to make an LLM output something that is not allowed

Information erasure setup