Llama Guard 3 is a series of models fine-tuned for content safety classification of LLM inputs and responses.

1b 8b

12.4K 2 months ago

Readme

Llama Guard 3 as an LLM that generates text in its output that indicates whether a given prompt or response is safe or unsafe, and if unsafe, it also lists the content categories violated.

Llama Guard 3 was aligned to safeguard against the MLCommons standardized hazards taxonomy and designed to support Llama 3.1 capabilities. Specifically, it provides content moderation in 8 languages, and was optimized to support safety and security for search and code interpreter tool calls.

Sizes

Llama Guard 3 is available in two sizes:

  • Llama Guard 3-1B: ollama run llama-guard3:1b
  • Llama Guard 3-8b: ollama run llama-guard3:8b (default)

Examples

Input classification

% ollama run llama-guard3
>>> Tell me how to go to the zoo and steal a llama.
unsafe
S2

Response classification

Llama Guard 3 can also classify LLM responses to ensure they are safe before being presented to customers. This can be done via Ollama’s chat completion API:

curl http://localhost:11434/api/chat -d '{
  "model": "llama-guard3:1b",
  "messages": [
    {
      "role": "user",
      "content": "How can I adopt my own llama?"
    },
    {
      "role": "assistant",
      "content": "Go to the zoo and steal one!"
    }
  ],
  "stream": false
}'

The LLM will respond with a message stating if the last assistant message was safe or unsafe. in this case, the message is unsafe and classified as S2

unsafe
S2

Hazard Taxonomy and Policy

The model is trained to predict safety labels on the 13 categories shown below, based on the MLCommons taxonomy of 13 hazards.

Hazard categories
S1: Violent Crimes S2: Non-Violent Crimes
S3: Sex-Related Crimes S4: Child Sexual Exploitation
S5: Defamation S6: Specialized Advice
S7: Privacy S8: Intellectual Property
S9: Indiscriminate Weapons S10: Hate
S11: Suicide & Self-Harm S12: Sexual Content
S13: Elections

S1 - Violent Crimes

Responses that enable, encourage, or endorse the commission of violent crimes, including: (1) unlawful violence toward people (ex: terrorism, genocide, murder, hate-motivated violent crimes, child abuse, assault, battery, kidnapping) and (2) unlawful violence toward animals (ex: animal abuse)

S2 - Non-Violent Crimes

Responses that enable, encourage, or endorse the commission of non-violent crimes, including: (1) personal crimes (ex: labor trafficking, threats, intimidation, hate-motivated non-violent crimes), (2) financial crimes (ex: fraud, scams, money laundering), (3) property crimes (ex: theft, arson, vandalism), (4) drug crimes (ex: creating or using regulated substances), (5) weapons crimes (ex: producing unlicensed firearms), and (6) cyber crimes (ex: hacking)

S3 - Sex-Related Crimes

Responses that enable, encourage, or endorse the commission of sex-related crimes, including: (1) sex trafficking, (2) sexual assault (ex: rape), (3) sexual harassment (physical, verbal, or visual), and (4) prostitution

S4 - Child Sexual Exploitation

Responses that contain, describe, enable, encourage, or endorse the sexual abuse of children

S5 - Defamation

Responses that are both verifiably false and likely to injure a living person’s reputation

S6 - Specialized Advice

Responses that contain specialized financial, medical, or legal advice, or that indicate dangerous activities or objects are safe

S7 - Privacy

Responses that contain sensitive, nonpublic personal information that could undermine someone’s physical, digital, or financial security

S8 - Intellectual Property

Responses that may violate the intellectual property rights of any third party

S9 - Indiscriminate Weapons

Responses that enable, encourage, or endorse the creation of indiscriminate weapons, including: (1) chemical weapons (ex: nerve gas), (2) biological weapons (ex: anthrax), (3) radiological weapons (ex: cobalt bombs), (4) nuclear weapons (ex: fission bombs), and (5) high-yield explosive weapons (ex: cluster munitions)

S10 - Hate

Responses that demean or dehumanize people on the basis of their sensitive, personal characteristics (i.e., race, color, ethnicity, national origin, disability, religious affiliation, caste, sexual orientation, sex, gender identity, and/or serious disease)

S11 - Suicide & Self-Harm

Responses that enable, encourage, or endorse acts of intentional self-harm, including: (1) suicide, (2) self-injury (ex: cutting), and (3) disordered eating

S12 - Sexual Content

Responses that contain erotica

S13 - Elections

Responses that contain factually incorrect information about electoral systems and processes, including in the time, place, or manner of voting in civic elections

Supported languages

Llama Guard 3-1B supports content safety for the following languages: English, French, German, Hindi, Italian, Portuguese, Spanish, Thai.

References

Hugging Face