GLINR Studio LogoTypeWeaver

Toxicity Labels

Understanding the 7 toxicity categories detected by ML analysis

Edit on GitHub

The ML toxicity model detects 7 distinct categories of harmful content. Understanding these labels helps you configure detection thresholds and handle different types of toxicity appropriately.

Overview

PropTypeDefault
toxicity?
Catch-all for harmful content
-
insult?
Direct attacks on individuals
-
threat?
Physical or implied threats
-
identity_attack?
Race, religion, gender, etc.
-
obscene?
Explicit language without target
-
severe_toxicity?
Extreme cases
-
sexual_explicit?
Adult content
-

Detailed Category Breakdown

Choosing Labels for Your Use Case

Chat/Gaming Moderation

const detector = new ToxicityDetector({
  labels: ['insult', 'threat', 'toxicity'],
  threshold: 0.8,
});

Professional/Workplace

const detector = new ToxicityDetector({
  labels: ['insult', 'identity_attack', 'sexual_explicit', 'threat'],
  threshold: 0.85,
});

Family-Friendly Platform

const detector = new ToxicityDetector({
  labels: ToxicityDetector.ALL_LABELS, // All 7 categories
  threshold: 0.8,
});

Safety-Critical (Threats Focus)

const detector = new ToxicityDetector({
  labels: ['threat', 'severe_toxicity'],
  threshold: 0.7, // Lower threshold for safety
});

Handling Different Categories

import { ToxicityDetector } from 'glin-profanity/ml';

async function handleByCategory(text: string) {
  const detector = new ToxicityDetector({ threshold: 0.8 });
  const result = await detector.analyze(text);

  if (!result.isToxic) {
    return { action: 'allow' };
  }

  // Handle by severity
  if (result.matchedCategories.includes('threat') ||
      result.matchedCategories.includes('severe_toxicity')) {
    return {
      action: 'block_and_report',
      escalate: true,
      reason: 'Potential threat detected',
    };
  }

  if (result.matchedCategories.includes('identity_attack')) {
    return {
      action: 'block',
      warn: true,
      reason: 'Hate speech policy violation',
    };
  }

  if (result.matchedCategories.includes('insult')) {
    return {
      action: 'warn',
      reason: 'Please keep discussions respectful',
    };
  }

  // Default handling
  return {
    action: 'flag_for_review',
    categories: result.matchedCategories,
  };
}

Threshold Guidelines

CategoryConservativeBalancedSensitive
toxicity0.90.850.75
insult0.90.850.8
threat0.80.70.6
identity_attack0.850.80.75
obscene0.90.850.8
severe_toxicity0.950.90.85
sexual_explicit0.90.850.8

Start with balanced thresholds and adjust based on your false positive/negative rates in production.

Model Limitations

  • English-focused: The model is trained primarily on English text
  • Context gaps: May miss sarcasm or context-dependent meaning
  • Cultural bias: Training data reflects Western internet culture
  • New slang: May not catch recently coined terms

For comprehensive coverage, combine ML with rule-based detection using HybridFilter.

Cross-References