Language dictionaries structure, customization, and extending with new words and languages

Comprehensive guide to Glin-Profanity's multi-language dictionary system, including how to extend dictionaries with custom words, add new languages, and manage language-specific configurations for production deployments.

Glin-Profanity uses a shared dictionary system with 23 supported languages, allowing for consistent profanity detection across JavaScript and Python implementations.

Dictionary Structure

The dictionary system is organized in a shared structure that both JavaScript and Python implementations access for consistent behavior across platforms.

Project Organization

english.json

spanish.json

french.json

german.json

italian.json

portuguese.json

russian.json

japanese.json

korean.json

chinese.json

arabic.json

hindi.json

turkish.json

polish.json

dutch.json

swedish.json

norwegian.json

danish.json

finnish.json

czech.json

hungarian.json

persian.json

thai.json

globalWhitelist.json

metadata.json

Supported Languages

Glin-Profanity supports 23 languages with comprehensive profanity dictionaries maintained by the community:

Prop	Type	Default
`European Languages?`	`13 languages`	`english, spanish, french, german, italian, portuguese, russian, polish, dutch, swedish, norwegian, danish, finnish, czech, hungarian`
`Asian Languages?`	`6 languages`	`japanese, korean, chinese, arabic, hindi, thai`
`Middle Eastern?`	`2 languages`	`persian, turkish`
`Constructed Languages?`	`1 language`	`esperanto`
`Special Files?`	`System files`	`globalWhitelist.json, metadata.json`

Supported Language Codes

// All 23 supported languages (alphabetical order)
type Language = 
  | "arabic" | "chinese" | "czech" | "danish" | "english" 
  | "esperanto" | "finnish" | "french" | "german" | "hindi" 
  | "hungarian" | "italian" | "japanese" | "korean" | "norwegian" 
  | "persian" | "polish" | "portuguese" | "russian" | "spanish" 
  | "swedish" | "thai" | "turkish"

Dictionary File Format

Each language dictionary follows a consistent JSON format for easy maintenance and updates:

english.json (Example)

[
  "damn",
  "shit", 
  "fuck",
  "bitch",
  "ass",
  "bastard",
  "hell",
  "crap"
]

Format Requirements:

JSON array of strings
Lowercase entries only
No duplicates within language
Alphabetical ordering preferred (not required)
UTF-8 encoding for international characters

spanish.json (Extended Example)

[
  "cabron",
  "cabrón",
  "coño", 
  "hijo de puta",
  "joder",
  "maldito",
  "mierda",
  "pendejo",
  "puto"
]

Advanced Features:

Unicode Support: Accented characters (á, é, í, ó, ú, ñ)
Multi-word Phrases: "hijo de puta", "va te faire foutre"
Regional Variants: Both "cabron" and "cabrón" included
Cultural Context: Language-specific profanity patterns

metadata.json (Dictionary Information)

{
  "version": "2.3.2",
  "lastUpdated": "2024-08-01",
  "languages": {
    "english": {
      "wordCount": 847,
      "contributors": ["community", "moderators"],
      "lastModified": "2024-07-15",
      "encoding": "utf-8",
      "notes": "Comprehensive English profanity dictionary"
    },
    "spanish": {
      "wordCount": 523,
      "contributors": ["native-speakers", "linguists"],
      "lastModified": "2024-06-20",
      "encoding": "utf-8", 
      "notes": "Includes Latin American and European variants"
    }
  },
  "globalWhitelist": {
    "wordCount": 156,
    "purpose": "Context-aware false positive reduction",
    "categories": ["gaming", "movies", "products", "technical"]
  }
}

Adding Custom Words

Extend existing language dictionaries with domain-specific or updated profanity terms using configuration options:

Configure Custom Words

Add custom profanity terms to your filter configuration without modifying core dictionary files:

JavaScript Custom Words

import { Filter } from 'glin-profanity';

const customFilter = new Filter({
  languages: ['english', 'spanish'],
  customWords: [
    // Gaming-specific terms
    'noob', 'scrub', 'pwned', 'rekt',
    
    // Internet slang
    'simp', 'karen', 'chad',
    
    // Domain-specific profanity
    'corporate-buzzword', 'synergy-bs'
  ],
  
  // Additional configuration
  severityLevels: true,
  allowObfuscatedMatch: true
});

// Test custom words
console.log(customFilter.isProfane('That noob got rekt!')); // true
console.log(customFilter.isProfane('Stop being a simp'));   // true

Python Custom Words

from glin_profanity import Filter

custom_filter = Filter({
    "languages": ["english", "spanish"],
    "custom_words": [
        # Gaming-specific terms
        "noob", "scrub", "pwned", "rekt",
        
        # Internet slang
        "simp", "karen", "chad",
        
        # Domain-specific profanity  
        "corporate-buzzword", "synergy-bs"
    ],
    
    # Additional configuration
    "severity_levels": True,
    "allow_obfuscated_match": True
})

# Test custom words
print(custom_filter.is_profane("That noob got rekt!"))  # True
print(custom_filter.is_profane("Stop being a simp"))    # True

Validate Custom Word Integration

Test that custom words integrate properly with existing dictionary detection:

Custom Word Validation

const filter = new Filter({
  languages: ['english'],
  customWords: ['noob', 'scrub', 'rekt'],
  severityLevels: true,
  replaceWith: '***'
});

// Test mixed content (dictionary + custom words)
const testCases = [
  'You damn noob!',           // Dictionary: damn, Custom: noob
  'That was fucking rekt',    // Dictionary: fucking, Custom: rekt  
  'Stop being a scrub',       // Custom: scrub only
  'This is normal text'       // No profanity
];

testCases.forEach(text => {
  const result = filter.checkProfanity(text);
  console.log(`Text: "${text}"`);
  console.log(`- Profanity detected: ${result.containsProfanity}`);
  console.log(`- Detected words: ${result.profaneWords.join(', ')}`);
  console.log(`- Processed: ${result.processedText}`);
  console.log('---');
});

Implement Ignore Lists

Use ignore lists to exclude words that might be false positives in your domain:

Custom Ignore Lists

const contextualFilter = new Filter({
  languages: ['english'],
  customWords: ['kill', 'dead', 'murder'], // Potentially problematic in gaming
  ignoreWords: [
    // Gaming context exceptions
    'kill',      // "kill the enemy" in gaming
    'dead',      // "dead zone" in networking
    'execution', // "code execution" in programming
    
    // Professional context exceptions  
    'master',    // "master branch", "master key"
    'slave',     // "slave server", "master-slave"
    'penetration' // "penetration testing" in security
  ]
});

// These won't trigger profanity detection
console.log(contextualFilter.isProfane('Kill the enemy boss'));      // false
console.log(contextualFilter.isProfane('Master-slave architecture')); // false
console.log(contextualFilter.isProfane('Penetration testing'));       // false

// But these still will
console.log(contextualFilter.isProfane('Go kill yourself'));          // true (context)
console.log(contextualFilter.isProfane('You fucking idiot'));         // true (profanity)

Adding New Languages

Extend Glin-Profanity with additional language support by following the established dictionary format:

Create Language Dictionary File

Create a new JSON dictionary file following the established format:

new-language.json (Template)

[
  "profane-word-1",
  "profane-word-2", 
  "profane-word-3",
  "multi word phrase",
  "unicode-characters-ñáéíóú"
]

File Creation Guidelines:

Use lowercase for all entries
Include accented/special characters where appropriate
Add multi-word phrases for comprehensive coverage
Sort alphabetically for maintainability
Ensure UTF-8 encoding
Target 100-500+ words for comprehensive coverage

Update Language Type Definitions

Add your new language to the supported language types:

types.ts (JavaScript/TypeScript)

// Add new language to the Language union type
type Language = 
  | "arabic" | "chinese" | "czech" | "danish" | "english" 
  | "esperanto" | "finnish" | "french" | "german" | "hindi" 
  | "hungarian" | "italian" | "japanese" | "korean" | "norwegian" 
  | "persian" | "polish" | "portuguese" | "russian" | "spanish" 
  | "swedish" | "thai" | "turkish"
  | "your-new-language"  // Add your language here

types.py (Python)

# Add new language to the Language Literal type
Language = Literal[
    "arabic", "chinese", "czech", "danish", "english", "esperanto", 
    "finnish", "french", "german", "hindi", "hungarian", "italian", 
    "japanese", "korean", "norwegian", "persian", "polish", 
    "portuguese", "russian", "spanish", "swedish", "thai", "turkish",
    "your-new-language"  # Add your language here
]

Update Dictionary Loaders

Modify the dictionary loading system to include your new language:

dictionary.ts (JavaScript)

// Import your new dictionary
import yourNewLanguage from '../../../shared/dictionaries/your-new-language.json';

const dictionary: Record<Language, string[]> = {
  arabic: require('../../../shared/dictionaries/arabic.json'),
  chinese: require('../../../shared/dictionaries/chinese.json'),
  // ... other languages
  turkish: require('../../../shared/dictionaries/turkish.json'),
  'your-new-language': yourNewLanguage  // Add your language
};

dictionary.py (Python)

class DictionaryLoader:
    def __init__(self) -> None:
        self._dictionaries = {
            "arabic": self._load_dictionary("arabic"),
            "chinese": self._load_dictionary("chinese"),
            # ... other languages
            "turkish": self._load_dictionary("turkish"),
            "your-new-language": self._load_dictionary("your-new-language")  # Add your language
        }

Test New Language Integration

Verify that your new language works correctly with the profanity detection system:

New Language Testing

import { Filter } from 'glin-profanity';

// Test single language
const newLanguageFilter = new Filter({
  languages: ['your-new-language'],
  severityLevels: true
});

// Test multi-language with new language
const multiLanguageFilter = new Filter({
  languages: ['english', 'spanish', 'your-new-language'],
  allowObfuscatedMatch: true
});

// Test cases for your new language
const testCases = [
  'Clean text in your language',
  'Text with profane-word-1 in it',
  'Multiple profane-word-1 and profane-word-2 words',
  'Mixed english damn and your-language profanity'
];

testCases.forEach(text => {
  const result = multiLanguageFilter.checkProfanity(text);
  console.log(`Text: "${text}"`);
  console.log(`- Profanity: ${result.containsProfanity}`);
  console.log(`- Words: ${result.profaneWords.join(', ')}`);
  console.log('---');
});

Update Metadata and Documentation

Add your new language to the metadata files and documentation:

metadata.json Update

{
  "version": "2.4.0",
  "languages": {
    // ... existing languages
    "your-new-language": {
      "wordCount": 234,
      "contributors": ["your-name", "community"],
      "lastModified": "2024-08-01",
      "encoding": "utf-8",
      "notes": "Comprehensive profanity detection for Your Language"
    }
  }
}

Documentation Updates:

Update language count in README (24 languages instead of 23)
Add language to supported languages list
Include usage examples in documentation
Add any cultural or linguistic notes for maintainers

// packages/js/src/data/dictionary.ts
const dictionaryPath = '../../../shared/dictionaries/';

const dictionary: Record<Language, string[]> = {
  english: require(`${dictionaryPath}english.json`),
  spanish: require(`${dictionaryPath}spanish.json`),
  // ... other languages
};

Python Path Resolution

# packages/py/glin_profanity/data/dictionary.py
import json
import os
from pathlib import Path

class DictionaryLoader:
    def __init__(self) -> None:
        # Resolve shared dictionary path
        self.dictionary_path = Path(__file__).parent.parent.parent.parent / "shared" / "dictionaries"
        
    def _load_dictionary(self, language: str) -> list[str]:
        dictionary_file = self.dictionary_path / f"{language}.json"
        with open(dictionary_file, 'r', encoding='utf-8') as f:
            return json.load(f)

Encoding Consistency

Both implementations must handle Unicode characters consistently:

Unicode Handling (JavaScript)

// Proper Unicode normalization for consistent matching
const normalizeText = (text) => {
  return text
    .normalize('NFD')  // Decompose accented characters
    .replace(/[\u0300-\u036f]/g, ''); // Remove combining marks
};

// Example: "café" and "cafe" both normalize to "cafe"
console.log(normalizeText('café')); // "cafe"
console.log(normalizeText('naïve')); // "naive"

Unicode Handling (Python)

import unicodedata

def normalize_text(text: str) -> str:
    """Normalize Unicode text for consistent matching."""
    # Decompose accented characters and remove combining marks
    normalized = unicodedata.normalize('NFD', text)
    return ''.join(c for c in normalized if unicodedata.category(c) != 'Mn')

# Example: "café" and "cafe" both normalize to "cafe"  
print(normalize_text('café'))   # "cafe"
print(normalize_text('naïve'))  # "naive"

TypeWeaver Dictionary API

Access dictionary updates and statistics through the official TypeWeaver API:

API Endpoints

Check for Dictionary Updates

// Check current version and available updates
const response = await fetch('https://typeweaver.com/api/glin-profanity/dictionary/version');
const versionInfo = await response.json();

console.log(`Current: ${versionInfo.currentVersion}`);
console.log(`Latest: ${versionInfo.latestVersion}`);

if (versionInfo.updateAvailable) {
  console.log('Updates available:');
  versionInfo.changelog.forEach(change => console.log(`- ${change}`));
  
  // Updated languages
  console.log('Updated languages:', versionInfo.languages.updated);
  console.log('Word counts:', versionInfo.languages.wordCounts);
}

Response Format:

{
  "currentVersion": "2.3.2",
  "latestVersion": "2.4.0",
  "updateAvailable": true,
  "changelog": ["Added 15 new Spanish terms", "Fixed Unicode handling"],
  "languages": {
    "total": 23,
    "updated": ["spanish", "french", "chinese"],
    "wordCounts": { "english": 847, "spanish": 538 }
  }
}

Download Dictionary Files

// Download all dictionaries as ZIP
const downloadAll = async () => {
  const response = await fetch('https://typeweaver.com/api/glin-profanity/dictionary/download?type=full&format=zip');
  const blob = await response.blob();
  
  // Save to file (browser)
  const url = URL.createObjectURL(blob);
  const a = document.createElement('a');
  a.href = url;
  a.download = 'glin-profanity-dictionaries.zip';
  a.click();
  URL.revokeObjectURL(url);
};

// Download specific languages
const downloadLanguages = async (languages) => {
  const params = new URLSearchParams({
    type: 'single',
    languages: languages.join(','),
    format: 'json'
  });
  
  const response = await fetch(`https://typeweaver.com/api/glin-profanity/dictionary/download?${params}`);
  const data = await response.json();
  
  return data.dictionaries;
};

// Example usage
await downloadAll();
const dictionaries = await downloadLanguages(['english', 'spanish', 'french']);

Available Parameters:

type: full, incremental, single
languages: comma-separated language codes
format: zip, json
fromVersion: for incremental updates

Dictionary Statistics API

// Get comprehensive statistics
const getStats = async () => {
  const response = await fetch('https://typeweaver.com/api/glin-profanity/dictionary/stats');
  const stats = await response.json();
  
  console.log(`Total Languages: ${stats.overview.totalLanguages}`);
  console.log(`Total Words: ${stats.overview.totalWords}`);
  console.log(`Quality Score: ${stats.quality.overallScore}%`);
  
  // Language breakdown
  Object.entries(stats.languageBreakdown).forEach(([lang, data]) => {
    console.log(`${lang}: ${data.wordCount} words (${data.coverage})`);
  });
};

// Get specific language stats
const getLanguageStats = async (language) => {
  const response = await fetch(`https://typeweaver.com/api/glin-profanity/dictionary/stats?language=${language}`);
  const stats = await response.json();
  
  return {
    language: stats.language,
    wordCount: stats.wordCount,
    coverage: stats.coverage,
    globalRanking: stats.globalRanking,
    percentageOfTotal: stats.percentageOfTotal
  };
};

// Example usage
await getStats();
const englishStats = await getLanguageStats('english');
console.log(`English: #${englishStats.globalRanking} with ${englishStats.wordCount} words`);

API Features

No Authentication Required: Public API for community use
CORS Enabled: Works from any domain
Rate Limited: 1000 requests per hour per IP
Cached Responses: Optimized performance with appropriate cache headers
Comprehensive Documentation: Full API reference at /api/glin-profanity

Cross-References

Installation Guide - Setting up Glin-Profanity with language selection
Configuration - Language configuration options and customWords setup
Core Functions - Using languages parameter in API calls
Filter Class - Object-oriented approach to language management
Python API - Cross-language dictionary access patterns
Context Analysis - Language-specific context understanding
TypeWeaver API Documentation - Complete API reference and examples

Dictionary Management