GLINR Studio LogoTypeWeaver

Dictionary Management

Language dictionaries structure, customization, and extending with new words and languages

Edit on GitHub

Comprehensive guide to Glin-Profanity's multi-language dictionary system, including how to extend dictionaries with custom words, add new languages, and manage language-specific configurations for production deployments.

Glin-Profanity uses a shared dictionary system with 23 supported languages, allowing for consistent profanity detection across JavaScript and Python implementations.

Dictionary Structure

The dictionary system is organized in a shared structure that both JavaScript and Python implementations access for consistent behavior across platforms.

Project Organization

english.json
spanish.json
french.json
german.json
italian.json
portuguese.json
russian.json
japanese.json
korean.json
chinese.json
arabic.json
hindi.json
turkish.json
polish.json
dutch.json
swedish.json
norwegian.json
danish.json
finnish.json
czech.json
hungarian.json
persian.json
thai.json
globalWhitelist.json
metadata.json

Supported Languages

Glin-Profanity supports 23 languages with comprehensive profanity dictionaries maintained by the community:

PropTypeDefault
European Languages?
13 languages
english, spanish, french, german, italian, portuguese, russian, polish, dutch, swedish, norwegian, danish, finnish, czech, hungarian
Asian Languages?
6 languages
japanese, korean, chinese, arabic, hindi, thai
Middle Eastern?
2 languages
persian, turkish
Constructed Languages?
1 language
esperanto
Special Files?
System files
globalWhitelist.json, metadata.json

Complete Language List

Supported Language Codes
// All 23 supported languages (alphabetical order)
type Language = 
  | "arabic" | "chinese" | "czech" | "danish" | "english" 
  | "esperanto" | "finnish" | "french" | "german" | "hindi" 
  | "hungarian" | "italian" | "japanese" | "korean" | "norwegian" 
  | "persian" | "polish" | "portuguese" | "russian" | "spanish" 
  | "swedish" | "thai" | "turkish"

Dictionary File Format

Each language dictionary follows a consistent JSON format for easy maintenance and updates:

english.json (Example)
[
  "damn",
  "shit", 
  "fuck",
  "bitch",
  "ass",
  "bastard",
  "hell",
  "crap"
]

Format Requirements:

  • JSON array of strings
  • Lowercase entries only
  • No duplicates within language
  • Alphabetical ordering preferred (not required)
  • UTF-8 encoding for international characters
spanish.json (Extended Example)
[
  "cabron",
  "cabrón",
  "coño", 
  "hijo de puta",
  "joder",
  "maldito",
  "mierda",
  "pendejo",
  "puto"
]

Advanced Features:

  • Unicode Support: Accented characters (á, é, í, ó, ú, ñ)
  • Multi-word Phrases: "hijo de puta", "va te faire foutre"
  • Regional Variants: Both "cabron" and "cabrón" included
  • Cultural Context: Language-specific profanity patterns
metadata.json (Dictionary Information)
{
  "version": "2.3.2",
  "lastUpdated": "2024-08-01",
  "languages": {
    "english": {
      "wordCount": 847,
      "contributors": ["community", "moderators"],
      "lastModified": "2024-07-15",
      "encoding": "utf-8",
      "notes": "Comprehensive English profanity dictionary"
    },
    "spanish": {
      "wordCount": 523,
      "contributors": ["native-speakers", "linguists"],
      "lastModified": "2024-06-20",
      "encoding": "utf-8", 
      "notes": "Includes Latin American and European variants"
    }
  },
  "globalWhitelist": {
    "wordCount": 156,
    "purpose": "Context-aware false positive reduction",
    "categories": ["gaming", "movies", "products", "technical"]
  }
}

Adding Custom Words

Extend existing language dictionaries with domain-specific or updated profanity terms using configuration options:

Configure Custom Words

Add custom profanity terms to your filter configuration without modifying core dictionary files:

JavaScript Custom Words
import { Filter } from 'glin-profanity';

const customFilter = new Filter({
  languages: ['english', 'spanish'],
  customWords: [
    // Gaming-specific terms
    'noob', 'scrub', 'pwned', 'rekt',
    
    // Internet slang
    'simp', 'karen', 'chad',
    
    // Domain-specific profanity
    'corporate-buzzword', 'synergy-bs'
  ],
  
  // Additional configuration
  severityLevels: true,
  allowObfuscatedMatch: true
});

// Test custom words
console.log(customFilter.isProfane('That noob got rekt!')); // true
console.log(customFilter.isProfane('Stop being a simp'));   // true
Python Custom Words
from glin_profanity import Filter

custom_filter = Filter({
    "languages": ["english", "spanish"],
    "custom_words": [
        # Gaming-specific terms
        "noob", "scrub", "pwned", "rekt",
        
        # Internet slang
        "simp", "karen", "chad",
        
        # Domain-specific profanity  
        "corporate-buzzword", "synergy-bs"
    ],
    
    # Additional configuration
    "severity_levels": True,
    "allow_obfuscated_match": True
})

# Test custom words
print(custom_filter.is_profane("That noob got rekt!"))  # True
print(custom_filter.is_profane("Stop being a simp"))    # True

Validate Custom Word Integration

Test that custom words integrate properly with existing dictionary detection:

Custom Word Validation
const filter = new Filter({
  languages: ['english'],
  customWords: ['noob', 'scrub', 'rekt'],
  severityLevels: true,
  replaceWith: '***'
});

// Test mixed content (dictionary + custom words)
const testCases = [
  'You damn noob!',           // Dictionary: damn, Custom: noob
  'That was fucking rekt',    // Dictionary: fucking, Custom: rekt  
  'Stop being a scrub',       // Custom: scrub only
  'This is normal text'       // No profanity
];

testCases.forEach(text => {
  const result = filter.checkProfanity(text);
  console.log(`Text: "${text}"`);
  console.log(`- Profanity detected: ${result.containsProfanity}`);
  console.log(`- Detected words: ${result.profaneWords.join(', ')}`);
  console.log(`- Processed: ${result.processedText}`);
  console.log('---');
});

Implement Ignore Lists

Use ignore lists to exclude words that might be false positives in your domain:

Custom Ignore Lists
const contextualFilter = new Filter({
  languages: ['english'],
  customWords: ['kill', 'dead', 'murder'], // Potentially problematic in gaming
  ignoreWords: [
    // Gaming context exceptions
    'kill',      // "kill the enemy" in gaming
    'dead',      // "dead zone" in networking
    'execution', // "code execution" in programming
    
    // Professional context exceptions  
    'master',    // "master branch", "master key"
    'slave',     // "slave server", "master-slave"
    'penetration' // "penetration testing" in security
  ]
});

// These won't trigger profanity detection
console.log(contextualFilter.isProfane('Kill the enemy boss'));      // false
console.log(contextualFilter.isProfane('Master-slave architecture')); // false
console.log(contextualFilter.isProfane('Penetration testing'));       // false

// But these still will
console.log(contextualFilter.isProfane('Go kill yourself'));          // true (context)
console.log(contextualFilter.isProfane('You fucking idiot'));         // true (profanity)

Adding New Languages

Extend Glin-Profanity with additional language support by following the established dictionary format:

Create Language Dictionary File

Create a new JSON dictionary file following the established format:

new-language.json (Template)
[
  "profane-word-1",
  "profane-word-2", 
  "profane-word-3",
  "multi word phrase",
  "unicode-characters-ñáéíóú"
]

File Creation Guidelines:

  • Use lowercase for all entries
  • Include accented/special characters where appropriate
  • Add multi-word phrases for comprehensive coverage
  • Sort alphabetically for maintainability
  • Ensure UTF-8 encoding
  • Target 100-500+ words for comprehensive coverage

Update Language Type Definitions

Add your new language to the supported language types:

types.ts (JavaScript/TypeScript)
// Add new language to the Language union type
type Language = 
  | "arabic" | "chinese" | "czech" | "danish" | "english" 
  | "esperanto" | "finnish" | "french" | "german" | "hindi" 
  | "hungarian" | "italian" | "japanese" | "korean" | "norwegian" 
  | "persian" | "polish" | "portuguese" | "russian" | "spanish" 
  | "swedish" | "thai" | "turkish"
  | "your-new-language"  // Add your language here
types.py (Python)
# Add new language to the Language Literal type
Language = Literal[
    "arabic", "chinese", "czech", "danish", "english", "esperanto", 
    "finnish", "french", "german", "hindi", "hungarian", "italian", 
    "japanese", "korean", "norwegian", "persian", "polish", 
    "portuguese", "russian", "spanish", "swedish", "thai", "turkish",
    "your-new-language"  # Add your language here
]

Update Dictionary Loaders

Modify the dictionary loading system to include your new language:

dictionary.ts (JavaScript)
// Import your new dictionary
import yourNewLanguage from '../../../shared/dictionaries/your-new-language.json';

const dictionary: Record<Language, string[]> = {
  arabic: require('../../../shared/dictionaries/arabic.json'),
  chinese: require('../../../shared/dictionaries/chinese.json'),
  // ... other languages
  turkish: require('../../../shared/dictionaries/turkish.json'),
  'your-new-language': yourNewLanguage  // Add your language
};
dictionary.py (Python)
class DictionaryLoader:
    def __init__(self) -> None:
        self._dictionaries = {
            "arabic": self._load_dictionary("arabic"),
            "chinese": self._load_dictionary("chinese"),
            # ... other languages
            "turkish": self._load_dictionary("turkish"),
            "your-new-language": self._load_dictionary("your-new-language")  # Add your language
        }

Test New Language Integration

Verify that your new language works correctly with the profanity detection system:

New Language Testing
import { Filter } from 'glin-profanity';

// Test single language
const newLanguageFilter = new Filter({
  languages: ['your-new-language'],
  severityLevels: true
});

// Test multi-language with new language
const multiLanguageFilter = new Filter({
  languages: ['english', 'spanish', 'your-new-language'],
  allowObfuscatedMatch: true
});

// Test cases for your new language
const testCases = [
  'Clean text in your language',
  'Text with profane-word-1 in it',
  'Multiple profane-word-1 and profane-word-2 words',
  'Mixed english damn and your-language profanity'
];

testCases.forEach(text => {
  const result = multiLanguageFilter.checkProfanity(text);
  console.log(`Text: "${text}"`);
  console.log(`- Profanity: ${result.containsProfanity}`);
  console.log(`- Words: ${result.profaneWords.join(', ')}`);
  console.log('---');
});

Update Metadata and Documentation

Add your new language to the metadata files and documentation:

metadata.json Update
{
  "version": "2.4.0",
  "languages": {
    // ... existing languages
    "your-new-language": {
      "wordCount": 234,
      "contributors": ["your-name", "community"],
      "lastModified": "2024-08-01",
      "encoding": "utf-8",
      "notes": "Comprehensive profanity detection for Your Language"
    }
  }
}

Documentation Updates:

  • Update language count in README (24 languages instead of 23)
  • Add language to supported languages list
  • Include usage examples in documentation
  • Add any cultural or linguistic notes for maintainers

Dictionary Maintenance

Best practices for maintaining and updating language dictionaries in production environments:

Cross-Platform Considerations

Ensure dictionary compatibility across JavaScript and Python implementations:

File Path Resolution

JavaScript Path Resolution
// packages/js/src/data/dictionary.ts
const dictionaryPath = '../../../shared/dictionaries/';

const dictionary: Record<Language, string[]> = {
  english: require(`${dictionaryPath}english.json`),
  spanish: require(`${dictionaryPath}spanish.json`),
  // ... other languages
};
Python Path Resolution
# packages/py/glin_profanity/data/dictionary.py
import json
import os
from pathlib import Path

class DictionaryLoader:
    def __init__(self) -> None:
        # Resolve shared dictionary path
        self.dictionary_path = Path(__file__).parent.parent.parent.parent / "shared" / "dictionaries"
        
    def _load_dictionary(self, language: str) -> list[str]:
        dictionary_file = self.dictionary_path / f"{language}.json"
        with open(dictionary_file, 'r', encoding='utf-8') as f:
            return json.load(f)

Encoding Consistency

Both implementations must handle Unicode characters consistently:

Unicode Handling (JavaScript)
// Proper Unicode normalization for consistent matching
const normalizeText = (text) => {
  return text
    .normalize('NFD')  // Decompose accented characters
    .replace(/[\u0300-\u036f]/g, ''); // Remove combining marks
};

// Example: "café" and "cafe" both normalize to "cafe"
console.log(normalizeText('café')); // "cafe"
console.log(normalizeText('naïve')); // "naive"
Unicode Handling (Python)
import unicodedata

def normalize_text(text: str) -> str:
    """Normalize Unicode text for consistent matching."""
    # Decompose accented characters and remove combining marks
    normalized = unicodedata.normalize('NFD', text)
    return ''.join(c for c in normalized if unicodedata.category(c) != 'Mn')

# Example: "café" and "cafe" both normalize to "cafe"  
print(normalize_text('café'))   # "cafe"
print(normalize_text('naïve'))  # "naive"

TypeWeaver Dictionary API

Access dictionary updates and statistics through the official TypeWeaver API:

API Endpoints

Check for Dictionary Updates
// Check current version and available updates
const response = await fetch('https://typeweaver.com/api/glin-profanity/dictionary/version');
const versionInfo = await response.json();

console.log(`Current: ${versionInfo.currentVersion}`);
console.log(`Latest: ${versionInfo.latestVersion}`);

if (versionInfo.updateAvailable) {
  console.log('Updates available:');
  versionInfo.changelog.forEach(change => console.log(`- ${change}`));
  
  // Updated languages
  console.log('Updated languages:', versionInfo.languages.updated);
  console.log('Word counts:', versionInfo.languages.wordCounts);
}

Response Format:

{
  "currentVersion": "2.3.2",
  "latestVersion": "2.4.0",
  "updateAvailable": true,
  "changelog": ["Added 15 new Spanish terms", "Fixed Unicode handling"],
  "languages": {
    "total": 23,
    "updated": ["spanish", "french", "chinese"],
    "wordCounts": { "english": 847, "spanish": 538 }
  }
}
Download Dictionary Files
// Download all dictionaries as ZIP
const downloadAll = async () => {
  const response = await fetch('https://typeweaver.com/api/glin-profanity/dictionary/download?type=full&format=zip');
  const blob = await response.blob();
  
  // Save to file (browser)
  const url = URL.createObjectURL(blob);
  const a = document.createElement('a');
  a.href = url;
  a.download = 'glin-profanity-dictionaries.zip';
  a.click();
  URL.revokeObjectURL(url);
};

// Download specific languages
const downloadLanguages = async (languages) => {
  const params = new URLSearchParams({
    type: 'single',
    languages: languages.join(','),
    format: 'json'
  });
  
  const response = await fetch(`https://typeweaver.com/api/glin-profanity/dictionary/download?${params}`);
  const data = await response.json();
  
  return data.dictionaries;
};

// Example usage
await downloadAll();
const dictionaries = await downloadLanguages(['english', 'spanish', 'french']);

Available Parameters:

  • type: full, incremental, single
  • languages: comma-separated language codes
  • format: zip, json
  • fromVersion: for incremental updates
Dictionary Statistics API
// Get comprehensive statistics
const getStats = async () => {
  const response = await fetch('https://typeweaver.com/api/glin-profanity/dictionary/stats');
  const stats = await response.json();
  
  console.log(`Total Languages: ${stats.overview.totalLanguages}`);
  console.log(`Total Words: ${stats.overview.totalWords}`);
  console.log(`Quality Score: ${stats.quality.overallScore}%`);
  
  // Language breakdown
  Object.entries(stats.languageBreakdown).forEach(([lang, data]) => {
    console.log(`${lang}: ${data.wordCount} words (${data.coverage})`);
  });
};

// Get specific language stats
const getLanguageStats = async (language) => {
  const response = await fetch(`https://typeweaver.com/api/glin-profanity/dictionary/stats?language=${language}`);
  const stats = await response.json();
  
  return {
    language: stats.language,
    wordCount: stats.wordCount,
    coverage: stats.coverage,
    globalRanking: stats.globalRanking,
    percentageOfTotal: stats.percentageOfTotal
  };
};

// Example usage
await getStats();
const englishStats = await getLanguageStats('english');
console.log(`English: #${englishStats.globalRanking} with ${englishStats.wordCount} words`);

API Features

  • No Authentication Required: Public API for community use
  • CORS Enabled: Works from any domain
  • Rate Limited: 1000 requests per hour per IP
  • Cached Responses: Optimized performance with appropriate cache headers
  • Comprehensive Documentation: Full API reference at /api/glin-profanity

Cross-References