Dictionary Management
Language dictionaries structure, customization, and extending with new words and languages
Comprehensive guide to Glin-Profanity's multi-language dictionary system, including how to extend dictionaries with custom words, add new languages, and manage language-specific configurations for production deployments.
Glin-Profanity uses a shared dictionary system with 23 supported languages, allowing for consistent profanity detection across JavaScript and Python implementations.
Dictionary Structure
The dictionary system is organized in a shared structure that both JavaScript and Python implementations access for consistent behavior across platforms.
Project Organization
Supported Languages
Glin-Profanity supports 23 languages with comprehensive profanity dictionaries maintained by the community:
| Prop | Type | Default |
|---|---|---|
European Languages? | 13 languages | english, spanish, french, german, italian, portuguese, russian, polish, dutch, swedish, norwegian, danish, finnish, czech, hungarian |
Asian Languages? | 6 languages | japanese, korean, chinese, arabic, hindi, thai |
Middle Eastern? | 2 languages | persian, turkish |
Constructed Languages? | 1 language | esperanto |
Special Files? | System files | globalWhitelist.json, metadata.json |
Complete Language List
// All 23 supported languages (alphabetical order)
type Language =
| "arabic" | "chinese" | "czech" | "danish" | "english"
| "esperanto" | "finnish" | "french" | "german" | "hindi"
| "hungarian" | "italian" | "japanese" | "korean" | "norwegian"
| "persian" | "polish" | "portuguese" | "russian" | "spanish"
| "swedish" | "thai" | "turkish"Dictionary File Format
Each language dictionary follows a consistent JSON format for easy maintenance and updates:
[
"damn",
"shit",
"fuck",
"bitch",
"ass",
"bastard",
"hell",
"crap"
]Format Requirements:
- JSON array of strings
- Lowercase entries only
- No duplicates within language
- Alphabetical ordering preferred (not required)
- UTF-8 encoding for international characters
[
"cabron",
"cabrón",
"coño",
"hijo de puta",
"joder",
"maldito",
"mierda",
"pendejo",
"puto"
]Advanced Features:
- Unicode Support: Accented characters (á, é, í, ó, ú, ñ)
- Multi-word Phrases: "hijo de puta", "va te faire foutre"
- Regional Variants: Both "cabron" and "cabrón" included
- Cultural Context: Language-specific profanity patterns
{
"version": "2.3.2",
"lastUpdated": "2024-08-01",
"languages": {
"english": {
"wordCount": 847,
"contributors": ["community", "moderators"],
"lastModified": "2024-07-15",
"encoding": "utf-8",
"notes": "Comprehensive English profanity dictionary"
},
"spanish": {
"wordCount": 523,
"contributors": ["native-speakers", "linguists"],
"lastModified": "2024-06-20",
"encoding": "utf-8",
"notes": "Includes Latin American and European variants"
}
},
"globalWhitelist": {
"wordCount": 156,
"purpose": "Context-aware false positive reduction",
"categories": ["gaming", "movies", "products", "technical"]
}
}Adding Custom Words
Extend existing language dictionaries with domain-specific or updated profanity terms using configuration options:
Configure Custom Words
Add custom profanity terms to your filter configuration without modifying core dictionary files:
import { Filter } from 'glin-profanity';
const customFilter = new Filter({
languages: ['english', 'spanish'],
customWords: [
// Gaming-specific terms
'noob', 'scrub', 'pwned', 'rekt',
// Internet slang
'simp', 'karen', 'chad',
// Domain-specific profanity
'corporate-buzzword', 'synergy-bs'
],
// Additional configuration
severityLevels: true,
allowObfuscatedMatch: true
});
// Test custom words
console.log(customFilter.isProfane('That noob got rekt!')); // true
console.log(customFilter.isProfane('Stop being a simp')); // truefrom glin_profanity import Filter
custom_filter = Filter({
"languages": ["english", "spanish"],
"custom_words": [
# Gaming-specific terms
"noob", "scrub", "pwned", "rekt",
# Internet slang
"simp", "karen", "chad",
# Domain-specific profanity
"corporate-buzzword", "synergy-bs"
],
# Additional configuration
"severity_levels": True,
"allow_obfuscated_match": True
})
# Test custom words
print(custom_filter.is_profane("That noob got rekt!")) # True
print(custom_filter.is_profane("Stop being a simp")) # TrueValidate Custom Word Integration
Test that custom words integrate properly with existing dictionary detection:
const filter = new Filter({
languages: ['english'],
customWords: ['noob', 'scrub', 'rekt'],
severityLevels: true,
replaceWith: '***'
});
// Test mixed content (dictionary + custom words)
const testCases = [
'You damn noob!', // Dictionary: damn, Custom: noob
'That was fucking rekt', // Dictionary: fucking, Custom: rekt
'Stop being a scrub', // Custom: scrub only
'This is normal text' // No profanity
];
testCases.forEach(text => {
const result = filter.checkProfanity(text);
console.log(`Text: "${text}"`);
console.log(`- Profanity detected: ${result.containsProfanity}`);
console.log(`- Detected words: ${result.profaneWords.join(', ')}`);
console.log(`- Processed: ${result.processedText}`);
console.log('---');
});Implement Ignore Lists
Use ignore lists to exclude words that might be false positives in your domain:
const contextualFilter = new Filter({
languages: ['english'],
customWords: ['kill', 'dead', 'murder'], // Potentially problematic in gaming
ignoreWords: [
// Gaming context exceptions
'kill', // "kill the enemy" in gaming
'dead', // "dead zone" in networking
'execution', // "code execution" in programming
// Professional context exceptions
'master', // "master branch", "master key"
'slave', // "slave server", "master-slave"
'penetration' // "penetration testing" in security
]
});
// These won't trigger profanity detection
console.log(contextualFilter.isProfane('Kill the enemy boss')); // false
console.log(contextualFilter.isProfane('Master-slave architecture')); // false
console.log(contextualFilter.isProfane('Penetration testing')); // false
// But these still will
console.log(contextualFilter.isProfane('Go kill yourself')); // true (context)
console.log(contextualFilter.isProfane('You fucking idiot')); // true (profanity)Adding New Languages
Extend Glin-Profanity with additional language support by following the established dictionary format:
Create Language Dictionary File
Create a new JSON dictionary file following the established format:
[
"profane-word-1",
"profane-word-2",
"profane-word-3",
"multi word phrase",
"unicode-characters-ñáéíóú"
]File Creation Guidelines:
- Use lowercase for all entries
- Include accented/special characters where appropriate
- Add multi-word phrases for comprehensive coverage
- Sort alphabetically for maintainability
- Ensure UTF-8 encoding
- Target 100-500+ words for comprehensive coverage
Update Language Type Definitions
Add your new language to the supported language types:
// Add new language to the Language union type
type Language =
| "arabic" | "chinese" | "czech" | "danish" | "english"
| "esperanto" | "finnish" | "french" | "german" | "hindi"
| "hungarian" | "italian" | "japanese" | "korean" | "norwegian"
| "persian" | "polish" | "portuguese" | "russian" | "spanish"
| "swedish" | "thai" | "turkish"
| "your-new-language" // Add your language here# Add new language to the Language Literal type
Language = Literal[
"arabic", "chinese", "czech", "danish", "english", "esperanto",
"finnish", "french", "german", "hindi", "hungarian", "italian",
"japanese", "korean", "norwegian", "persian", "polish",
"portuguese", "russian", "spanish", "swedish", "thai", "turkish",
"your-new-language" # Add your language here
]Update Dictionary Loaders
Modify the dictionary loading system to include your new language:
// Import your new dictionary
import yourNewLanguage from '../../../shared/dictionaries/your-new-language.json';
const dictionary: Record<Language, string[]> = {
arabic: require('../../../shared/dictionaries/arabic.json'),
chinese: require('../../../shared/dictionaries/chinese.json'),
// ... other languages
turkish: require('../../../shared/dictionaries/turkish.json'),
'your-new-language': yourNewLanguage // Add your language
};class DictionaryLoader:
def __init__(self) -> None:
self._dictionaries = {
"arabic": self._load_dictionary("arabic"),
"chinese": self._load_dictionary("chinese"),
# ... other languages
"turkish": self._load_dictionary("turkish"),
"your-new-language": self._load_dictionary("your-new-language") # Add your language
}Test New Language Integration
Verify that your new language works correctly with the profanity detection system:
import { Filter } from 'glin-profanity';
// Test single language
const newLanguageFilter = new Filter({
languages: ['your-new-language'],
severityLevels: true
});
// Test multi-language with new language
const multiLanguageFilter = new Filter({
languages: ['english', 'spanish', 'your-new-language'],
allowObfuscatedMatch: true
});
// Test cases for your new language
const testCases = [
'Clean text in your language',
'Text with profane-word-1 in it',
'Multiple profane-word-1 and profane-word-2 words',
'Mixed english damn and your-language profanity'
];
testCases.forEach(text => {
const result = multiLanguageFilter.checkProfanity(text);
console.log(`Text: "${text}"`);
console.log(`- Profanity: ${result.containsProfanity}`);
console.log(`- Words: ${result.profaneWords.join(', ')}`);
console.log('---');
});Update Metadata and Documentation
Add your new language to the metadata files and documentation:
{
"version": "2.4.0",
"languages": {
// ... existing languages
"your-new-language": {
"wordCount": 234,
"contributors": ["your-name", "community"],
"lastModified": "2024-08-01",
"encoding": "utf-8",
"notes": "Comprehensive profanity detection for Your Language"
}
}
}Documentation Updates:
- Update language count in README (24 languages instead of 23)
- Add language to supported languages list
- Include usage examples in documentation
- Add any cultural or linguistic notes for maintainers
Dictionary Maintenance
Best practices for maintaining and updating language dictionaries in production environments:
Cross-Platform Considerations
Ensure dictionary compatibility across JavaScript and Python implementations:
File Path Resolution
// packages/js/src/data/dictionary.ts
const dictionaryPath = '../../../shared/dictionaries/';
const dictionary: Record<Language, string[]> = {
english: require(`${dictionaryPath}english.json`),
spanish: require(`${dictionaryPath}spanish.json`),
// ... other languages
};# packages/py/glin_profanity/data/dictionary.py
import json
import os
from pathlib import Path
class DictionaryLoader:
def __init__(self) -> None:
# Resolve shared dictionary path
self.dictionary_path = Path(__file__).parent.parent.parent.parent / "shared" / "dictionaries"
def _load_dictionary(self, language: str) -> list[str]:
dictionary_file = self.dictionary_path / f"{language}.json"
with open(dictionary_file, 'r', encoding='utf-8') as f:
return json.load(f)Encoding Consistency
Both implementations must handle Unicode characters consistently:
// Proper Unicode normalization for consistent matching
const normalizeText = (text) => {
return text
.normalize('NFD') // Decompose accented characters
.replace(/[\u0300-\u036f]/g, ''); // Remove combining marks
};
// Example: "café" and "cafe" both normalize to "cafe"
console.log(normalizeText('café')); // "cafe"
console.log(normalizeText('naïve')); // "naive"import unicodedata
def normalize_text(text: str) -> str:
"""Normalize Unicode text for consistent matching."""
# Decompose accented characters and remove combining marks
normalized = unicodedata.normalize('NFD', text)
return ''.join(c for c in normalized if unicodedata.category(c) != 'Mn')
# Example: "café" and "cafe" both normalize to "cafe"
print(normalize_text('café')) # "cafe"
print(normalize_text('naïve')) # "naive"TypeWeaver Dictionary API
Access dictionary updates and statistics through the official TypeWeaver API:
API Endpoints
// Check current version and available updates
const response = await fetch('https://typeweaver.com/api/glin-profanity/dictionary/version');
const versionInfo = await response.json();
console.log(`Current: ${versionInfo.currentVersion}`);
console.log(`Latest: ${versionInfo.latestVersion}`);
if (versionInfo.updateAvailable) {
console.log('Updates available:');
versionInfo.changelog.forEach(change => console.log(`- ${change}`));
// Updated languages
console.log('Updated languages:', versionInfo.languages.updated);
console.log('Word counts:', versionInfo.languages.wordCounts);
}Response Format:
{
"currentVersion": "2.3.2",
"latestVersion": "2.4.0",
"updateAvailable": true,
"changelog": ["Added 15 new Spanish terms", "Fixed Unicode handling"],
"languages": {
"total": 23,
"updated": ["spanish", "french", "chinese"],
"wordCounts": { "english": 847, "spanish": 538 }
}
}// Download all dictionaries as ZIP
const downloadAll = async () => {
const response = await fetch('https://typeweaver.com/api/glin-profanity/dictionary/download?type=full&format=zip');
const blob = await response.blob();
// Save to file (browser)
const url = URL.createObjectURL(blob);
const a = document.createElement('a');
a.href = url;
a.download = 'glin-profanity-dictionaries.zip';
a.click();
URL.revokeObjectURL(url);
};
// Download specific languages
const downloadLanguages = async (languages) => {
const params = new URLSearchParams({
type: 'single',
languages: languages.join(','),
format: 'json'
});
const response = await fetch(`https://typeweaver.com/api/glin-profanity/dictionary/download?${params}`);
const data = await response.json();
return data.dictionaries;
};
// Example usage
await downloadAll();
const dictionaries = await downloadLanguages(['english', 'spanish', 'french']);Available Parameters:
type:full,incremental,singlelanguages: comma-separated language codesformat:zip,jsonfromVersion: for incremental updates
// Get comprehensive statistics
const getStats = async () => {
const response = await fetch('https://typeweaver.com/api/glin-profanity/dictionary/stats');
const stats = await response.json();
console.log(`Total Languages: ${stats.overview.totalLanguages}`);
console.log(`Total Words: ${stats.overview.totalWords}`);
console.log(`Quality Score: ${stats.quality.overallScore}%`);
// Language breakdown
Object.entries(stats.languageBreakdown).forEach(([lang, data]) => {
console.log(`${lang}: ${data.wordCount} words (${data.coverage})`);
});
};
// Get specific language stats
const getLanguageStats = async (language) => {
const response = await fetch(`https://typeweaver.com/api/glin-profanity/dictionary/stats?language=${language}`);
const stats = await response.json();
return {
language: stats.language,
wordCount: stats.wordCount,
coverage: stats.coverage,
globalRanking: stats.globalRanking,
percentageOfTotal: stats.percentageOfTotal
};
};
// Example usage
await getStats();
const englishStats = await getLanguageStats('english');
console.log(`English: #${englishStats.globalRanking} with ${englishStats.wordCount} words`);API Features
- No Authentication Required: Public API for community use
- CORS Enabled: Works from any domain
- Rate Limited: 1000 requests per hour per IP
- Cached Responses: Optimized performance with appropriate cache headers
- Comprehensive Documentation: Full API reference at
/api/glin-profanity
Cross-References
- Installation Guide - Setting up Glin-Profanity with language selection
- Configuration - Language configuration options and customWords setup
- Core Functions - Using languages parameter in API calls
- Filter Class - Object-oriented approach to language management
- Python API - Cross-language dictionary access patterns
- Context Analysis - Language-specific context understanding
- TypeWeaver API Documentation - Complete API reference and examples