v0.8.0 released — NLP Engine + Privacy Engine (PDP) modules for Indonesian text processing and UU PDP compliance. Read changelog
Skip to Content
DocumentationUtilitiesNLP

NLP Engine

Indonesian NLP utilities for text processing without external dependencies.

Overview

The NLP module provides algorithmic text processing for Indonesian language:

  • Stemming: Strip common Indonesian affixes (prefixes/suffixes)
  • Phonetics: Encode names for fuzzy matching (Soundex-like for Indonesian)
  • Tokenization: Split sentences while preserving abbreviations
  • Normalization: Clean whitespace and remove unwanted characters

Features

  • Zero external dependencies (pure algorithms)
  • Works in browser and Node.js environments
  • Heuristic stemming (no dictionary required)
  • Indonesian-specific phonetic normalization (Dutch/Arabic spellings)

Installation

npm install @indodev/toolkit

Quick Start

import { stemText, encodePhonetic, tokenizeIndo, normalizeWhitespace } from '@indodev/toolkit/nlp'; // Stemming stemText('mempertanggungjawabkan'); // 'tanggung jawab' // Phonetic matching encodePhonetic('Syahruddin'); // 'S635' encodePhonetic('Sjahruddin'); // 'S635' (same!) isPhoneticMatch('Fakhri', 'Fahri'); // true // Tokenization (preserves abbreviations) tokenizeIndo("Kpd Yth. Bpk. Budi di Jl. Sudirman. Harap datang."); // ["Kpd Yth. Bpk. Budi di Jl. Sudirman.", "Harap datang."] // Normalization normalizeWhitespace(" Budi pergi ke sekolah "); // "Budi pergi ke sekolah"

API Reference

stemText()

Strips Indonesian affixes using algorithmic rules.

function stemText(text: string): string;

Parameters:

NameTypeDescription
textstringText to stem

Returns: Stemmed word (affix-stripped)

Examples:

stemText('mempertanggungjawabkan'); // 'tanggung jawab' stemText('berkewarganegaraan'); // 'warganegara' stemText('dikerjakan'); // 'kerja'

encodePhonetic()

Encodes text to phonetic representation for fuzzy matching.

function encodePhonetic(text: string): string;

Parameters:

NameTypeDescription
textstringText to encode

Returns: Phonetic code (letter + 3 digits)

Examples:

encodePhonetic('Syahruddin'); // 'S635' encodePhonetic('Sjahrudin'); // 'S635' encodePhonetic('Budi'); // 'B300'

False Positives: Phonetic encoding may produce same codes for different words (e.g., Bagus ≈ Bagas). Use isPhoneticMatch for boolean comparison, not as identity verification.

isPhoneticMatch()

Compares two strings for phonetic equality.

function isPhoneticMatch(text1: string, text2: string): boolean;

Parameters:

NameTypeDescription
text1stringFirst text
text2stringSecond text

Returns: true if both encode to same phonetic code

Examples:

isPhoneticMatch('Fakhri', 'Fahri'); // true isPhoneticMatch('Bhoedie', 'Budi'); // true isPhoneticMatch('John', 'Jane'); // false

tokenizeIndo()

Splits Indonesian text into sentences, preserving abbreviations.

function tokenizeIndo(text: string): string[];

Parameters:

NameTypeDescription
textstringText to tokenize

Returns: Array of sentences

Preserved abbreviations: Yth., Bpk., Ibu., Sdr., S.Kom., M.Kom., Dr., Sp., Mk., Jl., D.a., D.l.

Examples:

tokenizeIndo("Kpd Yth. Bpk. Budi di Jl. Sudirman. Harap datang."); // ["Kpd Yth. Bpk. Budi di Jl. Sudirman.", "Harap datang."]

normalizeWhitespace()

Normalizes whitespace (collapses multiple spaces, trims).

function normalizeWhitespace(text: string): string;

Parameters:

NameTypeDescription
textstringText to normalize

Examples:

normalizeWhitespace(" Budi pergi ke sekolah "); // "Budi pergi ke sekolah"

stripNonAlphanumeric()

Removes non-alphanumeric characters except spaces.

function stripNonAlphanumeric(text: string): string;

Parameters:

NameTypeDescription
textstringText to clean

Examples:

stripNonAlphanumeric("Budi123@#$%^&*()"); // "Budi123"

Type Reference

PhoneticResult

interface PhoneticResult { original: string; encoded: string; }

TokenizationResult

interface TokenizationResult { sentences: string[]; abbreviationCount: number; }

StemmingResult

interface StemmingResult { original: string; stemmed: string; removedAffixes: string[]; }

Known Limitations

Stemming

  • Heuristic-only (no dictionary)
  • May over-stem or under-stem in some cases
  • Best for search/indexing, not grammatical accuracy

Phonetics

  • False positives possible (same code, different words)
  • Dutch/Arabic normalization may affect accuracy for other languages
  • Not suitable for identity verification
  • Text - Additional text manipulation utilities
  • Privacy (PDP) - PII scanning and masking
Last updated on