NLP Engine

Indonesian NLP utilities for text processing without external dependencies.

Overview

The NLP module provides algorithmic text processing for Indonesian language:

Stemming: Strip common Indonesian affixes (prefixes/suffixes)
Phonetics: Encode names for fuzzy matching (Soundex-like for Indonesian)
Tokenization: Split sentences while preserving abbreviations
Normalization: Clean whitespace and remove unwanted characters

Features

Zero external dependencies (pure algorithms)
Works in browser and Node.js environments
Heuristic stemming (no dictionary required)
Indonesian-specific phonetic normalization (Dutch/Arabic spellings)

Installation

npm


npm install @indodev/toolkit

pnpm


pnpm add @indodev/toolkit

yarn


yarn add @indodev/toolkit

bun


bun add @indodev/toolkit

Quick Start


import { stemText, encodePhonetic, tokenizeIndo, normalizeWhitespace } from '@indodev/toolkit/nlp';
 
// Stemming
stemText('mempertanggungjawabkan'); // 'tanggung jawab'
 
// Phonetic matching
encodePhonetic('Syahruddin');  // 'S635'
encodePhonetic('Sjahruddin');   // 'S635' (same!)
isPhoneticMatch('Fakhri', 'Fahri'); // true
 
// Tokenization (preserves abbreviations)
tokenizeIndo("Kpd Yth. Bpk. Budi di Jl. Sudirman. Harap datang.");
// ["Kpd Yth. Bpk. Budi di Jl. Sudirman.", "Harap datang."]
 
// Normalization
normalizeWhitespace("  Budi   pergi  ke   sekolah  ");
// "Budi pergi ke sekolah"

API Reference

stemText()

Strips Indonesian affixes using algorithmic rules.


function stemText(text: string): string;

Parameters:

Name	Type	Description
`text`	`string`	Text to stem

Returns: Stemmed word (affix-stripped)

Examples:


stemText('mempertanggungjawabkan'); // 'tanggung jawab'
stemText('berkewarganegaraan');       // 'warganegara'
stemText('dikerjakan');               // 'kerja'

encodePhonetic()

Encodes text to phonetic representation for fuzzy matching.


function encodePhonetic(text: string): string;

Parameters:

Name	Type	Description
`text`	`string`	Text to encode

Returns: Phonetic code (letter + 3 digits)

Examples:


encodePhonetic('Syahruddin'); // 'S635'
encodePhonetic('Sjahrudin');   // 'S635'
encodePhonetic('Budi');        // 'B300'

False Positives: Phonetic encoding may produce same codes for different words (e.g., Bagus ≈ Bagas). Use isPhoneticMatch for boolean comparison, not as identity verification.

isPhoneticMatch()

Compares two strings for phonetic equality.


function isPhoneticMatch(text1: string, text2: string): boolean;

Parameters:

Name	Type	Description
`text1`	`string`	First text
`text2`	`string`	Second text

Returns: true if both encode to same phonetic code

Examples:


isPhoneticMatch('Fakhri', 'Fahri');   // true
isPhoneticMatch('Bhoedie', 'Budi');   // true
isPhoneticMatch('John', 'Jane');      // false

tokenizeIndo()

Splits Indonesian text into sentences, preserving abbreviations.


function tokenizeIndo(text: string): string[];

Parameters:

Name	Type	Description
`text`	`string`	Text to tokenize

Returns: Array of sentences

Preserved abbreviations: Yth., Bpk., Ibu., Sdr., S.Kom., M.Kom., Dr., Sp., Mk., Jl., D.a., D.l.

Examples:


tokenizeIndo("Kpd Yth. Bpk. Budi di Jl. Sudirman. Harap datang.");
// ["Kpd Yth. Bpk. Budi di Jl. Sudirman.", "Harap datang."]

normalizeWhitespace()

Normalizes whitespace (collapses multiple spaces, trims).


function normalizeWhitespace(text: string): string;

Parameters:

Name	Type	Description
`text`	`string`	Text to normalize

Examples:


normalizeWhitespace("  Budi   pergi  ke   sekolah  ");
// "Budi pergi ke sekolah"

stripNonAlphanumeric()

Removes non-alphanumeric characters except spaces.


function stripNonAlphanumeric(text: string): string;

Parameters:

Name	Type	Description
`text`	`string`	Text to clean

Examples:


stripNonAlphanumeric("Budi123@#$%^&*()");
// "Budi123"

Type Reference

PhoneticResult


interface PhoneticResult {
  original: string;
  encoded: string;
}

TokenizationResult


interface TokenizationResult {
  sentences: string[];
  abbreviationCount: number;
}

StemmingResult


interface StemmingResult {
  original: string;
  stemmed: string;
  removedAffixes: string[];
}

Known Limitations

Stemming

Heuristic-only (no dictionary)
May over-stem or under-stem in some cases
Best for search/indexing, not grammatical accuracy

Phonetics

False positives possible (same code, different words)
Dutch/Arabic normalization may affect accuracy for other languages
Not suitable for identity verification

Text - Additional text manipulation utilities
Privacy (PDP) - PII scanning and masking

NLP Engine

Overview

Features

Installation

npm

pnpm

yarn

bun

Quick Start

API Reference

stemText()

encodePhonetic()

isPhoneticMatch()

tokenizeIndo()

normalizeWhitespace()

stripNonAlphanumeric()

Type Reference

PhoneticResult

TokenizationResult

StemmingResult

Known Limitations

Stemming

Phonetics

Related Modules