Text Cleaner runs a configurable pipeline of small string transforms over the input. Each operation is implemented with native String methods and RegExp — no external NLP library is loaded — and the operations execute in a fixed order so combining them produces predictable output. The full pipeline covers more than twenty independent transformations spanning whitespace, line endings, quotes, HTML, casing, and Unicode normalization.
Whitespace cleanup uses three regex passes. Extra spaces collapse runs of two or more space characters to a single space using /[ ]{2,}/g (tabs are handled separately). Trailing whitespace per line is stripped with /[ \t]+$/gm using the multiline flag so $ matches each line end. Empty lines are detected with /^\s*$/gm and either removed or collapsed to a single blank.
Line-ending normalization handles the historical mess of CRLF (Windows, U+000D U+000A), LF (Unix and macOS, U+000A), and standalone CR (classic Mac, U+000D). The tool detects which style is present and rewrites the entire input to a single chosen style. This matters when pasted text mixes line endings — git, diff, and most code formatters treat CRLF and LF differently.
Quote normalization swaps between straight quotes (U+0022, U+0027) and the smart-quote family (U+201C, U+201D, U+2018, U+2019, plus the prime characters U+2032, U+2033). The toggle uses a small lookup map rather than a regex, so apostrophes and quotation marks are converted directionally based on whether the surrounding character is alphanumeric or whitespace.
Non-printable cleanup removes characters in the C0 and C1 control ranges (U+0000 to U+001F and U+007F to U+009F) excluding tab, newline, and carriage return. The byte-order-mark (U+FEFF) is also stripped if present at the start of the input — copy-pasting from a UTF-8 BOM file is the most common way that ghost character appears in copied prose. Zero-width spaces and joiner characters (U+200B through U+200D, U+2060) are also removed because they are invisible but break search and diff tools.
HTML stripping uses the regex /<[^>]+>/g to remove tags, then optionally decodes entities (&, <, >, ", ', and the named-entity subset that document.createElement().textContent will resolve). HTML entity encoding does the reverse using a small lookup table. Neither operation parses HTML — they treat the input as a string — so malformed markup with unbalanced angle brackets may produce surprising results.
Settings persist via localStorage under the key 'text-cleaner-settings' and the live-preview pipeline is debounced by 200 ms to keep typing responsive. The full clean pass runs in O(n) over the input length, so multi-megabyte documents are interactive on modern hardware. No request leaves the browser tab — confidential drafts and unredacted logs stay local.