Text Cleaner

Clean up messy text: remove extra spaces, normalize line endings, fix punctuation, change case, strip HTML, and more. 100% client-side, handles large inputs efficiently.

100% private

Strip the formatting artefacts that creep into text when it travels through PDFs, emails, and copy-paste — smart quotes (""), non-breaking spaces, zero-width characters, double-spaces, and trailing whitespace. Useful before pasting into a CMS that mangles smart punctuation, or for cleaning text scraped from sources with broken Unicode. Each rule is a toggleable checkbox so you keep what you want. The cleaning runs as a sequence of regex passes in the page.

chars • lines
Quick Presets
Cleaning Options
chars
Runs right inside your browser tab. No uploads. Your files stay private.

How the Text Cleaner Pipeline Works

Text Cleaner runs a configurable pipeline of small string transforms over the input. Each operation is implemented with native String methods and RegExp — no external NLP library is loaded — and the operations execute in a fixed order so combining them produces predictable output. The full pipeline covers more than twenty independent transformations spanning whitespace, line endings, quotes, HTML, casing, and Unicode normalization.
Whitespace cleanup runs a few independent passes. Extra-space collapsing replaces runs of spaces and tabs with a single space and also folds non-breaking spaces (U+00A0) into a regular space so Word and PDF artefacts disappear. Trimming removes whitespace from the start and end of each line. Empty lines are either removed entirely or collapsed so that no more than one blank line remains between blocks.
Line-ending normalization handles the historical mess of CRLF (Windows, U+000D U+000A), LF (Unix and macOS, U+000A), and standalone CR (classic Mac, U+000D). The tool detects which style is present and rewrites the entire input to a single chosen style. This matters when pasted text mixes line endings — git, diff, and most code formatters treat CRLF and LF differently.
Quote normalization swaps between straight quotes (U+0022, U+0027) and the smart-quote family (U+201C, U+201D, U+2018, U+2019). Smart-to-straight maps each curly quote back to its ASCII form. Straight-to-smart pairs quotes with a regex — each matched "..." or '...' becomes an opening/closing curly pair — so balanced quotation marks are converted directionally.
Non-printable cleanup removes characters in the C0 and C1 control ranges (U+0000 to U+001F and U+007F to U+009F) excluding tab, newline, and carriage return. The byte-order-mark (U+FEFF) is also stripped if present at the start of the input — copy-pasting from a UTF-8 BOM file is the most common way that ghost character appears in copied prose. Zero-width spaces and joiner characters (U+200B through U+200D, U+2060) are also removed because they are invisible but break search and diff tools.
HTML stripping removes anything between angle brackets with a single regex pass. Entity decoding is separate: it assigns the text to a detached <textarea> element and reads back its decoded value, so the full set of named and numeric entities the browser knows (&amp;, &lt;, &gt;, &quot;, &#39;, and the rest) is resolved. Entity encoding does the reverse for the five core characters using a small lookup table. Neither operation parses HTML — they treat the input as a string — so malformed markup with unbalanced angle brackets may produce surprising results.
Settings persist via localStorage under the key 'text-cleaner-settings'. Live preview recomputes the pipeline as you type; for very large pastes you can switch it off and use the Clean button to run the pipeline once on demand. The full clean pass runs in O(n) over the input length. No request leaves the browser tab — confidential drafts and unredacted logs stay local.

Common Use Cases

01

PDF copy-paste cleanup

Remove the stray spaces, smart quotes, and broken line wraps that PDF readers introduce when copying body text into a Word document or email.

02

Word-to-CMS migration

Normalize Word's smart quotes to straight ASCII and collapse Microsoft non-breaking spaces (U+00A0) before pasting into a CMS that expects plain text.

03

Code linting prep

Normalize tabs to spaces, fix CRLF vs LF mismatches, and strip trailing whitespace per line before committing pasted-in source code.

04

Email body sanitation

Remove forwarded-message indentation, soft hyphens, and zero-width spaces that break search and reply formatting in long email threads.

Frequently Asked Questions

Control characters in U+0000 to U+001F (excluding tab, newline, carriage return) and U+007F to U+009F, the byte-order-mark U+FEFF when at the start of input, and the zero-width family U+200B (zero-width space), U+200C (zero-width non-joiner), U+200D (zero-width joiner), and U+2060 (word joiner). These are common pollution sources from rich-text copy operations.
Smart-to-straight is mechanical: U+201C and U+201D both become U+0022, U+2018 and U+2019 both become U+0027. Straight-to-smart works on balanced pairs — a regex matches each "..." or '...' span and rewrites it as an opening/closing curly pair, so well-formed quotations are converted directionally. An unpaired straight quote is left as-is.
When 'fix line endings' is on, every CRLF, CR, and LF in the input is rewritten to whichever single style you select (LF by default, the Unix and macOS convention). This is essential for paste-from-Windows-into-shell-script scenarios where mixed endings cause `bash: command not found` errors.
No. It uses the regex /<[^>]*>/g, which removes anything between angle brackets. Self-closing tags, comments, and CDATA blocks are all caught. Malformed input with unbalanced angle brackets may leak text — for those cases use a real HTML parser. For typical pasted web content, the regex approach is fast and correct.
The smart-quotes toggle works in both directions. If you converted straight to smart and your text had ASCII apostrophes, they were rewritten to U+2019 (right single quotation mark). To reverse, run the same input through with the opposite direction selected, or paste a fresh copy and disable the option.
Yes by default. CJK, Cyrillic, Arabic, emoji, and accented Latin all pass through unchanged. The 'remove accents' toggle is the explicit opt-in that decomposes via NFD and strips combining marks, turning 'café' into 'cafe'. Without that option enabled, all Unicode is preserved.
Memory-bound only. The full pipeline is O(n) over input length. Live preview recomputes on every keystroke, so very large inputs may show a small lag while you type — switch live preview off and click 'Clean' to run the pipeline once on demand instead.
No. Settings are stored in localStorage under the key 'text-cleaner-settings' and live in the current browser only. There is no account system and no cloud sync — clear the key via DevTools > Application > Local Storage to reset to defaults.
It corrects spacing around common punctuation: removes spaces before commas, periods, and other sentence punctuation, ensures a single space after it (without splitting runs like '...'), and trims the space just inside brackets and parentheses. It also normalizes ellipsis based on the selected quote style — three or more periods collapse to U+2026 when 'smart' is chosen, and U+2026 expands back to three periods when 'straight' is chosen.
No. The cleaner runs synchronously inside the browser tab. There is no fetch call and no analytics on the input. You can disconnect from the network after the page loads and the tool keeps working.
Maintained by the WebToolVerse teamLast updated Suggest an edit

Advertisement