Fix HTML entity encoding/decoding in markdown conversion#7565
Fix HTML entity encoding/decoding in markdown conversion#7565
Conversation
✅ Deploy Preview for tiptap-embed ready!
To edit notification comments on pull requests, go to your Netlify project configuration. |
🦋 Changeset detectedLatest commit: 3774f9e The changes in this PR will be included in the next version bump. This PR includes changesets to release 72 packages
Not sure what this means? Click here to learn what changesets are. Click here if you're a maintainer who wants to add another changeset to this PR |
There was a problem hiding this comment.
Pull request overview
Fixes HTML entity handling in the Markdown conversion pipeline to ensure safe/consistent roundtripping of special characters (notably <, >, &) between markdown ↔︎ editor JSON, with explicit exceptions for code contexts.
Changes:
- Added
decodeHtmlEntities/encodeHtmlEntitiesutilities to@tiptap/coreand re-exported them from@tiptap/markdownutils for compatibility. - Decoded entities when parsing markdown text tokens and encoded special characters when serializing text nodes back to markdown (skipping code blocks / inline code).
- Added regression tests covering decoding, encoding, roundtrips, doubly-encoded sequences, and
empty-paragraph behavior.
Reviewed changes
Copilot reviewed 7 out of 7 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
| packages/markdown/src/utils.ts | Re-exports core html entity utilities to keep local utils imports stable. |
| packages/markdown/src/MarkdownManager.ts | Applies entity decoding during parsing and entity encoding during serialization (with code-context exclusions). |
| packages/markdown/tests/conversion.spec.ts | Adds targeted tests for entity decode/encode + roundtrip behavior; includes Code mark in setup. |
| packages/extension-text/src/text.ts | Decodes common entities when parsing markdown text tokens into text nodes. |
| packages/core/src/utilities/index.ts | Exposes the new html entity utility module via core utilities exports. |
| packages/core/src/utilities/htmlEntities.ts | Implements encode/decode helpers with ordering to handle doubly-encoded sequences. |
@tiptap/extension-character-count
@tiptap/extension-focus
@tiptap/extension-dropcursor
@tiptap/extension-gapcursor
@tiptap/extension-history
@tiptap/extension-list-item
@tiptap/extension-list-keymap
@tiptap/extension-placeholder
@tiptap/extension-table-header
@tiptap/extension-table-cell
@tiptap/extension-table-row
@tiptap/extension-task-item
@tiptap/extension-task-list
@tiptap/core
@tiptap/extension-blockquote
@tiptap/extension-bold
@tiptap/extension-audio
@tiptap/extension-bullet-list
@tiptap/extension-bubble-menu
@tiptap/extension-code
@tiptap/extension-code-block-lowlight
@tiptap/extension-code-block
@tiptap/extension-collaboration
@tiptap/extension-collaboration-caret
@tiptap/extension-color
@tiptap/extension-details
@tiptap/extension-document
@tiptap/extension-drag-handle
@tiptap/extension-drag-handle-react
@tiptap/extension-drag-handle-vue-3
@tiptap/extension-drag-handle-vue-2
@tiptap/extension-floating-menu
@tiptap/extension-file-handler
@tiptap/extension-emoji
@tiptap/extension-hard-break
@tiptap/extension-font-family
@tiptap/extension-heading
@tiptap/extension-highlight
@tiptap/extension-horizontal-rule
@tiptap/extension-image
@tiptap/extension-invisible-characters
@tiptap/extension-italic
@tiptap/extension-link
@tiptap/extension-list
@tiptap/extension-mathematics
@tiptap/extension-mention
@tiptap/extension-node-range
@tiptap/extension-ordered-list
@tiptap/extension-strike
@tiptap/extension-subscript
@tiptap/extension-paragraph
@tiptap/extension-superscript
@tiptap/extension-table
@tiptap/extension-table-of-contents
@tiptap/extension-text
@tiptap/extension-text-align
@tiptap/extension-text-style
@tiptap/extension-typography
@tiptap/extension-twitch
@tiptap/extension-underline
@tiptap/extension-unique-id
@tiptap/extension-youtube
@tiptap/extensions
@tiptap/markdown
@tiptap/html
@tiptap/react
@tiptap/starter-kit
@tiptap/pm
@tiptap/static-renderer
@tiptap/suggestion
@tiptap/vue-3
@tiptap/vue-2
commit: |
…ndtrip (#7539) Decode HTML entities (<, >, &, ") to literal characters during markdown parsing so the editor displays them correctly, and re-encode them during serialization so they survive markdown roundtrips. Code blocks and inline code are excluded from encoding to preserve literal characters in code contexts. https://claude.ai/code/session_01BhDQNLqwkb5XMwHqiRA9Mz
…cation Move decodeHtmlEntities and encodeHtmlEntities from markdown/utils.ts and extension-text into a shared @tiptap/core utility. The markdown package re-exports from core for backward compatibility. Also removes the issue number reference from the test description. https://claude.ai/code/session_01BhDQNLqwkb5XMwHqiRA9Mz
encodeHtmlEntities now encodes `"` → `"` to match the `"` → `"` decoding already present in decodeHtmlEntities. Also adds roundtrip tests for the encode/decode pair. https://claude.ai/code/session_01BhDQNLqwkb5XMwHqiRA9Mz
Text nodes with a `code` mark should preserve literal characters like `<`, `>`, and `&` rather than encoding them to HTML entities. This mirrors the existing code-mark check in renderNodesWithMarkBoundaries. https://claude.ai/code/session_01BhDQNLqwkb5XMwHqiRA9Mz
Add parse, serialize, and roundtrip tests for " ⇄ " to match the existing coverage for <, >, and &. https://claude.ai/code/session_01BhDQNLqwkb5XMwHqiRA9Mz
Double quotes are ordinary markdown characters and don't need entity encoding. Keep decoding " → " (markdown-it may emit it) but don't encode " back — this avoids mangling quoted text in serialized markdown. https://claude.ai/code/session_01BhDQNLqwkb5XMwHqiRA9Mz
dd4e4ff to
f73da0f
Compare
Instead of checking `parentNode?.type === 'codeBlock'` and `mark.type === 'code'`, build a set of code-like extension names from the `code: true` spec property at registration time. This respects custom extensions that set `code: true` and won't break if users rename the built-in code/codeBlock node types. https://claude.ai/code/session_01BhDQNLqwkb5XMwHqiRA9Mz
… case - Fix changeset to clarify that " is decoded but not re-encoded - Add regression test proving &nbsp; roundtrips correctly and is not misinterpreted as an empty paragraph marker https://claude.ai/code/session_01BhDQNLqwkb5XMwHqiRA9Mz
…ports - Remove decodeHtmlEntities from extension-text (dead code; MarkdownManager already decodes text tokens before the extension handler is reached) - Drop @tiptap/extension-text from changeset since it has no behavioral change - Import decode/encode utilities directly from @tiptap/core in MarkdownManager instead of re-exporting through markdown/utils.ts - Replace [...currentMarks.keys()].some() with node.marks check to avoid unnecessary array allocation https://claude.ai/code/session_01BhDQNLqwkb5XMwHqiRA9Mz
- Replace "markdown-it" with "the markdown tokenizer" in htmlEntities.ts JSDoc since this repo uses marked, not markdown-it. - Extract isInsideCode detection + entity encoding into a shared private `encodeTextForMarkdown` method on MarkdownManager, deduplicating the logic between renderNodeToMarkdown and renderNodesWithMarkBoundaries. https://claude.ai/code/session_01BhDQNLqwkb5XMwHqiRA9Mz
Changes Overview
This PR fixes HTML entity handling in markdown parsing and serialization to ensure proper roundtripping of special characters like
<,>, and&. Previously, these characters were not being properly encoded/decoded, causing data loss or corruption during markdown conversion.Implementation Approach
Added HTML entity decoding during parsing: When parsing markdown tokens, HTML entities (
<,>,",&) are now decoded to their literal character equivalents so they display correctly in the editor.Added HTML entity encoding during serialization: When serializing editor content back to markdown, special characters are encoded to their HTML entity equivalents to ensure safe roundtripping.
Preserved literal characters in code contexts: Code blocks and inline code marks are excluded from entity encoding since they should preserve literal
<,>, and&characters without escaping.Implemented proper encoding order:
<,>,"are decoded first, then&last to handle doubly-encoded sequences correctly (e.g.,&lt;→<)&is encoded first to avoid double-encoding (e.g.,<→<, not&lt;)Added Code extension to test setup: The
Codeextension was added to the test configuration to support inline code mark testing.Testing Done
Comprehensive test suite added covering:
<,>,&)&lt;) in empty paragraphsAll tests pass and verify correct behavior across parsing, serialization, and roundtripping scenarios.
Verification Steps
npm test -- packages/markdown/__tests__/conversion.spec.ts<,>,&characters without encodingAdditional Notes
The implementation handles edge cases like doubly-encoded entities and preserves the special behavior of
for empty paragraphs. The entity encoding/decoding logic is centralized in utility functions for consistency across the codebase.Checklist
Related Issues
Fixes #7539