Skip to content

[Bug]: ISO 8859/1 messages display corrupted characters (encoding mismatch) #78

@Pappet

Description

@Pappet

Affected Component

HL7 Parser (message parsing, delimiter detection)

Bug Description

Messages encoded in ISO 8859-1 (Latin-1) are displayed with broken/corrupted characters in HL7 Forge. Accented characters (ä, ö, ü, �, é, etc.) appear as ? or as the Unicode replacement character (â�, �¶, etc.) instead of their correct glyphs.

Example: A patient name like Thöni is rendered as Th�¶ni or Th?ni.

The sending system correctly declares the encoding in MSH-18 (8859/1), but HL7 Forge ignores this field entirely.


Root Cause

File: src/mllp.rs, line 232

let message = String::from_utf8_lossy(message_bytes).to_string();

String::from_utf8_lossy unconditionally assumes UTF-8. ISO 8859-1 uses byte values 0x80�0xFF for extended Latin characters � these are not valid UTF-8 sequences, so they are silently replaced with U+FFFD before the message ever reaches the parser or the UI.

MSH-18 is never read. The parser (src/hl7/parser.rs) operates on the already-corrupted string. The Hl7Message struct (src/hl7/types.rs) has no charset field. There is no code path anywhere that inspects or respects MSH-18.


Affected Code Paths

File Line Issue
src/mllp.rs 232 Hard-coded from_utf8_lossy() � always assumes UTF-8
src/hl7/parser.rs � Receives already-corrupted &str; MSH-18 never extracted
src/hl7/types.rs � Hl7Message has no charset/encoding field

Proposed Fix

  1. Two-pass decode in extract_mllp_frame (src/mllp.rs):

    • Do a minimal ASCII-only scan of the raw bytes to locate and read the MSH-18 field value (safe, since all HL7 delimiters and segment names are ASCII).
    • Select decoder based on MSH-18: 8859/1 â�� Latin-1, UTF-8/absent â�� current behavior.
    • Recommended crate: encoding_rs
  2. Add charset field to Hl7Message (src/hl7/types.rs):

    pub charset: Option<String>
  3. Surface charset in the UI � show detected encoding in the message header row.

Steps to Reproduce

  1. Start hl7-forge with default settings.
  2. Send an MLLP message with MSH-18 set to 8859/1 containing Latin-1 extended characters (e.g. patient name Müller or Thöni).
  3. Open the Web UI and click on the received message.
  4. Observe corrupted characters in both the Segments tab and Raw tab.

Expected Behavior

Characters like ä, ö, ü, � display correctly based on the encoding declared in MSH-18.

Actual Behavior

Extended Latin characters are corrupted � displayed as â�, �¶, �¤, or ? depending on the byte sequence.

HL7 Message Sample

MSH|^~\&|XXX RIS|XXX|Test-ORU|XXX|20260309102241||ORU^R01|137975750|P|2.4|||AL|NE||8859/1

Operating System

Windows

HL7 Forge Version

v0.4.0

Deployment Method

Pre-built binary (.exe / release)

Additional Context

HL7 Standard Reference

MSH-18 (Character Set) � HL7 v2.x Table 0211

Common values: ASCII, 8859/1, 8859/2, UNICODE UTF-8
Many real-world EMR/RIS systems send ISO 8859-1 data and declare it in MSH-18.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingtriageNeeds initial review and classification

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions