Affected Component
HL7 Parser (message parsing, delimiter detection)
Bug Description
Messages encoded in ISO 8859-1 (Latin-1) are displayed with broken/corrupted characters in HL7 Forge. Accented characters (ä, ö, ü, �, é, etc.) appear as ? or as the Unicode replacement character (â�, �¶, etc.) instead of their correct glyphs.
Example: A patient name like Thöni is rendered as Th�¶ni or Th?ni.
The sending system correctly declares the encoding in MSH-18 (8859/1), but HL7 Forge ignores this field entirely.
Root Cause
File: src/mllp.rs, line 232
let message = String::from_utf8_lossy(message_bytes).to_string();
String::from_utf8_lossy unconditionally assumes UTF-8. ISO 8859-1 uses byte values 0x80�0xFF for extended Latin characters � these are not valid UTF-8 sequences, so they are silently replaced with U+FFFD before the message ever reaches the parser or the UI.
MSH-18 is never read. The parser (src/hl7/parser.rs) operates on the already-corrupted string. The Hl7Message struct (src/hl7/types.rs) has no charset field. There is no code path anywhere that inspects or respects MSH-18.
Affected Code Paths
| File |
Line |
Issue |
src/mllp.rs |
232 |
Hard-coded from_utf8_lossy() � always assumes UTF-8 |
src/hl7/parser.rs |
� |
Receives already-corrupted &str; MSH-18 never extracted |
src/hl7/types.rs |
� |
Hl7Message has no charset/encoding field |
Proposed Fix
-
Two-pass decode in extract_mllp_frame (src/mllp.rs):
- Do a minimal ASCII-only scan of the raw bytes to locate and read the MSH-18 field value (safe, since all HL7 delimiters and segment names are ASCII).
- Select decoder based on MSH-18:
8859/1 � Latin-1, UTF-8/absent � current behavior.
- Recommended crate:
encoding_rs
-
Add charset field to Hl7Message (src/hl7/types.rs):
pub charset: Option<String>
-
Surface charset in the UI � show detected encoding in the message header row.
Steps to Reproduce
- Start hl7-forge with default settings.
- Send an MLLP message with
MSH-18 set to 8859/1 containing Latin-1 extended characters (e.g. patient name Müller or Thöni).
- Open the Web UI and click on the received message.
- Observe corrupted characters in both the Segments tab and Raw tab.
Expected Behavior
Characters like ä, ö, ü, � display correctly based on the encoding declared in MSH-18.
Actual Behavior
Extended Latin characters are corrupted � displayed as â�, �¶, �¤, or ? depending on the byte sequence.
HL7 Message Sample
MSH|^~\&|XXX RIS|XXX|Test-ORU|XXX|20260309102241||ORU^R01|137975750|P|2.4|||AL|NE||8859/1
Operating System
Windows
HL7 Forge Version
v0.4.0
Deployment Method
Pre-built binary (.exe / release)
Additional Context
HL7 Standard Reference
MSH-18 (Character Set) � HL7 v2.x Table 0211
Common values: ASCII, 8859/1, 8859/2, UNICODE UTF-8
Many real-world EMR/RIS systems send ISO 8859-1 data and declare it in MSH-18.
Affected Component
HL7 Parser (message parsing, delimiter detection)
Bug Description
Messages encoded in ISO 8859-1 (Latin-1) are displayed with broken/corrupted characters in HL7 Forge. Accented characters (
ä,ö,ü,�,é, etc.) appear as?or as the Unicode replacement character (â�,�¶, etc.) instead of their correct glyphs.Example: A patient name like
Thöniis rendered asTh�¶niorTh?ni.The sending system correctly declares the encoding in MSH-18 (
8859/1), but HL7 Forge ignores this field entirely.Root Cause
File:
src/mllp.rs, line 232String::from_utf8_lossyunconditionally assumes UTF-8. ISO 8859-1 uses byte values0x80�0xFFfor extended Latin characters � these are not valid UTF-8 sequences, so they are silently replaced withU+FFFDbefore the message ever reaches the parser or the UI.MSH-18 is never read. The parser (
src/hl7/parser.rs) operates on the already-corrupted string. TheHl7Messagestruct (src/hl7/types.rs) has nocharsetfield. There is no code path anywhere that inspects or respects MSH-18.Affected Code Paths
src/mllp.rsfrom_utf8_lossy()� always assumes UTF-8src/hl7/parser.rs&str; MSH-18 never extractedsrc/hl7/types.rsHl7Messagehas nocharset/encodingfieldProposed Fix
Two-pass decode in
extract_mllp_frame(src/mllp.rs):8859/1� Latin-1,UTF-8/absent � current behavior.encoding_rsAdd
charsetfield toHl7Message(src/hl7/types.rs):Surface charset in the UI � show detected encoding in the message header row.
Steps to Reproduce
MSH-18set to8859/1containing Latin-1 extended characters (e.g. patient nameMüllerorThöni).Expected Behavior
Characters like
ä,ö,ü,�display correctly based on the encoding declared in MSH-18.Actual Behavior
Extended Latin characters are corrupted � displayed as
â�,�¶,�¤, or?depending on the byte sequence.HL7 Message Sample
Operating System
Windows
HL7 Forge Version
v0.4.0
Deployment Method
Pre-built binary (.exe / release)
Additional Context
HL7 Standard Reference
MSH-18 (Character Set) � HL7 v2.x Table 0211
Common values:
ASCII,8859/1,8859/2,UNICODE UTF-8Many real-world EMR/RIS systems send ISO 8859-1 data and declare it in MSH-18.