Description
stripControlChars in src/engines/pdf/pdfjs.ts strips characters that are part of common typographic ligatures (fi, fl, ff, ffi, ffl). This causes words containing these ligatures to be silently corrupted in the output.
Reproduction
Parse any PDF containing words with fl or fi ligatures (common in French pharmaceutical/medical PDFs using professional fonts).
Example: the word ofloxacine (an antibiotic) is output as ooxacine — the fl pair is dropped.
import { LiteParse } from '@llamaindex/liteparse';
const parser = new LiteParse({ ocrEnabled: false, outputFormat: 'text' });
const response = await fetch('https://pharmactuel.com/index.php/pharmactuel/article/download/1571/1844');
const buffer = Buffer.from(await response.arrayBuffer());
const result = await parser.parse(buffer, true);
console.log(result.text.includes('ofloxacine')); // false — should be true
console.log(result.text.includes('ooxacine')); // true — corrupted output
For comparison, pdftotext (poppler) extracts the same PDF correctly with ofloxacine intact.
Root Cause
In src/engines/pdf/pdfjs.ts around line 582, stripControlChars is applied to decoded text items. The comment says it handles "form feed chars that sneak into ligatures like fi", but the stripping is too aggressive — it removes the ligature characters themselves rather than normalizing them to their ASCII equivalents.
Expected Behavior
Ligature characters (U+FB00–U+FB06) should be decomposed to their ASCII equivalents (ff, fi, fl, ffi, ffl, st) rather than stripped. A simple .normalize('NFKC') on the text items before stripping control chars would handle this, or an explicit replacement in stripControlChars.
Impact
Affects any PDF using fonts with ligature substitution (very common in professional typesetting). Particularly impactful for medical/pharmaceutical documents where drug names like ofloxacine, fluoroquinolone, fluconazole, rifampicine contain fl/fi pairs.
Description
stripControlCharsinsrc/engines/pdf/pdfjs.tsstrips characters that are part of common typographic ligatures (fi,fl,ff,ffi,ffl). This causes words containing these ligatures to be silently corrupted in the output.Reproduction
Parse any PDF containing words with
florfiligatures (common in French pharmaceutical/medical PDFs using professional fonts).Example: the word
ofloxacine(an antibiotic) is output asooxacine— theflpair is dropped.For comparison,
pdftotext(poppler) extracts the same PDF correctly withofloxacineintact.Root Cause
In
src/engines/pdf/pdfjs.tsaround line 582,stripControlCharsis applied to decoded text items. The comment says it handles "form feed chars that sneak into ligatures like fi", but the stripping is too aggressive — it removes the ligature characters themselves rather than normalizing them to their ASCII equivalents.Expected Behavior
Ligature characters (
U+FB00–U+FB06) should be decomposed to their ASCII equivalents (ff,fi,fl,ffi,ffl,st) rather than stripped. A simple.normalize('NFKC')on the text items before stripping control chars would handle this, or an explicit replacement instripControlChars.Impact
Affects any PDF using fonts with ligature substitution (very common in professional typesetting). Particularly impactful for medical/pharmaceutical documents where drug names like
ofloxacine,fluoroquinolone,fluconazole,rifampicinecontainfl/fipairs.