Skip to content

[Bug] stripControlChars drops characters from fi/fl/ff ligatures #111

@TheVortex8

Description

@TheVortex8

Description

stripControlChars in src/engines/pdf/pdfjs.ts strips characters that are part of common typographic ligatures (fi, fl, ff, ffi, ffl). This causes words containing these ligatures to be silently corrupted in the output.

Reproduction

Parse any PDF containing words with fl or fi ligatures (common in French pharmaceutical/medical PDFs using professional fonts).

Example: the word ofloxacine (an antibiotic) is output as ooxacine — the fl pair is dropped.

import { LiteParse } from '@llamaindex/liteparse';

const parser = new LiteParse({ ocrEnabled: false, outputFormat: 'text' });
const response = await fetch('https://pharmactuel.com/index.php/pharmactuel/article/download/1571/1844');
const buffer = Buffer.from(await response.arrayBuffer());
const result = await parser.parse(buffer, true);

console.log(result.text.includes('ofloxacine')); // false — should be true
console.log(result.text.includes('ooxacine'));    // true — corrupted output

For comparison, pdftotext (poppler) extracts the same PDF correctly with ofloxacine intact.

Root Cause

In src/engines/pdf/pdfjs.ts around line 582, stripControlChars is applied to decoded text items. The comment says it handles "form feed chars that sneak into ligatures like fi", but the stripping is too aggressive — it removes the ligature characters themselves rather than normalizing them to their ASCII equivalents.

Expected Behavior

Ligature characters (U+FB00U+FB06) should be decomposed to their ASCII equivalents (ff, fi, fl, ffi, ffl, st) rather than stripped. A simple .normalize('NFKC') on the text items before stripping control chars would handle this, or an explicit replacement in stripControlChars.

Impact

Affects any PDF using fonts with ligature substitution (very common in professional typesetting). Particularly impactful for medical/pharmaceutical documents where drug names like ofloxacine, fluoroquinolone, fluconazole, rifampicine contain fl/fi pairs.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions