Skip to content

Pathological memory on oversize string literals (~95x source size) #478

@mgajda

Description

@mgajda

On haskell-src-exts-1.23.1 (matches homplexity-0.4.8.1's pin),
Language.Haskell.Exts.parseModuleWithComments allocates dramatically
more heap than the input size when it hits string literals with a few
MB of \-escaped bytes on a single line. Observed while running
homplexity over the GHC repository:

File Source Peak RSS Ratio
libraries/base/GHC/Unicode/.../GeneralCategory.hs 3.2 MB 304 MB ~95x

Cause. A single literal of the form "\25\25\25..." with a few MB
of \-escaped bytes in one line. The lexer decodes each escape into a
Char and the parser materialises the literal as a String — a
linked list of boxed Chars in Literal. A 3 MB escaped literal
expands to roughly 120 MB of : cons cells plus boxed chars, and the
lexer holds additional intermediate state for the duration.

Suggested fix. Store string-literal bytes as Text or ByteString
(or an offset+length into the source buffer) rather than eagerly
materialising a String. This changes the public AST type for string
literals, so it would be a breaking change, but it eliminates the
pathology entirely and is also a meaningful memory win on ordinary
source with many small literals.

Meanwhile a pre-parse blob filter (reject lines > 1 KiB or string
literals > 4 KiB of raw escaped bytes) catches the worst ~0.1% of
blobs cheaply; that is what we ended up doing in homplexity-benchmark.

Happy to provide more reproduction material if useful.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions