On haskell-src-exts-1.23.1 (matches homplexity-0.4.8.1's pin),
Language.Haskell.Exts.parseModuleWithComments allocates dramatically
more heap than the input size when it hits string literals with a few
MB of \-escaped bytes on a single line. Observed while running
homplexity over the GHC repository:
| File |
Source |
Peak RSS |
Ratio |
libraries/base/GHC/Unicode/.../GeneralCategory.hs |
3.2 MB |
304 MB |
~95x |
Cause. A single literal of the form "\25\25\25..." with a few MB
of \-escaped bytes in one line. The lexer decodes each escape into a
Char and the parser materialises the literal as a String — a
linked list of boxed Chars in Literal. A 3 MB escaped literal
expands to roughly 120 MB of : cons cells plus boxed chars, and the
lexer holds additional intermediate state for the duration.
Suggested fix. Store string-literal bytes as Text or ByteString
(or an offset+length into the source buffer) rather than eagerly
materialising a String. This changes the public AST type for string
literals, so it would be a breaking change, but it eliminates the
pathology entirely and is also a meaningful memory win on ordinary
source with many small literals.
Meanwhile a pre-parse blob filter (reject lines > 1 KiB or string
literals > 4 KiB of raw escaped bytes) catches the worst ~0.1% of
blobs cheaply; that is what we ended up doing in homplexity-benchmark.
Happy to provide more reproduction material if useful.
On
haskell-src-exts-1.23.1(matcheshomplexity-0.4.8.1's pin),Language.Haskell.Exts.parseModuleWithCommentsallocates dramaticallymore heap than the input size when it hits string literals with a few
MB of
\-escaped bytes on a single line. Observed while runninghomplexity over the GHC repository:
libraries/base/GHC/Unicode/.../GeneralCategory.hsCause. A single literal of the form
"\25\25\25..."with a few MBof
\-escaped bytes in one line. The lexer decodes each escape into aCharand the parser materialises the literal as aString— alinked list of boxed
Chars inLiteral. A 3 MB escaped literalexpands to roughly 120 MB of
:cons cells plus boxed chars, and thelexer holds additional intermediate state for the duration.
Suggested fix. Store string-literal bytes as
TextorByteString(or an offset+length into the source buffer) rather than eagerly
materialising a
String. This changes the public AST type for stringliterals, so it would be a breaking change, but it eliminates the
pathology entirely and is also a meaningful memory win on ordinary
source with many small literals.
Meanwhile a pre-parse blob filter (reject lines > 1 KiB or string
literals > 4 KiB of raw escaped bytes) catches the worst ~0.1% of
blobs cheaply; that is what we ended up doing in homplexity-benchmark.
Happy to provide more reproduction material if useful.