Skip to content

Doctype after <?xml> or with internal subsets results in parse errors #50

@EmilGedda

Description

@EmilGedda

Currently, the Xeno.DOM.Robust does not properly handle XML doctypes.

Doctypes are removed if they appear at the start of the document, however, usually the doctypes are placed after the XML-declaration: <?xml ...><!DOCTYPE html>.

i.e., this test fails:

describe "skipDoctype" $ do
  it "strips doctype after xml declaration" $ do
    skipDoctype "<?xml version=\"1.0\" encoding=\"UTF-8\"?><!DOCTYPE html>Hello" `shouldBe` "<?xml version=\"1.0\" encoding=\"UTF-8\"?>Hello"

One thought is that skipDoctype should check the first two < if they are followed !DOCTYPE and then remove the matching node.
I don't think supporting a doctype at the end of a document is something to be bothered with.

On top of that, skipDoctype does also not handle doctypes with internal subsets such as

<!DOCTYPE html [
  <!-- an internal subset can be embedded here -->
]>

Appropriate test:

describe "skipDoctype" $ do
  it "strips doctype with internal subsets" $ do
    skipDoctype "<!DOCTYPE html [ <!-- --> ]><?xml version=\"1.0\" encoding=\"UTF-8\"?>Hello" `shouldBe` "<?xml version=\"1.0\" encoding=\"UTF-8\"?>Hello"

In this case, skipDoctype will return a ByteString which starts with ]>.
Ideally, skipDoctype should drop until [ or >, and if a [ was matched, then continue to drop until ]>.

Metadata

Metadata

Assignees

No one assigned

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions