Skip to content

Strategies for dealing with broken clients? (invalid documents) #664

@packplusplus

Description

@packplusplus

I have a client I can't change, which submits a malformed SOAP doc. The client assumes that there's already a namespace called soap with the uri http://schemas.xmlsoap.org/soap/. Who knows why the original server accepted this. I'm not in a position to judge or fix, only make something bug for bug compatible.

Let's demonstrate the behavior in a simplified way.

from lxml import etree
s = "<soapenv:Header><soap:authentication><soap:username>some_user</soap:username><soap:password>some_pass</soap:password></soap:authentication></soapenv:Header>" 
etree.fromstring(s)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "src/lxml/etree.pyx", line 3237, in lxml.etree.fromstring
  File "src/lxml/parser.pxi", line 1896, in lxml.etree._parseMemoryDocument
  File "src/lxml/parser.pxi", line 1777, in lxml.etree._parseDoc
  File "src/lxml/parser.pxi", line 1082, in lxml.etree._BaseParser._parseUnicodeDoc
  File "src/lxml/parser.pxi", line 615, in lxml.etree._ParserContext._handleParseResultDoc
  File "src/lxml/parser.pxi", line 725, in lxml.etree._handleParseResult
  File "src/lxml/parser.pxi", line 654, in lxml.etree._raiseParseError
  File "<string>", line 1
etree.XMLSyntaxError: Namespace prefix soapenv on Header is not defined, line 1, column 16

I had originally gone down the path of trying to use the listeners, like before_deserialize, but all of those seem to occur after lxml.XMLID tries to parse the object and throws a fault (

root, xmlids = etree.XMLID(string.encode(charset), parser)
).

I ended up monkey patching Soap11's _parse_xml_string to do some nasty string manipulation with some regex's to find out if the namespace was specified in the envelope, and if not, modify the incoming string to include it.

  • Will this fail on super large payloads? Probably (my payloads are less than 4k, it's not going to be an issue)
  • Do I think this was the right thing to do? Not really, but the POC works.

It makes sense that an invalid XML doc would cause a failure, but I'm wondering if there are other strategies I should be considering instead of this approach. Or perhaps it was the intent of those listeners in the Soap11 object to help cope with bad document and the functionality was lost over time. If it's the former, I'm all ears, if it's the later, I'm not totally sure how I would fix it.

I could see a path where create_in_document fires an event that allows you to manipulate the xml string before ctx.in_document gets a parsed XML object, and I would take a swing a PR for that functionality if anyone sees merit in it.

Thoughts?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions