Techworld

XXE attacks and disabling remote entity loading when using Python's sax library

Tutorial: How to prevent XXE attacks when using XML and Python

If you need to work with XML in Python, there are a couple of libraries you can use. One of the more popular one is the xml.sax library which is included in the default installation of Python.

Building a basic parser in xml.sax is easy. For example, if we wanted to write a parser to load some XML as follows:

my_xml="""
<person>
<name>John Smith</name>
<company>Smith Pty Ltd</company>
</person>
"""

We would like the code to parse it into a simple Python object:

class MyObject:
    def __init__(self):
        self.name = None
        self.company = None

def __repr__(self): return "%s (%s)" % (self.name, self.company)

Doing so in xml.sax is only a few lines of code. A simple parser for this type of XML would look like this:
class MyContentHandler(xml.sax.ContentHandler):
    def __init__(self, object):
        xml.sax.ContentHandler.__init__(self)
        self.object = object

def startElement(self, name, attrs): self.chars = ""

def endElement(self, name): if name=="name": obj.name = self.chars elif name=="company": obj.company = self.chars

def characters(self, content): self.chars += content

Given these bits of code, most programmers will then parse the XML like this:
obj = MyObject()

parser = MyContentHandler(obj) xml.sax.parseString(my_xml, parser)

print obj.__repr__()

Running this code will output:

John Smith (Smith Pty Ltd)
There is a problem with doing this type of parsing however. In recent times there has been a growing concern over untrusted XML and the attack vectors that are introduced by XML parsers. One of the main concerns is known as a XML External Entity (XXE) vulnerability. There are two methods which can be used to load content from outside of an XML file. They are an ENTITY declaration and a DTD declaration.

Let's check out an example by modifying our XML to look as follows:

my_xml="""
<!DOCTYPE person [<!ENTITY remote SYSTEM 
"http://www.computerworld.com.au/xml.test">]>
<person>
<name>John Smith &remote;</name>
<company>Smith Pty Ltd</company>
</person>
"""
If we run our previous code on this command, the returned value will be:
John Smith Hello (Smith Pty Ltd)
This is because the entity called remote is causing SAX to reach out to that address and fetching the results. This can cause issues if the files are huge (or never ending) such as file:///dev/random or may contain sensitive information such as file:///etc/passwd

You can verify what HTTP calls are being made by using a tool such as TCPdump, or the great HTTPretty library. For example, using the HTTPretty library let's us do the following:

from httpretty import HTTPretty

HTTPretty.enable() obj = MyObject()

parser = MyContentHandler(obj) xml.sax.parseString(my_xml, parser)

print obj.__repr__() print HTTPretty.latest_requests HTTPretty.disable()

This will print out the HTTP requests made between HTTPretty.enable() and the HTTPretty.disable(), in this case:
HTTPrettyRequest(headers=Host: www.computerworld.com.au
User-Agent: Python-urllib/1.17
, body="")
Unfortunately, disabling the loading of foreign entities is a little counterintuitive. The main issue we are faced with is that xml.sax.parseString (and its sibling, xml.sax.parse) do not give you the ability to set any options. This means we have to re-implement part of those functions to get access to the settings we want to change. In our case, the code would look as follows:
obj = MyObject()

contentParser = MyContentHandler(obj) parser = xml.sax.make_parser() parser.setContentHandler(contentParser) parser.setFeature(xml.sax.handler.feature_external_ges, 0) parser.parse(StringIO.StringIO(my_xml))

print obj.__repr__()

Turning off the remote entity loading is done through the setFeature function. Once we have changed the code, running the code will no longer reach out to the network. Doing this will also disable the loading of remote DTDs which is another vector for such attacks, and which will probably result in faster code loads as you will not be reaching out over the network to get DTD files. This is something that people like the W3C will appreciate.

Tags pythonsecurity

More about W3C

Comments

Comments are now closed

Top Whitepapers

Twitter Feed

Featured Whitepapers