Coverage of 'Race to Zero' has focussed attention, at least for a short while, on the very real problem that polymorphism poses for those who are trying to filter out all the different types of malware that can arrive on a user's system.
In Information Security terms, polymorphism is used to describe a malware sample that can exist in multiple different forms (usually different binary executables) yet still have the same active payload.
Because polymorphism isn't a new concept there have been a number of techniques introduced over the years to automatically morph software to allow it to slip past protective software. Fortunately for those writing the detection tools, many of these early attempts left obvious signatures in the resultant files, making it fairly straight forward to detect the payload even if it was the first time that a file with that exact byte structure had been created.
Over time the code in use to generate the morphed variants got better and it began to take more effort from the antimalware developers to keep up, with many suggesting that the malware developers are winning.
PDF files have been targeted in the past as a means to slip malware past scanners, an approach made easier through the fact that a PDF file is a set of instructions that can tell the PDF interpreter to perform various autonomous actions through some simple scripting commands, and not just the formatted documents that most people are familiar with. The general belief is that PDFs are a 'safe' document format, but there are increasing levels of research being invested to uncover vulnerabilities with this file format.
Didier Stevens recently privately reported a discovery (to be posted here) that shows some simple tricks to make any PDF-embedded malware polymorphic to the extent that effectively the only way to know that there is something hiding inside is to actually interpret the PDF as if it was for display.
From Stevens' work it seems that PDF interpreters are more than happy to interpret alternative string coding (hexadecimal, octal, and ANSI are the examples he uses) as if it was straight text. This should be fairly straight forward to check against, but it does force any scanning application to devote more resources to each file it is scanning. Extending the finding is the discovery that effectively-unlimited amounts of whitespace (" ") can be placed between each character and PDF interpreters will still correctly interpret the content.
Probably the worst combination of the above is that it is then possible to further obfuscate the unlimited options by encrypting the malware payload using PDF encryption support.
The general belief is that PDFs are a 'safe' document format, but there are increasing levels of research being invested to uncover vulnerabilities with this file format
In order to counter these problems an antimalware scanner will now need to decrypt all non-password protected encrypted content, reduce all string representations in a PDF file to a common type (canonicalisation) and then strip all whitespace from the file before scanning for malware payload. This effectively reverses the polymorphism, however it does impose a significant increase in the level of resources required to scan each PDF file.
Stevens reassures that there is "nothing alarmist" about what he has discovered. What he considers important is that "you have to be careful with PDF documents from an unknown source (mail), because you cannot completely rely on your AV, antispam or NIDS software to block malicious PDF documents." When "the bad guys can read the PDF specs too" and PDF interpreters choose to apply the PDF specifications in slightly different ways, it means that a payload that might prompt for user interaction in Adobe's official PDF reader might automatically trigger in another, according to Stevens.