De-duplication has become popular for backup data, but not for primary storage. Now, US start-up company Ocarina Networks wants to change that, with a data reduction technology which it claims can shrink live production data too - even if the file formats are already compressed.
The technology has already been picked up by Photoways Group, which runs a British photosharing site, Photobox. It expects to save millions of euro in deferred storage hardware purchases as a result, according to its CTO.
In effect, the Ocarina technology disassembles stored files into their constituent parts in order to compress them, via a out-of-band hardware appliance. The compressed files are then restored when needed via a file system filter driver.
The problem with using current de-dupe schemes on primary storage is that "You're much less likely to find duplicate blocks in an online subdirectory, say," explained Carter George, Ocarina's products VP and co-founder. He pointed out that where there are duplicates, they are often not redundant - on replicated storage arrays, for instance.
However, that doesn't mean there's no redundancy within the files, he added: "For example, a PowerPoint, a PDF, a Word document and a Jpeg all might contain the same picture, but it's rescaled, or pasted in a different format, or whatever, and while a human would say 'It's the same picture', on disk there's no common bytes."
So in a process the company calls ECO, for extract, correlate, optimize - Ocarina's storage optimizer appliance cracks open the file format and de-duplicates its constituent elements by looking for patterns at the information level, he claimed.
Using this method, even compressed image formats such as Jpeg can be compressed still further, George claimed. That's because a set of photos of the same event will share image elements - and therefore some of their underlying mathematical properties - and those can be de-duplicated.
"The maths to do this is really hard," George said. "Most companies concentrate on the D part of R&D. We have seven PhD mathematicians doing breakthrough mathematical research on how to find patterns."
The ECO process is extremely processor-intensive, so the optimizer box is a 16-core Linux appliance. It works out-of-band, pulling files off your NAS system, compressing them and then putting them back in Ocarina format - a size-reduced shadow format, with bit-for-bit consistency checks.
File reconstruction is much faster and is handled by reader software, also Linux-based. You can install it as a filter on a web or application server, or on a workstation, or buy a complete Ocarina Reader appliance.
The reconstruction process adds around 4ms latency, George said, and because you can have multiple readers - Ocarina sells unlimited sites licences - it shouldn't be a single point of failure.
He added that, as well as selling the technology in appliance form, Ocarina is working with other suppliers to develop integrated tier-2 storage subsystems.