The Australian Bureau of Statistics (ABS) is ramping up for a massive change to its use of data standards, with plans to move to open source formats from next year.
The 2011 Census of Population and Housing is expected to become the first dry run of the bureau’s implementation of XML-based Data Documentation Initiative (DDI) and Statistical Data and Metadata Exchange (SDMX) formats, with the ABS directing software developer Space-Time Research to utilise the standards for both input and output of all data collected next year.
Successful use of the formats will ultimately allow users to access Web services from the bureau, including better search capabilities for both statistics and metadata.
Though previously reliant on proprietary data and metadata formats for its census data, the bureau has begun exploring the simultaneous use of the standards as part of its Remote Execution Environment for Microdata (REEM) and Electronic Data Reporting (EDR) projects. In an internal working document, the bureau stipulated the standards were part of a greater effort to facilitate machine-to-machine communication.
“SDMX and DDI are not cure-all's [sic] that will solve all of the ABS' current data and metadata issues,” the documents read. “However... their implementation will represent a significant step in improving our current capabilities in the capture, reuse and communication of data and metadata.”
Michael Beahan, branch manager of data management and classification at ABS, told Computerworld Australia that the new standards were a step forward from the do-it-yourself data retrieval services released following the 2006 census.
“When you get your data, this is what it’s going to look like, it’s in this format,” he said. “DDI and SDMX are good at describing things, and we’re testing the very notion that you can actually consume this stuff and make it discoverable metadata for your search engines.”
Though the two standards were developed by opposing international statistical communities, they are typically seen as complementary - while DDI is useful for inputting data in a particular way, SDMX is better used for searching for specific variables in a dataset.
The bureau’s investigations into the standards have extended to the public through its recent competition CodePlay, which provides DDI and SDMX-based datasets to tertiary education students for creation of mash-ups and other data tools. Based on the Gov 2.0 mash-up competitions previously run by NSW and Victorian state governments, the bureau is hoping to find bright statisticians while also exploring unique ways of utilising the formats.
“It’s not outsourcing our work but it is seeing how people can be creative with our statistics,” said Rachel Oldmeadow, technical architect and project manager of CodePlay at the agency. “We definitely want to see who’s keen, who’s interested in statistics and metadata, open data, data linking and what people can do with it as well.
“If people come up with good ideas it benefits not just the ABS but the standards communities as well.”
As no Australian datasets currently conform to either standard, CodePlay only includes international data for the time being. Winners are expected to be announced following closure on 15 April 2011.
The ABS has been working with the Australian Government Information Management Office (AGIMO) on ways of standardising metadata ahead of the Federal Government’s final data repository launch but it remains unclear whether DDI and SDMX will be offered as possible standards for other agencies.
“We’re not saying to the world ‘you have to use DDI and SDMX’,” Beahan said. “What we’re saying is ‘hey, you’re putting data out into the community, think about what standards you want to use’.”
Beahan agreed with one winner of the recent Apps4NSW competition, who argued greater metadata capabilities were needed to ensure datasets were easily accessible, readable and disseminated by third parties for mash-ups and reuse.
Current attempts at defining datasets across agencies - including the AGLS format - weren’t good enough, according to Beahan.
“If you can look at the dataset name and make some assumptions about what’s in there, but if you’ve got no idea by looking at the name it’s very difficult to discover stuff without simply opening up every dataset and having a look,” he said.
“If this is going to work effectively, there needs to be some sort of minimum set of requirements for what people need to do to make sure that what they’re putting up there is actually useful and consumable by somebody else.”
The bureau has been readying IT systems for next year’s census, establishing a large data processing centre in Melbourne and exploring the notion of a private Cloud that would host census data on-premise, rather than outsourcing it to IBM.
Follow James Hutchinson on Twitter: @j_hutch
Follow Computerworld Australia on Twitter: @ComputerworldAU