censusfail, cryptography, security, open data, research
Rosie Williams, BA (Sociology)
28th Mar 2019
We Are All Special Snowflakes And Why It Matters
Macquarie university researcher responds to ABS TableBuilder claims.

Researchers at Macquarie University recently discovered a vulnerability in the algorithm that governs public access to census data:

Today, we release our work that identifies and demonstrates a vulnerability of the Perturbation Algorithm used by the Australian Bureau of Statistics for its online tool, TableBuilder, that enables querying the Australian Census Data.

The ABS was quick to dismiss concerns:

The ABS argues that recent claims from academics about the TableBuilder tool are incorrect and alarmist.

Given these conflicting accounts, I interviewed one of the researchers, Hassan Asghar to see if I could clarify the situation.

The vulnerability was discovered by Dr Dali Kaafar and Hassan Jameel Asghar both of whom split their time between lecturing at Macquarie University and consulting for the CSIRO.

Our chat ranged across several topics from the significance of the algorithms we rely on to protect our personal information to the independence of cryptography research from government influence.

I was initially struck by the realisation that this newly identified vulnerability is likely to have been part of TableBuilder for its entire history. I suspect that the significance of this was lost on a lot of people when the story originally broke.

Further to this, Hassan also made it clear that simply because his team decided to take a look at TableBuilder and found a flaw, this is no guarantee that other vulnerabilities do not still exist. We can't know what we don't know.

At this point it occurred to me what a hit-and-miss prospect security is when vulnerabilities can exist for years and only come to light because a cryptographer somewhere decides, perhaps by chance, to take a poke around.

Despite the brilliant expertise of the Macquarie University cryptographers, Hassan was at pains to explain to me that the math behind the vulnerability is relatively straight-forward, even going so far as to write a blog post to explain the logic behind the hack.

This explanation helps to clarify the original statements regarding re-identification and the ability to 'reconstruct the original census data' which has been disputed by the ABS.

It is apparent from Hassan's blog post that what was intended by 'the original census data' is a little different from how this claim was interpreted and disputed by the ABS.

The claim that a sophisticated attacker could reconstruct the entire Census database through the use of ABS TableBuilder is factually incorrect because the entire Census database is not available in TableBuilder.

TableBuilder is the name of the API that provided the interface between unit record level data and the public - but it is the algorithm that is used within that API that safeguards data from unintended publication.

This API was created to allow people to derive high level statistics from the individual fields that collectively form the census data. Each of those fields contains your answer to one of the questions answered on the census form.

TableBuilder is intended to count individual responses to questions to create overall totals (statistics) but perturb or change that data ever so slightly before showing it to the public for download.

This process is called adding 'noise' to the data and is significant here because while we may end up feeling like a statistic, one of many, we are actually very unique snowflakes.

Our individual profile, when each individual field is considered against everyone else's are often unique and it is this uniqueness which makes us easier to identify if a hacker can find an alternative data source - the electoral roll perhaps - to fill in the blanks and effectively 'reconstruct' the census dataset.

But first we have to get the unperturbed data out of the census to have something to match against the electoral roll. Hassan's blog post explains how this was possible.

The blog post details the steps a hacker would go through in order to reconstruct the data in each individual cell, by performing multiple queries on TableBuilder that would allow the true value of each cell to be calculated.

While the ABS would have us believe that simply removing name and address fields from the census data means that the census data cannot be reconstructed in full, Hassan demonstrated that it was possible to get around the perturbation and access the original values.

That name and address have been removed does not mean that a hacker could not then take the available information and match it with other datasets.

The more fields available in the census data, the easier it is to match with identifying information.

This process is called re-identification and the Commonwealth government provides advice to agencies that protecting the anonymity of data relies on a lot more than simply removing names and addresses.

Removing identifying details, such as name, from a dataset does not necessarily protect identity, as other variables can be used to deduce the identity of an individual or organisation in the dataset. For example, the identity of a person with a very rare disease or health condition could be deduced even in highly aggregated data.

Despite this, the ABS has claimed the vulnerability is a non-issue so far as protecting your privacy is concerned but there is more to be concerned about here than just the census data.

It is worth noting that the ABS employs the same or similar individuals and teams to build and evaluate the systems they publicly declare confidence in.

It seems logically conflicting to claim that the experts finding flaws in their security are misinformed when the ABS can only rely on this very expertise to develop and defend their solutions.

People who have been following my writing since the last census will be aware that privacy experts were up in arms due to names and addresses from the census being used for the first time to create a key to join other datasets together.

The #CensusFail furore died down but the ABS has been quietly beavering away in the background to combine our personal data from other agencies for the first time in panopticon-like research projects.

With the exception of health data which has a carve out in the legislation, it is actually a breach of the Privacy Act 1988 to use what is termed 'administrative data' for purposes other than providing a service to us.

It is one of the protections of the Privacy Act 1988 that an organisation covered by it cannot collect data with our permission for any reason other than providing a service. The Australian Privacy Principles state that organisations cannot collect data for one reason and then decide to use it for another purpose for which we have not given our informed consent.

In November 2015 (and coinciding with the very brief public consultation to de-identify the census) the Commonwealth data governance guidelines were quietly updated. Despite that the authorising legislation had not (and still has not) been put before parliament, the Commonwealth prohibition against the re-purposing of administrative data for research was removed.

Given that the Privacy Act 1988 actually mitigates against such plans, the government has introduced new legislation called The Data Sharing and Release Act, which, when passed will over-ride the Privacy Act where the two laws conflict.

The reason this is important to the conversation about TableBuilder (which only allows access to census data) is because it is the exact same algorithm that has been trialled for use with the API developed by Data61 for users to query the integrated data projects.

The ABS has developed a new API for querying the integrated data projects which provide access to a range of linked datasets including our health, social security, tax and education data. The API that has been developed for this purpose is called Protari.

Therefore, any vulnerability that affects that algorithm will also affect any other instance where that algorithm is put to use.

Given that to date Protari is the best shot the ABS has for protecting the privacy of our combined administrative data, the recently identified flaw by Maquarie University cryptographers raises the question of whether we can really rely on the ABS to protect so much sensitive data?

The ABS is conducting user trials of 'Protari' - a tool to enable more users to gain insights from integrated data while preserving confidentiality. Protari is being developed collaboratively with Data61 under the National Innovation and Science Agenda's Platforms for Open Data Program. In the initial stage of the trial, the ABS is inviting participation from analysts associated with Australian Commonwealth or State Government agencies, particularly partner agencies of the Multi-Agency Data Integration Project.

In an email the ABS stated:

Protari was developed jointly with Data61 and uses the same confidentiality algorithm as TableBuilder. The ABS is still testing Protari and has not yet made a decision on whether or not to productionise it.