Buscar

Italo de Oliveira Santos

Prévia do material em texto

Data Anonymization Techniques in Cloud: Literature Review,
Research Opportunities and a Web Tool for Data Protection
Ítalo O. Santos1, Emanuel F. Coutinho1, Leonardo O. Moreira1
1Instituto Universidade Virtual (UFC Virtual)
Universidade Federal do Ceará (UFC) – Fortaleza, CE – Brasil
oliveira.italo07@gmail.com, {emanuel,leoomoreira}@virtual.ufc.br
Abstract. Cloud Computing provides an infrastructure to run applications with
quality of service. When the user hires a cloud provider, there is a loss of control
over data security and privacy issues. Privacy in cloud is the ability of a user or
organization to control what information can be revealed about themselves and
take control of who can access certain information. Due to this, there is a need
to use techniques or strategies to preserve the security and privacy of user data
in cloud. This paper surveys the literature about data anonymization techniques
and presents a tool called SMDAnonymizer to generate anonymized files using
the k-anonymity algorithm to increase the privacy of data sets. We highlight
some research challenges and opportunities in data anonymization and present
results obtained with our tool applied in real data. Finally, we discuss future
works that we intend to do, for improve our tool.
1. Introduction
With the advancement of modern human society, basic and essential services are almost
all delivered transparently. Utilities such as water, electricity, gas and telephone have
become fundamental to our lives, being exploited through the use-based payment model
[Vecchiola et al. 2009]. Nowadays, existing business models provide these services any-
where and anytime. These services are charged considering the various collections poli-
cies for the end user. This also applies to services offered in technology areas and this is
due to the growth and spread of Cloud Computing (CC).
CC is a recent technology trend aimed at providing on-demand, pay-for-use In-
formation Technology (IT) services. Trends prior to CC were limited to a particu-
lar class of users or focused on making available a specific demand for IT resources
[Buyya et al. 2009]. CC proposes to be global and it delivers services from the end user
who hosts their personal documents like texts, videos and images on the internet to com-
panies that outsource the entire IT infrastructure offering their services through the cloud.
The CC greatly facilitates how users will use the services, requiring only an oper-
ating system, a browser and Internet access. Computational resources are available in the
cloud, not requiring high computational resources on the user’s machines, reducing the
cost of acquiring machines, making access to this environment easier for end users. The
CC model was designed and developed with the objective of providing services that are
easy to access, low cost, and guaranteed availability and scalability.
CC is a recent area of research that, in addition to being subject to common
risks relevant to IT environments, has their own set of security problems, identified by
[Krutz and Vines 2010] in seven categories: network security, interfaces, data security,
virtualization, governance, compliance, and legal issues. In order for a CC environment
to be exploited by corporations, the security and privacy of data stored in the cloud is a
fundamental requirement. Research papers that discuss the topic of privacy cover disci-
plines from diverse fields of knowledge such as philosophy, political science, information
science, engineering and computer science [Branco Jr et al. 2014].
Privacy is a concept directly related to people, and it is a human right, such as free-
dom, justice or equality before the law, and is directly related to people’s interest in main-
taining a personal space, without the interference of other people or organizations. More-
over, it ensures that individuals control or influence what information related to them can
be collected and stored by someone and with whom they can be shared [Stallings 2007].
Privacy in CC is the ability of a user or organization to control what information can be
revealed about themselves in the cloud, that is, take control of who can access certain
information and how it can occur. [Jr et al. 2010] defined three privacy dimensions:
• Territorial Privacy: Protection of the region close to an individual;
• Individual Privacy: Protection against moral damages and unwanted interference;
• Information Privacy: Protection for personal data collected, stored, processed and
propagated to third parties.
In a cloud, it is important to note that developers and users deliver their appli-
cations and data to be managed on the infrastructures and platforms provided by cloud
providers. In this sense, the need arises to adopt techniques so that the delivered data are
free of internal or external access in an environment that can be controlled by third par-
ties, especially in data considered sensitive or highly private [Sousa et al. 2009]. There
are some techniques currently proposed by the academic community for data protection,
which can be used and applied with the aim of anonymizing data, such as:
• Generalization: Replaces quasi-identifier attribute values with less specific but se-
mantically consistent values representing them. The technique categorizes the at-
tributes, creating a taxonomy of values with levels of abstraction going from the par-
ticular level to the generic level;
• Suppression: Deletes some identifier and/or quasi-identifier attribute values from the
anonymized table;
• Encryption: Uses cryptography schemes normally based on public key or symmetric
key to replace sensitive data (identifiers, quasi-identifiers and sensitive attributes) by
encrypted data;
• Perturbation: It is used for data mining privacy preservation or for replacing actual
data values with dummy data for masking test or training databases.
In this paper, we will investigate the anonymization of data related to information
privacy specified above, where data generalization techniques that work to make the data
of users protected in cloud environments will be discussed, and also investigate tools used
for data anonymization in the cloud environments. In the face of this situation, we ask
ourselves: how much do we know about data anonymization techniques? The general
objective of this research is to study the data generalization techniques currently used to
protect the information in cloud environments. The specific objectives are: (i) to make a
taxo- nomy among the generalization techniques that were analyzed; (ii) to point out some
research challenges that can be unraveled in the area of data anonymization; (iii) design
and develop a web tool for data anonymization that uses the k-anonymity technique to
anonymize user data; (iv) show the applicability of the web tool to real data; and (v)
present new opportunities related to future versions of the web tool.
2. Cloud Computing
According to the National Institute of Standards and Technology (NIST)
[Mell and Grance 2009], CC is defined as an evolving paradigm. Their definitions,
use cases, technologies, problems, risks and benefits will be redefined in discussions
between the public and private sectors, and these definitions, attributes, and characteris-
tics will evolve over time. In dealing specifically with the definition, a broadly accepted
definition is not yet available. NIST presents the following definition for CC: “CC is a
model that enables convenient and on-demand access to a set of configurable computing
resources (for example, networks, servers, storage, applications, and services) which can
be quickly acquired and released with minimal managerial effort or interaction with the
service provider.”
Another CC definition propose the following definition: “CC is a set of
enabled network services, providing scalability, quality of service, inexpensive on-
demand computing infrastructure and can be accessed in a simple and pervasive
way” [Armbrust et al.2009]. In this paper, we consider the presented view of NIST,
which describes that the CC model consists of five essential characteristics, three service
models and four deployment models [Mell and Grance 2009].
CC has essential features that, taken together, exclusively define CC and distin-
guish it from other paradigms. These features are: Self-service on demand: The user
unilaterally acquires a computational resource, such as server processing time or net-
work storage, to the extent that it needs and does not require human interaction with the
providers of each service; Wide access: Features are made available through the net-
work and accessed through standardized mechanisms that enable use by thin or thin client
platforms, such as cell phones, laptops, and PDAs. Users to change their working con-
ditions and environments, such as programming languages and operating system. Client
software systems installed locally for cloud access are lightweight, such as an Internet
browser; Resource pooling: Provider computing resources are organized into one pool to
serve multiple users using a multi-tenant or multi-tenant model [Jacobs et al. 2007], with
different physical and virtual resources dynamically assigned and adjusted according to
users’ demand. These users need not be aware of the physical location of the computa-
tional resources, and can only specify the location at a higher level of abstraction, such
as the country, state or data center; Fast elasticity: Resources can be acquired quickly
and elastically, in some cases automatically, if there is a need to scale with increasing
demand, and released, in the retraction of this demand; Measured service: Cloud sys-
tems automatically control and optimize the use of resources by means of a measurement
capability. Automation is performed at some level of abstraction appropriate to the type
of service, such as storage, processing, bandwidth, and active user accounts. The use of
resources can be monitored and controlled, allowing transparency for the provider and
the user of the service. To ensure Quality of Service (QoS), it is possible to use a Service
Level Agreement (SLA) approach. The SLA provides information about levels of avail-
ability, functionality, performance, or other attributes of the service such as billing and
even penalties for violations of these levels.
The CC environment is composed of three service models. These models are im-
portant because they define an architectural standard for CC applications. These models
are: Software-as-a-Service (SaaS): it provides software for specific purposes that are
available to users over the Internet. Software systems are accessible from multiple user
devices through a thin client interface as a web browser. In SaaS, you do not manage
or control the underlying infrastructure, including network, servers, operating systems,
storage, or even the characteristics of the application, except for specific settings. As a
result, developers focus on innovation rather than infrastructure, leading to the rapid de-
velopment of software systems; Platform as a Service (PaaS): a high-level integration
infrastructure is offered to deploy and test applications in the cloud. The users can not
manage or control the underlying infrastructure, including network, servers, operating
systems, or storage, but they can control over the deployed applications and possibly the
configurations of the applications hosted on this infrastructure. PaaS provides an oper-
ating system, programming languages and development environments for applications,
helping to implement software systems, as it contains development tools and collabora-
tion between developers; Infrastructure as a Service (IaaS): responsible for providing
all the necessary infrastructure for PaaS and SaaS. It refers to a computational infrastruc-
ture based on computing resource virtualization techniques. The primary goal of IaaS is
to make it easier to provide resources such as servers, network, storage, and other critical
computing resources to build an on-demand environment that can include operating sys-
tems and applications. IaaS has some features, such as a single interface for infrastructure
administration, the Application Programming Interface (API) for interaction with hosts,
switches, routers and the support for adding new equipment in a simple and transparent
way. In general, you do not manage or control the cloud infrastructure, but you have con-
trol over the operating systems, storage, and deployed applications, and eventually select
network components such as firewalls.
CC deployment models can be divided into public, private, community, and hy-
brid cloud, they will be described as follows: Private cloud: the cloud infrastructure is
used exclusively for an organization, this cloud being local or remote and managed by
the company itself or by third parties. In this deployment model, service access poli-
cies are employed. The techniques used to provide such features can be at the level of
network management, service provider configurations, and the use of authentication and
authorization technologies; Public cloud: the cloud infrastructure is made available to
public, being accessed by any user who knows the location of the service. In this deploy-
ment model, access restrictions on network management can not be applied, much less
use techniques for authentication and authorization; Community cloud: multiple cloud
sharing is supported by a specific community that shares their interests, such as mission,
security, policy, and considerations. This type of deployment model may exist locally
or remotely and is generally administered by some community company or by a third
party; Hybrid cloud: there is a composition of two or more clouds, which can be private,
community or public, and which remain as single entities, linked by a standardized or
proprietary technology that allows The portability of data and applications.
3. Privacy
The amout of personal information transferred to the cloud is increasing, so does the con-
cern of individuals and organizations about how this data will be stored and processed.
The fact that the data is stored in multiple locations, often transparently in relation to their
location, causes uncertainty as to the degree of privacy to which they are exposed. Ac-
cording to [Pearson 2013], terminology for dealing with data privacy issues in the cloud
includes the following concepts:
• Data Controller: An entity (individual or legal entity, public authority, agency or or-
ganization) that alone or in conjunction with others determines the manner and purpose
for which personal information is processed;
• Data Processor: An entity (individual or legal entity, public authority, agency or orga-
nization) that processes personal information in accordance with the Data Controller’s
instructions;
• Data Subject: An identified or identifiable individual to whom personal information
refers, either by direct or indirect identification (for example by reference to an identi-
fication number or by one or more physical, psychological, mental, economic, cultural
or Social).
The NIST Computer Security Handbook defines computer security as the protec-
tion afforded to an automated information system in order to achieve the proposed ob-
jectives of preserving the integrity, availability, and confidentiality of information system
resources [Guttman and Roback 1995].
The process of developing and deploying applications to the CC platform, which
follow the Software as a Service (SaaS) model, should consider the following security
aspects of data stored in the cloud [Subashini and Kavitha 2011]:
• Data Security: In SaaS model, data is stored outside the boundaries of the organiza-
tion’s technology infrastructure, so the cloud provider must provide mechanisms that
ensure data security. For instance, this can be done using strong encryption techniques
and fine-tuning mechanisms for authorization and access control;
• Network Security:Client data is processed by SaaS applications and stored on cloud
servers. The transfer of organization data to the cloud must be protected to prevent loss
of sensitive information;
• Data Location: In SaaS model, the client uses the SaaS applications to process their
data, but does not know where the data will be stored. This may be a problem due
to privacy legislation in some countries that prohibit data being stored outside their
geographical boundaries;
• Data Integrity: The SaaS model is composed of multi-tenant cloud-hosted applica-
tions. These applications use interfaces based on API-Application Program Interfaces
XML to expose their functionalities in the form of web services;
• Data Segregation: Data from multiple clients may be stored on the same server or
database as the SaaS model. The SaaS application must ensure the segregation, at the
physical level and at the application layer, of customer data;
• Access to Data: The multi-tenant environment of the cloud can generate problems
related to the lack of flexibility of SaaS applications to incorporate specific policies of
access to data by the users of SaaS client organizations.
Table 1. Adult Data Set Records
Gender Age Race Education Native-country Workclass Salary
Male 39 White Bachelors United-States State-gov <=50K
Male 38 White HS-grad United-States Private <=50K
Male 53 Black 11th United-States Private <=50K
Female 37 White Masters United-States Private <=50K
Female 31 White Masters United-States Private >50K
Male 42 White Bachelors United-States Private >50K
Male 37 Black Some-college United-States Private >50K
Female 23 White Bachelors United-States Private <=50K
Male 32 Black Assoc-acdm United-States Private <=50K
Male 32 White HS-grad United-States Private <=50K
3.1. Data Anonymization
Data anonymization is used to preserve privacy over data publishing. Large public and
private corporations have increasingly been charged to publish their “raw” data in elec-
tronic format, rather than providing only statistical or tabulated data. These “raw” data
are called microdata. In this case, prior to publication, data must be “sanitized” by re-
moving explicit identifiers such as names, addresses, and telephone numbers. For this,
one can use anonymization techniques. In Table 1, we have an adapted example provided
by UCI (University of California at Irvine) Machine Learning Repository1. Each record
corresponds to the personal information for an individual person.
The term anonymity is acquired from the Greek word anonymia, that refers to
“without a name or namelessness”. In informal use, “anonymous” consistently specifies
to a person, and is usually seen that the information which identifies the given person is
mysterious and represents the fact that the subject is not uniquely characterized within a
set of subjects. In this case, it is said that the set is anonymized. The subject concept
refers to an active entity, such as a person or a computer. Group of subjects can be a
group of people or a network of computers [Pfitzmann and Köhntopp 2001]. A record or
transaction is considered anonymous when the data, individually or combined with other
data, can not be associated with a particular subject [Clarke 1999].
From the perspective of data dissemination of individuals, the attributes can be
classified as follows [Camenisch et al. 2011]:
• Identifiers: Attributes that uniquely identify individuals (e.g. social security number,
name, identity number);
• Quasi-Identifiers (QI): Attributes that can be combined with external information to
expose some or all individuals, or reduce uncertainty about their identities (e.g. date
of birth, ZIP code, work position, function, blood type);
• Sensitive Attributes (SAs): Attributes that contain sensitive information about indi-
viduals (e.g. salary, medical examinations, credit card postings).
Information disclosure is one by which we can reach to particular’s identity. By
using anonymization techniques, data sets privacy is preserved from various disclosures
according [Gokila and Venkateswari 2014].
1https://archive.ics.uci.edu/ml/datasets/adult
• Identity Disclosure: An individual is usually associated with an evidence in the pub-
lished table of data set. If his personality is disclosed, then the compatible sensitive
value of an individual would be divulged;
• Attribute Disclosure: Attribute disclosure was possible when information about indi-
vidual record would be revealed. Before releasing the data, it must infer attributes of
an individual with high confidence;
• Membership Disclosure: Membership information in the released table would imply
an identity of an individual through various attacks. If the selection criteria were not a
sensitive attribute value, then it would lead to have a membership disclosure.
3.2. Anonymization Tools
According [Fung et al. 2010] personal records are collected and used for data analysis
by various organizations in public and private sectors. In such cases privacy should be
ensured for not disclosing the personal information at the time of data sharing and anal-
ysis. Although anonymization is an important method for privacy protection, there is
a lack of tools which are both comprehensive and readily available to informatics re-
searchers and also to non-IT experts, e.g., researchers responsible for the sharing of data
[Prasser et al. 2014]. Graphical user interfaces (GUIs) and the option of using a wide vari-
ety of intuitive and replicable methods are needed. Tools have to offer interfaces allowing
their integration into pipelines comprising further data processing modules. Moreover,
extensive testing, documentation and openness to reviews by the community are of high
importance. Informatics researchers who want to use or evaluate existing anonymization
methods or to develop novel methods will benefit from well-documented, open-source
software libraries [Prasser et al. 2014]. We briefly review the related work as follows.
The µ-Argus [Argus 2017] is a software program designed to create safe micro-
data files, and is a closed-source application that implements a broad spectrum of tech-
niques, but it is no longer under active development. The sdcMirco [sdcMicro 2017] is a
package for the R statistics software used for the generation of anonymized micro data,
i.e. for the creation of public and scientific-use files, which implements many primitives
required for data anonymization but offers only a limited support for using them to find
data transformations that are suitable for a specific context.
UTD Anonymization Toolbox [Toolbox 2017] was developed by UT Dallas Data
Security and Privacy Lab, they made an implementation of various anonymization meth-
ods into a toolbox for public use by researchers. These algorithms can either be applied
directly to a data set or can be used as library functions inside other applications, and
the Cornell Anonymization Toolkit (CAT) [Toolkit 2017] was designed for interactively
anonymizing published data set to limit identification disclosure of records under various
attacker models, both are research prototypes that have mainly been developed for demon
extraction purposes. Problems with these tools include scalability issues when handling
large data sets, complex configuration requiring IT-expertise, and incomplete support of
privacy criteria and methods of data transformation.
The ARX tool an open-source data anonymization framework, features a cross-
platform user interface that is oriented towards non-IT experts, it utilizes a well-known
and highly efficient anonymization algorithm [Prasser et al. 2014].
4. Literature Review
A systematic review according to [Russell et al. 2009] is a comprehensive protocol-
oriented review and synthesis of studies focusing on a research topic or related key is-
sues. Using a controlled and formal process of bibliographic research, it is expected that
the topics investigated return relevant gaps, challenges, processes, tools and techniques.
In this research,we propose a simple way of a classical systematic review
based on the orientation of [Kitchenham 2004], with some adaptations made by
[Coutinho et al. 2015]. In this work, relevant studies have been researched and selected
that address work related to data anonymization and the generalization techniques used to
maintain data security and integrity in CC environments. Figure 1 shows the systematic
review process used in this work.
Figure 1. Systematic Review Adaptation Process
4.1. Activity 1: Plan Review
In this section, all planning activities of the systematic review will be described, as will
the need for the review, the definition of the review protocol, the search queries, the search
string, the search sources, the search procedure, the inclusion and exclusion criteria, study
selection procedures and the data extraction procedure.
Privacy in CC is the ability for a user or organization to control what information
needs to be revealed about themselves in the cloud, that is, control who has access to
the information and how it can occur. There are several generalization techniques for
anonymizing data used for data protection, but it is difficult to find a study that classifies
these techniques in more detailed way. The systematic review of this paper proposes to
answer the following research questions:
• Main Question (QP): What is the state of the art about generalization techniques for
anonymizing data in the cloud?
• Secondary Question 1 (SQ1): What is the most used data anonymization technique?
• Secondary Question 2 (SQ2): What are the research challenges and opportunities?
Initially, we defined the following keywords for the research: “Privacy”, “Data
Anonymization”, “Cloud Computing”, “Techniques”, “Algorithms” and “Evaluation”.
Some search tests were done, but the final result was not satisfactory because it often
shows to be of little relevance to the main theme of this work. After some tests, the search
string was refined generating the following string: “(Privacy AND “Cloud Computing”
AND “Data Anonymization” AND Generalization AND (Techniques OR Algorithms OR
Evaluation))”. The string construction was thought considering the used terms and the
order in which they were arranged in the search sequence. Several tests were performed
until the final string was reached, taking into account the works found.
The repositories used for research papers in this work were: IEEE Xplorer2 and
Science Direct3, the number of results initially varied between 1600 and 16000 according
to the repository and were being refined according to the search string changes.
The same search string was used in the two sources, using the advanced search
engine. In the IEEE Xplorer search site, the following steps were done: 1) clicking
the “Command Search” option; 2) by checking the “Search” checkbox in the “Meta-
data Only” option; and 3) by using the string: “(Privacy AND “Cloud Computing” AND
“Data Anonymization” AND Generalization AND (Techniques OR Algorithms OR Eval-
uation))”. Using the Science Direct search site, the following steps were done: 1) clicking
the “Expert search” option; 2) using the string: “(Privacy AND “Cloud Computing” AND
“Data Anonymization” AND Generalization AND (Techniques OR Algorithms OR Eval-
uation))”; 3) selecting the fields: “Computer Science”, “Engineering”, “Mathematics”;
and 4) restricting publication years from 2013.
Therefore, in the systematic review, the definition of Inclusion Criteria (IC) and
Exclusion Criteria (EC) contribute to the inclusion of primary studies that are relevant
and to answer the research questions that were previously raised and exclude those works
that do not respond to them. Thus, the primary studies included in the systematic review
should meet the following inclusion criteria described below:
• Inclusion Criteria 1 (IC1): The primary study should propose or report an approach
of a data anonymization technique;
• Inclusion Criteria 2 (IC2): The keyword “data anonymization” must be in the paper;
• Inclusion Criteria 3 (IC3): The primary study should discuss future work or research
opportunities;
• Inclusion Criteria 4 (IC4): The primary study must have some evaluation where the
technique is tested.
2http://ieeexplore.ieee.org/Xplore/home.jsp
3http://www.sciencedirect.com/
The exclusion criteria chosen in this study were:
• Exclusion Criteria 1 (EC1): The study presents contributions in other areas than data
anonymization in cloud;
• Exclusion Criteria 2 (EC2): The primary study is an earlier version of a more com-
plete study of the same research;
• Exclusion Criteria 3 (EC3): The date of publication of the primary study is before
2013.
The following steps were defined to select the studies:
• Step 1 — Search for Keywords: Search strings were applied to each listed search
source;
• Step 2 — First Selection: For each primary study obtained as a result of the searches,
the title and abstract were read and the inclusion and exclusion criteria were applied.
And in case of doubt in the selection or not of a study, the introduction and conclusion
should be read;
• Step 3 — Second Selection: The primary studies selected in Step 2 should be read in
full and the inclusion and exclusion criteria again applied.
Extracting information from papers was done based on a spreadsheet with ques-
tions oriented to get answers to the review’s research questions. The spreadsheet is di-
vided with items to be filled by each paper, which are: title, year of publication, publica-
tion vehicle, authors, countries, research group, keywords, proposal or approach, obser-
vations, type of analysis, anonymization techniques, tools, future work and ideas.
4.2. Activity 2: Conduct Review
The conduction of the review consists of four activities: identifying primary studies, se-
lecting primary studies, evaluating the quality of the studies and extracting the data, and
these activities will be described in the following subsections.
• Identify Primary Studies: Data were collected in March 2017 and updated in June
2017. As a result, we found 8 papers in the Science Direct repository, and 6 papers
were found in IEEE Xplore. As for the number of results obtained, it is important
to note that the number of articles found varied according to the type of search string
used, search engines have several configurations, each one with their own peculiarities,
a small change in these settings and in the keywords used or some change in the oper-
ators applied in the search, can result in a different amount of work, influencing data
consolidation and analysis. For the review, the results were sufficient, due to the sev-
eral tests simulating several search strings, looking for the highest quantity of works
with quality.
• Select Primary Studies: The reading of all 14 abstracts was used as a refinement
criterion, divided into 8 from Science Direct and 6 from IEEE Xplore, and 5 studies of
this total were excluded because they did not meet the specified criteria.
• Evaluate Quality of the Studies: The evaluation of the quality of the primary pa-
pers was simplified, verifying the presence or not of some type of data anonymization
technique and some type of experiment.
• Extract Data: Once the set of papers selected for complete reading has been defined,
the process for extracting data has been performed according to the planning specified
above. A spreadsheet was filled out for each paper with specified information. This
activity took about two months to complete.
4.3. Activity 3: Result Review
This activity presents general results of the review and the results for each research ques-
tion. An overview of the results of the review will be presented. These results showed
some general information related to data anonymization techniques. And the results of
the research questions will also be presented arranged in tables.
5. Results Overview
The number of selected papers per year in both journals and conferences from 2013 to
2016 were: 2013 (1), 2014 (3), 2015 (4)and 2016 (1).
Data anonymization refers to hiding identity and/or sensitive data for owners
of data records. Then, the privacy of an individual can be effectively preserved while
certain aggregate information is exposed to data users for diverse analysis and min-
ing. [Zhang et al. 2013] investigated the scalability issue of sub-tree anonymization over
big data on cloud, proposing a hybrid approach that combines Top-Down Specializa-
tion (TDS) and Bottom-Up Generalization (BUG) together. Existing TDS and BUG ap-
proaches are developed individually for sub-tree generalization scheme. Both of them
lack the awareness of the user-specified k-anonymity parameter. In fact, the values of
the k-anonymity parameter can impact their performance. Intuitively, if parameter k is
large, TDS is more suitable while BUG will probably get bad performance, the case is
reversed when k is small. The hybrid approach automatically selects one of the two com-
ponents via comparing the user specified k-anonymity parameter. Both TDS and BUG
have been accomplished in a highly scalable way via a series of deliberately designed
MapReduce jobs. Experimental results demonstrated the hybrid approach significantly
improves the scalability and efficiency of sub-tree data anonymization compared with
existing approaches.
TDS is an iterative process starting from the topmost domain values in the taxon-
omy trees of attributes, each round of iteration consists of three steps: (i) finding the best
specialization; (ii) performing specialization; (iii) updating values of the search metric
for the next round. [Zhang et al. 2014] investigated the scalability problem of large-scale
data anonymization by TDS, proposing a highly scalable two-phase TDS approach using
MapReduce on cloud. Data sets are partitioned and anonymized in parallel in the first
phase, producing intermediate results. Then, the intermediate results are merged and fur-
ther anonymized to produce consistent k-anonymous data sets in the second phase. They
applied MapReduce on cloud to data anonymization, and they deliberately designed a
group of innovative MapReduce jobs to concretely accomplish the specialization compu-
tation in a highly scalable way. The experimental results was made on real-world data sets
and with their approach, the scalability and efficiency of TDS are significantly improved.
[Balusamy and Muthusundari 2014] discussed different techniques to provide se-
curity to the user data such as multidimensional k-anonymity technique to produce a secu-
rity to users private data and the suppression technique for anonymization and generaliza-
tion that is a process of replacing original value into a less specified semantic consistent
value. When the user data scalability increases, providing security to the user’s data is
a challenge. In this case, generalization approach is best to provide the security of user
private data in an effective and faster way, it minimize information and privacy losses in
less execution time and better quality of service. This work only made an overview about
some generalization techniques but do not show any experimental evaluation.
[Logeswari et al. 2014] focused to provide efficient analysis of the shared Per-
sonal Health Records (PHR’s) by the proposed Efficient K-Means Clustering (EKMC)
algorithm which clusters the PHR’s into several partitions and it is superior to the tradi-
tional k-means algorithm by improving the time complexity and enhancing the speed of
clustering. In general, Data Aggregation and Deduplication (DAD) that integrates both
data aggregation and data deduplication is an information mining process that searches,
gathers and presents a summarized report to achieve specific business objectives. Data
deduplication is a data compression method that removes duplicate copies of repeated
data, if repeated records are found during comparison then they are eliminated and their
count is incremented. The proposed DAD algorithm is used to reduce the cost of cloud
storage to a great extent since dealing with huge records. The privacy of the patient’s PHR
is preserved through various data anonymization techniques.
[Panackal and Pillai 2015] proposed an approach based on association mining
namely, Adaptive Utility-based Anonymization (AUA), initially the model is tested with
sample instances of original data set of the National Family Health Survey (NFHS) that is
an important source of data on population, health, and nutrition for India and their states,
this paper includes performance evaluation of AUA model using data sets and proves that
the data anonymization can be done without compromising the quality of data mining
results. The model gets the benefits of k-anonymity in terms of privacy protection, and
provides maximum information to the users. The AUA approach is a two-step iterative
process based on association mining. Using support and confidence, they have performed
association mining on given data set either for filtering risky instances or for retaining
maximum information based on utility of data. The first step is based on quasi-sensitive
associations among entire instances of the given data set and the second step is based on
quasi-quasi associations among instances of the non-frequent set and among certain in-
stances of frequent set. From this iterative process, multiple versions of anonymized data
sets can be served positively according to the user‘s need. This work made an experimen-
tal evaluation but only has specified the type of data set used, however do not describe the
technical environment where the experiments were made.
[S et al. 2015] implemented Two Phase Top Down Specialization (Two Phase
TDS) to improve the scalability and efficiency over Centralized TDS on cloud. This
approach has a map and reduce phase. In map phase the large scale data set is parti-
tioned into smaller data sets. These data sets are anonymized in parallel and produce
anonymized intermediate results. In reduce phase the intermediate results are combined
and further anonymized to produce k-anonymous data which are consistent. In Two Phase
TDS the intermediate data sets are produced after anonymization by Map function and the
final consistent anonymized data sets are produced after the specialization. i.e., applying
anonymization once the intermediate results are integrated in Reduce phase. In this tech-
nique the data can be either generalized or suppressed using various algorithms. TDS
in k-Anonymity is the most used generalization algorithm for data anonymization. This
anonymizes the data sets that are highly scalable and significantly improves the efficiency
over existing approaches.
[Taneja et al. 2015] proposed an approach based on reducing the re-identification
risk to preserve the privacy of the EMRs. This solution is based on k-Anonymity, l-
Diversity, t-Closeness & δ-Presence and is implemented through ARX Anonymization
tool. ARX Anonymization tool is used recently by few researchers for preserving the
privacy using anonymization techniques, and they executed in four different phases: Con-
figure Transformation, Explore results, Analyze utility and Analyze risk. They have im-
plemented the solution on randomly generated medical data set based on extension of
publicly available in Electronic Medical Records (EMR). Privacy is measured through
re-identification risk based on the uniqueness of records in the data set. The proposed
technique helped in reducing the average re-identification risks from 100% to 2.33%. Au-
thors only described the experimental evaluation was made using the ARX tool, they do
not specified more information about the technical environment used.
[Zhang et al. 2015] investigated the local-recoding problem for big data
anonymization against proximity privacy breaches as a proximity-aware clustering (PAC)
problem, and proposed a scalable two-phase clustering approach accordingly. Techni-
cally, a proximity-aware distance is introduced over both quasi-identifier and sensitive
attributesto facilitate clustering algorithms. To address the scalability problem they pre-
sented a proximity privacy model which allows semantic proximity of sensitive values and
multiple sensitive attributes, and they model the problem of local recoding as a proximity-
aware clustering problem. A scalable two-phase clustering approach consisting of a t-
ancestors clustering (similar to k-means) algorithm and a proximity-aware agglomerative
clustering algorithm was proposed. The first phase splits an original data set into t parti-
tions that contain similar data records in terms of quasi-identifiers. In the second phase,
data partitions are locally recoded by the proximity-aware agglomerative clustering al-
gorithm in parallel. We design the algorithms with MapReduce in order to gain high
scalability by performing data-parallel computation over multiple computing nodes in
cloud.
[Aldeen et al. 2016] proposed a new anonymization technique to attain better pri-
vacy protection with high data utility over distributed and incremental data sets on CC
called incremental anonymization technique. In this technique, the data are partitioned
into a variety of relativity small blocks of data that are then stored in cloud they divided
original anonymized data sets according to an anonymization level of K. Upon adding the
new data, the updates can be handled after initializing the original anonymized data sets.
The integration of the anonymized data sets is performed through the privacy preserva-
tion metric together with additional metrics including the computation and the storage.
Performance evaluation of the developed incremental anonymization technique is made
for different K values and compared with the classical one. The performance of their
technique is found to be insensitive to the variation of K. Improved data privacy preserva-
tion and the confidentiality requirement is established. The authors made an evaluation,
but they do not described anything about the technical environment used, only have men-
tioned that the data set was provided by UCI Machine Learning Repository.
The Table 2, presents the experimental evaluation per work, describing the envi-
ronment used to make their tests like the programming language, the data set used, the
cloud environment, the operational system among others technologies. We can perceive
some similarities between the environments used by the researchers, they used in common
the Java language for implementation of the techniques, Ubuntu was the most used op-
erating system, as cloud environment they used U-Cloud and CloudBees, the researchers
sought to use generics databases as pointed out in the table Adult data set that has attribute
information like age, work-class, education, marital-status, occupation, relationship, race,
sex, native-country, and in the tests that used mapreduce was used the Hadoop API and
Hadoop Clusters.
Table 2. Experimental Evaluation
Work Experimental Evaluation? Environment described
[Zhang et al. 2013] YES
U-Cloud - Ubuntu - Java -
Hadoop MapReduce API - KVM - OpenStack -
Hadoop clusters - Adult data set
[Zhang et al. 2014] YES
U-Cloud - Ubuntu - Java -
Hadoop MapReduce API - KVM - OpenStack -
Hadoop clusters - Adult data set
[Balusamy and Muthusundari 2014] NO -
[Logeswari et al. 2014] YES Java - CloudBees
[Panackal and Pillai 2015] YES Adult data set
[S et al. 2015] YES
Java - Hadoop MapReduce API -
Eclipse - Adult data set
[Taneja et al. 2015] YES ARX Anonymization tool
[Zhang et al. 2015] YES
U-Cloud - Java -
Hadoop MapReduce API - OpenStack -
Hadoop clusters - Census-Income data set
[Aldeen et al. 2016] YES Nothing was described
6. Research Questions, Challenges and Opportunities
The purpose of this section is to present the data extracted from the selected studies in
order to answer the search questions. The following are the answers for each question,
according to the authors.
Main Question (QP): What is the state of the art about generalization techniques for
anonymizing data in the cloud?
As presented in the papers we can highlight the advantages for a company that
stores records with sensitive information, this data anonymization techniques offers sev-
eral advantages. One of them is financial: storing records in the cloud is less expensive.
That companies can decrease investments in hardware, software, and support since they
reduce the size of the data caches stored in-house and companies are protected against the
risks of accidental data spills or deliberate attacks in the cloud. Since the data is disiden-
tified, and only non-sensitive information is stored outside the organization, it is tough
for an attacker to make use of the data even if the cloud-based data cache is made public.
Other advantage is that organizations can redirect the savings from cloud storage to invest
in greater security for other financial type data.
Related to the clients who use the cloud services to store their private data these
techniques are important to assure the confidence of the data owners. According to
[Paul et al. 2015] for such data owners, the data anonymization techniques enhances pri-
vacy and confidentiality: even if cloud based data about healthy, financial, purchases or
transactions with a particular company becomes public, the consumers are assured that
their individual details cannot be linked to their identities. Second, the cost savings that
accrue from using cloud-based storage should enable companies to lower prices, leaving
consumers with additional disposable income.
Some important parameters for comparing the Privacy Preserving Data Mining
Techniques by [Kaur and Sofat 2016]:
• Information loss: Information loss means that when the information is required in
their original form, it should be retrieved with the same values, i.e., somebody’s age is
changed from ‘23’ to ‘>20’ or ‘<25’ or some other value, it should be reverted to 23.
The information loss factor must be minimum for the technique.
• Privacy Preserved: This means the level of information hided or distorted from orig-
inal values. This parameter needs to have maximum value for a technique to be good
Privacy preserving technique.
• Computational time: The time factor is always an important parameter to calculate
efficiency of any technique or algorithm. A technique is considered efficient Privacy
preserving data mining technique if it achieves maximum privacy in minimum time.
• Complexity: For a technique to be a good privacy preserving technique, the algorithm
needs to easy to understand and implement.
• Dependency on size of data: As the amount of data is increasing exponentially, so
the performance of the technique with the increase in size of data is an unavoidable
parameter. A technique with 100% performance in all the other factors but only on a
small set of data is never acceptable.
Table 3. New approaches proposed per work
Work New approach? What?
[Zhang et al. 2013] YES Hybrid approach
[Zhang et al. 2014] YES Highly scalable two-phase TDS approach
[Balusamy and Muthusundari 2014] NO -
[Logeswari et al. 2014] YES
Efficient K-Means Clustering,
Data Aggregation and Deduplication (DAD)
[Panackal and Pillai 2015] YES Adaptive Utility-based Anonymization (AUA)
[S et al. 2015] NO -
[Taneja et al. 2015] YES Combination of anonymization techniques
[Zhang et al. 2015] YES Scalable two-phase clustering approach
[Aldeen et al. 2016] YES Incremental anonymization technique
As state of the art about generalization techniques, we found many studies that
try to prove their approaches can anonymizing data the better way possible considering
the parameters mentioned above. However, there is many new approaches over existent
techniques and each author say that their own technique is better than others, but we do
not have any evidence regarding their effectiveness of these techniques above others and
what variables could affect the perfomance of these techniques according the environment
that are implemented.
Even with many studies that explore someanonymization techniques in cloud
environment, there is a lack of guarantees that the technique assure full anonymity for the
initial raw data to the final anonymized data, it is extremely important that the costumers
that share their data in cloud have their privacy full guaranteed.
As presented in Table 3, we specified the new approaches proposed
per work, only two papers do not propose new approaches, in the work of
[Balusamy and Muthusundari 2014] they only described existent techniques, while in the
work of [S et al. 2015] the authors made an experimental evaluation using k-anonymity
on cloud. However, we have seen that they are increasingly trying to propose new ap-
proaches to improve the anonymization techniques, and they are worried about issues
as how to guarantee anonymization in huge volume of information that demands more
computational processing and how to share this data on cloud environment, besides it is
important to adapt these techniques to run in cloud environment with their full potential.
Secondary Question 1 (SQ1): What is the most used data anonymization technique?
A large number of existing techniques anonymize data based on the concept of
k-anonymization [Sweeney 2002]. We can verify this affirmation looking to the Table 4,
which shows the mentioned techniques per work, we could see that the generalization
technique most adopted by the authors was the k-anonymity technique.
Table 4. Techniques mentioned per work
Work Techniques
[Zhang et al. 2013]
Top-Down Specialization,
Bottom-Up Generalization,
K-Anonymity
[Zhang et al. 2014]
Top-Down Specialization,
K-anonymity
[Balusamy and Muthusundari 2014]
K-Anonymity,
Top-Down Specialization,
Multidimensional k-anonymity
[Logeswari et al. 2014] K-Means Clustering
[Panackal and Pillai 2015] K-Anonymity
[S et al. 2015]
Two Phase TDS,
Centralized TDS,
K-Anonymity
[Taneja et al. 2015]
K-Anonymity,
L-Diversity,
T-Closeness,
δ-Presence
[Zhang et al. 2015]
T-ancestors Clustering,
Proximity-aware Agglomerative Clustering
[Aldeen et al. 2016]
K-Anonymity,
L-Diversity
The k-anonymity model requires that any combination of quasi-identifier at-
tributes be shared by at least k records in an anonymous database [Samarati 2001], where
k is a positive integer value defined by the data owner, possibly as a result of negotiations
with other interested parties. A high value of k indicates that the anonymized bank has
low disclosure risk because the probability of re-identifying a record is 1/k, but this does
not protect the data against disclosure of attributes. Even if the attacker does not have the
ability to re-identify the registry, he may discover sensitive attributes in the anonymized
database.
K-anonymization techniques are a key component of any comprehensive solution
to data privacy and have been the focus of intense research in the last few years. An
important requirement for such techniques is to ensure anonymization of data while at
the same time minimizing the information loss resulting from data modifications such as
generalization and suppression [Byun et al. 2007]. In Figure 2, we can observe that the k-
anonymity technique was more used by the authors, to create new techniques. Among the
techniques derived from k-anonymity we have: Incremental anonymization technique,
Combination of anonymization techniques, Adaptive Utility-based Anonymization
(AUA), Data Aggregation and Deduplication (DAD) and Efficient K-Means Cluster-
ing
We will briefly describe the others techniques mentioned in each work, as well as
Figure 2. Data Anonymization Taxonomy Techniques Derived from K-anonymity
the new techniques showed in Table 3:
• Top-Down Specialization (TDS): is an iterative process starting from the topmost
domain values in the taxonomy trees of attributes, each round of iteration consists of
three main steps: (i) finding the best specialization, (ii) performing specialization and
(iii) updating values of the search metric for the next round.
• Bottom-Up Generalization (BUG): is an iterative process starting from the low-
est anonymization level, the lowest anonymization level contains the internal domain
nodes in the lowest level of taxonomy trees.
• Hybrid Approach [Zhang et al. 2013]: combines TDS and BUG together for sub-
tree anonymization over big data, and automatically determines which component is
used to conduct the anonymization when a data set is given, by comparing the user-
specified k-anonymity parameter with a threshold derived from the data set.
• Highly Scalable Two-phase TDS Approach [Zhang et al. 2014]: the two phases of
this approach are based on the two levels of parallelization provisioned by MapRe-
duce on cloud, i.e., job level and task level: (i) the job level parallelization means that
multiple MapReduce jobs can be executed simultaneously to make full use of cloud in-
frastructure resources and the (ii) task level parallelization refers to that multiple map-
per/reducer tasks in a MapReduce job are executed simultaneously over data splits.
To achieve high scalability, this approach parallelizing multiple jobs on data partitions
in the first phase, but the resultant anonymization levels are not identical. To obtain
finally consistent anonymous data sets, the second phase is necessary to integrate the
intermediate results and further anonymize entire data sets.
• Multidimensional K-anonymity: According [LeFevre et al. 2006] is a global recod-
ing that achieves anonymity by mapping the domains of the quasi-identifier attributes
to generalized or altered values.
• K-Means Clustering: The traditional k-means clustering algorithm calculates the dis-
tance between each data object to all the cluster centers, which consumes a lot of time
especially when dealing with large records.
• Efficient K-Means Clustering [Logeswari et al. 2014]: arranges the numerical at-
tribute to be clustered in ascending order, the threshold value between the current clus-
ter center and next cluster center is calculated. Then the distance between the data
object and the current cluster center is calculated. If the calculated distance is smaller
than or equal to the threshold value, the data objects stay in the same cluster, if not the
data objects moves to the next cluster. This process is repeated until all data objects
are grouped to their cluster centers.
• Data Aggregation and Deduplication (DAD) [Logeswari et al. 2014]: compares
each record with all other records. If repeated records are found during comparison
then they are eliminated and their count is incremented to the corresponding record.
• Adaptive Utility-based Anonymization (AUA) [Panackal and Pillai 2015]: is a
two-step iterative process based on association mining. The first step is based on quasi-
sensitive associations among entire instances of the given data set and the second step
is based on quasi-quasi associations among instances of the non frequent set and among
certain instances of frequent set.
• Two Phase TDS: this approach has a map and reduce phase. In map phase the large
scale data set is partitioned into smaller data sets. These data sets are anonymized in
parallel and produce anonymized intermediate results. In reduce phase the intermedi-
ate results are combined and further anonymized to produce k-anonymous data which
are consistent.
• Centralized TDS: exploits the data structure Taxonomy Indexed PartitionS (TIPS) to
improve the scalability and efficiency by indexing anonymous data records and retain-
ing statistical information in TIPS, centralized approaches probably suffer from low
efficiency and scalability when handling large-scale data sets.
• L-Diversity: this model proposed by [Machanavajjhala et al. 2007] captures the risk
of discovery of sensitive attributes in an anonymous database. The l-diversity model
requires that for each combination of semi-identifier attributes (SI group), there must
be at least l “well-represented” values for each sensitive attribute.
• T-Closeness: According [Branco Jr et al. 2014] thistechnique uses the concept of
“global backward knowledge”, which assumes that the opponent can infer informa-
tion about sensitive attributes, from the knowledge of the frequency of occurrence of
these attributes in the table, this model estimates the risk of disclosure computing the
distance between the distribution of confidential attributes within the SI group and the
entire table.
• δ-Presence: This technique is used to preserve membership disclosure. It is based on
the background knowledge of the attacker with respect to the larger data set as a super
set of the disclosed data set.
• Combination of Anonymization Techniques [Taneja et al. 2015]: is a combina-
tion of anonymization techniques such as k-Anonymity, l-Diversity, t-Closeness &
δ-presence applied to reduce the re-identification risk and hence the privacy of the
patients is preserved.
• T-ancestors Clustering: this algorithm splits an original data set into t partitions with
quasi-identifier based similar records, an ancestor of a cluster refers to a data record
whose attribute value of each categorical quasi-identifier is the lowest common ances-
tor of the original values in the cluster. Each numerical quasi-identifier of an ancestor
record is the median of original values in the cluster.
• Proximity-aware Agglomerative Clustering: in the agglomerative clustering
method, each data record is regarded as a cluster initially, and then two clusters are
picked to be merged in each round of iteration until some stopping criteria are sat-
isfied. Usually, two clusters with the shortest distance are merged. Thus, one core
problem of the agglomerative clustering method is how to define the distance between
two clusters.
• Scalable Two-phase Clustering Approach [Zhang et al. 2015]: in this technique the
first phase splits an original data set into t partitions that contain similar data records
in terms of quasi-identifiers. In the second phase, data partitions are locally recoded
by the proximity-aware agglomerative clustering algorithm in parallel. It was designed
with MapReduce in order to gain high scalability by performing data-parallel compu-
tation over multiple computing nodes in cloud.
• Incremental Anonymization Technique [Aldeen et al. 2016]: the data are parti-
tioned into a variety of relativity small blocks of data that are then stored in cloud.
The technique divided original anonymized data sets according to an anonymization
level of K. Upon adding the new data, the updates can be handled after initializing the
original anonymized data sets.
Figure 3. Data Anonymization Taxonomy Techniques
In Figure 3, we can see the existing relationships between the techniques described
above, the arrows specify which techniques were used to create the new techniques pro-
posed by each author, thus we have: Hybrid Approach is based on the techniques of
Bottom-Up Generalization and Top-Down Specialization, the latter was used by the
Centralized TDS and the Two Phase TDS techniques which were important for the cre-
ation of Highly Scalable Two-phase TDS Approach and Scalable Two-phase Cluster-
ing Approach that was proposed based on the T-ancestors Clustering and Proximity-
aware Agglomerative Clustering.
Secondary Question 2 (SQ2): What are the research challenges and opportunities?
Privacy concerns on CC have attracted the attention of researchers in different re-
search communities. But ensuring privacy preservation of large scale data sets still needs
extensive investigation. In Figure 4, we have some words that show us some challenges
and research opportunities related to this subject on data anonymization, we will discuss
each topic below.
Figure 4. Key Challenges
In big data applications, data privacy is one of the most concerned issues be-
cause processing large-scale data sets often requires computational power provided by
public cloud services. CC and big data are two disruptive trends at present, imposing
significant impacts on current IT industry and research communities. Privacy is one of
the most concerned issues in the big data applications that involve multiple parties, and
the concern aggravates in the context of CC although some privacy issues are not new
[Chaudhuri 2012]. In accordance with various data and intensive applications on cloud,
processing huge-volume of anonymized data sets is becoming an important research area.
Privacy preservation for such data sets is one of important yet challenging research issues,
and it needs thorough investigation.
Working with data anonymization within the cloud environment brings us major
challenges as we have to understand how it works, comprehend their limitations, and
identify the best ways to apply anonymization techniques so that they can reach their full
potential in cloud. As a research opportunities pointed by [Zhang et al. 2013] in cloud
environment, privacy preservation for data analysis, share and mining is a challenging
research issue due to increasingly larger volumes of data sets, thereby requiring inten-
sive investigation. In a cloud environment, preserving privacy in data publishing is a big
problem. To share data for research purposes, in several fields such as medicine, where
it is necessary to disclose sensitive data about patients informing the type of disease, the
history of consultations, data from population, informing characteristics of each region
referring to the individuals living in those places as address, the number of demographic
concentration, the government field needs to be accountable to society by publicizing
public funds on public sites, informing the social security number, salaries of the public
servants and the service agencies. Anonymization techniques will work to make this data
available without the actual users risking being identified by malicious users.
The need for anonymization is motivated by many legal and ethical requirements
for protecting private, personal data. The intent is that anonymized data can be shared
freely with other parties, who can perform their own analysis and investigation of the
data [Cormode and Srivastava 2009]. Anonymization techniques are useful to protect the
users sensitive information from malicious users. If the malicious users may possible
to get the privacy information means it may cause financial or social reputation level of
losses. Aside from the techniques mentioned in the papers studied, a variety of other
techniques have been proposed for protecting data privacy. Developing new techniques of
anonymization and improving existing techniques are challenges that must be overcome
and intensively investigated by researchers in the area, it is also interesting to try to better
understand the techniques that exist and conduct experiments by exploring the capabilities
of each technique and identifying their performance in the cloud.
Anonymization techniques should safeguard the privacy of user data and ensure
that even if this data is shared, the actual user will not be identified through possible
attacks or security breaches. The medical information is taken as the most confidential
information as it directly contains the personal data of the patients. It has become utter-
most concern to preserve the confidentiality of the patient’s data despite the fact that this
data needs to be shared with other medical bodies, in case it is required. Though, the
platforms providing the cloud based data are increasing, it has also resulted in the privacy
and security concerns of the medical data stored in the cloud. It is a challenge inves-
tigated if these generalization techniques could really ensure that once the data set was
anonymized, these data can not be used by third parties to re-identify the actual patients.
So it is essential that the community continue developing these techniques to reach their
maximal potential. It is important to think of new quality metrics to regulate the levels of
security to be achieved through the implementation of each anonymization technique.
Interdisciplinarityis a factor that can be further investigated by researchers, it is a
great challenge to work with this topic of data anonymization in CC environments, as it
involves knowledge relevant to several areas such as CC, security and big data. Therefore,
it is necessary to have a little knowledge in these areas when starting the studies in this
topic, because this is a matter that involves different areas, and tends to aggregate the
existing difficulties in each area, such as performance and scalability problems.
An increasing growth of data within organizations and lower maintenance costs
are two factors that force data processing on public clouds instead of private clouds. De-
spite the change in the location of data processing, the need for privacy preservation of
sensitive data remains identical. Thus, it is beneficial to process data based on sensi-
tivity on the organization’s private cloud and public clouds [Derbeko et al. 2016]. The
management of anonymization techniques is a factor that must be treated with extreme
importance, create ways to correctly manage the application of the technique, the operat-
ing environment, so that we can be sure that the shared data is really safe.
7. Experimental Evaluation
[Prasser et al. 2014] presents the ARX tool, a comprehensive open-source data
anonymization framework that implements a simple three-step process. It provides sup-
port for all common privacy criteria, as well as for arbitrary combinations. It utilizes
a well-known and highly efficient anonymization algorithm. Moreover, it implements a
carefully chosen set of techniques that can handle a broad spectrum of data anonymiza-
tion tasks, while being efficient, intuitive and easy to understand. Their tool features has
a cross-platform user interface that is oriented towards non-IT experts.
In our approach, we used the ARX API to create a generic web tool for data
anonymization, we will explain more about the tool developed in the next section.
[Prasser et al. 2014] provides a stand-alone software library with an easy-to-use public
API for integration into other systems. Their code base is extensible, well-tested and
extensively documented. As such, it provides a solid basis for developing novel privacy
methods.
7.1. Experiment Settings
The proposed SMD Anonymizer tool is implemented in java using the NetBeans4 as
integrated development environment (IDE) for coding. Our experiments are conducted
in a cloud environment5 hosted in IBITURUNA research group. We have collected
data set from a federal government website6. The data set name is BolsaFamilia data
set and have the following attributes: UF, Code-SIAFI, Municipality, Code-Function,
Code-Subfunction, Code-Program, Code-Action, NIS-Favored, Name-Favored, Source-
Purpose, Value-Month. The tool implements the k-anonymity algorithm as generalization
technique for data anonymization. The k-anonymity parameter is set to 2 for the results
that will be presented.
7.2. Experiment Process and Results
We present the tool interface in Figure 5, where we have the initial screen, where we can
perform the first steps, in the interface we can download a data file example that shows
the user the type of file the tool receives, our tool support the format .csv short for comma
separated values, this format is often used to exchange data between differently similar
applications.
The user has the option to use confirm button after uploading the file which will
be anonymized and has selected the anonymization algorithm that will be applied to the
data set, or can click the clear option that will delete the filled fields then the user can start
uploading and selecting the anonymization algorithm again.
4https://netbeans.org/
5http://app.ibituruna.virtual.ufc.br/
6http://www.transparencia.gov.br/
Figure 5. Step 1: Selecting the data set and the anonymization algorithm
Figure 6. Step 2: Selecting the anonymization hierarchies of the fields
After uploading and selecting the algorithm, the tool reads and interprets the fields
referring to the columns of the data set that has been uploaded and shows in the check-
box field the columns the data set has, then the user can select the fields which will be
anonymized as presented in Figure 6.
By selecting the fields which will be anonymized, the user must upload the hi-
erarchy. To generalize the hierarchy is created for each attribute which defines the pri-
vacy level. A hierarchy is created for quasi-identifiers based on the type of values these
attributes hold. For instance, the hierarchy of attribute UF, Code-Subfunction, Code-
Program and Code-Action is shown in Figure 7.
Therefore, after selecting fields and uploading their hierarchies, the user must
Figure 7. Hierarchy of attributes applied
confirm the operation, and then the tool exports and saves the anonymized data to a file in
csv format. Table 5 shows the final file generated by the tool.
Table 5. BolsaFamilia data set result
UF Codigo-Subfuncao Codigo-Programa Codigo-Acao
Nordeste 24* 13** 844*
Nordeste 24* 13** 844*
Centro-Oeste 24* 13** 844*
Norte 24* 13** 844*
Sudeste 24* 13** 844*
Sul 24* 13** 844*
Centro-Oeste 24* 13** 844*
By multimedia, we understand all the programs and systems where communi-
cation between man and computer occurs through multiple means of representation of
information. Our tool fits as a multimedia product according to the characteristics de-
scribed by [Paula Filho 2011]: (i) Non-linear access the information is quickly accessible
non-linear, the user does not get stuck in a time sequence like the reader of a book, the
listener of a lecture or the spectator of a movie; (ii) Interactivity is the situation of the
user in front of the computer may not be that of passive spectator, but of participant of an
activity; and (iii) Integration with application programs is when the computer can perform
calculations, searches on databases and other normal tasks of any application program.
Therefore, as the multimedia characteristics mentioned above our tool meets the
requirements of interactivity where it requires the user to upload the file that will be
anonymized, and select in the checkboxes the hierarchies that will be applied, and meets
the requirement of integration with application programs, because the tool executes the
k-anonymity algorithm used for data anonymization, and makes use of an API to execute
the others functionalities present in the application.
8. Conclusion and Future Work
In this paper, we presented a survey on data anonymization in CC based on an adaptation
of a classic systematic review. We also identified concepts presented in the literature for
data anonymization. Moreover, we presented concepts related to CC and privacy focusing
on anonymization techniques.
This paper also reveals the state of the art related to the main topic where we dis-
covered that the main anonymization technique used is the k-anonymity. To prove this,
we have made a taxonomy between the techniques presented by the works showed in Ta-
ble 4 and we realized that the k-anonymity is used as base to create new approaches. In
addition, we made a discussion between the work proposed by the authors and interest-
ing insights to better understand research opportunities and challenges, that could guide
researches who have interests in study this area.
Furthermore, we presented our tool called SMDAnonymizer and described their
use as a tool to anonymize data raw and generate a new file with anonymized data. As
future work, we intend to further develop the tool, implementing new anonymization algo-
rithms, and testing different types of data, comparing the efficiency of each implemented
algorithm. We also want to integrate other APIs to make our tool more expandable and
increase the amount of functionalities. We will try to develop this tool with a friendly
interface and it can be used for research purposes and by the non-IT experts.
References
Aldeen, Y. A. A. S., Salleh, M., and Aljeroudi, Y. (2016). An innovative privacy pre-serving technique for incremental datasets on cloud computing. Journal of Biomedical
Informatics, 62:107 – 116.
Argus (2017). µ-argus manual. Available from: http://neon.vb.cbs.nl/casc/
Software/MuManual4.2.pdf.
Armbrust, M., Fox, A., Griffith, R., Joseph, A. D., Katz, R. H., Konwinski, A., Lee, G.,
Patterson, D. A., Rabkin, A., Stoica, I., et al. (2009). Above the clouds: A berkeley
view of cloud computing. Technical report, Technical Report UCB/EECS-2009-28,
EECS Department, University of California, Berkeley.
Balusamy, M. and Muthusundari, S. (2014). Data anonymization through generalization
using map reduce on cloud. In Proceedings of IEEE International Conference on
Computer Communication and Systems ICCCS14, pages 039–042.
Branco Jr, E. C., Machado, J. C., and Monteiro, J. M. (2014). Estratégias para proteção da
privacidade de dados armazenados na nuvem. Simpósio Brasileiro de Banco de Dados.
Citado na pág, 6.
Buyya, R., Yeo, C. S., Venugopal, S., Broberg, J., and Brandic, I. (2009). Cloud comput-
ing and emerging it platforms: Vision, hype, and reality for delivering computing as
the 5th utility. Future Generation computer systems, 25(6):599–616.
Byun, J.-W., Kamra, A., Bertino, E., and Li, N. (2007). Efficient k-anonymization using
clustering techniques. In International Conference on Database Systems for Advanced
Applications, pages 188–200. Springer.
Camenisch, J., Fischer-Hübner, S., and Rannenberg, K. (2011). Privacy and identity
management for life. Springer Science & Business Media.
Chaudhuri, S. (2012). What next?: A half-dozen data management research goals for
big data and the cloud. In Proceedings of the 31st ACM SIGMOD-SIGACT-SIGAI
Symposium on Principles of Database Systems, PODS ’12, pages 1–4, New York, NY,
USA. ACM.
Clarke, R. (1999). Introduction to dataveillance and information privacy, and definitions
of terms. Roger Clarke’s Dataveillance and Information Privacy Pages.
Cormode, G. and Srivastava, D. (2009). Anonymized data: Generation, models, usage.
In Proceedings of the 2009 ACM SIGMOD International Conference on Management
of Data, SIGMOD ’09, pages 1015–1018, New York, NY, USA. ACM.
Coutinho, E. F., de Carvalho Sousa, F. R., Rego, P. A. L., Gomes, D. G., and de Souza,
J. N. (2015). Elasticity in cloud computing: a survey. annals of telecommunications-
annales des télécommunications, 70(7-8):289–309.
Derbeko, P., Dolev, S., Gudes, E., and Sharma, S. (2016). Security and privacy aspects in
mapreduce on clouds: A survey. Computer Science Review, 20(Supplement C):1 – 28.
Fung, B., Wang, K., Chen, R., and Yu, P. S. (2010). Privacy-preserving data publishing:
A survey of recent developments. ACM Computing Surveys (CSUR), 42(4):14.
Gokila, S. and Venkateswari, P. (2014). A survey on privacy preserving data publishing.
International Journal on Cybernetics & Informatics (IJCI) Vol, 3.
Guttman, B. and Roback, E. A. (1995). An introduction to computer security: the NIST
handbook. DIANE Publishing.
Jacobs, D., Aulbach, S., et al. (2007). Ruminations on multi-tenant databases. In BTW,
volume 103, pages 514–521.
Jr, A. M., Laureano, M., Santin, A., and Maziero, C. (2010). Aspectos de segurança e
privacidade em ambientes de computação em nuvem.
Kaur, A. and Sofat, S. (2016). A proposed hybrid approach for privacy preserving data
mining. In 2016 International Conference on Inventive Computation Technologies
(ICICT), volume 1, pages 1–6.
Kitchenham, B. (2004). Procedures for performing systematic reviews. Keele, UK, Keele
University, 33(2004):1–26.
Krutz, R. L. and Vines, R. D. (2010). Cloud security: A comprehensive guide to secure
cloud computing. Wiley Publishing.
LeFevre, K., DeWitt, D. J., and Ramakrishnan, R. (2006). Mondrian multidimensional
k-anonymity. In Data Engineering, 2006. ICDE’06. Proceedings of the 22nd Interna-
tional Conference on, pages 25–25. IEEE.
Logeswari, G., Sangeetha, D., and Vaidehi, V. (2014). A cost effective clustering based
anonymization approach for storing phr’s in cloud. In 2014 International Conference
on Recent Trends in Information Technology, pages 1–5.
Machanavajjhala, A., Kifer, D., Gehrke, J., and Venkitasubramaniam, M. (2007). L-
diversity: Privacy beyond k-anonymity. ACM Transactions on Knowledge Discovery
from Data (TKDD), 1(1):3.
Mell, P. and Grance, T. (2009). The nist definition of cloud computing. national insti-
tute of standards and technology (nist). Information Technology Laboratory.[Online].
Available: http://csrc. nist. gov/groups/SNS/cloud-computing/index. html.
Panackal, J. J. and Pillai, A. S. (2015). Adaptive utility-based anonymization model:
Performance evaluation on big data sets. Procedia Computer Science, 50:347 – 352.
Big Data, Cloud and Computing Challenges.
Paul, M., Collberg, C., and Bambauer, D. (2015). A possible solution for privacy preserv-
ing cloud data storage. In 2015 IEEE International Conference on Cloud Engineering,
pages 397–403.
Paula Filho, W. P. (2011). Multimı́dia - Conceitos e Aplicações. Editora LTC, 2 edition.
Pearson, S. (2013). Privacy, security and trust in cloud computing. In Privacy and Security
for Cloud Computing, pages 3–42. Springer.
Pfitzmann, A. and Köhntopp, M. (2001). Anonymity, unobservability, and
pseudonymity—a proposal for terminology. In Designing privacy enhancing tech-
nologies, pages 1–9. Springer.
Prasser, F., Kohlmayer, F., Lautenschläger, R., and Kuhn, K. A. (2014). Arx-a comprehen-
sive tool for anonymizing biomedical data. In AMIA Annual Symposium Proceedings,
volume 2014, page 984. American Medical Informatics Association.
Russell, R., Chung, M., Balk, E. M., Atkinson, S., Giovannucci, E. L., Ip, S., Taylor,
M. S., Raman, G., Ross, A. C., Trikalinos, T., et al. (2009). Issues and challenges
in conducting systematic reviews to support development of nutrient reference values:
Workshop summary: Nutrition research series, vol. 2.
S, K., S, Y., and P, R. V. (2015). An evaluation on big data generalization using k-
anonymity algorithm on cloud. In 2015 IEEE 9th International Conference on Intelli-
gent Systems and Control (ISCO), pages 1–5.
Samarati, P. (2001). Protecting respondents identities in microdata release. IEEE trans-
actions on Knowledge and Data Engineering, 13(6):1010–1027.
sdcMicro (2017). Data-Analysis. Available from: https://cran.r-project.
org/web/packages/sdcMicro/.
Sousa, F. R., Moreira, L. O., and Machado, J. C. (2009). Computação em nuvem: Con-
ceitos, tecnologias, aplicações e desafios. II Escola Regional de Computação Ceará,
Maranhão e Piauı́ (ERCEMAPI), pages 150–175.
Stallings, W. (2007). Network security essentials: applications and standards. Pearson
Education India.
Subashini, S. and Kavitha, V. (2011). A survey on security issues in service delivery
models of cloud computing. Journal of network and computer applications, 34(1):1–
11.
Sweeney, L. (2002). k-anonymity: A model for protecting privacy. International Journal
of Uncertainty, Fuzziness and Knowledge-Based Systems, 10(05):557–570.
Taneja, H., Kapil, and Singh, A. K. (2015). Preserving privacy of patients based on re-
identification risk. Procedia Computer Science, 70:448 – 454. Proceedings of the 4th
International Conference on Eco-friendly Computing and Communication Systems.
Toolbox, U. A. (2017). UT Dallas Data Security and Privacy Lab. Available from:
http://cs.utdallas.edu/dspl/cgi-bin/toolbox/.
Toolkit, C. A. (2017). Cornell Database Group. Available from: https://
sourceforge.net/projects/anony-toolkit/.
Vecchiola, C., Chu, X., and Buyya, R. (2009). Aneka: a software platform for .net-based
cloud computing. High Speed and Large Scale Scientific Computing, 18:267–295.
Zhang, X., Dou, W., Pei, J., Nepal, S., Yang, C., Liu, C., and Chen, J. (2015). Proximity-
aware local-recoding anonymization with mapreduce for scalable big data privacy
preservation in cloud. IEEE Transactions on Computers, 64(8):2293–2307.
Zhang, X., Liu, C., Nepal, S., Yang, C., Dou, W., and Chen, J. (2013). Combining top-
down and bottom-up:

Continue navegando