Baixe o app para aproveitar ainda mais
Prévia do material em texto
Data Anonymization Techniques in Cloud: Literature Review, Research Opportunities and a Web Tool for Data Protection Ítalo O. Santos1, Emanuel F. Coutinho1, Leonardo O. Moreira1 1Instituto Universidade Virtual (UFC Virtual) Universidade Federal do Ceará (UFC) – Fortaleza, CE – Brasil oliveira.italo07@gmail.com, {emanuel,leoomoreira}@virtual.ufc.br Abstract. Cloud Computing provides an infrastructure to run applications with quality of service. When the user hires a cloud provider, there is a loss of control over data security and privacy issues. Privacy in cloud is the ability of a user or organization to control what information can be revealed about themselves and take control of who can access certain information. Due to this, there is a need to use techniques or strategies to preserve the security and privacy of user data in cloud. This paper surveys the literature about data anonymization techniques and presents a tool called SMDAnonymizer to generate anonymized files using the k-anonymity algorithm to increase the privacy of data sets. We highlight some research challenges and opportunities in data anonymization and present results obtained with our tool applied in real data. Finally, we discuss future works that we intend to do, for improve our tool. 1. Introduction With the advancement of modern human society, basic and essential services are almost all delivered transparently. Utilities such as water, electricity, gas and telephone have become fundamental to our lives, being exploited through the use-based payment model [Vecchiola et al. 2009]. Nowadays, existing business models provide these services any- where and anytime. These services are charged considering the various collections poli- cies for the end user. This also applies to services offered in technology areas and this is due to the growth and spread of Cloud Computing (CC). CC is a recent technology trend aimed at providing on-demand, pay-for-use In- formation Technology (IT) services. Trends prior to CC were limited to a particu- lar class of users or focused on making available a specific demand for IT resources [Buyya et al. 2009]. CC proposes to be global and it delivers services from the end user who hosts their personal documents like texts, videos and images on the internet to com- panies that outsource the entire IT infrastructure offering their services through the cloud. The CC greatly facilitates how users will use the services, requiring only an oper- ating system, a browser and Internet access. Computational resources are available in the cloud, not requiring high computational resources on the user’s machines, reducing the cost of acquiring machines, making access to this environment easier for end users. The CC model was designed and developed with the objective of providing services that are easy to access, low cost, and guaranteed availability and scalability. CC is a recent area of research that, in addition to being subject to common risks relevant to IT environments, has their own set of security problems, identified by [Krutz and Vines 2010] in seven categories: network security, interfaces, data security, virtualization, governance, compliance, and legal issues. In order for a CC environment to be exploited by corporations, the security and privacy of data stored in the cloud is a fundamental requirement. Research papers that discuss the topic of privacy cover disci- plines from diverse fields of knowledge such as philosophy, political science, information science, engineering and computer science [Branco Jr et al. 2014]. Privacy is a concept directly related to people, and it is a human right, such as free- dom, justice or equality before the law, and is directly related to people’s interest in main- taining a personal space, without the interference of other people or organizations. More- over, it ensures that individuals control or influence what information related to them can be collected and stored by someone and with whom they can be shared [Stallings 2007]. Privacy in CC is the ability of a user or organization to control what information can be revealed about themselves in the cloud, that is, take control of who can access certain information and how it can occur. [Jr et al. 2010] defined three privacy dimensions: • Territorial Privacy: Protection of the region close to an individual; • Individual Privacy: Protection against moral damages and unwanted interference; • Information Privacy: Protection for personal data collected, stored, processed and propagated to third parties. In a cloud, it is important to note that developers and users deliver their appli- cations and data to be managed on the infrastructures and platforms provided by cloud providers. In this sense, the need arises to adopt techniques so that the delivered data are free of internal or external access in an environment that can be controlled by third par- ties, especially in data considered sensitive or highly private [Sousa et al. 2009]. There are some techniques currently proposed by the academic community for data protection, which can be used and applied with the aim of anonymizing data, such as: • Generalization: Replaces quasi-identifier attribute values with less specific but se- mantically consistent values representing them. The technique categorizes the at- tributes, creating a taxonomy of values with levels of abstraction going from the par- ticular level to the generic level; • Suppression: Deletes some identifier and/or quasi-identifier attribute values from the anonymized table; • Encryption: Uses cryptography schemes normally based on public key or symmetric key to replace sensitive data (identifiers, quasi-identifiers and sensitive attributes) by encrypted data; • Perturbation: It is used for data mining privacy preservation or for replacing actual data values with dummy data for masking test or training databases. In this paper, we will investigate the anonymization of data related to information privacy specified above, where data generalization techniques that work to make the data of users protected in cloud environments will be discussed, and also investigate tools used for data anonymization in the cloud environments. In the face of this situation, we ask ourselves: how much do we know about data anonymization techniques? The general objective of this research is to study the data generalization techniques currently used to protect the information in cloud environments. The specific objectives are: (i) to make a taxo- nomy among the generalization techniques that were analyzed; (ii) to point out some research challenges that can be unraveled in the area of data anonymization; (iii) design and develop a web tool for data anonymization that uses the k-anonymity technique to anonymize user data; (iv) show the applicability of the web tool to real data; and (v) present new opportunities related to future versions of the web tool. 2. Cloud Computing According to the National Institute of Standards and Technology (NIST) [Mell and Grance 2009], CC is defined as an evolving paradigm. Their definitions, use cases, technologies, problems, risks and benefits will be redefined in discussions between the public and private sectors, and these definitions, attributes, and characteris- tics will evolve over time. In dealing specifically with the definition, a broadly accepted definition is not yet available. NIST presents the following definition for CC: “CC is a model that enables convenient and on-demand access to a set of configurable computing resources (for example, networks, servers, storage, applications, and services) which can be quickly acquired and released with minimal managerial effort or interaction with the service provider.” Another CC definition propose the following definition: “CC is a set of enabled network services, providing scalability, quality of service, inexpensive on- demand computing infrastructure and can be accessed in a simple and pervasive way” [Armbrust et al.2009]. In this paper, we consider the presented view of NIST, which describes that the CC model consists of five essential characteristics, three service models and four deployment models [Mell and Grance 2009]. CC has essential features that, taken together, exclusively define CC and distin- guish it from other paradigms. These features are: Self-service on demand: The user unilaterally acquires a computational resource, such as server processing time or net- work storage, to the extent that it needs and does not require human interaction with the providers of each service; Wide access: Features are made available through the net- work and accessed through standardized mechanisms that enable use by thin or thin client platforms, such as cell phones, laptops, and PDAs. Users to change their working con- ditions and environments, such as programming languages and operating system. Client software systems installed locally for cloud access are lightweight, such as an Internet browser; Resource pooling: Provider computing resources are organized into one pool to serve multiple users using a multi-tenant or multi-tenant model [Jacobs et al. 2007], with different physical and virtual resources dynamically assigned and adjusted according to users’ demand. These users need not be aware of the physical location of the computa- tional resources, and can only specify the location at a higher level of abstraction, such as the country, state or data center; Fast elasticity: Resources can be acquired quickly and elastically, in some cases automatically, if there is a need to scale with increasing demand, and released, in the retraction of this demand; Measured service: Cloud sys- tems automatically control and optimize the use of resources by means of a measurement capability. Automation is performed at some level of abstraction appropriate to the type of service, such as storage, processing, bandwidth, and active user accounts. The use of resources can be monitored and controlled, allowing transparency for the provider and the user of the service. To ensure Quality of Service (QoS), it is possible to use a Service Level Agreement (SLA) approach. The SLA provides information about levels of avail- ability, functionality, performance, or other attributes of the service such as billing and even penalties for violations of these levels. The CC environment is composed of three service models. These models are im- portant because they define an architectural standard for CC applications. These models are: Software-as-a-Service (SaaS): it provides software for specific purposes that are available to users over the Internet. Software systems are accessible from multiple user devices through a thin client interface as a web browser. In SaaS, you do not manage or control the underlying infrastructure, including network, servers, operating systems, storage, or even the characteristics of the application, except for specific settings. As a result, developers focus on innovation rather than infrastructure, leading to the rapid de- velopment of software systems; Platform as a Service (PaaS): a high-level integration infrastructure is offered to deploy and test applications in the cloud. The users can not manage or control the underlying infrastructure, including network, servers, operating systems, or storage, but they can control over the deployed applications and possibly the configurations of the applications hosted on this infrastructure. PaaS provides an oper- ating system, programming languages and development environments for applications, helping to implement software systems, as it contains development tools and collabora- tion between developers; Infrastructure as a Service (IaaS): responsible for providing all the necessary infrastructure for PaaS and SaaS. It refers to a computational infrastruc- ture based on computing resource virtualization techniques. The primary goal of IaaS is to make it easier to provide resources such as servers, network, storage, and other critical computing resources to build an on-demand environment that can include operating sys- tems and applications. IaaS has some features, such as a single interface for infrastructure administration, the Application Programming Interface (API) for interaction with hosts, switches, routers and the support for adding new equipment in a simple and transparent way. In general, you do not manage or control the cloud infrastructure, but you have con- trol over the operating systems, storage, and deployed applications, and eventually select network components such as firewalls. CC deployment models can be divided into public, private, community, and hy- brid cloud, they will be described as follows: Private cloud: the cloud infrastructure is used exclusively for an organization, this cloud being local or remote and managed by the company itself or by third parties. In this deployment model, service access poli- cies are employed. The techniques used to provide such features can be at the level of network management, service provider configurations, and the use of authentication and authorization technologies; Public cloud: the cloud infrastructure is made available to public, being accessed by any user who knows the location of the service. In this deploy- ment model, access restrictions on network management can not be applied, much less use techniques for authentication and authorization; Community cloud: multiple cloud sharing is supported by a specific community that shares their interests, such as mission, security, policy, and considerations. This type of deployment model may exist locally or remotely and is generally administered by some community company or by a third party; Hybrid cloud: there is a composition of two or more clouds, which can be private, community or public, and which remain as single entities, linked by a standardized or proprietary technology that allows The portability of data and applications. 3. Privacy The amout of personal information transferred to the cloud is increasing, so does the con- cern of individuals and organizations about how this data will be stored and processed. The fact that the data is stored in multiple locations, often transparently in relation to their location, causes uncertainty as to the degree of privacy to which they are exposed. Ac- cording to [Pearson 2013], terminology for dealing with data privacy issues in the cloud includes the following concepts: • Data Controller: An entity (individual or legal entity, public authority, agency or or- ganization) that alone or in conjunction with others determines the manner and purpose for which personal information is processed; • Data Processor: An entity (individual or legal entity, public authority, agency or orga- nization) that processes personal information in accordance with the Data Controller’s instructions; • Data Subject: An identified or identifiable individual to whom personal information refers, either by direct or indirect identification (for example by reference to an identi- fication number or by one or more physical, psychological, mental, economic, cultural or Social). The NIST Computer Security Handbook defines computer security as the protec- tion afforded to an automated information system in order to achieve the proposed ob- jectives of preserving the integrity, availability, and confidentiality of information system resources [Guttman and Roback 1995]. The process of developing and deploying applications to the CC platform, which follow the Software as a Service (SaaS) model, should consider the following security aspects of data stored in the cloud [Subashini and Kavitha 2011]: • Data Security: In SaaS model, data is stored outside the boundaries of the organiza- tion’s technology infrastructure, so the cloud provider must provide mechanisms that ensure data security. For instance, this can be done using strong encryption techniques and fine-tuning mechanisms for authorization and access control; • Network Security:Client data is processed by SaaS applications and stored on cloud servers. The transfer of organization data to the cloud must be protected to prevent loss of sensitive information; • Data Location: In SaaS model, the client uses the SaaS applications to process their data, but does not know where the data will be stored. This may be a problem due to privacy legislation in some countries that prohibit data being stored outside their geographical boundaries; • Data Integrity: The SaaS model is composed of multi-tenant cloud-hosted applica- tions. These applications use interfaces based on API-Application Program Interfaces XML to expose their functionalities in the form of web services; • Data Segregation: Data from multiple clients may be stored on the same server or database as the SaaS model. The SaaS application must ensure the segregation, at the physical level and at the application layer, of customer data; • Access to Data: The multi-tenant environment of the cloud can generate problems related to the lack of flexibility of SaaS applications to incorporate specific policies of access to data by the users of SaaS client organizations. Table 1. Adult Data Set Records Gender Age Race Education Native-country Workclass Salary Male 39 White Bachelors United-States State-gov <=50K Male 38 White HS-grad United-States Private <=50K Male 53 Black 11th United-States Private <=50K Female 37 White Masters United-States Private <=50K Female 31 White Masters United-States Private >50K Male 42 White Bachelors United-States Private >50K Male 37 Black Some-college United-States Private >50K Female 23 White Bachelors United-States Private <=50K Male 32 Black Assoc-acdm United-States Private <=50K Male 32 White HS-grad United-States Private <=50K 3.1. Data Anonymization Data anonymization is used to preserve privacy over data publishing. Large public and private corporations have increasingly been charged to publish their “raw” data in elec- tronic format, rather than providing only statistical or tabulated data. These “raw” data are called microdata. In this case, prior to publication, data must be “sanitized” by re- moving explicit identifiers such as names, addresses, and telephone numbers. For this, one can use anonymization techniques. In Table 1, we have an adapted example provided by UCI (University of California at Irvine) Machine Learning Repository1. Each record corresponds to the personal information for an individual person. The term anonymity is acquired from the Greek word anonymia, that refers to “without a name or namelessness”. In informal use, “anonymous” consistently specifies to a person, and is usually seen that the information which identifies the given person is mysterious and represents the fact that the subject is not uniquely characterized within a set of subjects. In this case, it is said that the set is anonymized. The subject concept refers to an active entity, such as a person or a computer. Group of subjects can be a group of people or a network of computers [Pfitzmann and Köhntopp 2001]. A record or transaction is considered anonymous when the data, individually or combined with other data, can not be associated with a particular subject [Clarke 1999]. From the perspective of data dissemination of individuals, the attributes can be classified as follows [Camenisch et al. 2011]: • Identifiers: Attributes that uniquely identify individuals (e.g. social security number, name, identity number); • Quasi-Identifiers (QI): Attributes that can be combined with external information to expose some or all individuals, or reduce uncertainty about their identities (e.g. date of birth, ZIP code, work position, function, blood type); • Sensitive Attributes (SAs): Attributes that contain sensitive information about indi- viduals (e.g. salary, medical examinations, credit card postings). Information disclosure is one by which we can reach to particular’s identity. By using anonymization techniques, data sets privacy is preserved from various disclosures according [Gokila and Venkateswari 2014]. 1https://archive.ics.uci.edu/ml/datasets/adult • Identity Disclosure: An individual is usually associated with an evidence in the pub- lished table of data set. If his personality is disclosed, then the compatible sensitive value of an individual would be divulged; • Attribute Disclosure: Attribute disclosure was possible when information about indi- vidual record would be revealed. Before releasing the data, it must infer attributes of an individual with high confidence; • Membership Disclosure: Membership information in the released table would imply an identity of an individual through various attacks. If the selection criteria were not a sensitive attribute value, then it would lead to have a membership disclosure. 3.2. Anonymization Tools According [Fung et al. 2010] personal records are collected and used for data analysis by various organizations in public and private sectors. In such cases privacy should be ensured for not disclosing the personal information at the time of data sharing and anal- ysis. Although anonymization is an important method for privacy protection, there is a lack of tools which are both comprehensive and readily available to informatics re- searchers and also to non-IT experts, e.g., researchers responsible for the sharing of data [Prasser et al. 2014]. Graphical user interfaces (GUIs) and the option of using a wide vari- ety of intuitive and replicable methods are needed. Tools have to offer interfaces allowing their integration into pipelines comprising further data processing modules. Moreover, extensive testing, documentation and openness to reviews by the community are of high importance. Informatics researchers who want to use or evaluate existing anonymization methods or to develop novel methods will benefit from well-documented, open-source software libraries [Prasser et al. 2014]. We briefly review the related work as follows. The µ-Argus [Argus 2017] is a software program designed to create safe micro- data files, and is a closed-source application that implements a broad spectrum of tech- niques, but it is no longer under active development. The sdcMirco [sdcMicro 2017] is a package for the R statistics software used for the generation of anonymized micro data, i.e. for the creation of public and scientific-use files, which implements many primitives required for data anonymization but offers only a limited support for using them to find data transformations that are suitable for a specific context. UTD Anonymization Toolbox [Toolbox 2017] was developed by UT Dallas Data Security and Privacy Lab, they made an implementation of various anonymization meth- ods into a toolbox for public use by researchers. These algorithms can either be applied directly to a data set or can be used as library functions inside other applications, and the Cornell Anonymization Toolkit (CAT) [Toolkit 2017] was designed for interactively anonymizing published data set to limit identification disclosure of records under various attacker models, both are research prototypes that have mainly been developed for demon extraction purposes. Problems with these tools include scalability issues when handling large data sets, complex configuration requiring IT-expertise, and incomplete support of privacy criteria and methods of data transformation. The ARX tool an open-source data anonymization framework, features a cross- platform user interface that is oriented towards non-IT experts, it utilizes a well-known and highly efficient anonymization algorithm [Prasser et al. 2014]. 4. Literature Review A systematic review according to [Russell et al. 2009] is a comprehensive protocol- oriented review and synthesis of studies focusing on a research topic or related key is- sues. Using a controlled and formal process of bibliographic research, it is expected that the topics investigated return relevant gaps, challenges, processes, tools and techniques. In this research,we propose a simple way of a classical systematic review based on the orientation of [Kitchenham 2004], with some adaptations made by [Coutinho et al. 2015]. In this work, relevant studies have been researched and selected that address work related to data anonymization and the generalization techniques used to maintain data security and integrity in CC environments. Figure 1 shows the systematic review process used in this work. Figure 1. Systematic Review Adaptation Process 4.1. Activity 1: Plan Review In this section, all planning activities of the systematic review will be described, as will the need for the review, the definition of the review protocol, the search queries, the search string, the search sources, the search procedure, the inclusion and exclusion criteria, study selection procedures and the data extraction procedure. Privacy in CC is the ability for a user or organization to control what information needs to be revealed about themselves in the cloud, that is, control who has access to the information and how it can occur. There are several generalization techniques for anonymizing data used for data protection, but it is difficult to find a study that classifies these techniques in more detailed way. The systematic review of this paper proposes to answer the following research questions: • Main Question (QP): What is the state of the art about generalization techniques for anonymizing data in the cloud? • Secondary Question 1 (SQ1): What is the most used data anonymization technique? • Secondary Question 2 (SQ2): What are the research challenges and opportunities? Initially, we defined the following keywords for the research: “Privacy”, “Data Anonymization”, “Cloud Computing”, “Techniques”, “Algorithms” and “Evaluation”. Some search tests were done, but the final result was not satisfactory because it often shows to be of little relevance to the main theme of this work. After some tests, the search string was refined generating the following string: “(Privacy AND “Cloud Computing” AND “Data Anonymization” AND Generalization AND (Techniques OR Algorithms OR Evaluation))”. The string construction was thought considering the used terms and the order in which they were arranged in the search sequence. Several tests were performed until the final string was reached, taking into account the works found. The repositories used for research papers in this work were: IEEE Xplorer2 and Science Direct3, the number of results initially varied between 1600 and 16000 according to the repository and were being refined according to the search string changes. The same search string was used in the two sources, using the advanced search engine. In the IEEE Xplorer search site, the following steps were done: 1) clicking the “Command Search” option; 2) by checking the “Search” checkbox in the “Meta- data Only” option; and 3) by using the string: “(Privacy AND “Cloud Computing” AND “Data Anonymization” AND Generalization AND (Techniques OR Algorithms OR Eval- uation))”. Using the Science Direct search site, the following steps were done: 1) clicking the “Expert search” option; 2) using the string: “(Privacy AND “Cloud Computing” AND “Data Anonymization” AND Generalization AND (Techniques OR Algorithms OR Eval- uation))”; 3) selecting the fields: “Computer Science”, “Engineering”, “Mathematics”; and 4) restricting publication years from 2013. Therefore, in the systematic review, the definition of Inclusion Criteria (IC) and Exclusion Criteria (EC) contribute to the inclusion of primary studies that are relevant and to answer the research questions that were previously raised and exclude those works that do not respond to them. Thus, the primary studies included in the systematic review should meet the following inclusion criteria described below: • Inclusion Criteria 1 (IC1): The primary study should propose or report an approach of a data anonymization technique; • Inclusion Criteria 2 (IC2): The keyword “data anonymization” must be in the paper; • Inclusion Criteria 3 (IC3): The primary study should discuss future work or research opportunities; • Inclusion Criteria 4 (IC4): The primary study must have some evaluation where the technique is tested. 2http://ieeexplore.ieee.org/Xplore/home.jsp 3http://www.sciencedirect.com/ The exclusion criteria chosen in this study were: • Exclusion Criteria 1 (EC1): The study presents contributions in other areas than data anonymization in cloud; • Exclusion Criteria 2 (EC2): The primary study is an earlier version of a more com- plete study of the same research; • Exclusion Criteria 3 (EC3): The date of publication of the primary study is before 2013. The following steps were defined to select the studies: • Step 1 — Search for Keywords: Search strings were applied to each listed search source; • Step 2 — First Selection: For each primary study obtained as a result of the searches, the title and abstract were read and the inclusion and exclusion criteria were applied. And in case of doubt in the selection or not of a study, the introduction and conclusion should be read; • Step 3 — Second Selection: The primary studies selected in Step 2 should be read in full and the inclusion and exclusion criteria again applied. Extracting information from papers was done based on a spreadsheet with ques- tions oriented to get answers to the review’s research questions. The spreadsheet is di- vided with items to be filled by each paper, which are: title, year of publication, publica- tion vehicle, authors, countries, research group, keywords, proposal or approach, obser- vations, type of analysis, anonymization techniques, tools, future work and ideas. 4.2. Activity 2: Conduct Review The conduction of the review consists of four activities: identifying primary studies, se- lecting primary studies, evaluating the quality of the studies and extracting the data, and these activities will be described in the following subsections. • Identify Primary Studies: Data were collected in March 2017 and updated in June 2017. As a result, we found 8 papers in the Science Direct repository, and 6 papers were found in IEEE Xplore. As for the number of results obtained, it is important to note that the number of articles found varied according to the type of search string used, search engines have several configurations, each one with their own peculiarities, a small change in these settings and in the keywords used or some change in the oper- ators applied in the search, can result in a different amount of work, influencing data consolidation and analysis. For the review, the results were sufficient, due to the sev- eral tests simulating several search strings, looking for the highest quantity of works with quality. • Select Primary Studies: The reading of all 14 abstracts was used as a refinement criterion, divided into 8 from Science Direct and 6 from IEEE Xplore, and 5 studies of this total were excluded because they did not meet the specified criteria. • Evaluate Quality of the Studies: The evaluation of the quality of the primary pa- pers was simplified, verifying the presence or not of some type of data anonymization technique and some type of experiment. • Extract Data: Once the set of papers selected for complete reading has been defined, the process for extracting data has been performed according to the planning specified above. A spreadsheet was filled out for each paper with specified information. This activity took about two months to complete. 4.3. Activity 3: Result Review This activity presents general results of the review and the results for each research ques- tion. An overview of the results of the review will be presented. These results showed some general information related to data anonymization techniques. And the results of the research questions will also be presented arranged in tables. 5. Results Overview The number of selected papers per year in both journals and conferences from 2013 to 2016 were: 2013 (1), 2014 (3), 2015 (4)and 2016 (1). Data anonymization refers to hiding identity and/or sensitive data for owners of data records. Then, the privacy of an individual can be effectively preserved while certain aggregate information is exposed to data users for diverse analysis and min- ing. [Zhang et al. 2013] investigated the scalability issue of sub-tree anonymization over big data on cloud, proposing a hybrid approach that combines Top-Down Specializa- tion (TDS) and Bottom-Up Generalization (BUG) together. Existing TDS and BUG ap- proaches are developed individually for sub-tree generalization scheme. Both of them lack the awareness of the user-specified k-anonymity parameter. In fact, the values of the k-anonymity parameter can impact their performance. Intuitively, if parameter k is large, TDS is more suitable while BUG will probably get bad performance, the case is reversed when k is small. The hybrid approach automatically selects one of the two com- ponents via comparing the user specified k-anonymity parameter. Both TDS and BUG have been accomplished in a highly scalable way via a series of deliberately designed MapReduce jobs. Experimental results demonstrated the hybrid approach significantly improves the scalability and efficiency of sub-tree data anonymization compared with existing approaches. TDS is an iterative process starting from the topmost domain values in the taxon- omy trees of attributes, each round of iteration consists of three steps: (i) finding the best specialization; (ii) performing specialization; (iii) updating values of the search metric for the next round. [Zhang et al. 2014] investigated the scalability problem of large-scale data anonymization by TDS, proposing a highly scalable two-phase TDS approach using MapReduce on cloud. Data sets are partitioned and anonymized in parallel in the first phase, producing intermediate results. Then, the intermediate results are merged and fur- ther anonymized to produce consistent k-anonymous data sets in the second phase. They applied MapReduce on cloud to data anonymization, and they deliberately designed a group of innovative MapReduce jobs to concretely accomplish the specialization compu- tation in a highly scalable way. The experimental results was made on real-world data sets and with their approach, the scalability and efficiency of TDS are significantly improved. [Balusamy and Muthusundari 2014] discussed different techniques to provide se- curity to the user data such as multidimensional k-anonymity technique to produce a secu- rity to users private data and the suppression technique for anonymization and generaliza- tion that is a process of replacing original value into a less specified semantic consistent value. When the user data scalability increases, providing security to the user’s data is a challenge. In this case, generalization approach is best to provide the security of user private data in an effective and faster way, it minimize information and privacy losses in less execution time and better quality of service. This work only made an overview about some generalization techniques but do not show any experimental evaluation. [Logeswari et al. 2014] focused to provide efficient analysis of the shared Per- sonal Health Records (PHR’s) by the proposed Efficient K-Means Clustering (EKMC) algorithm which clusters the PHR’s into several partitions and it is superior to the tradi- tional k-means algorithm by improving the time complexity and enhancing the speed of clustering. In general, Data Aggregation and Deduplication (DAD) that integrates both data aggregation and data deduplication is an information mining process that searches, gathers and presents a summarized report to achieve specific business objectives. Data deduplication is a data compression method that removes duplicate copies of repeated data, if repeated records are found during comparison then they are eliminated and their count is incremented. The proposed DAD algorithm is used to reduce the cost of cloud storage to a great extent since dealing with huge records. The privacy of the patient’s PHR is preserved through various data anonymization techniques. [Panackal and Pillai 2015] proposed an approach based on association mining namely, Adaptive Utility-based Anonymization (AUA), initially the model is tested with sample instances of original data set of the National Family Health Survey (NFHS) that is an important source of data on population, health, and nutrition for India and their states, this paper includes performance evaluation of AUA model using data sets and proves that the data anonymization can be done without compromising the quality of data mining results. The model gets the benefits of k-anonymity in terms of privacy protection, and provides maximum information to the users. The AUA approach is a two-step iterative process based on association mining. Using support and confidence, they have performed association mining on given data set either for filtering risky instances or for retaining maximum information based on utility of data. The first step is based on quasi-sensitive associations among entire instances of the given data set and the second step is based on quasi-quasi associations among instances of the non-frequent set and among certain in- stances of frequent set. From this iterative process, multiple versions of anonymized data sets can be served positively according to the user‘s need. This work made an experimen- tal evaluation but only has specified the type of data set used, however do not describe the technical environment where the experiments were made. [S et al. 2015] implemented Two Phase Top Down Specialization (Two Phase TDS) to improve the scalability and efficiency over Centralized TDS on cloud. This approach has a map and reduce phase. In map phase the large scale data set is parti- tioned into smaller data sets. These data sets are anonymized in parallel and produce anonymized intermediate results. In reduce phase the intermediate results are combined and further anonymized to produce k-anonymous data which are consistent. In Two Phase TDS the intermediate data sets are produced after anonymization by Map function and the final consistent anonymized data sets are produced after the specialization. i.e., applying anonymization once the intermediate results are integrated in Reduce phase. In this tech- nique the data can be either generalized or suppressed using various algorithms. TDS in k-Anonymity is the most used generalization algorithm for data anonymization. This anonymizes the data sets that are highly scalable and significantly improves the efficiency over existing approaches. [Taneja et al. 2015] proposed an approach based on reducing the re-identification risk to preserve the privacy of the EMRs. This solution is based on k-Anonymity, l- Diversity, t-Closeness & δ-Presence and is implemented through ARX Anonymization tool. ARX Anonymization tool is used recently by few researchers for preserving the privacy using anonymization techniques, and they executed in four different phases: Con- figure Transformation, Explore results, Analyze utility and Analyze risk. They have im- plemented the solution on randomly generated medical data set based on extension of publicly available in Electronic Medical Records (EMR). Privacy is measured through re-identification risk based on the uniqueness of records in the data set. The proposed technique helped in reducing the average re-identification risks from 100% to 2.33%. Au- thors only described the experimental evaluation was made using the ARX tool, they do not specified more information about the technical environment used. [Zhang et al. 2015] investigated the local-recoding problem for big data anonymization against proximity privacy breaches as a proximity-aware clustering (PAC) problem, and proposed a scalable two-phase clustering approach accordingly. Techni- cally, a proximity-aware distance is introduced over both quasi-identifier and sensitive attributesto facilitate clustering algorithms. To address the scalability problem they pre- sented a proximity privacy model which allows semantic proximity of sensitive values and multiple sensitive attributes, and they model the problem of local recoding as a proximity- aware clustering problem. A scalable two-phase clustering approach consisting of a t- ancestors clustering (similar to k-means) algorithm and a proximity-aware agglomerative clustering algorithm was proposed. The first phase splits an original data set into t parti- tions that contain similar data records in terms of quasi-identifiers. In the second phase, data partitions are locally recoded by the proximity-aware agglomerative clustering al- gorithm in parallel. We design the algorithms with MapReduce in order to gain high scalability by performing data-parallel computation over multiple computing nodes in cloud. [Aldeen et al. 2016] proposed a new anonymization technique to attain better pri- vacy protection with high data utility over distributed and incremental data sets on CC called incremental anonymization technique. In this technique, the data are partitioned into a variety of relativity small blocks of data that are then stored in cloud they divided original anonymized data sets according to an anonymization level of K. Upon adding the new data, the updates can be handled after initializing the original anonymized data sets. The integration of the anonymized data sets is performed through the privacy preserva- tion metric together with additional metrics including the computation and the storage. Performance evaluation of the developed incremental anonymization technique is made for different K values and compared with the classical one. The performance of their technique is found to be insensitive to the variation of K. Improved data privacy preserva- tion and the confidentiality requirement is established. The authors made an evaluation, but they do not described anything about the technical environment used, only have men- tioned that the data set was provided by UCI Machine Learning Repository. The Table 2, presents the experimental evaluation per work, describing the envi- ronment used to make their tests like the programming language, the data set used, the cloud environment, the operational system among others technologies. We can perceive some similarities between the environments used by the researchers, they used in common the Java language for implementation of the techniques, Ubuntu was the most used op- erating system, as cloud environment they used U-Cloud and CloudBees, the researchers sought to use generics databases as pointed out in the table Adult data set that has attribute information like age, work-class, education, marital-status, occupation, relationship, race, sex, native-country, and in the tests that used mapreduce was used the Hadoop API and Hadoop Clusters. Table 2. Experimental Evaluation Work Experimental Evaluation? Environment described [Zhang et al. 2013] YES U-Cloud - Ubuntu - Java - Hadoop MapReduce API - KVM - OpenStack - Hadoop clusters - Adult data set [Zhang et al. 2014] YES U-Cloud - Ubuntu - Java - Hadoop MapReduce API - KVM - OpenStack - Hadoop clusters - Adult data set [Balusamy and Muthusundari 2014] NO - [Logeswari et al. 2014] YES Java - CloudBees [Panackal and Pillai 2015] YES Adult data set [S et al. 2015] YES Java - Hadoop MapReduce API - Eclipse - Adult data set [Taneja et al. 2015] YES ARX Anonymization tool [Zhang et al. 2015] YES U-Cloud - Java - Hadoop MapReduce API - OpenStack - Hadoop clusters - Census-Income data set [Aldeen et al. 2016] YES Nothing was described 6. Research Questions, Challenges and Opportunities The purpose of this section is to present the data extracted from the selected studies in order to answer the search questions. The following are the answers for each question, according to the authors. Main Question (QP): What is the state of the art about generalization techniques for anonymizing data in the cloud? As presented in the papers we can highlight the advantages for a company that stores records with sensitive information, this data anonymization techniques offers sev- eral advantages. One of them is financial: storing records in the cloud is less expensive. That companies can decrease investments in hardware, software, and support since they reduce the size of the data caches stored in-house and companies are protected against the risks of accidental data spills or deliberate attacks in the cloud. Since the data is disiden- tified, and only non-sensitive information is stored outside the organization, it is tough for an attacker to make use of the data even if the cloud-based data cache is made public. Other advantage is that organizations can redirect the savings from cloud storage to invest in greater security for other financial type data. Related to the clients who use the cloud services to store their private data these techniques are important to assure the confidence of the data owners. According to [Paul et al. 2015] for such data owners, the data anonymization techniques enhances pri- vacy and confidentiality: even if cloud based data about healthy, financial, purchases or transactions with a particular company becomes public, the consumers are assured that their individual details cannot be linked to their identities. Second, the cost savings that accrue from using cloud-based storage should enable companies to lower prices, leaving consumers with additional disposable income. Some important parameters for comparing the Privacy Preserving Data Mining Techniques by [Kaur and Sofat 2016]: • Information loss: Information loss means that when the information is required in their original form, it should be retrieved with the same values, i.e., somebody’s age is changed from ‘23’ to ‘>20’ or ‘<25’ or some other value, it should be reverted to 23. The information loss factor must be minimum for the technique. • Privacy Preserved: This means the level of information hided or distorted from orig- inal values. This parameter needs to have maximum value for a technique to be good Privacy preserving technique. • Computational time: The time factor is always an important parameter to calculate efficiency of any technique or algorithm. A technique is considered efficient Privacy preserving data mining technique if it achieves maximum privacy in minimum time. • Complexity: For a technique to be a good privacy preserving technique, the algorithm needs to easy to understand and implement. • Dependency on size of data: As the amount of data is increasing exponentially, so the performance of the technique with the increase in size of data is an unavoidable parameter. A technique with 100% performance in all the other factors but only on a small set of data is never acceptable. Table 3. New approaches proposed per work Work New approach? What? [Zhang et al. 2013] YES Hybrid approach [Zhang et al. 2014] YES Highly scalable two-phase TDS approach [Balusamy and Muthusundari 2014] NO - [Logeswari et al. 2014] YES Efficient K-Means Clustering, Data Aggregation and Deduplication (DAD) [Panackal and Pillai 2015] YES Adaptive Utility-based Anonymization (AUA) [S et al. 2015] NO - [Taneja et al. 2015] YES Combination of anonymization techniques [Zhang et al. 2015] YES Scalable two-phase clustering approach [Aldeen et al. 2016] YES Incremental anonymization technique As state of the art about generalization techniques, we found many studies that try to prove their approaches can anonymizing data the better way possible considering the parameters mentioned above. However, there is many new approaches over existent techniques and each author say that their own technique is better than others, but we do not have any evidence regarding their effectiveness of these techniques above others and what variables could affect the perfomance of these techniques according the environment that are implemented. Even with many studies that explore someanonymization techniques in cloud environment, there is a lack of guarantees that the technique assure full anonymity for the initial raw data to the final anonymized data, it is extremely important that the costumers that share their data in cloud have their privacy full guaranteed. As presented in Table 3, we specified the new approaches proposed per work, only two papers do not propose new approaches, in the work of [Balusamy and Muthusundari 2014] they only described existent techniques, while in the work of [S et al. 2015] the authors made an experimental evaluation using k-anonymity on cloud. However, we have seen that they are increasingly trying to propose new ap- proaches to improve the anonymization techniques, and they are worried about issues as how to guarantee anonymization in huge volume of information that demands more computational processing and how to share this data on cloud environment, besides it is important to adapt these techniques to run in cloud environment with their full potential. Secondary Question 1 (SQ1): What is the most used data anonymization technique? A large number of existing techniques anonymize data based on the concept of k-anonymization [Sweeney 2002]. We can verify this affirmation looking to the Table 4, which shows the mentioned techniques per work, we could see that the generalization technique most adopted by the authors was the k-anonymity technique. Table 4. Techniques mentioned per work Work Techniques [Zhang et al. 2013] Top-Down Specialization, Bottom-Up Generalization, K-Anonymity [Zhang et al. 2014] Top-Down Specialization, K-anonymity [Balusamy and Muthusundari 2014] K-Anonymity, Top-Down Specialization, Multidimensional k-anonymity [Logeswari et al. 2014] K-Means Clustering [Panackal and Pillai 2015] K-Anonymity [S et al. 2015] Two Phase TDS, Centralized TDS, K-Anonymity [Taneja et al. 2015] K-Anonymity, L-Diversity, T-Closeness, δ-Presence [Zhang et al. 2015] T-ancestors Clustering, Proximity-aware Agglomerative Clustering [Aldeen et al. 2016] K-Anonymity, L-Diversity The k-anonymity model requires that any combination of quasi-identifier at- tributes be shared by at least k records in an anonymous database [Samarati 2001], where k is a positive integer value defined by the data owner, possibly as a result of negotiations with other interested parties. A high value of k indicates that the anonymized bank has low disclosure risk because the probability of re-identifying a record is 1/k, but this does not protect the data against disclosure of attributes. Even if the attacker does not have the ability to re-identify the registry, he may discover sensitive attributes in the anonymized database. K-anonymization techniques are a key component of any comprehensive solution to data privacy and have been the focus of intense research in the last few years. An important requirement for such techniques is to ensure anonymization of data while at the same time minimizing the information loss resulting from data modifications such as generalization and suppression [Byun et al. 2007]. In Figure 2, we can observe that the k- anonymity technique was more used by the authors, to create new techniques. Among the techniques derived from k-anonymity we have: Incremental anonymization technique, Combination of anonymization techniques, Adaptive Utility-based Anonymization (AUA), Data Aggregation and Deduplication (DAD) and Efficient K-Means Cluster- ing We will briefly describe the others techniques mentioned in each work, as well as Figure 2. Data Anonymization Taxonomy Techniques Derived from K-anonymity the new techniques showed in Table 3: • Top-Down Specialization (TDS): is an iterative process starting from the topmost domain values in the taxonomy trees of attributes, each round of iteration consists of three main steps: (i) finding the best specialization, (ii) performing specialization and (iii) updating values of the search metric for the next round. • Bottom-Up Generalization (BUG): is an iterative process starting from the low- est anonymization level, the lowest anonymization level contains the internal domain nodes in the lowest level of taxonomy trees. • Hybrid Approach [Zhang et al. 2013]: combines TDS and BUG together for sub- tree anonymization over big data, and automatically determines which component is used to conduct the anonymization when a data set is given, by comparing the user- specified k-anonymity parameter with a threshold derived from the data set. • Highly Scalable Two-phase TDS Approach [Zhang et al. 2014]: the two phases of this approach are based on the two levels of parallelization provisioned by MapRe- duce on cloud, i.e., job level and task level: (i) the job level parallelization means that multiple MapReduce jobs can be executed simultaneously to make full use of cloud in- frastructure resources and the (ii) task level parallelization refers to that multiple map- per/reducer tasks in a MapReduce job are executed simultaneously over data splits. To achieve high scalability, this approach parallelizing multiple jobs on data partitions in the first phase, but the resultant anonymization levels are not identical. To obtain finally consistent anonymous data sets, the second phase is necessary to integrate the intermediate results and further anonymize entire data sets. • Multidimensional K-anonymity: According [LeFevre et al. 2006] is a global recod- ing that achieves anonymity by mapping the domains of the quasi-identifier attributes to generalized or altered values. • K-Means Clustering: The traditional k-means clustering algorithm calculates the dis- tance between each data object to all the cluster centers, which consumes a lot of time especially when dealing with large records. • Efficient K-Means Clustering [Logeswari et al. 2014]: arranges the numerical at- tribute to be clustered in ascending order, the threshold value between the current clus- ter center and next cluster center is calculated. Then the distance between the data object and the current cluster center is calculated. If the calculated distance is smaller than or equal to the threshold value, the data objects stay in the same cluster, if not the data objects moves to the next cluster. This process is repeated until all data objects are grouped to their cluster centers. • Data Aggregation and Deduplication (DAD) [Logeswari et al. 2014]: compares each record with all other records. If repeated records are found during comparison then they are eliminated and their count is incremented to the corresponding record. • Adaptive Utility-based Anonymization (AUA) [Panackal and Pillai 2015]: is a two-step iterative process based on association mining. The first step is based on quasi- sensitive associations among entire instances of the given data set and the second step is based on quasi-quasi associations among instances of the non frequent set and among certain instances of frequent set. • Two Phase TDS: this approach has a map and reduce phase. In map phase the large scale data set is partitioned into smaller data sets. These data sets are anonymized in parallel and produce anonymized intermediate results. In reduce phase the intermedi- ate results are combined and further anonymized to produce k-anonymous data which are consistent. • Centralized TDS: exploits the data structure Taxonomy Indexed PartitionS (TIPS) to improve the scalability and efficiency by indexing anonymous data records and retain- ing statistical information in TIPS, centralized approaches probably suffer from low efficiency and scalability when handling large-scale data sets. • L-Diversity: this model proposed by [Machanavajjhala et al. 2007] captures the risk of discovery of sensitive attributes in an anonymous database. The l-diversity model requires that for each combination of semi-identifier attributes (SI group), there must be at least l “well-represented” values for each sensitive attribute. • T-Closeness: According [Branco Jr et al. 2014] thistechnique uses the concept of “global backward knowledge”, which assumes that the opponent can infer informa- tion about sensitive attributes, from the knowledge of the frequency of occurrence of these attributes in the table, this model estimates the risk of disclosure computing the distance between the distribution of confidential attributes within the SI group and the entire table. • δ-Presence: This technique is used to preserve membership disclosure. It is based on the background knowledge of the attacker with respect to the larger data set as a super set of the disclosed data set. • Combination of Anonymization Techniques [Taneja et al. 2015]: is a combina- tion of anonymization techniques such as k-Anonymity, l-Diversity, t-Closeness & δ-presence applied to reduce the re-identification risk and hence the privacy of the patients is preserved. • T-ancestors Clustering: this algorithm splits an original data set into t partitions with quasi-identifier based similar records, an ancestor of a cluster refers to a data record whose attribute value of each categorical quasi-identifier is the lowest common ances- tor of the original values in the cluster. Each numerical quasi-identifier of an ancestor record is the median of original values in the cluster. • Proximity-aware Agglomerative Clustering: in the agglomerative clustering method, each data record is regarded as a cluster initially, and then two clusters are picked to be merged in each round of iteration until some stopping criteria are sat- isfied. Usually, two clusters with the shortest distance are merged. Thus, one core problem of the agglomerative clustering method is how to define the distance between two clusters. • Scalable Two-phase Clustering Approach [Zhang et al. 2015]: in this technique the first phase splits an original data set into t partitions that contain similar data records in terms of quasi-identifiers. In the second phase, data partitions are locally recoded by the proximity-aware agglomerative clustering algorithm in parallel. It was designed with MapReduce in order to gain high scalability by performing data-parallel compu- tation over multiple computing nodes in cloud. • Incremental Anonymization Technique [Aldeen et al. 2016]: the data are parti- tioned into a variety of relativity small blocks of data that are then stored in cloud. The technique divided original anonymized data sets according to an anonymization level of K. Upon adding the new data, the updates can be handled after initializing the original anonymized data sets. Figure 3. Data Anonymization Taxonomy Techniques In Figure 3, we can see the existing relationships between the techniques described above, the arrows specify which techniques were used to create the new techniques pro- posed by each author, thus we have: Hybrid Approach is based on the techniques of Bottom-Up Generalization and Top-Down Specialization, the latter was used by the Centralized TDS and the Two Phase TDS techniques which were important for the cre- ation of Highly Scalable Two-phase TDS Approach and Scalable Two-phase Cluster- ing Approach that was proposed based on the T-ancestors Clustering and Proximity- aware Agglomerative Clustering. Secondary Question 2 (SQ2): What are the research challenges and opportunities? Privacy concerns on CC have attracted the attention of researchers in different re- search communities. But ensuring privacy preservation of large scale data sets still needs extensive investigation. In Figure 4, we have some words that show us some challenges and research opportunities related to this subject on data anonymization, we will discuss each topic below. Figure 4. Key Challenges In big data applications, data privacy is one of the most concerned issues be- cause processing large-scale data sets often requires computational power provided by public cloud services. CC and big data are two disruptive trends at present, imposing significant impacts on current IT industry and research communities. Privacy is one of the most concerned issues in the big data applications that involve multiple parties, and the concern aggravates in the context of CC although some privacy issues are not new [Chaudhuri 2012]. In accordance with various data and intensive applications on cloud, processing huge-volume of anonymized data sets is becoming an important research area. Privacy preservation for such data sets is one of important yet challenging research issues, and it needs thorough investigation. Working with data anonymization within the cloud environment brings us major challenges as we have to understand how it works, comprehend their limitations, and identify the best ways to apply anonymization techniques so that they can reach their full potential in cloud. As a research opportunities pointed by [Zhang et al. 2013] in cloud environment, privacy preservation for data analysis, share and mining is a challenging research issue due to increasingly larger volumes of data sets, thereby requiring inten- sive investigation. In a cloud environment, preserving privacy in data publishing is a big problem. To share data for research purposes, in several fields such as medicine, where it is necessary to disclose sensitive data about patients informing the type of disease, the history of consultations, data from population, informing characteristics of each region referring to the individuals living in those places as address, the number of demographic concentration, the government field needs to be accountable to society by publicizing public funds on public sites, informing the social security number, salaries of the public servants and the service agencies. Anonymization techniques will work to make this data available without the actual users risking being identified by malicious users. The need for anonymization is motivated by many legal and ethical requirements for protecting private, personal data. The intent is that anonymized data can be shared freely with other parties, who can perform their own analysis and investigation of the data [Cormode and Srivastava 2009]. Anonymization techniques are useful to protect the users sensitive information from malicious users. If the malicious users may possible to get the privacy information means it may cause financial or social reputation level of losses. Aside from the techniques mentioned in the papers studied, a variety of other techniques have been proposed for protecting data privacy. Developing new techniques of anonymization and improving existing techniques are challenges that must be overcome and intensively investigated by researchers in the area, it is also interesting to try to better understand the techniques that exist and conduct experiments by exploring the capabilities of each technique and identifying their performance in the cloud. Anonymization techniques should safeguard the privacy of user data and ensure that even if this data is shared, the actual user will not be identified through possible attacks or security breaches. The medical information is taken as the most confidential information as it directly contains the personal data of the patients. It has become utter- most concern to preserve the confidentiality of the patient’s data despite the fact that this data needs to be shared with other medical bodies, in case it is required. Though, the platforms providing the cloud based data are increasing, it has also resulted in the privacy and security concerns of the medical data stored in the cloud. It is a challenge inves- tigated if these generalization techniques could really ensure that once the data set was anonymized, these data can not be used by third parties to re-identify the actual patients. So it is essential that the community continue developing these techniques to reach their maximal potential. It is important to think of new quality metrics to regulate the levels of security to be achieved through the implementation of each anonymization technique. Interdisciplinarityis a factor that can be further investigated by researchers, it is a great challenge to work with this topic of data anonymization in CC environments, as it involves knowledge relevant to several areas such as CC, security and big data. Therefore, it is necessary to have a little knowledge in these areas when starting the studies in this topic, because this is a matter that involves different areas, and tends to aggregate the existing difficulties in each area, such as performance and scalability problems. An increasing growth of data within organizations and lower maintenance costs are two factors that force data processing on public clouds instead of private clouds. De- spite the change in the location of data processing, the need for privacy preservation of sensitive data remains identical. Thus, it is beneficial to process data based on sensi- tivity on the organization’s private cloud and public clouds [Derbeko et al. 2016]. The management of anonymization techniques is a factor that must be treated with extreme importance, create ways to correctly manage the application of the technique, the operat- ing environment, so that we can be sure that the shared data is really safe. 7. Experimental Evaluation [Prasser et al. 2014] presents the ARX tool, a comprehensive open-source data anonymization framework that implements a simple three-step process. It provides sup- port for all common privacy criteria, as well as for arbitrary combinations. It utilizes a well-known and highly efficient anonymization algorithm. Moreover, it implements a carefully chosen set of techniques that can handle a broad spectrum of data anonymiza- tion tasks, while being efficient, intuitive and easy to understand. Their tool features has a cross-platform user interface that is oriented towards non-IT experts. In our approach, we used the ARX API to create a generic web tool for data anonymization, we will explain more about the tool developed in the next section. [Prasser et al. 2014] provides a stand-alone software library with an easy-to-use public API for integration into other systems. Their code base is extensible, well-tested and extensively documented. As such, it provides a solid basis for developing novel privacy methods. 7.1. Experiment Settings The proposed SMD Anonymizer tool is implemented in java using the NetBeans4 as integrated development environment (IDE) for coding. Our experiments are conducted in a cloud environment5 hosted in IBITURUNA research group. We have collected data set from a federal government website6. The data set name is BolsaFamilia data set and have the following attributes: UF, Code-SIAFI, Municipality, Code-Function, Code-Subfunction, Code-Program, Code-Action, NIS-Favored, Name-Favored, Source- Purpose, Value-Month. The tool implements the k-anonymity algorithm as generalization technique for data anonymization. The k-anonymity parameter is set to 2 for the results that will be presented. 7.2. Experiment Process and Results We present the tool interface in Figure 5, where we have the initial screen, where we can perform the first steps, in the interface we can download a data file example that shows the user the type of file the tool receives, our tool support the format .csv short for comma separated values, this format is often used to exchange data between differently similar applications. The user has the option to use confirm button after uploading the file which will be anonymized and has selected the anonymization algorithm that will be applied to the data set, or can click the clear option that will delete the filled fields then the user can start uploading and selecting the anonymization algorithm again. 4https://netbeans.org/ 5http://app.ibituruna.virtual.ufc.br/ 6http://www.transparencia.gov.br/ Figure 5. Step 1: Selecting the data set and the anonymization algorithm Figure 6. Step 2: Selecting the anonymization hierarchies of the fields After uploading and selecting the algorithm, the tool reads and interprets the fields referring to the columns of the data set that has been uploaded and shows in the check- box field the columns the data set has, then the user can select the fields which will be anonymized as presented in Figure 6. By selecting the fields which will be anonymized, the user must upload the hi- erarchy. To generalize the hierarchy is created for each attribute which defines the pri- vacy level. A hierarchy is created for quasi-identifiers based on the type of values these attributes hold. For instance, the hierarchy of attribute UF, Code-Subfunction, Code- Program and Code-Action is shown in Figure 7. Therefore, after selecting fields and uploading their hierarchies, the user must Figure 7. Hierarchy of attributes applied confirm the operation, and then the tool exports and saves the anonymized data to a file in csv format. Table 5 shows the final file generated by the tool. Table 5. BolsaFamilia data set result UF Codigo-Subfuncao Codigo-Programa Codigo-Acao Nordeste 24* 13** 844* Nordeste 24* 13** 844* Centro-Oeste 24* 13** 844* Norte 24* 13** 844* Sudeste 24* 13** 844* Sul 24* 13** 844* Centro-Oeste 24* 13** 844* By multimedia, we understand all the programs and systems where communi- cation between man and computer occurs through multiple means of representation of information. Our tool fits as a multimedia product according to the characteristics de- scribed by [Paula Filho 2011]: (i) Non-linear access the information is quickly accessible non-linear, the user does not get stuck in a time sequence like the reader of a book, the listener of a lecture or the spectator of a movie; (ii) Interactivity is the situation of the user in front of the computer may not be that of passive spectator, but of participant of an activity; and (iii) Integration with application programs is when the computer can perform calculations, searches on databases and other normal tasks of any application program. Therefore, as the multimedia characteristics mentioned above our tool meets the requirements of interactivity where it requires the user to upload the file that will be anonymized, and select in the checkboxes the hierarchies that will be applied, and meets the requirement of integration with application programs, because the tool executes the k-anonymity algorithm used for data anonymization, and makes use of an API to execute the others functionalities present in the application. 8. Conclusion and Future Work In this paper, we presented a survey on data anonymization in CC based on an adaptation of a classic systematic review. We also identified concepts presented in the literature for data anonymization. Moreover, we presented concepts related to CC and privacy focusing on anonymization techniques. This paper also reveals the state of the art related to the main topic where we dis- covered that the main anonymization technique used is the k-anonymity. To prove this, we have made a taxonomy between the techniques presented by the works showed in Ta- ble 4 and we realized that the k-anonymity is used as base to create new approaches. In addition, we made a discussion between the work proposed by the authors and interest- ing insights to better understand research opportunities and challenges, that could guide researches who have interests in study this area. Furthermore, we presented our tool called SMDAnonymizer and described their use as a tool to anonymize data raw and generate a new file with anonymized data. As future work, we intend to further develop the tool, implementing new anonymization algo- rithms, and testing different types of data, comparing the efficiency of each implemented algorithm. We also want to integrate other APIs to make our tool more expandable and increase the amount of functionalities. We will try to develop this tool with a friendly interface and it can be used for research purposes and by the non-IT experts. References Aldeen, Y. A. A. S., Salleh, M., and Aljeroudi, Y. (2016). An innovative privacy pre-serving technique for incremental datasets on cloud computing. Journal of Biomedical Informatics, 62:107 – 116. Argus (2017). µ-argus manual. Available from: http://neon.vb.cbs.nl/casc/ Software/MuManual4.2.pdf. Armbrust, M., Fox, A., Griffith, R., Joseph, A. D., Katz, R. H., Konwinski, A., Lee, G., Patterson, D. A., Rabkin, A., Stoica, I., et al. (2009). Above the clouds: A berkeley view of cloud computing. Technical report, Technical Report UCB/EECS-2009-28, EECS Department, University of California, Berkeley. Balusamy, M. and Muthusundari, S. (2014). Data anonymization through generalization using map reduce on cloud. In Proceedings of IEEE International Conference on Computer Communication and Systems ICCCS14, pages 039–042. Branco Jr, E. C., Machado, J. C., and Monteiro, J. M. (2014). Estratégias para proteção da privacidade de dados armazenados na nuvem. Simpósio Brasileiro de Banco de Dados. Citado na pág, 6. Buyya, R., Yeo, C. S., Venugopal, S., Broberg, J., and Brandic, I. (2009). Cloud comput- ing and emerging it platforms: Vision, hype, and reality for delivering computing as the 5th utility. Future Generation computer systems, 25(6):599–616. Byun, J.-W., Kamra, A., Bertino, E., and Li, N. (2007). Efficient k-anonymization using clustering techniques. In International Conference on Database Systems for Advanced Applications, pages 188–200. Springer. Camenisch, J., Fischer-Hübner, S., and Rannenberg, K. (2011). Privacy and identity management for life. Springer Science & Business Media. Chaudhuri, S. (2012). What next?: A half-dozen data management research goals for big data and the cloud. In Proceedings of the 31st ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems, PODS ’12, pages 1–4, New York, NY, USA. ACM. Clarke, R. (1999). Introduction to dataveillance and information privacy, and definitions of terms. Roger Clarke’s Dataveillance and Information Privacy Pages. Cormode, G. and Srivastava, D. (2009). Anonymized data: Generation, models, usage. In Proceedings of the 2009 ACM SIGMOD International Conference on Management of Data, SIGMOD ’09, pages 1015–1018, New York, NY, USA. ACM. Coutinho, E. F., de Carvalho Sousa, F. R., Rego, P. A. L., Gomes, D. G., and de Souza, J. N. (2015). Elasticity in cloud computing: a survey. annals of telecommunications- annales des télécommunications, 70(7-8):289–309. Derbeko, P., Dolev, S., Gudes, E., and Sharma, S. (2016). Security and privacy aspects in mapreduce on clouds: A survey. Computer Science Review, 20(Supplement C):1 – 28. Fung, B., Wang, K., Chen, R., and Yu, P. S. (2010). Privacy-preserving data publishing: A survey of recent developments. ACM Computing Surveys (CSUR), 42(4):14. Gokila, S. and Venkateswari, P. (2014). A survey on privacy preserving data publishing. International Journal on Cybernetics & Informatics (IJCI) Vol, 3. Guttman, B. and Roback, E. A. (1995). An introduction to computer security: the NIST handbook. DIANE Publishing. Jacobs, D., Aulbach, S., et al. (2007). Ruminations on multi-tenant databases. In BTW, volume 103, pages 514–521. Jr, A. M., Laureano, M., Santin, A., and Maziero, C. (2010). Aspectos de segurança e privacidade em ambientes de computação em nuvem. Kaur, A. and Sofat, S. (2016). A proposed hybrid approach for privacy preserving data mining. In 2016 International Conference on Inventive Computation Technologies (ICICT), volume 1, pages 1–6. Kitchenham, B. (2004). Procedures for performing systematic reviews. Keele, UK, Keele University, 33(2004):1–26. Krutz, R. L. and Vines, R. D. (2010). Cloud security: A comprehensive guide to secure cloud computing. Wiley Publishing. LeFevre, K., DeWitt, D. J., and Ramakrishnan, R. (2006). Mondrian multidimensional k-anonymity. In Data Engineering, 2006. ICDE’06. Proceedings of the 22nd Interna- tional Conference on, pages 25–25. IEEE. Logeswari, G., Sangeetha, D., and Vaidehi, V. (2014). A cost effective clustering based anonymization approach for storing phr’s in cloud. In 2014 International Conference on Recent Trends in Information Technology, pages 1–5. Machanavajjhala, A., Kifer, D., Gehrke, J., and Venkitasubramaniam, M. (2007). L- diversity: Privacy beyond k-anonymity. ACM Transactions on Knowledge Discovery from Data (TKDD), 1(1):3. Mell, P. and Grance, T. (2009). The nist definition of cloud computing. national insti- tute of standards and technology (nist). Information Technology Laboratory.[Online]. Available: http://csrc. nist. gov/groups/SNS/cloud-computing/index. html. Panackal, J. J. and Pillai, A. S. (2015). Adaptive utility-based anonymization model: Performance evaluation on big data sets. Procedia Computer Science, 50:347 – 352. Big Data, Cloud and Computing Challenges. Paul, M., Collberg, C., and Bambauer, D. (2015). A possible solution for privacy preserv- ing cloud data storage. In 2015 IEEE International Conference on Cloud Engineering, pages 397–403. Paula Filho, W. P. (2011). Multimı́dia - Conceitos e Aplicações. Editora LTC, 2 edition. Pearson, S. (2013). Privacy, security and trust in cloud computing. In Privacy and Security for Cloud Computing, pages 3–42. Springer. Pfitzmann, A. and Köhntopp, M. (2001). Anonymity, unobservability, and pseudonymity—a proposal for terminology. In Designing privacy enhancing tech- nologies, pages 1–9. Springer. Prasser, F., Kohlmayer, F., Lautenschläger, R., and Kuhn, K. A. (2014). Arx-a comprehen- sive tool for anonymizing biomedical data. In AMIA Annual Symposium Proceedings, volume 2014, page 984. American Medical Informatics Association. Russell, R., Chung, M., Balk, E. M., Atkinson, S., Giovannucci, E. L., Ip, S., Taylor, M. S., Raman, G., Ross, A. C., Trikalinos, T., et al. (2009). Issues and challenges in conducting systematic reviews to support development of nutrient reference values: Workshop summary: Nutrition research series, vol. 2. S, K., S, Y., and P, R. V. (2015). An evaluation on big data generalization using k- anonymity algorithm on cloud. In 2015 IEEE 9th International Conference on Intelli- gent Systems and Control (ISCO), pages 1–5. Samarati, P. (2001). Protecting respondents identities in microdata release. IEEE trans- actions on Knowledge and Data Engineering, 13(6):1010–1027. sdcMicro (2017). Data-Analysis. Available from: https://cran.r-project. org/web/packages/sdcMicro/. Sousa, F. R., Moreira, L. O., and Machado, J. C. (2009). Computação em nuvem: Con- ceitos, tecnologias, aplicações e desafios. II Escola Regional de Computação Ceará, Maranhão e Piauı́ (ERCEMAPI), pages 150–175. Stallings, W. (2007). Network security essentials: applications and standards. Pearson Education India. Subashini, S. and Kavitha, V. (2011). A survey on security issues in service delivery models of cloud computing. Journal of network and computer applications, 34(1):1– 11. Sweeney, L. (2002). k-anonymity: A model for protecting privacy. International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems, 10(05):557–570. Taneja, H., Kapil, and Singh, A. K. (2015). Preserving privacy of patients based on re- identification risk. Procedia Computer Science, 70:448 – 454. Proceedings of the 4th International Conference on Eco-friendly Computing and Communication Systems. Toolbox, U. A. (2017). UT Dallas Data Security and Privacy Lab. Available from: http://cs.utdallas.edu/dspl/cgi-bin/toolbox/. Toolkit, C. A. (2017). Cornell Database Group. Available from: https:// sourceforge.net/projects/anony-toolkit/. Vecchiola, C., Chu, X., and Buyya, R. (2009). Aneka: a software platform for .net-based cloud computing. High Speed and Large Scale Scientific Computing, 18:267–295. Zhang, X., Dou, W., Pei, J., Nepal, S., Yang, C., Liu, C., and Chen, J. (2015). Proximity- aware local-recoding anonymization with mapreduce for scalable big data privacy preservation in cloud. IEEE Transactions on Computers, 64(8):2293–2307. Zhang, X., Liu, C., Nepal, S., Yang, C., Dou, W., and Chen, J. (2013). Combining top- down and bottom-up:
Compartilhar