Baixe o app para aproveitar ainda mais
Prévia do material em texto
2302 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 20, NO. 3, MAY/JUNE 2023 A Stochastic Grammar Approach to Mass Classification in Mammograms Ricardo Wandré Dias Pedro , Ana Luiza Silveira Ferreira, Rodolph Vinicius Siqueira Pessoa, Almir Galvão Vieira Bitencourt , Ariane Machado-Lima, and Fátima L. S. Nunes Abstract—Breast cancer is responsible for approximately 15% of all cancer-related deaths among women worldwide, and early and accurate diagnosis increases the chances of survival. Over the last decades, several machine learning approaches have been used to improve the diagnosis of this disease, but most of them require a large set of samples for training. Syntactic approaches were barely used in this context, although it can present good results even if the training set has few samples. This article presents a syntactic approach to classify masses as benign or malignant. There were used features extracted from a polygonal representation of masses combined with a stochastic grammar approach to discriminate the masses found in mammograms. The results were compared with other machine learning techniques, and the grammar-based clas- sifiers showed superior performance in the classification task. The best accuracies achieved were from 96% to 100%, indicating that grammatical approaches are robust and able to discriminate the masses even when trained with small samples of images. Syntactic approaches could be more frequently employed in the classification of masses, since they can learn the pattern of benign and malignant masses from a small sample of images achieving similar results when compared to the state of art. Index Terms—Breast cancer, classification, diagnosis, mammogram, pattern recognition, stochastic grammars, syntactic approach. I. INTRODUCTION BREAST cancer is the most common type of cancer ac- cording to The World Health Organization (WHO) [1]. Some of the most used machine learning techniques to develop computer-aided diagnosis systems are artificial neural network (including deep learning), support vector machine, k-nearest neighbors and random forest [2]. Few studies have explored Manuscript received 2 July 2022; revised 26 December 2022; accepted 16 February 2023. Date of publication 22 February 2023; date of current version 5 June 2023. This work was supported in part by Brazilian National Council of Scientific and Technological Development (CNPq) under Grant #309030/2019-6; in part by CNPq and São Paulo Research Foundation (FAPESP): National Institute of Science and Technology – Medicine Assisted by Scientific Computing (INCT-MACC) under Grant #157535/2017-7. (Corresponding author: Ricardo Wandré Dias Pedro.) Ricardo Wandré Dias Pedro is with the Electrical Engineering, Polytechnic School, University of São Paulo, São Paulo 05508, Brazil (e-mail: rwan- dre@usp.br). Ana Luiza Silveira Ferreira, Rodolph Vinicius Siqueira Pessoa, and Almir Galvão Vieira Bitencourt are with the A.C.Camargo Cancer Center, São Paulo 01525-001, Brazil (e-mail: analusilveiraf@gmail.com; rodolph.vini@gmail.com; almir.bitencourt@accamargo.org.br). Ariane Machado-Lima and Fátima L. S. Nunes are with the Information Systems, School of Arts, Sciences and Humanities, University of São Paulo, São Paulo 05508, Brazil (e-mail: ariane.machado@usp.br; fatima.nunes@usp.br). Digital Object Identifier 10.1109/TCBB.2023.3247144 the theory of formal languages to classify masses in mammo- graphic images [3]. The syntactic approach has the advantage of providing a concise hierarchical representation of parts of the images and their relationships. Syntactic approaches are useful even when there are not numerous samples to learn the target pattern. As far as we know, grammars were used in the process of discriminating masses only in [4], [5], and [6]. In [5] and [6] grammars are used to classify the masses directly, while in [4] the output generated by the syntactic analysis is used as input to an artificial neural network, what makes our approach more straightforward. The goal of this paper is to show the ability of grammar- based classifiers to discriminate masses as benign and malignant without the aid of other machine learning methods. This paper expands our previous studies by providing these main contributions: 1) classification is based on features extracted from a polyg- onal representation of mass boundaries using the Ramer- Douglas-Peucker algorithm, differently of our previous studies where shape features were extracted based on a polygonal model proposed in [7]; 2) Hu moments were used as input features for the grammar- based classifiers, in addition to the shape features used in the previous studies; 3) in addition to the dataset used in the previous studies, a new dataset with 202 images was introduced, creating a combined dataset with 313 images; 4) the robustness of the syntactic approach was validated by training the models with images from one dataset and testing them with images from the other dataset. The results show that the proposed model is able to learn the pattern of the masses even with a small number of training images. Besides, the created models can be more easily under- stood by physicians (similar to decision trees whose structure describes the decision process [8]) and showed to be robust when dealing with images from different datasets. II. BACKGROUND A. Breast Cancer and Suspicious Findings Breast cancer is the second most common malignant neopla- sia. The decrease in mortality rates of this cancer depends on an adequate therapeutic plan based on screening programs and 1545-5963 © 2023 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See https://www.ieee.org/publications/rights/index.html for more information. Authorized licensed use limited to: UNIVERSIDADE DE SAO PAULO. Downloaded on September 29,2023 at 12:39:34 UTC from IEEE Xplore. Restrictions apply. https://orcid.org/0000-0002-3728-4540 https://orcid.org/0000-0003-0192-9885 https://orcid.org/0000-0003-0040-0752 mailto:rwandre@usp.br mailto:rwandre@usp.br mailto:analusilveiraf@gmail.com mailto:rodolph.vini@gmail.com mailto:almir.bitencourt@accamargo.org.br mailto:ariane.machado@usp.br mailto:fatima.nunes@usp.br PEDRO et al.: STOCHASTIC GRAMMAR APPROACH TO MASS CLASSIFICATION IN MAMMOGRAMS 2303 early detection. Mammography is still considered the preferred method for screening in average-risk women [9]. Suspicious findings on screening mammography should be submitted to percutaneous biopsy to confirm malignancy and plan treatment. However, mammography interpretation is a chal- lenge even for breast imaging specialists, and many of these find- ings are associated to benign conditions. False positive results are one of the main limitations of mammography screening as it is associated with increased patient anxiety, invasive procedures, and increased healthcare costs [10]. Mass lesions found on mammography are challenging due to the wide range of different diagnoses. The risk of malignancy is evaluated based on the mass shape, margins and density. How- ever, the overlapping imaging features of benign and malignant lesions make it hard to avoid histological analysis, resulting in a large number of benign biopsies. B. Grammars Grammars are a formalism derived from the theory of formal languages developed to represent a set of sequences (language). This theory can be applied to deal with problems in different areas, for example, to understand the content of images. In this paper a stochastic context-free grammar was used. A stochastic context-free grammar is a quintuple Gs = (VN , VT , R, S, P ), where: � VN is a set containing the non-terminal symbols, that are auxiliary symbols to structure the grammar rules; � VT is a set containing the terminal symbols, that are used to represent each sequence; � R is a set of substitution rules, where the rules follow the pattern: A → β, A ∈ VN , β ∈ (VT ∪ VN )∗; � S ∈ VN is the initial symbolof the grammar; � P is the set of probability distributions over the rules having the same left side. Therefore, for each non-terminalA, con- sidering all rules {A → βi, pi} ∈ R, βi ∈ (VN ∪ VT ) ∗, the∑ i pi = 1. A parser is an algorithm that, for a given sequence and grammar, provides at least one syntactic tree if the sequence belongs to the grammar, otherwise it produces an error. A parser for a stochastic grammar not only provides the syntactic trees, but also the probability associated to the these trees. Given a stochastic grammar Gs and a sequence x, the proba- bility P (x, t|Gs) is the probability of x in the syntactic tree t, that is given by the multiplication of the probability of each Gs grammar rule used in the parsing process. In this study, we focus on the tree that maximizes this probability: tmax = argmaxt P (x, t|Gs). In a classification context, a stochastic grammar is used to represent each class cj of the problem, j = 1, . . ., n. Then, a new instance (sequence) x is classified as belonging to the class ci that maximizes P (x, tmax|Gcj ), j = 1, . . ., n. III. RELATED WORK Over the last decades, numerous studies have worked on mass classification problems in digital mammograms by using differ- ent combination of techniques, datasets and features extracted from them. The results are variable, usually being presented by accuracy or area under the Receiving Operating Characteristic (ROC) curve. When features are extracted for posterior clas- sification, different feature categories are considered, usually dependent on the image set evaluated [2]. A feature considering the concavities of the mass boundaries was used in [7], [11] together with compactness and spiculation index to classify masses considering the classes benign, ma- lignant, circumscribed or spiculated. In [11], the accuracy was of 81%, while in [7] the accuracy obtained was of 91%. The studies were performed with 53 images from Mammographic Image Analysis Society database (MIAS) [12] and from Alberta Program for the Early Detection of Breast Cancer database (ALB) [13]. In [14], the authors used genetic programming in the classification process, and the approach was able to correctly classify 95% of the benign masses and 97.3% of the malignant ones. The features (edge-sharpness, shape and texture features) were extracted from 57 images from ALB dataset. Fractal dimensions, compactness, spiculation index and fractional concavity were used in [15] to classify the masses, achieving an AUC of 0.92 considering 111 images from MIAS and from ALB datasets. In [16] a combination of features based on gradient and texture was employed as input to the classifier obtaining an AUC of 0.76 when applied to images from MIAS and ALB datasets. Hu moments were used as shape descriptors in [17], [18], [19]. Authors of [17], [18] used texture features, central and Hu moments extracted from 200 images from Digital Database Screening Mammography dataset (DDSM) [20] as input to a support vector machine algorithm, achieving an accuracy of 93%. A comparative of different descriptors was performed in [19], showing the best performance for texture features extracted from Gray Level Run Length Matrices. For shape features, the best results were achieved using Hu moments when using 322 images from MIAS dataset. A syntactic approach was used in [4] to classify masses in benign or malignant. The output of a syntactic analysis was used as input to an artificial neural network. Texture features and Zernike moments were used as features, achieving an AUC of 0.86. More recently, in [5] a new syntactic approach was developed using shape and texture features to classify 111 images belonging to two distinct datasets, obtaining accuracies from 96% to 100% depending on the model and the dataset used. In [6] the results achieved with grammars were compared with the results of other classifiers such as artificial neural network (ANN), support vector machine (SVM), k-nearest neighboors (KNN) and random forest (RF), and showed that grammars outperformed these techniques in this problem. Although we are using grammars to classify nodules, other approaches are also being used to handle this classification problem. In [21] an optimized kernel extreme learning machine was proposed and used with Haralick’s features as input. The classifier achieved accuracy ranging from 99% to 100% for binary classification and accuracies from 87% to 100% for a multi-class classification with 150 images from DDSM dataset. The study [22] proposed an approach where deep learning tech- niques, especially Convolutional Neural Networks and Transfer Authorized licensed use limited to: UNIVERSIDADE DE SAO PAULO. Downloaded on September 29,2023 at 12:39:34 UTC from IEEE Xplore. Restrictions apply. 2304 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 20, NO. 3, MAY/JUNE 2023 Fig. 1. Tasks executed in order to classify masses as benign or malignant using the proposed grammatical approach. Learning, were used together with Softmax and SVM classifiers to detect and classify masses as benign, malignant or normal. The study used a dataset provided by MIAS consisting in 322 images (62 benign, 51 malignant and 209 normal), achieving the best accuracy of 98%. SVM models were employed in [23] to classify masses as benign and malignant. The authors showed that when the features were selected using a random projection algorithm (RPA) the accuracy of 75% obtained outperforms the accuracy of other SVM models when other feature selection mechanisms were employed. IV. PROPOSED GRAMMAR APPROACH Fig. 1 shows the pipeline of tasks used in this study to dis- criminate masses as benign and malignant. In the next sections these tasks are presented in detail. A. Polygonal Model Representation The Ramer-Douglas-Peucker (RDP) [24] algorithm was used to generate a polygonal representation from the original bound- aries delimited by specialists. The polygonal representation is used to extract the shape features and the Hu moments. This algorithm has two input parameters: the threshold ε and the number N of points representing the mass boundary. Initially, N points (P1 toPN ) are considered forming a boundary. Next,P1 andPN (the first and the last points) are connected with a straight line. Then, the most distant point (Pi) to the line segment P1PN (i = 2, . . ., N − 1) is found. If the distance computed is higher than the threshold ε, the segment P1PN is divided in P1Pi and PiPN . This step is repeated for each of these two new segments and while ε is smaller than the distance between all points and the segment. The final points are used in the polygonal model, otherwise they will be discarded. In this study we used with ε = 0.4% of the perimeter of the mass boundary, since the RDP proved to be more robust with this value of ε as shown in [25]. Fig. 2 shows the original boundaries of benign and malignant masses and their polygonal representation. B. Feature Extraction To handle the mass classification problem, shape features and Hu moments were used as input to the classifiers. A total of eight shape features and seven Hu moments were extracted from 313 gray-level images. These shape features were originally used in [7], [11], [14], [15], [16], while Hu moments were used in [17], [18], [19]. 1) Shape Features: The shape features used in this study are briefly described here. Additional details of how these features were computed can be found in [25]. a) Compactness (CC): It is a measure of the efficiency of a given contour in covering a specific area [7]. This feature is important due to the fact that malignant masses tend to posses a higher value for this feature than the benign ones [11]. This feature was computed according to (1). It is a measure of the efficiency of a given contour in covering a specific area [15]. In (1), P represents the perimeter of the mass while A is its area. CC = 1− 4πA P 2 (1) b) Spiculation Index (SI): This index measures how spiculatedthe contour of a mass is. Malignant masses tend to present more irregular boundaries resulting higher SIs. Equation (2) proposed in [7] is used to compute the index, where Si and θi for i = 1, 2,..., N is the length and the angle of two segments representing a spicule, respectively. SI = ∑N i=1(1 + cosθi)Si∑N i=1 Si (2) c) Fractal Dimension (FD): It is a measure used to compute how self-similar a pattern is and, in general, it can be used to explain the boundary complexity [15]. Equation (3) is used to compute the self-similarity FD, where the number of self- similar parts with a reduction factor of 1/s (used to represent the measurement precision) is represented by a, which is obtained using (4). An estimation of FD can be considered the slope of a line approximation plotting log(a) versus log(1/s) [15]. In this study four different measures of FD were employed. Two of them were obtained considering the bidimensional boundaries of the masses using the ruler and box counting methods [15]. Authorized licensed use limited to: UNIVERSIDADE DE SAO PAULO. Downloaded on September 29,2023 at 12:39:34 UTC from IEEE Xplore. Restrictions apply. PEDRO et al.: STOCHASTIC GRAMMAR APPROACH TO MASS CLASSIFICATION IN MAMMOGRAMS 2305 Fig. 2. (a) Example of a benign mass boundary; (b) polygonal model representing the benign mass using RDP algorithm with ε = 0.4%. (c) Example of an original malignant mass boundary; (d) polygonal model representing the malignant mass using RDP algorithm with ε = 0.4%. The other two fractal dimensions were computed considering the unidimensional signature of the boundaries of masses using the same methods (ruler and box counting) [15]. The 1D signature of the boundaries was defined as a radial distance from the centroid to each boundary point as a function of the index of the boundary point. To normalize each 2D boundary the following steps were performed: the wider axis of the boundary was found and all values along that axis were normalized, thus their values range from 0 to 1. Then, the values along the other axis were also normalized, but this time, based on the length of the wider axis. This method of normalization was applied because it preservers the ratio between the width and the height of the boundaries in the dataset and the 1D signatures created were normalized to range between 0 and 1 considering both axes [15]. FD = log(a) log(1/s) (3) a = 1 sD (4) d) Fractional Concavity (FC): This measure is based on the number of concavity parts that a mass contains. Relying on the number of concavities in a mass boundary, the FC is useful in the classification process because benign masses tend to have more convex parts while malignant masses are generally composed of concave and convex parts. Equation (5) is used to compute the length of the boundary Tl, where Si, i = 1, 2, 3, . . .,M is the length of each one of the M segments that form the mass. CCl, given by (6), is the length of all concave segments in the boundary, where CCi, i = 1, 2, 3, . . .P is the length of each one of the P concave segments. Equation (7) gives the FC as described in [7], [11]. Tl = M∑ i=1 Si (5) CCl = P∑ i=1 CCi (6) FC = CCl Tl (7) e) Fourier Factor (FF): It can be used to measure the presence of high-frequency components in a boundary or roughness (pixel values that change rapidly in space) [15]. Equation (8) shows how this measure is computed, where Z0(k) are the Fourier descriptors normalized (obtained by (9)), Z(k) are the Fourier descriptors computed according to (10) for k = −N/2, . . .,−1, 0, 1, 2, . . .N/2− 1, and z(n) = x(n) + jy(n), n = 0, 1, . . .N − 1 represents the sequence of pixels in the boundary [15], where N is the number of pixels in the boundary. FF = 1− ∑N/2 K=−N/2+1 |Z0(k)|/|k|∑N/2 K=−N/2+1 |Z0(k)| (8) Z0(k) = { 0, k = 0; Z(k) |Z(1)| , otherwise. (9) Z(k) = 1 N N−1∑ n=0 z(n)exp [ − j 2π N nk ] (10) 2) Hu Moments: These moments describe the spatial distri- bution of points contained in the image or in a region and are used as shape descriptors, since they are useful due to the fact that they are invariant to translation, rotation and scale of the shape. The definition of Hu moments is explained in details in [26]. They are summarized in (11) to (15), where I(x, y) is the pixel intensity at position (x,y) of an image I represented by a two-dimensional matrix, (p + q) are called order moments, x̄ and ȳ are the components of the centroid, ηpq are the central moments, the μpq are the invariant moments and H1to H7 are the Hu moments. Mij = ∑ x ∑ y xiyiI(x, y) (11) x̄ = M10 M00 , ȳ = M01 M00 (12) ηpq = ∑ x ∑ y (x− x̄)p(y − ȳ)qI(x, y) (13) μpq = ηpq ηγ00 , γ = p+ q 2 (14) Authorized licensed use limited to: UNIVERSIDADE DE SAO PAULO. Downloaded on September 29,2023 at 12:39:34 UTC from IEEE Xplore. Restrictions apply. 2306 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 20, NO. 3, MAY/JUNE 2023 H1 = μ20 + μ02 H2 = (μ20 − μ02) 2 + 4(μ11) 2 H3 = (μ30 − 3μ12) 2 + (μ03 − 3μ21) 2 H4 = (μ30 + μ12) 2 + (μ03 + μ21) 2 H5 = (μ30−3μ12)(μ30+μ12)((μ30+μ12) 2−3(μ21+μ03) 2) + (3μ21 − μ03)(μ21 + μ03)(3(μ30 + μ12) 2 − (μ03 + μ21) 2) H6 = (μ20 − μ02)((μ30 + μ12) 2 − (μ21 + μ03) 2) + 4μ11(μ30 + μ12)(μ21 + μ03) H7 = (3μ21−μ03)(μ30+μ12)((μ30+μ12) 2−3(μ21+μ03) 2) + (μ30 − 3μ12)(μ21 + μ03)(3(μ30 + μ12) 2 − (μ03 + μ21) 2) (15) In the present study the classifiers were created using only the shape features (eight features), only the Hu moments (seven features), as well as a combination of shape and Hu moments as detailed in Section IV-C. C. Feature Selection The “Gini importance” was employed as a feature selection mechanism in the present study. To compute this metric, a Random Forest (RF) classifier was trained with 100 trees. The training was performed using images from the training dataset considering benign and malignant classes, and one of the outputs of the classifier is the Gini importance. This value is used as a measure of how often a feature is selected for a split based on how discriminating the feature is for the classification [27]. In the scenario where the combination of features was used, we selected the eight more important features according to the Gini importance. For the other scenarios where the classification was performed using only shape features or Hu moments, the feature selection step was not employed. D. Data Discretization A data discretization is necessary since every feature value must be represented by a token or a symbol in the grammars. For example, assume that compactness features and fractional concavity can have the possible values {0.1, 0.3, 0.5} and {0.1, 0.2}, respectively. A discretization process, for example, could label the values of compactness features 0.1, 0.3 as ‘co1’ and the value 0.5 as ‘co2’. Similarly, the discretization process could label the fractional concavity values 0.1 as ‘fc1’ and the value 0.2 as ‘fc2’. Thus, the sequences ‘co1 fc1’, ‘co1 fc2’, ‘co2 fc1’ and ‘co2 fc2’ are the possible sequences used to represent benign and malignant masses in this example. Omega algorithm [28] was used to discretize the continuous features extracted from the masses in this study. This algorithm has two input parameters: Hmin and ζmax. Hmin determines the minimum number of elements that each bin group must contain, i.e., the minimum number of values that each token must represent. ζmax represents the maximum inconsistency level; it specifies that two consecutive bins must be merged only when their elements have the same majority class and the inconsistency level of the new merged bin is below ζmax. In this project, the features were discretized considering the labels Circumscribed Benign (CB), Circumscribed Malignant (CM), Spiculated Benign (SB), and Spiculated Malignant (SM). In [6] a calibration process was performed to verify the best parameters for the algorithm and the values Hmin = 2 and ζmax = 0.35 were found. In the present study we consider Hmin =2, 4, 6, 8 and 10, since this parameter proved to have a major impact on the discretization process and kept ζmax = 0.35, since it has only a minor impact. E. Grammar Learning AND-OR graphs were used to visually represent context-free grammars. The internal nodes (AND/OR) are mapped to the non-terminal symbols, while the leaf nodes are mapped to the terminal symbols of the grammar. The AND nodes decompose each entity into their parts and the OR nodes produce alternative substructures [29]. Two AND-OR graphs are created, one representing the benign masses and the other representing the malignant ones. The generated graphs consider whether the masses are circumscribed or spiculated. Fig. 3(a) shows an AND-OR graph with two generic features (F1 and F2) in combination with two internal labels CIRCUMSCRIBED (cF1 and cF2) or SPICULATED (sF1 and sF2) to represent the masses. Fig. 3(b) shows the equivalent context-free grammar of the AND-OR graph shown in Fig. 3(a). In Fig. 3(a) two generic features were used as an example. However, to represent the real benign and malignant masses, the proposed approach creates AND-OR graphs considering all the eight shape features, all the Hu moments and a combination of shape features and Hu moments. A maximum a posteriori algorithm proposed in [30] was used to convert the context-free grammars in stochastic context-free grammars. Each production rule has a pseudocounter that was first initialized with value 0.1, thus no production rule has proba- bility of zero at the end of the process. Next, the initial grammar is used to parse all the sequences in the training set, and when a production rule is used, its counter is incremented by one. When all the sequences are analyzed, the counters are normalized, thus the sum of probabilities considering the production rules with the same left side is equal to one. In order to parse the sequences that represent the masses, the Earley algorithm [31] was used for stochastic grammars. The classifier is composed by a stochastic parser generated considering the grammar that represents the benign masses (Gb) and another generated from the grammar for malignant masses (Gm). Given a sequence representing a mass x, the parsers provide the P (x, t∗|Gi), where t∗ = argmaxt P (x, t|Gi) for all syntactic trees t of x given Gi and i ∈ {b,m}. When P (x, t∗|Gb) > P (x, t∗|Gm) the mass is classified as benign, otherwise it is classified as malignant, what corresponds to the likelihood-ratio test. Authorized licensed use limited to: UNIVERSIDADE DE SAO PAULO. Downloaded on September 29,2023 at 12:39:34 UTC from IEEE Xplore. Restrictions apply. PEDRO et al.: STOCHASTIC GRAMMAR APPROACH TO MASS CLASSIFICATION IN MAMMOGRAMS 2307 Fig. 3. (a) Example of mass representation with AND-OR graph. The AND nodes are represented by the dashed circles, the OR nodes are represented by the solid circles, and the squares are the leaf nodes. (b) The equivalent grammar of the AND-OR graph, where the terminal symbols are represented in italic. The symbols ‘.’ and ‘|’ represent the logic conditions ‘AND’ and ‘OR,’ respectively. V. EXPERIMENTAL EVALUATION A. Datasets This study used images from two datasets. The first one (ALB dataset) contains images from the Foothills Hospital in Calgary [7] (pixel size 62μm), images from MIAS dataset (pixel size of 50 μm) and images from Screen Test dataset [13] (pixel size of 50 μm). This dataset contains 111 images (66 benign and 46 malignant) where specialists manually drew the boundaries of the masses. The images are labeled as circumscribed benign (CB), spiculated benign (SB), circumscribed malignant (CM) and spiculated malignant (SM). The second dataset (ACC dataset) contains 202 images (104 benign and 98 malignant) from A. C. Camargo Cancer Center (a Brazilian hospital specialized in cancer research and treatment). The images have pixel size of 70 μm and the boundaries of the masses were also marked by specialists. Similarly to the first dataset, each mass was labeled as benign or malignant (classes) and as circumscribed or not circumscribed. Although the images from different datasets have different pixel size, no data processing were performed in order to make the pixel size of the two datasets consistent because the feature extraction methods are invariant to image resolution, since the features are not extracted directly from the pixels. B. Evaluation Scenarios Different scenarios were built in order to evaluate the grammar-based classifiers in relation to their ability to classify masses from different datasets. Three experiments were per- formed using stratified k-fold cross-validation: using only ALB dataset (k = 23); using only ACC dataset (k = 20) and using the combined dataset (k = 20). In these scenarios, the feature selection was performed without considering the test fold. Two additional experiments were performed training with one dataset and testing with the other. For these additional experiments, the feature selection was employed considering only the training dataset. Although k-fold cross-validation experiments are tradition- ally performed using k=10, we used higher values in or- der to increase the number of training images in each fold, since both ALB and ACC datasets have a limited num- ber of images. Furthermore, the different values for k (20 and 23) intend to better accommodate the image distribu- tion through the folds. For instance, considering the ALB dataset, with k = 23, the classifier is trained with around 106 images and tested with five images in each fold itera- tion. The evaluation of the grammar-based classifiers was per- formed considering the accuracy since the used datasets are balanced. VI. RESULTS AND DISCUSSION A. The Effect of Discretization Process Fig. 4 shows the effect of the discretization process in the over- all classification. As the value of the Hmin parameter increases, the accuracy tends to decrease. The greatest difference was noticed using Hu moments, when the training was performed with ACC dataset and the tests were performed with ALB dataset (Hu_ACC_ALB in Fig. 4), where the accuracy decreased from 96.42% (using Hmin = 2) to 62.16% (using Hmin = 10). The smallest difference between the accuracy values was achieved when the combination of shape features and Hu moments were evaluated using ACC dataset (Shape_Hu_ACC in Fig. 4), show- ing an accuracy of 100% when Hmin = 2 and accuracy of 98.05% when Hmin = 10. The effect of the discretization was also reported in [6], and it happens because when the value of Hmin increases, the number of bins decreases, then, each bin contains a mixture of values belonging to different labels (CB, CM, SB, SM) used in the discretization process. For that reason, the next tests were performed considering Hmin = 2. B. Analysis of the Proposed Grammatical Approach Fig. 5 shows the classification results achieved using the pro- posed grammatical approach. The achieved accuracies ranged Authorized licensed use limited to: UNIVERSIDADE DE SAO PAULO. Downloaded on September 29,2023 at 12:39:34 UTC from IEEE Xplore. Restrictions apply. 2308 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 20, NO. 3, MAY/JUNE 2023 Fig. 4. The effect of discretization on classification of masses. In the legend, Shape means the use of shape features; Hu means the use of Hu moments; combined means the combined dataset; ACC_ALB means trained in ACC dataset and tested in ALB dataset; ALB_ACC means trained in ALB dataset and tested in ACC dataset. Fig. 5. Accuracies achieved with the grammar-based approach. combined means the combination of ALB and ACC datasets; ACC_ALB means trained in ACC dataset and tested in ALB dataset; ALB_ACC means trained in ALB dataset and tested in ACC dataset. The error bars indicate the standard deviation when applicable (all scenarios except those that did not use cross-validation - ALB_ACC and ACC_ALB). from 98.72% to 100% considering only dataset ALB, only dataset ACC and the combineddataset (all scenarios using k-fold cross validation). In addition, to verify if the discretiza- tion process using Hmin = 2 was overfitting the tests (since each bin contains just a small number of continuous values) we trained our model with one dataset and tested with the other. For these scenarios, the best accuracy was of 100% when the training was performed using images from ALB and the combination of shape and Hu moments as features. When the training was performed with images from ACC the best accu- racy was 100% using only shape features. The worst result in terms of accuracy was of 96.42% when only Hu moments were used as input in the scenario where images from ACC were used for training and images from ALB were used during the tests. The results shown in Fig. 5 indicate that the classifiers built using grammars were able to classify the images with accuracies ranging from 96.42% to 100% for Hmin = 2. However, the accuracy tends to fall as the value of Hmin increases, demon- strating that our approach depends on a good discretization process (Fig. 4). Regarding the shape features and Hu moments, the grammar approach was able to use both types of features to classify the images with good results regardless of the dataset. However, Authorized licensed use limited to: UNIVERSIDADE DE SAO PAULO. Downloaded on September 29,2023 at 12:39:34 UTC from IEEE Xplore. Restrictions apply. PEDRO et al.: STOCHASTIC GRAMMAR APPROACH TO MASS CLASSIFICATION IN MAMMOGRAMS 2309 Fig. 6. Accuracy achieved by each model using the shape features, Hu moments and the combined features. Combined dataset means the combination of ALB and ACC datasets; ACC_ALB means trained in ACC dataset and tested in ALB dataset; ALB_ACC menas trained in ALB dataset and tested in ACC dataset. TABLE I CLASSIFIERS CREATED AND THEIR HYPERPARAMETERS, WHERE N_FEATURES IS THE NUMBER OF FEATURES AND X.VAR() IS THE TRAIN DATASET VARIANCE the shape features seem to be more robust, since when the Hmin increases the accuracies tend to fall slower than when the Hu moments were used (Fig. 4). Images from ALB dataset were harder to classify correctly than images from ACC dataset, especially whenHmin increases. This may be because the ALB dataset contains a greater number of images with circumscribed shapes that are malignant and images with spiculated shapes that are benign than the ACC dataset. Another important factor that may be driving this be- havior is that there are more images in the ACC dataset than in the ALB dataset (202 and 111, respectively) and the classifier can be more stable as the number of images increases. It is quite difficult to compare the results obtained by differ- ent studies in this research area, as different datasets, images, segmentation methods and evaluation metrics are used prior to the classification stage. However, it can be noted that the results obtained in the present study are comparable to some of the most recent studies found in the literature, as can be seen in Section III. For instance, the best accuracies obtained in this study (96.42% to 100%) are similar to the accuracies presented in studies [21] (99% to 100%) and [22] (98%) and superior to the accuracy presented in [23] (75%). C. Comparison of the Grammatical and Non-Grammatical Classifiers To compare the results achieved using the grammar approach with other classifiers, the classification of images from ALB and ACC datasets were performed using the following models: Arti- ficial Neural Network (ANN), Support Vector Machine (SVM), K-Nearest Neighbors (KNN), Random Forest (RF) and Light Gradient Boosting Model (LGBM). More precisely, the three experiments were performed for each feature category (shape feature, Hu moments and both): one cross-validation with the two datasets combined (the same fold division used for the grammatical approach) and other two experiments training with one dataset and testing with the other. Furthermore, the same features used for the grammatical approach were also used by the non-grammatical classifiers. Before training the models, all feature values were scaled using the min-max normalization, so their new values ranged from 0 to 1. Table I shows, for each classifier induction algorithm, the tested values for the hyperparameters. Fig. 6 shows the classification results for all tested scenarios. Comparing the results achieved by the proposed grammatical approach to classify masses as benign and malignant with other classifiers, it shows that the results achieved by the proposed approach surpassed the other models. In special, Fig. 6 shows that the grammar-based approach was able to classify the masses Authorized licensed use limited to: UNIVERSIDADE DE SAO PAULO. Downloaded on September 29,2023 at 12:39:34 UTC from IEEE Xplore. Restrictions apply. 2310 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 20, NO. 3, MAY/JUNE 2023 using Hu momments as input while the other methods showed more difficulty in this task. Considering the non-grammatical approaches, the best ac- curacies were of 92.32% and 92.57% with LGBM and ANN, respectively, when shape features were used. When only the Hu momments were employed the non-grammatical approaches had their worst performance, showing that these classifiers were not able to learn the pattern of benign and malignant masses with this type feature. Tables II, III and IV summarize the comparison between the proposed method and the non-grammatical approach. Besides accuracy, these tables show the following performance mea- sures: sensitivity, specificity, F-score and the Matthews correla- tion coefficient (MCC). The superior results of the grammar approach are likely related to the number of images in the dataset. It seems that the grammar-based classifiers demand less images to learn the pattern of benign and malignant masses than the other classifiers, which was previously found in [6]. A possible reason for the performance achieved by the grammar-based models might be related to the set of rules chosen to create the grammars that represent the benign and malignant masses and by the fact of the classifier be composed by two distinct parsers (created based on the grammar used to represent benign and malig- nant masses) that will compute probabilities of the sentence representing the mass (sequence of tokens/features) belongs to the language of each grammar. For this reason, grammat- ical methods could be more used in this classification prob- lem, in special when there are not a large number of images available. Despite the good results obtained by the grammar classifiers, the proposed approach has its drawbacks. The first limitation is the need to extract features from mass boundaries marked by a specialist or by an automatic segmentation process. The images in this study had their boundaries marked by special- ists, since developing a segmentation process is also a very challenging research area, especially for dense breasts. An- other issue that can severely impact the results when gram- mar classifiers are used is the discretization process as dis- cussed in Section VI-A. However, this issue can be overcome by using a proper algorithm as demonstrated in [6]. Addi- tionally, a deep knowledge of the application is required to compose the process and perform the previous cited actions. This can decrease the generalization ability of the grammar- based modelsto other application areas, which sometimes can be more easily reached by machine learning methods. These drawbacks can be more significant, especially when comparing the proposed approach with deep learning techniques where there is no need for prior segmentation or a discretization process. D. Advantages of the Grammatical Classifiers The first advantage of our approach when compared with the other classifiers – and especially with deep learning techniques – is that the pattern of benign and malignant masses can be learned TABLE II PERFORMANCE MEASURES ACHIEVED BY EACH MODELUSING SHAPE FEATURES: SENSITIVITY (SEN.), SPECIFICITY (SPE.), ACCURACY (ACC), F-SCORE AND MATTHEWS CORRELATION COEFFICIENT (MCC). COMBINED DATASET: TRAINING = BOTH, TEST = BOTH; ACC → ALB DATASETS: TRAINING = ACC; TEST = ALB; ALB → ACC DATASETS: TRAINING = ALB, TEST= ACC Authorized licensed use limited to: UNIVERSIDADE DE SAO PAULO. Downloaded on September 29,2023 at 12:39:34 UTC from IEEE Xplore. Restrictions apply. PEDRO et al.: STOCHASTIC GRAMMAR APPROACH TO MASS CLASSIFICATION IN MAMMOGRAMS 2311 TABLE III PERFORMANCE MEASURES ACHIEVED BY EACH MODEL USING HU MOMENTS: SENSITIVITY (SEN.), SPECIFICITY (SPE.), ACCURACY (ACC), F-SCORE AND MATTHEWS CORRELATION COEFFICIENT (MCC). COMBINED DATASET: TRAINING = BOTH, TEST = BOTH; ACC → ALB DATASETS: TRAINING = ACC; TEST = ALB; ALB → ACC DATASETS: TRAINING = ALB, TEST = ACC TABLE IV PERFORMANCE MEASURES ACHIEVED BY EACH MODEL USING THE COMBINED FEATURES: SENSITIVITY (SEN.), SPECIFICITY (SPE.), ACCURACY (ACC), F-SCORE AND MATTHEWS CORRELATION COEFFICIENT (MCC). COMBINED DATASET: TRAINING = BOTH, TEST = BOTH; ACC → ALB DATASETS: TRAINING = ACC; TEST = ALB; ALB → ACC DATASETS: TRAINING = ALB, TEST= ACC Authorized licensed use limited to: UNIVERSIDADE DE SAO PAULO. Downloaded on September 29,2023 at 12:39:34 UTC from IEEE Xplore. Restrictions apply. 2312 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 20, NO. 3, MAY/JUNE 2023 from a small sample of images. For instance, the ALB dataset has 111 images while the ACC dataset has 202 images. For a deep learning approach to solve the same problem, it would demand many more images in order to learn all the features and weights necessary to represent the benign and malignant patterns [32]. To achieve similar results to the ones presented in this study, deep learning strategies were combined with data augmentation to increase the number of images or using other techniques such as transfer learning [22] and [33]. Next, the grammar-based classifiers seem to be more stable than the non-grammatical base classifiers used in this study, since the accuracies obtained vary less than the accuracies obtained by the other models (Fig. 6). Another advantage of our approach is the simple represen- tation of the masses used to build the classifiers that can be seen as a hierarchical structure (Fig. 3) showing what the algorithm does, especially when compared to artificial neu- ral networks and other more complex approaches closer to black boxes. By using the cited Figure as an example, we can easily explain to health professionals that dashed circles represent a “OR” node/rule, solid circles represent a “AND” node/rule and the squares are the leaf nodes with the feature values. Thus, we wonder that even with no technical knowl- edge about computational grammars, the health professional can understand the rules used to classify a mass as benign or malignant. VII. CONCLUSION This article presented a syntactic approach to classify masses found in mammograms. Two datasets were used: i) the first one contains 111 images and was provided by researchers from the University of Calgary (Canada); ii) the other dataset contains 202 images and was provided by researchers from the A. C. Camargo Center (Brazil). The results show that the syntactic approach is robust, can learn the pattern of benign and malignant masses even with small samples of images and had superior results when compared with artificial neural network, support vector machine, k-nearest neighbors, and random forest techniques. Furthermore, the results obtained were similar to the results demonstrated in some of the most recent studies in this area of research. Considering all the experiments performed, the best accuracy achieved by the grammar-based classifier was 100%, while the best accuracies achieved by the ANN, SVM, KNN, RF and LGBM were 92.57%, 90.98%, 89.91%, 90.38% and 92.32%, respectively. Furthermore, the shape features seem to be better suited to be used in this classification process, as the accuracies tend to be higher when they are used. Despite the fact that the syntactic approach depends on good data discretization, the proposed approach proved not to overfit the data. The tests showed accuracies from 96.42% to 100% when training with images from one dataset and testing with im- ages from the other dataset. This can indicate that the proposed approach has potential for classifying structures in contexts similar to that shown in this article. REFERENCES [1] WHO - World Health Organization. Accessed: Apr. 17, 2020. [Online]. Available: https://www.who.int/news-room/fact-sheets/detail/cancer [2] R. Pedro, A. Machado-Lima, and F. Nunes, “Is mass classification in mammograms a solved problem? - A critical review over the last 20 years,” Expert Syst. Appl., vol. 119, pp. 90–103, 2019. [3] R. Pedro, F. Nunes, and A. Machado-Lima, “Using grammars for pattern recognition in images: A systematic review,” ACM Comput. Surv., vol. 46, no. 2, pp. 26:1–26:34, Nov. 2013. [Online]. Available: http://doi.acm.org/ 10.1145/2543581.2543593 [4] A. Tahmasbi, F. Saki, and S. Shokouhi, “CWLA: A novel cognitive classifier for breast mass diagnosis,” in Proc. 18th Iranian Conf. Biomed. Eng., 2011, pp. 255–259. [5] R. Pedro, A. Machado-Lima, and F. Nunes, “A new syntactic approach for masses classification in digital mammograms,” in Proc. IEEE 32nd Int. Symp. Comput.-Based Med. Syst., 2019, pp. 385–390. [6] R. Pedro, A. Machado-Lima, and F. Nunes, “Towards an approach us- ing grammars for automatic classification of masses in mammograms,” Comput. Intell., vol. 37, no. 4, pp. 1515–1544 2021. [7] R. Rangayyan, N. Mudigonda, and J. Desautels, “Boundary modelling and shape analysis methods for classification of mammographic masses,” Med. Biol. Eng. Comput., vol. 38, no. 5, pp. 487–496, 2000. [Online]. Available: http://dx.doi.org/10.1007/BF02345742 [8] G. Stiglic, S. Kocbek, I. Pernek, and P. Kokol, “Comprehensive decision tree models in bioinformatics,” PLoS One, vol. 7, 2012, Art. no. e33812. [9] M. B. Mainiero et al., “ACR appropriateness criteria breast cancer screening,” J. Amer. College Radiol., vol. 14, no. 11, pp. S383–S390, Nov. 2017. [Online]. Available: https://linkinghub.elsevier.com/retrieve/ pii/S1546144017310992 [10] H. D. Nelson, M. Pappas, A. Cantor, J. Griffin, M. Daeges, and L. Humphrey, “Harms of breast cancer screening: Systematic review to update the 2009 U.S. preventive services task force recommendation,” Ann. Intern. Med., vol. 164, no. 4, Feb. 2016, Art. no. 256. [Online]. Available: http://annals.org/article.aspx?doi=10.7326/M15--0970 [11] N. Mudigonda, R. Rangayyan, and J. Desautels, “Concavity and convexity analysis of mammographic masses via an iterative boundary segmentation algorithm,” in Proc. Eng. Solutions Next Millennium IEEE Can. Conf. Elect. Comput. Eng., 1999, pp. 1489–1494. [12] J. Suckling et al., “The mammographic image analysis society digital mammogram database,” in Proc. 2nd Int. Workshop Digit. Mammogr., 1994, pp. 375–378. [13] C. Alberta Cancer Board, Alberta, “Screen test: Alberta program for the early detection of breast cancer,” 2001/03 - Biennial Report, 2004. [Online]. Available: http://www.cancerboard.ab.ca/screentest [14] R. Nandi, A. Nandi, R. Rangayyan, and D. Scutt, “Classification of breast masses in mammograms using genetic programming and fea- ture selection,” Med. Biol. Eng. Comput., vol. 44, no. 8, pp. 683–694, 2006. [15] R. Rangayyan and T. Nguyen, “Fractal analysis of contours of breast masses in mammograms,” J. Digit. Imag., vol. 20, no. 3, pp. 223–237, 2007. [16] N. Mudigonda, R. Rangayyan, and J. Desautels, “Gradient and texture analysis for the classification of mammography masses,” IEEE Trans. Med. Imag., vol. 19, no. 10, pp. 1032–1043, Oct. 2000. [17] N. Azizi, N. Zemmal, M. Sellami, and N. Farah, “A new hybrid method combining genetic algorithm and support vector machine classifier: Ap- plication to CAD system for mammogram images,” in Proc. Int. Conf.Multimedia Comput. Syst., 2014, pp. 415–420. [18] N. Zemmal, N. Azizi, and M. Sellami, “CAD system for classification of mammographic abnormalities using transductive semi supervised learning algorithm and heterogeneous features,” in Proc. 12th Int. Symp. Program. Syst., 2015, pp. 1–9. [19] R. Chaieb, A. Bacha, K. Kalti, and F. B. Lamine, “Image features extraction for masses classification in mammograms,” in Proc. 6th Int. Conf. Soft Comput. Pattern Recognit., 2014, pp. 203–208. [20] University of South Florida, “Digital database for screening mam- mography,” 2004. [Online]. Available: http://marathon.csee.usf.edu/ Mammography/Database.html [21] F. Mohanty, S. Rup, and B. Dash, “Automated diagnosis of breast cancer using parameter optimized kernel extreme learning machine,” Biomed. Signal Process. Control, vol. 62, 2020, Art. no. 102108. [22] A. Saber, M. Sakr, O. M. Abo-Seida, A. Keshk, and H. Chen, “A novel deep-learning model for automatic detection and classification of breast cancer using the transfer-learning technique,” IEEE Access, vol. 9, pp. 71194–71209, 2021. Authorized licensed use limited to: UNIVERSIDADE DE SAO PAULO. Downloaded on September 29,2023 at 12:39:34 UTC from IEEE Xplore. Restrictions apply. https://www.who.int/news-room/fact-sheets/detail/cancer http://doi.acm.org/10.1145/2543581.2543593 http://doi.acm.org/10.1145/2543581.2543593 http://dx.doi.org/10.1007/BF02345742 https://linkinghub.elsevier.com/retrieve/pii/S1546144017310992 https://linkinghub.elsevier.com/retrieve/pii/S1546144017310992 http://annals.org/article.aspx{?}doi=10.7326/M15--0970 http://www.cancerboard.ab.ca/screentest http://marathon.csee.usf.edu/Mammography/Database.html http://marathon.csee.usf.edu/Mammography/Database.html PEDRO et al.: STOCHASTIC GRAMMAR APPROACH TO MASS CLASSIFICATION IN MAMMOGRAMS 2313 [23] M. Heidari et al., “Applying a random projection algorithm to optimize machine learning model for breast lesion classification,” IEEE Trans. Biomed. Eng., vol. 68, no. 9, pp. 2764–2775, Sep. 2021. [24] U. Ramer, “An iterative procedure for the polygonal approximation of plane curves,” Comput. Graph. Image Process., vol. 1, no. 3, pp. 244–256, 1972. [25] R. Hirama, R. Pedro, V. Graciliano, A. Machado-Lima, and F. Nunes, “Evaluating the impact of polygonal representations on mass classifica- tion,” in Proc. Int. Conf. Syst. Signals Image Process., 2020, pp. 75–80. [26] M.-K. Hu, “Visual pattern recognition by moment invariants,” IRE Trans. Inf. Theory, vol. 8, no. 2, pp. 179–187, 1962. [27] B. Menze et al., “A comparison of random forest and its Gini importance with standard chemometric methods for the feature selection and classifi- cation of spectral data,” BMC Bioinf., vol. 10, no. 1, Jul. 2009, Art. no. 213. [Online]. Available: https://doi.org/10.1186/1471--2105-10-213 [28] M. Ribeiro, M. Ferreira, C. Traina Jr., and A. Traina, “Data pre-processing: A new algorithm for feature selection and data discretization,” in Proc. 5th Int. Conf. Soft Comput. Transdisciplinary Sci. Technol., New York, NY, USA, 2008, pp. 252–257. [Online]. Available: http://doi.acm.org/10.1145/ 1456223.1456277 [29] S. Zhu and D. Mumford, “A stochastic grammar of images,” Found. Trends Comput. Graph. Vis., vol. 2, no. 4, pp. 259–362, Jan. 2006. [Online]. Available: http://dx.doi.org/10.1561/0600000018 [30] K. Fu, Syntactic Pattern Recognition and Applications. Englewood Cliffs, NJ, USA: Prentice-Hall, 1982. [31] J. Earley, “An efficient context-free parsing algorithm,” Commun. ACM, vol. 13, no. 2, pp. 94–102, Feb. 1970. [32] H. N. Khan, A. R. Shahid, B. Raza, A. H. Dar, and H. Alquhayz, “Multi- view feature fusion based four views model for mammogram classification using convolutional neural network,” IEEE Access, vol. 7, pp. 165724– 165733, 2019. [33] S. J. Malebary and A. Hashmi, “Automated breast mass classification system using deep learning and ensemble learning in digital mammogram,” IEEE Access, vol. 9, pp. 55312–55328, 2021. Ricardo Wandré Dias Pedro received the bachelor’s degree in computer science and the MSc degree in information systems, both from the University of Sao Paulo - Brazil, in 2009 and 2013. He is currently work- ing toward the PhD degree with the University of Sao Paulo - Brazil. His current research interests include machine learning and syntactic pattern recognition. Ana Luiza Silveira Ferreira received the MD de- gree from Faculdades Integradas Pitágoras de Montes Claros- Brazil, in 2017. She performed her medical residency in radiology and diagnostic imaging (2018- 2021) with A.C.Camargo Cancer Center - Brazil, fellow (R4) in general magnetic resonance by the Santa Izabel Hospital - Brazil (2021-2022) and is in the fellow(R5) in Women’s Image by the CAM Group - Brazil (2022). Rodolph Vinicius Siqueira Pessoa received the physician graduate degree from the University of the State of Rio Grande do Norte - Brazil, in 2019. He completed a medical residency in radiology and diagnostic imaging with Fundação Antônio Prudente – A.C. Camargo Cancer Center - Brazil (2022) - Brazil. Currently, he is a fellow with the Interven- tional Radiology and Angioradiology Program at the same institution. Current areas of research interests include: Oncology, interventional radiology, breast imaging. Almir Galvão Vieira Bitencourt received the MD degree from the Federal University of Bahia, in 2007, and the PhD degree in oncology from the Antônio Prudente Foundation / A.C.Camargo Cancer Center - Brazil, in 2012. He performed his medical resi- dency in radiology and diagnostic imaging (2008- 2010) and fellow in breast imaging (2011) with the A.C.Camargo Cancer Center - Brazil. He currently works as a radiologist and researcher with the same institution in the following areas: Diagnostic imaging, oncology, and breast imaging. Ariane Machado-Lima received the MSc degree in computer science and the PhD degree in bioinformat- ics, both from the University of Sao Paulo - Brazil, in 2002 and 2006, and the PhD degree sandwich period from the Washington University School of Medicine - EUA. She is a computer scientist with the University of Sao Paulo - Brazil. Her current research interests include machine learning and syntactic pattern recog- nition. Fátima L. S. Nunes received the bachelor’s degree in computer science, and the master’s and PhD de- grees in sciences, in the Graphics area. She is a full professor with the University of São Paulo, where she teaches and supervises master and PhD students in computer science and computer engineering. Her main research areas in computer science are virtual reality, image processing, and content-based image retrieval. Most of her research is related to solving real problems in the health area. Currently, she is working in problems related to life quality involv- ing computational techniques to aid diagnosis of Autism Spectrum Disorder, breast cancer, cardiomyopathies, and mild cognitive impairment. She also has developing techniques to allow fast development and automatic adaptations of serious games applied to rehabilitation processes Authorized licensed use limited to: UNIVERSIDADE DE SAO PAULO. Downloaded on September 29,2023 at 12:39:34 UTC from IEEE Xplore. Restrictions apply. https://doi.org/10.1186/1471--2105-10-213 http://doi.acm.org/10.1145/1456223.1456277 http://doi.acm.org/10.1145/1456223.1456277 http://dx.doi.org/10.1561/0600000018 << /ASCII85EncodePages false /AllowTransparency false /AutoPositionEPSFiles true /AutoRotatePages /None /Binding /Left /CalGrayProfile (Gray Gamma 2.2) /CalRGBProfile (sRGB IEC61966-2.1) /CalCMYKProfile (U.S. Web Coated \050SWOP\051 v2) /sRGBProfile (sRGB IEC61966-2.1) /CannotEmbedFontPolicy /Warning /CompatibilityLevel 1.4 /CompressObjects /Off /CompressPages true /ConvertImagesToIndexed true /PassThroughJPEGImages true /CreateJobTicket false /DefaultRenderingIntent /Default /DetectBlends true /DetectCurves 0.0000 /ColorConversionStrategy /sRGB /DoThumbnails true/EmbedAllFonts true /EmbedOpenType false /ParseICCProfilesInComments true /EmbedJobOptions true /DSCReportingLevel 0 /EmitDSCWarnings false /EndPage -1 /ImageMemory 1048576 /LockDistillerParams true /MaxSubsetPct 100 /Optimize true /OPM 0 /ParseDSCComments false /ParseDSCCommentsForDocInfo true /PreserveCopyPage true /PreserveDICMYKValues true /PreserveEPSInfo false /PreserveFlatness true /PreserveHalftoneInfo true /PreserveOPIComments false /PreserveOverprintSettings true /StartPage 1 /SubsetFonts true /TransferFunctionInfo /Remove /UCRandBGInfo /Preserve /UsePrologue false /ColorSettingsFile () /AlwaysEmbed [ true /Algerian /Arial-Black /Arial-BlackItalic /Arial-BoldItalicMT /Arial-BoldMT /Arial-ItalicMT /ArialMT /ArialNarrow /ArialNarrow-Bold /ArialNarrow-BoldItalic /ArialNarrow-Italic /ArialUnicodeMS /BaskOldFace /Batang /Bauhaus93 /BellMT /BellMTBold /BellMTItalic /BerlinSansFB-Bold /BerlinSansFBDemi-Bold /BerlinSansFB-Reg /BernardMT-Condensed /BodoniMTPosterCompressed /BookAntiqua /BookAntiqua-Bold /BookAntiqua-BoldItalic /BookAntiqua-Italic /BookmanOldStyle /BookmanOldStyle-Bold /BookmanOldStyle-BoldItalic /BookmanOldStyle-Italic /BookshelfSymbolSeven /BritannicBold /Broadway /BrushScriptMT /CalifornianFB-Bold /CalifornianFB-Italic /CalifornianFB-Reg /Centaur /Century /CenturyGothic /CenturyGothic-Bold /CenturyGothic-BoldItalic /CenturyGothic-Italic /CenturySchoolbook /CenturySchoolbook-Bold /CenturySchoolbook-BoldItalic /CenturySchoolbook-Italic /Chiller-Regular /ColonnaMT /ComicSansMS /ComicSansMS-Bold /CooperBlack /CourierNewPS-BoldItalicMT /CourierNewPS-BoldMT /CourierNewPS-ItalicMT /CourierNewPSMT /EstrangeloEdessa /FootlightMTLight /FreestyleScript-Regular /Garamond /Garamond-Bold /Garamond-Italic /Georgia /Georgia-Bold /Georgia-BoldItalic /Georgia-Italic /Haettenschweiler /HarlowSolid /Harrington /HighTowerText-Italic /HighTowerText-Reg /Impact /InformalRoman-Regular /Jokerman-Regular /JuiceITC-Regular /KristenITC-Regular /KuenstlerScript-Black /KuenstlerScript-Medium /KuenstlerScript-TwoBold /KunstlerScript /LatinWide /LetterGothicMT /LetterGothicMT-Bold /LetterGothicMT-BoldOblique /LetterGothicMT-Oblique /LucidaBright /LucidaBright-Demi /LucidaBright-DemiItalic /LucidaBright-Italic /LucidaCalligraphy-Italic /LucidaConsole /LucidaFax /LucidaFax-Demi /LucidaFax-DemiItalic /LucidaFax-Italic /LucidaHandwriting-Italic /LucidaSansUnicode /Magneto-Bold /MaturaMTScriptCapitals /MediciScriptLTStd /MicrosoftSansSerif /Mistral /Modern-Regular /MonotypeCorsiva /MS-Mincho /MSReferenceSansSerif /MSReferenceSpecialty /NiagaraEngraved-Reg /NiagaraSolid-Reg /NuptialScript /OldEnglishTextMT /Onyx /PalatinoLinotype-Bold /PalatinoLinotype-BoldItalic /PalatinoLinotype-Italic /PalatinoLinotype-Roman /Parchment-Regular /Playbill /PMingLiU /PoorRichard-Regular /Ravie /ShowcardGothic-Reg /SimSun /SnapITC-Regular /Stencil /SymbolMT /Tahoma /Tahoma-Bold /TempusSansITC /TimesNewRomanMT-ExtraBold /TimesNewRomanMTStd /TimesNewRomanMTStd-Bold /TimesNewRomanMTStd-BoldCond /TimesNewRomanMTStd-BoldIt /TimesNewRomanMTStd-Cond /TimesNewRomanMTStd-CondIt /TimesNewRomanMTStd-Italic /TimesNewRomanPS-BoldItalicMT /TimesNewRomanPS-BoldMT /TimesNewRomanPS-ItalicMT /TimesNewRomanPSMT /Times-Roman /Trebuchet-BoldItalic /TrebuchetMS /TrebuchetMS-Bold /TrebuchetMS-Italic /Verdana /Verdana-Bold /Verdana-BoldItalic /Verdana-Italic /VinerHandITC /Vivaldii /VladimirScript /Webdings /Wingdings2 /Wingdings3 /Wingdings-Regular /ZapfChanceryStd-Demi /ZWAdobeF ] /NeverEmbed [ true ] /AntiAliasColorImages false /CropColorImages true /ColorImageMinResolution 150 /ColorImageMinResolutionPolicy /OK /DownsampleColorImages false /ColorImageDownsampleType /Bicubic /ColorImageResolution 900 /ColorImageDepth -1 /ColorImageMinDownsampleDepth 1 /ColorImageDownsampleThreshold 1.00111 /EncodeColorImages true /ColorImageFilter /DCTEncode /AutoFilterColorImages true /ColorImageAutoFilterStrategy /JPEG /ColorACSImageDict << /QFactor 0.76 /HSamples [2 1 1 2] /VSamples [2 1 1 2] >> /ColorImageDict << /QFactor 0.40 /HSamples [1 1 1 1] /VSamples [1 1 1 1] >> /JPEG2000ColorACSImageDict << /TileWidth 256 /TileHeight 256 /Quality 15 >> /JPEG2000ColorImageDict << /TileWidth 256 /TileHeight 256 /Quality 15 >> /AntiAliasGrayImages false /CropGrayImages true /GrayImageMinResolution 150 /GrayImageMinResolutionPolicy /OK /DownsampleGrayImages false /GrayImageDownsampleType /Bicubic /GrayImageResolution 1200 /GrayImageDepth -1 /GrayImageMinDownsampleDepth 2 /GrayImageDownsampleThreshold 1.00083 /EncodeGrayImages true /GrayImageFilter /DCTEncode /AutoFilterGrayImages true /GrayImageAutoFilterStrategy /JPEG /GrayACSImageDict << /QFactor 0.76 /HSamples [2 1 1 2] /VSamples [2 1 1 2] >> /GrayImageDict << /QFactor 0.40 /HSamples [1 1 1 1] /VSamples [1 1 1 1] >> /JPEG2000GrayACSImageDict << /TileWidth 256 /TileHeight 256 /Quality 15 >> /JPEG2000GrayImageDict << /TileWidth 256 /TileHeight 256 /Quality 15 >> /AntiAliasMonoImages false /CropMonoImages true /MonoImageMinResolution 1200 /MonoImageMinResolutionPolicy /OK /DownsampleMonoImages false /MonoImageDownsampleType /Bicubic /MonoImageResolution 1600 /MonoImageDepth -1 /MonoImageDownsampleThreshold 1.00063 /EncodeMonoImages true /MonoImageFilter /CCITTFaxEncode /MonoImageDict << /K -1 >> /AllowPSXObjects false /CheckCompliance [ /None ] /PDFX1aCheck false /PDFX3Check false /PDFXCompliantPDFOnly false /PDFXNoTrimBoxError true /PDFXTrimBoxToMediaBoxOffset [ 0.00000 0.00000 0.00000 0.00000 ] /PDFXSetBleedBoxToMediaBox true /PDFXBleedBoxToTrimBoxOffset [ 0.00000 0.00000 0.00000 0.00000 ] /PDFXOutputIntentProfile (None) /PDFXOutputConditionIdentifier () /PDFXOutputCondition () /PDFXRegistryName () /PDFXTrapped /False /CreateJDFFile false /Description << /CHS <FEFF4f7f75288fd94e9b8bbe5b9a521b5efa7684002000410064006f006200650020005000440046002065876863900275284e8e55464e1a65876863768467e5770b548c62535370300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c676562535f00521b5efa768400200050004400460020658768633002> /CHT <FEFF4f7f752890194e9b8a2d7f6e5efa7acb7684002000410064006f006200650020005000440046002065874ef69069752865bc666e901a554652d965874ef6768467e5770b548c52175370300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c4f86958b555f5df25efa7acb76840020005000440046002065874ef63002> /DAN <FEFF004200720075006700200069006e0064007300740069006c006c0069006e006700650072006e0065002000740069006c0020006100740020006f007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e007400650072002c0020006400650072002000650067006e006500720020007300690067002000740069006c00200064006500740061006c006a006500720065007400200073006b00e60072006d007600690073006e0069006e00670020006f00670020007500640073006b007200690076006e0069006e006700200061006600200066006f0072007200650074006e0069006e006700730064006f006b0075006d0065006e007400650072002e0020004400650020006f007000720065007400740065006400650020005000440046002d0064006f006b0075006d0065006e0074006500720020006b0061006e002000e50062006e00650073002000690020004100630072006f00620061007400200065006c006c006500720020004100630072006f006200610074002000520065006100640065007200200035002e00300020006f00670020006e0079006500720065002e>/DEU <FEFF00560065007200770065006e00640065006e0020005300690065002000640069006500730065002000450069006e007300740065006c006c0075006e00670065006e0020007a0075006d002000450072007300740065006c006c0065006e00200076006f006e002000410064006f006200650020005000440046002d0044006f006b0075006d0065006e00740065006e002c00200075006d002000650069006e00650020007a0075007600650072006c00e40073007300690067006500200041006e007a006500690067006500200075006e00640020004100750073006700610062006500200076006f006e00200047006500730063006800e40066007400730064006f006b0075006d0065006e00740065006e0020007a0075002000650072007a00690065006c0065006e002e00200044006900650020005000440046002d0044006f006b0075006d0065006e007400650020006b00f6006e006e0065006e0020006d006900740020004100630072006f00620061007400200075006e0064002000520065006100640065007200200035002e003000200075006e00640020006800f600680065007200200067006500f600660066006e00650074002000770065007200640065006e002e> /ESP <FEFF005500740069006c0069006300650020006500730074006100200063006f006e0066006900670075007200610063006900f3006e0020007000610072006100200063007200650061007200200064006f00630075006d0065006e0074006f0073002000640065002000410064006f00620065002000500044004600200061006400650063007500610064006f007300200070006100720061002000760069007300750061006c0069007a00610063006900f3006e0020006500200069006d0070007200650073006900f3006e00200064006500200063006f006e006600690061006e007a006100200064006500200064006f00630075006d0065006e0074006f007300200063006f006d00650072006300690061006c00650073002e002000530065002000700075006500640065006e00200061006200720069007200200064006f00630075006d0065006e0074006f00730020005000440046002000630072006500610064006f007300200063006f006e0020004100630072006f006200610074002c002000410064006f00620065002000520065006100640065007200200035002e003000200079002000760065007200730069006f006e0065007300200070006f00730074006500720069006f007200650073002e> /FRA <FEFF005500740069006c006900730065007a00200063006500730020006f007000740069006f006e00730020006100660069006e00200064006500200063007200e900650072002000640065007300200064006f00630075006d0065006e00740073002000410064006f006200650020005000440046002000700072006f00660065007300730069006f006e006e0065006c007300200066006900610062006c0065007300200070006f007500720020006c0061002000760069007300750061006c00690073006100740069006f006e0020006500740020006c00270069006d007000720065007300730069006f006e002e0020004c0065007300200064006f00630075006d0065006e00740073002000500044004600200063007200e900e90073002000700065007500760065006e0074002000ea0074007200650020006f007500760065007200740073002000640061006e00730020004100630072006f006200610074002c002000610069006e00730069002000710075002700410064006f00620065002000520065006100640065007200200035002e0030002000650074002000760065007200730069006f006e007300200075006c007400e90072006900650075007200650073002e> /ITA (Utilizzare queste impostazioni per creare documenti Adobe PDF adatti per visualizzare e stampare documenti aziendali in modo affidabile. I documenti PDF creati possono essere aperti con Acrobat e Adobe Reader 5.0 e versioni successive.) /JPN <FEFF30d330b830cd30b9658766f8306e8868793a304a3088307353705237306b90693057305f002000410064006f0062006500200050004400460020658766f8306e4f5c6210306b4f7f75283057307e305930023053306e8a2d5b9a30674f5c62103055308c305f0020005000440046002030d530a130a430eb306f3001004100630072006f0062006100740020304a30883073002000410064006f00620065002000520065006100640065007200200035002e003000204ee5964d3067958b304f30533068304c3067304d307e305930023053306e8a2d5b9a3067306f30d530a930f330c8306e57cb30818fbc307f3092884c3044307e30593002> /KOR <FEFFc7740020c124c815c7440020c0acc6a9d558c5ec0020be44c988b2c8c2a40020bb38c11cb97c0020c548c815c801c73cb85c0020bcf4ace00020c778c1c4d558b2940020b3700020ac00c7a50020c801d569d55c002000410064006f0062006500200050004400460020bb38c11cb97c0020c791c131d569b2c8b2e4002e0020c774b807ac8c0020c791c131b41c00200050004400460020bb38c11cb2940020004100630072006f0062006100740020bc0f002000410064006f00620065002000520065006100640065007200200035002e00300020c774c0c1c5d0c11c0020c5f40020c2180020c788c2b5b2c8b2e4002e> /NLD (Gebruik deze instellingen om Adobe PDF-documenten te maken waarmee zakelijke documenten betrouwbaar kunnen worden weergegeven en afgedrukt. De gemaakte PDF-documenten kunnen worden geopend met Acrobat en Adobe Reader 5.0 en hoger.) /NOR <FEFF004200720075006b00200064006900730073006500200069006e006e007300740069006c006c0069006e00670065006e0065002000740069006c002000e50020006f0070007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740065007200200073006f006d002000650072002000650067006e0065007400200066006f00720020007000e5006c006900740065006c006900670020007600690073006e0069006e00670020006f00670020007500740073006b007200690066007400200061007600200066006f0072007200650074006e0069006e006700730064006f006b0075006d0065006e007400650072002e0020005000440046002d0064006f006b0075006d0065006e00740065006e00650020006b0061006e002000e50070006e00650073002000690020004100630072006f00620061007400200065006c006c00650072002000410064006f00620065002000520065006100640065007200200035002e003000200065006c006c00650072002e> /PTB <FEFF005500740069006c0069007a006500200065007300730061007300200063006f006e00660069006700750072006100e700f50065007300200064006500200066006f0072006d00610020006100200063007200690061007200200064006f00630075006d0065006e0074006f0073002000410064006f00620065002000500044004600200061006400650071007500610064006f00730020007000610072006100200061002000760069007300750061006c0069007a006100e700e3006f002000650020006100200069006d0070007200650073007300e3006f00200063006f006e0066006900e1007600650069007300200064006500200064006f00630075006d0065006e0074006f007300200063006f006d0065007200630069006100690073002e0020004f007300200064006f00630075006d0065006e0074006f00730020005000440046002000630072006900610064006f007300200070006f00640065006d0020007300650072002000610062006500720074006f007300200063006f006d0020006f0020004100630072006f006200610074002000650020006f002000410064006f00620065002000520065006100640065007200200035002e0030002000650020007600650072007300f50065007300200070006f00730074006500720069006f007200650073002e> /SUO <FEFF004b00e40079007400e40020006e00e40069007400e4002000610073006500740075006b007300690061002c0020006b0075006e0020006c0075006f0074002000410064006f0062006500200050004400460020002d0064006f006b0075006d0065006e007400740065006a0061002c0020006a006f0074006b006100200073006f0070006900760061007400200079007200690074007900730061007300690061006b00690072006a006f006a0065006e0020006c0075006f00740065007400740061007600610061006e0020006e00e400790074007400e4006d0069007300650065006e0020006a0061002000740075006c006f007300740061006d0069007300650065006e002e0020004c0075006f0064007500740020005000440046002d0064006f006b0075006d0065006e00740069007400200076006f0069006400610061006e0020006100760061007400610020004100630072006f0062006100740069006c006c00610020006a0061002000410064006f00620065002000520065006100640065007200200035002e0030003a006c006c00610020006a006100200075007500640065006d006d0069006c006c0061002e> /SVE <FEFF0041006e007600e4006e00640020006400650020006800e4007200200069006e0073007400e4006c006c006e0069006e006700610072006e00610020006f006d002000640075002000760069006c006c00200073006b006100700061002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e007400200073006f006d00200070006100730073006100720020006600f60072002000740069006c006c006600f60072006c00690074006c006900670020007600690073006e0069006e00670020006f006300680020007500740073006b007200690066007400650072002000610076002000610066006600e4007200730064006f006b0075006d0065006e0074002e002000200053006b006100700061006400650020005000440046002d0064006f006b0075006d0065006e00740020006b0061006e002000f600700070006e00610073002000690020004100630072006f0062006100740020006f00630068002000410064006f00620065002000520065006100640065007200200035002e00300020006f00630068002000730065006e006100720065002e> /ENU (Use these settings to create PDFs that match the "Suggested" settings for PDF Specification 4.0) >> >> setdistillerparams<< /HWResolution [600 600] /PageSize [612.000 792.000] >> setpagedevice
Compartilhar