Buscar

RicardoWandre2023_A_Stochastic_Grammar_Approach_to_Mass_Classification_in_Mammograms

Prévia do material em texto

2302 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 20, NO. 3, MAY/JUNE 2023
A Stochastic Grammar Approach to Mass
Classification in Mammograms
Ricardo Wandré Dias Pedro , Ana Luiza Silveira Ferreira, Rodolph Vinicius Siqueira Pessoa,
Almir Galvão Vieira Bitencourt , Ariane Machado-Lima, and Fátima L. S. Nunes
Abstract—Breast cancer is responsible for approximately 15%
of all cancer-related deaths among women worldwide, and early
and accurate diagnosis increases the chances of survival. Over the
last decades, several machine learning approaches have been used
to improve the diagnosis of this disease, but most of them require a
large set of samples for training. Syntactic approaches were barely
used in this context, although it can present good results even if
the training set has few samples. This article presents a syntactic
approach to classify masses as benign or malignant. There were
used features extracted from a polygonal representation of masses
combined with a stochastic grammar approach to discriminate the
masses found in mammograms. The results were compared with
other machine learning techniques, and the grammar-based clas-
sifiers showed superior performance in the classification task. The
best accuracies achieved were from 96% to 100%, indicating that
grammatical approaches are robust and able to discriminate the
masses even when trained with small samples of images. Syntactic
approaches could be more frequently employed in the classification
of masses, since they can learn the pattern of benign and malignant
masses from a small sample of images achieving similar results
when compared to the state of art.
Index Terms—Breast cancer, classification, diagnosis,
mammogram, pattern recognition, stochastic grammars, syntactic
approach.
I. INTRODUCTION
BREAST cancer is the most common type of cancer ac-
cording to The World Health Organization (WHO) [1].
Some of the most used machine learning techniques to develop
computer-aided diagnosis systems are artificial neural network
(including deep learning), support vector machine, k-nearest
neighbors and random forest [2]. Few studies have explored
Manuscript received 2 July 2022; revised 26 December 2022; accepted 16
February 2023. Date of publication 22 February 2023; date of current version
5 June 2023. This work was supported in part by Brazilian National Council
of Scientific and Technological Development (CNPq) under Grant
#309030/2019-6; in part by CNPq and São Paulo Research Foundation
(FAPESP): National Institute of Science and Technology – Medicine Assisted
by Scientific Computing (INCT-MACC) under Grant #157535/2017-7.
(Corresponding author: Ricardo Wandré Dias Pedro.)
Ricardo Wandré Dias Pedro is with the Electrical Engineering, Polytechnic
School, University of São Paulo, São Paulo 05508, Brazil (e-mail: rwan-
dre@usp.br).
Ana Luiza Silveira Ferreira, Rodolph Vinicius Siqueira Pessoa, and
Almir Galvão Vieira Bitencourt are with the A.C.Camargo Cancer
Center, São Paulo 01525-001, Brazil (e-mail: analusilveiraf@gmail.com;
rodolph.vini@gmail.com; almir.bitencourt@accamargo.org.br).
Ariane Machado-Lima and Fátima L. S. Nunes are with the Information
Systems, School of Arts, Sciences and Humanities, University of São Paulo, São
Paulo 05508, Brazil (e-mail: ariane.machado@usp.br; fatima.nunes@usp.br).
Digital Object Identifier 10.1109/TCBB.2023.3247144
the theory of formal languages to classify masses in mammo-
graphic images [3]. The syntactic approach has the advantage
of providing a concise hierarchical representation of parts of the
images and their relationships. Syntactic approaches are useful
even when there are not numerous samples to learn the target
pattern.
As far as we know, grammars were used in the process of
discriminating masses only in [4], [5], and [6]. In [5] and [6]
grammars are used to classify the masses directly, while in [4]
the output generated by the syntactic analysis is used as input
to an artificial neural network, what makes our approach more
straightforward.
The goal of this paper is to show the ability of grammar-
based classifiers to discriminate masses as benign and malignant
without the aid of other machine learning methods.
This paper expands our previous studies by providing these
main contributions:
1) classification is based on features extracted from a polyg-
onal representation of mass boundaries using the Ramer-
Douglas-Peucker algorithm, differently of our previous
studies where shape features were extracted based on a
polygonal model proposed in [7];
2) Hu moments were used as input features for the grammar-
based classifiers, in addition to the shape features used in
the previous studies;
3) in addition to the dataset used in the previous studies, a
new dataset with 202 images was introduced, creating a
combined dataset with 313 images;
4) the robustness of the syntactic approach was validated
by training the models with images from one dataset and
testing them with images from the other dataset.
The results show that the proposed model is able to learn
the pattern of the masses even with a small number of training
images. Besides, the created models can be more easily under-
stood by physicians (similar to decision trees whose structure
describes the decision process [8]) and showed to be robust when
dealing with images from different datasets.
II. BACKGROUND
A. Breast Cancer and Suspicious Findings
Breast cancer is the second most common malignant neopla-
sia. The decrease in mortality rates of this cancer depends on
an adequate therapeutic plan based on screening programs and
1545-5963 © 2023 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: UNIVERSIDADE DE SAO PAULO. Downloaded on September 29,2023 at 12:39:34 UTC from IEEE Xplore. Restrictions apply. 
https://orcid.org/0000-0002-3728-4540
https://orcid.org/0000-0003-0192-9885
https://orcid.org/0000-0003-0040-0752
mailto:rwandre@usp.br
mailto:rwandre@usp.br
mailto:analusilveiraf@gmail.com
mailto:rodolph.vini@gmail.com
mailto:almir.bitencourt@accamargo.org.br
mailto:ariane.machado@usp.br
mailto:fatima.nunes@usp.br
PEDRO et al.: STOCHASTIC GRAMMAR APPROACH TO MASS CLASSIFICATION IN MAMMOGRAMS 2303
early detection. Mammography is still considered the preferred
method for screening in average-risk women [9].
Suspicious findings on screening mammography should be
submitted to percutaneous biopsy to confirm malignancy and
plan treatment. However, mammography interpretation is a chal-
lenge even for breast imaging specialists, and many of these find-
ings are associated to benign conditions. False positive results
are one of the main limitations of mammography screening as it
is associated with increased patient anxiety, invasive procedures,
and increased healthcare costs [10].
Mass lesions found on mammography are challenging due to
the wide range of different diagnoses. The risk of malignancy is
evaluated based on the mass shape, margins and density. How-
ever, the overlapping imaging features of benign and malignant
lesions make it hard to avoid histological analysis, resulting in
a large number of benign biopsies.
B. Grammars
Grammars are a formalism derived from the theory of formal
languages developed to represent a set of sequences (language).
This theory can be applied to deal with problems in different
areas, for example, to understand the content of images. In this
paper a stochastic context-free grammar was used.
A stochastic context-free grammar is a quintuple Gs =
(VN , VT , R, S, P ), where:
� VN is a set containing the non-terminal symbols, that are
auxiliary symbols to structure the grammar rules;
� VT is a set containing the terminal symbols, that are used
to represent each sequence;
� R is a set of substitution rules, where the rules follow the
pattern: A → β, A ∈ VN , β ∈ (VT ∪ VN )∗;
� S ∈ VN is the initial symbolof the grammar;
� P is the set of probability distributions over the rules having
the same left side. Therefore, for each non-terminalA, con-
sidering all rules {A → βi, pi} ∈ R, βi ∈ (VN ∪ VT )
∗, the∑
i pi = 1.
A parser is an algorithm that, for a given sequence and
grammar, provides at least one syntactic tree if the sequence
belongs to the grammar, otherwise it produces an error. A parser
for a stochastic grammar not only provides the syntactic trees,
but also the probability associated to the these trees.
Given a stochastic grammar Gs and a sequence x, the proba-
bility P (x, t|Gs) is the probability of x in the syntactic tree t,
that is given by the multiplication of the probability of each
Gs grammar rule used in the parsing process. In this study,
we focus on the tree that maximizes this probability: tmax =
argmaxt P (x, t|Gs).
In a classification context, a stochastic grammar is used to
represent each class cj of the problem, j = 1, . . ., n. Then, a
new instance (sequence) x is classified as belonging to the class
ci that maximizes P (x, tmax|Gcj ), j = 1, . . ., n.
III. RELATED WORK
Over the last decades, numerous studies have worked on mass
classification problems in digital mammograms by using differ-
ent combination of techniques, datasets and features extracted
from them. The results are variable, usually being presented by
accuracy or area under the Receiving Operating Characteristic
(ROC) curve. When features are extracted for posterior clas-
sification, different feature categories are considered, usually
dependent on the image set evaluated [2].
A feature considering the concavities of the mass boundaries
was used in [7], [11] together with compactness and spiculation
index to classify masses considering the classes benign, ma-
lignant, circumscribed or spiculated. In [11], the accuracy was
of 81%, while in [7] the accuracy obtained was of 91%. The
studies were performed with 53 images from Mammographic
Image Analysis Society database (MIAS) [12] and from Alberta
Program for the Early Detection of Breast Cancer database
(ALB) [13]. In [14], the authors used genetic programming
in the classification process, and the approach was able to
correctly classify 95% of the benign masses and 97.3% of
the malignant ones. The features (edge-sharpness, shape and
texture features) were extracted from 57 images from ALB
dataset. Fractal dimensions, compactness, spiculation index and
fractional concavity were used in [15] to classify the masses,
achieving an AUC of 0.92 considering 111 images from MIAS
and from ALB datasets. In [16] a combination of features based
on gradient and texture was employed as input to the classifier
obtaining an AUC of 0.76 when applied to images from MIAS
and ALB datasets.
Hu moments were used as shape descriptors in [17], [18],
[19]. Authors of [17], [18] used texture features, central and
Hu moments extracted from 200 images from Digital Database
Screening Mammography dataset (DDSM) [20] as input to a
support vector machine algorithm, achieving an accuracy of
93%. A comparative of different descriptors was performed
in [19], showing the best performance for texture features
extracted from Gray Level Run Length Matrices. For shape
features, the best results were achieved using Hu moments when
using 322 images from MIAS dataset.
A syntactic approach was used in [4] to classify masses in
benign or malignant. The output of a syntactic analysis was
used as input to an artificial neural network. Texture features
and Zernike moments were used as features, achieving an AUC
of 0.86. More recently, in [5] a new syntactic approach was
developed using shape and texture features to classify 111
images belonging to two distinct datasets, obtaining accuracies
from 96% to 100% depending on the model and the dataset used.
In [6] the results achieved with grammars were compared with
the results of other classifiers such as artificial neural network
(ANN), support vector machine (SVM), k-nearest neighboors
(KNN) and random forest (RF), and showed that grammars
outperformed these techniques in this problem.
Although we are using grammars to classify nodules, other
approaches are also being used to handle this classification
problem. In [21] an optimized kernel extreme learning machine
was proposed and used with Haralick’s features as input. The
classifier achieved accuracy ranging from 99% to 100% for
binary classification and accuracies from 87% to 100% for a
multi-class classification with 150 images from DDSM dataset.
The study [22] proposed an approach where deep learning tech-
niques, especially Convolutional Neural Networks and Transfer
Authorized licensed use limited to: UNIVERSIDADE DE SAO PAULO. Downloaded on September 29,2023 at 12:39:34 UTC from IEEE Xplore. Restrictions apply. 
2304 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 20, NO. 3, MAY/JUNE 2023
Fig. 1. Tasks executed in order to classify masses as benign or malignant using
the proposed grammatical approach.
Learning, were used together with Softmax and SVM classifiers
to detect and classify masses as benign, malignant or normal.
The study used a dataset provided by MIAS consisting in 322
images (62 benign, 51 malignant and 209 normal), achieving the
best accuracy of 98%. SVM models were employed in [23] to
classify masses as benign and malignant. The authors showed
that when the features were selected using a random projection
algorithm (RPA) the accuracy of 75% obtained outperforms the
accuracy of other SVM models when other feature selection
mechanisms were employed.
IV. PROPOSED GRAMMAR APPROACH
Fig. 1 shows the pipeline of tasks used in this study to dis-
criminate masses as benign and malignant. In the next sections
these tasks are presented in detail.
A. Polygonal Model Representation
The Ramer-Douglas-Peucker (RDP) [24] algorithm was used
to generate a polygonal representation from the original bound-
aries delimited by specialists. The polygonal representation is
used to extract the shape features and the Hu moments.
This algorithm has two input parameters: the threshold ε and
the number N of points representing the mass boundary. Initially,
N points (P1 toPN ) are considered forming a boundary. Next,P1
andPN (the first and the last points) are connected with a straight
line. Then, the most distant point (Pi) to the line segment P1PN
(i = 2, . . ., N − 1) is found. If the distance computed is higher
than the threshold ε, the segment P1PN is divided in P1Pi and
PiPN . This step is repeated for each of these two new segments
and while ε is smaller than the distance between all points and
the segment. The final points are used in the polygonal model,
otherwise they will be discarded. In this study we used with
ε = 0.4% of the perimeter of the mass boundary, since the RDP
proved to be more robust with this value of ε as shown in [25].
Fig. 2 shows the original boundaries of benign and malignant
masses and their polygonal representation.
B. Feature Extraction
To handle the mass classification problem, shape features and
Hu moments were used as input to the classifiers. A total of
eight shape features and seven Hu moments were extracted from
313 gray-level images. These shape features were originally
used in [7], [11], [14], [15], [16], while Hu moments were used
in [17], [18], [19].
1) Shape Features: The shape features used in this study are
briefly described here. Additional details of how these features
were computed can be found in [25].
a) Compactness (CC): It is a measure of the efficiency of
a given contour in covering a specific area [7]. This feature is
important due to the fact that malignant masses tend to posses
a higher value for this feature than the benign ones [11]. This
feature was computed according to (1). It is a measure of the
efficiency of a given contour in covering a specific area [15]. In
(1), P represents the perimeter of the mass while A is its area.
CC = 1− 4πA
P 2
(1)
b) Spiculation Index (SI): This index measures how spiculatedthe contour of a mass is. Malignant masses tend to present more
irregular boundaries resulting higher SIs. Equation (2) proposed
in [7] is used to compute the index, where Si and θi for i = 1,
2,..., N is the length and the angle of two segments representing
a spicule, respectively.
SI =
∑N
i=1(1 + cosθi)Si∑N
i=1 Si
(2)
c) Fractal Dimension (FD): It is a measure used to compute
how self-similar a pattern is and, in general, it can be used
to explain the boundary complexity [15]. Equation (3) is used
to compute the self-similarity FD, where the number of self-
similar parts with a reduction factor of 1/s (used to represent the
measurement precision) is represented by a, which is obtained
using (4). An estimation of FD can be considered the slope of
a line approximation plotting log(a) versus log(1/s) [15]. In this
study four different measures of FD were employed. Two of
them were obtained considering the bidimensional boundaries
of the masses using the ruler and box counting methods [15].
Authorized licensed use limited to: UNIVERSIDADE DE SAO PAULO. Downloaded on September 29,2023 at 12:39:34 UTC from IEEE Xplore. Restrictions apply. 
PEDRO et al.: STOCHASTIC GRAMMAR APPROACH TO MASS CLASSIFICATION IN MAMMOGRAMS 2305
Fig. 2. (a) Example of a benign mass boundary; (b) polygonal model representing the benign mass using RDP algorithm with ε = 0.4%. (c) Example of an
original malignant mass boundary; (d) polygonal model representing the malignant mass using RDP algorithm with ε = 0.4%.
The other two fractal dimensions were computed considering the
unidimensional signature of the boundaries of masses using the
same methods (ruler and box counting) [15]. The 1D signature of
the boundaries was defined as a radial distance from the centroid
to each boundary point as a function of the index of the boundary
point. To normalize each 2D boundary the following steps were
performed: the wider axis of the boundary was found and all
values along that axis were normalized, thus their values range
from 0 to 1. Then, the values along the other axis were also
normalized, but this time, based on the length of the wider axis.
This method of normalization was applied because it preservers
the ratio between the width and the height of the boundaries in
the dataset and the 1D signatures created were normalized to
range between 0 and 1 considering both axes [15].
FD =
log(a)
log(1/s)
(3)
a =
1
sD
(4)
d) Fractional Concavity (FC): This measure is based on the
number of concavity parts that a mass contains. Relying on the
number of concavities in a mass boundary, the FC is useful in the
classification process because benign masses tend to have more
convex parts while malignant masses are generally composed of
concave and convex parts.
Equation (5) is used to compute the length of the boundary
Tl, where Si, i = 1, 2, 3, . . .,M is the length of each one of
the M segments that form the mass. CCl, given by (6), is the
length of all concave segments in the boundary, where CCi, i =
1, 2, 3, . . .P is the length of each one of the P concave segments.
Equation (7) gives the FC as described in [7], [11].
Tl =
M∑
i=1
Si (5)
CCl =
P∑
i=1
CCi (6)
FC =
CCl
Tl
(7)
e) Fourier Factor (FF): It can be used to measure the
presence of high-frequency components in a boundary or
roughness (pixel values that change rapidly in space) [15].
Equation (8) shows how this measure is computed, where
Z0(k) are the Fourier descriptors normalized (obtained by
(9)), Z(k) are the Fourier descriptors computed according to
(10) for k = −N/2, . . .,−1, 0, 1, 2, . . .N/2− 1, and z(n) =
x(n) + jy(n), n = 0, 1, . . .N − 1 represents the sequence of
pixels in the boundary [15], where N is the number of pixels in
the boundary.
FF = 1−
∑N/2
K=−N/2+1 |Z0(k)|/|k|∑N/2
K=−N/2+1 |Z0(k)|
(8)
Z0(k) =
{
0, k = 0;
Z(k)
|Z(1)| , otherwise.
(9)
Z(k) =
1
N
N−1∑
n=0
z(n)exp
[
− j
2π
N
nk
]
(10)
2) Hu Moments: These moments describe the spatial distri-
bution of points contained in the image or in a region and are
used as shape descriptors, since they are useful due to the fact
that they are invariant to translation, rotation and scale of the
shape. The definition of Hu moments is explained in details
in [26]. They are summarized in (11) to (15), where I(x, y) is
the pixel intensity at position (x,y) of an image I represented by
a two-dimensional matrix, (p + q) are called order moments, x̄
and ȳ are the components of the centroid, ηpq are the central
moments, the μpq are the invariant moments and H1to H7 are
the Hu moments.
Mij =
∑
x
∑
y
xiyiI(x, y) (11)
x̄ =
M10
M00
, ȳ =
M01
M00
(12)
ηpq =
∑
x
∑
y
(x− x̄)p(y − ȳ)qI(x, y) (13)
μpq =
ηpq
ηγ00
, γ =
p+ q
2
(14)
Authorized licensed use limited to: UNIVERSIDADE DE SAO PAULO. Downloaded on September 29,2023 at 12:39:34 UTC from IEEE Xplore. Restrictions apply. 
2306 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 20, NO. 3, MAY/JUNE 2023
H1 = μ20 + μ02
H2 = (μ20 − μ02)
2 + 4(μ11)
2
H3 = (μ30 − 3μ12)
2 + (μ03 − 3μ21)
2
H4 = (μ30 + μ12)
2 + (μ03 + μ21)
2
H5 = (μ30−3μ12)(μ30+μ12)((μ30+μ12)
2−3(μ21+μ03)
2)
+ (3μ21 − μ03)(μ21 + μ03)(3(μ30 + μ12)
2
− (μ03 + μ21)
2)
H6 = (μ20 − μ02)((μ30 + μ12)
2 − (μ21 + μ03)
2)
+ 4μ11(μ30 + μ12)(μ21 + μ03)
H7 = (3μ21−μ03)(μ30+μ12)((μ30+μ12)
2−3(μ21+μ03)
2)
+ (μ30 − 3μ12)(μ21 + μ03)(3(μ30 + μ12)
2
− (μ03 + μ21)
2) (15)
In the present study the classifiers were created using only
the shape features (eight features), only the Hu moments (seven
features), as well as a combination of shape and Hu moments as
detailed in Section IV-C.
C. Feature Selection
The “Gini importance” was employed as a feature selection
mechanism in the present study. To compute this metric, a
Random Forest (RF) classifier was trained with 100 trees. The
training was performed using images from the training dataset
considering benign and malignant classes, and one of the outputs
of the classifier is the Gini importance. This value is used as a
measure of how often a feature is selected for a split based on
how discriminating the feature is for the classification [27].
In the scenario where the combination of features was used,
we selected the eight more important features according to the
Gini importance. For the other scenarios where the classification
was performed using only shape features or Hu moments, the
feature selection step was not employed.
D. Data Discretization
A data discretization is necessary since every feature value
must be represented by a token or a symbol in the grammars.
For example, assume that compactness features and fractional
concavity can have the possible values {0.1, 0.3, 0.5} and {0.1,
0.2}, respectively. A discretization process, for example, could
label the values of compactness features 0.1, 0.3 as ‘co1’ and
the value 0.5 as ‘co2’. Similarly, the discretization process could
label the fractional concavity values 0.1 as ‘fc1’ and the value
0.2 as ‘fc2’. Thus, the sequences ‘co1 fc1’, ‘co1 fc2’, ‘co2 fc1’
and ‘co2 fc2’ are the possible sequences used to represent benign
and malignant masses in this example.
Omega algorithm [28] was used to discretize the continuous
features extracted from the masses in this study. This algorithm
has two input parameters: Hmin and ζmax. Hmin determines
the minimum number of elements that each bin group must
contain, i.e., the minimum number of values that each token
must represent. ζmax represents the maximum inconsistency
level; it specifies that two consecutive bins must be merged
only when their elements have the same majority class and the
inconsistency level of the new merged bin is below ζmax. In
this project, the features were discretized considering the labels
Circumscribed Benign (CB), Circumscribed Malignant (CM),
Spiculated Benign (SB), and Spiculated Malignant (SM).
In [6] a calibration process was performed to verify the best
parameters for the algorithm and the values Hmin = 2 and
ζmax = 0.35 were found. In the present study we consider
Hmin =2, 4, 6, 8 and 10, since this parameter proved to have
a major impact on the discretization process and kept ζmax =
0.35, since it has only a minor impact.
E. Grammar Learning
AND-OR graphs were used to visually represent context-free
grammars. The internal nodes (AND/OR) are mapped to the
non-terminal symbols, while the leaf nodes are mapped to the
terminal symbols of the grammar. The AND nodes decompose
each entity into their parts and the OR nodes produce alternative
substructures [29].
Two AND-OR graphs are created, one representing the benign
masses and the other representing the malignant ones. The
generated graphs consider whether the masses are circumscribed
or spiculated. Fig. 3(a) shows an AND-OR graph with two
generic features (F1 and F2) in combination with two internal
labels CIRCUMSCRIBED (cF1 and cF2) or SPICULATED (sF1 and
sF2) to represent the masses. Fig. 3(b) shows the equivalent
context-free grammar of the AND-OR graph shown in Fig. 3(a).
In Fig. 3(a) two generic features were used as an example.
However, to represent the real benign and malignant masses,
the proposed approach creates AND-OR graphs considering all
the eight shape features, all the Hu moments and a combination
of shape features and Hu moments.
A maximum a posteriori algorithm proposed in [30] was used
to convert the context-free grammars in stochastic context-free
grammars. Each production rule has a pseudocounter that was
first initialized with value 0.1, thus no production rule has proba-
bility of zero at the end of the process. Next, the initial grammar
is used to parse all the sequences in the training set, and when a
production rule is used, its counter is incremented by one. When
all the sequences are analyzed, the counters are normalized, thus
the sum of probabilities considering the production rules with
the same left side is equal to one.
In order to parse the sequences that represent the masses,
the Earley algorithm [31] was used for stochastic grammars.
The classifier is composed by a stochastic parser generated
considering the grammar that represents the benign masses (Gb)
and another generated from the grammar for malignant masses
(Gm). Given a sequence representing a mass x, the parsers
provide the P (x, t∗|Gi), where t∗ = argmaxt P (x, t|Gi) for
all syntactic trees t of x given Gi and i ∈ {b,m}. When
P (x, t∗|Gb) > P (x, t∗|Gm) the mass is classified as benign,
otherwise it is classified as malignant, what corresponds to the
likelihood-ratio test.
Authorized licensed use limited to: UNIVERSIDADE DE SAO PAULO. Downloaded on September 29,2023 at 12:39:34 UTC from IEEE Xplore. Restrictions apply. 
PEDRO et al.: STOCHASTIC GRAMMAR APPROACH TO MASS CLASSIFICATION IN MAMMOGRAMS 2307
Fig. 3. (a) Example of mass representation with AND-OR graph. The AND nodes are represented by the dashed circles, the OR nodes are represented by the
solid circles, and the squares are the leaf nodes. (b) The equivalent grammar of the AND-OR graph, where the terminal symbols are represented in italic. The
symbols ‘.’ and ‘|’ represent the logic conditions ‘AND’ and ‘OR,’ respectively.
V. EXPERIMENTAL EVALUATION
A. Datasets
This study used images from two datasets. The first one
(ALB dataset) contains images from the Foothills Hospital in
Calgary [7] (pixel size 62μm), images from MIAS dataset (pixel
size of 50 μm) and images from Screen Test dataset [13] (pixel
size of 50 μm). This dataset contains 111 images (66 benign and
46 malignant) where specialists manually drew the boundaries
of the masses. The images are labeled as circumscribed benign
(CB), spiculated benign (SB), circumscribed malignant (CM)
and spiculated malignant (SM).
The second dataset (ACC dataset) contains 202 images (104
benign and 98 malignant) from A. C. Camargo Cancer Center (a
Brazilian hospital specialized in cancer research and treatment).
The images have pixel size of 70 μm and the boundaries of the
masses were also marked by specialists. Similarly to the first
dataset, each mass was labeled as benign or malignant (classes)
and as circumscribed or not circumscribed.
Although the images from different datasets have different
pixel size, no data processing were performed in order to make
the pixel size of the two datasets consistent because the feature
extraction methods are invariant to image resolution, since the
features are not extracted directly from the pixels.
B. Evaluation Scenarios
Different scenarios were built in order to evaluate the
grammar-based classifiers in relation to their ability to classify
masses from different datasets. Three experiments were per-
formed using stratified k-fold cross-validation: using only ALB
dataset (k = 23); using only ACC dataset (k = 20) and using
the combined dataset (k = 20). In these scenarios, the feature
selection was performed without considering the test fold. Two
additional experiments were performed training with one dataset
and testing with the other. For these additional experiments, the
feature selection was employed considering only the training
dataset.
Although k-fold cross-validation experiments are tradition-
ally performed using k=10, we used higher values in or-
der to increase the number of training images in each fold,
since both ALB and ACC datasets have a limited num-
ber of images. Furthermore, the different values for k (20
and 23) intend to better accommodate the image distribu-
tion through the folds. For instance, considering the ALB
dataset, with k = 23, the classifier is trained with around
106 images and tested with five images in each fold itera-
tion. The evaluation of the grammar-based classifiers was per-
formed considering the accuracy since the used datasets are
balanced.
VI. RESULTS AND DISCUSSION
A. The Effect of Discretization Process
Fig. 4 shows the effect of the discretization process in the over-
all classification. As the value of the Hmin parameter increases,
the accuracy tends to decrease. The greatest difference was
noticed using Hu moments, when the training was performed
with ACC dataset and the tests were performed with ALB dataset
(Hu_ACC_ALB in Fig. 4), where the accuracy decreased from
96.42% (using Hmin = 2) to 62.16% (using Hmin = 10). The
smallest difference between the accuracy values was achieved
when the combination of shape features and Hu moments were
evaluated using ACC dataset (Shape_Hu_ACC in Fig. 4), show-
ing an accuracy of 100% when Hmin = 2 and accuracy of
98.05% when Hmin = 10.
The effect of the discretization was also reported in [6],
and it happens because when the value of Hmin increases, the
number of bins decreases, then, each bin contains a mixture of
values belonging to different labels (CB, CM, SB, SM) used in
the discretization process. For that reason, the next tests were
performed considering Hmin = 2.
B. Analysis of the Proposed Grammatical Approach
Fig. 5 shows the classification results achieved using the pro-
posed grammatical approach. The achieved accuracies ranged
Authorized licensed use limited to: UNIVERSIDADE DE SAO PAULO. Downloaded on September 29,2023 at 12:39:34 UTC from IEEE Xplore. Restrictions apply. 
2308 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 20, NO. 3, MAY/JUNE 2023
Fig. 4. The effect of discretization on classification of masses. In the legend, Shape means the use of shape features; Hu means the use of Hu moments; combined
means the combined dataset; ACC_ALB means trained in ACC dataset and tested in ALB dataset; ALB_ACC means trained in ALB dataset and tested in ACC
dataset.
Fig. 5. Accuracies achieved with the grammar-based approach. combined means the combination of ALB and ACC datasets; ACC_ALB means trained in ACC
dataset and tested in ALB dataset; ALB_ACC means trained in ALB dataset and tested in ACC dataset. The error bars indicate the standard deviation when
applicable (all scenarios except those that did not use cross-validation - ALB_ACC and ACC_ALB).
from 98.72% to 100% considering only dataset ALB, only
dataset ACC and the combineddataset (all scenarios using
k-fold cross validation). In addition, to verify if the discretiza-
tion process using Hmin = 2 was overfitting the tests (since
each bin contains just a small number of continuous values)
we trained our model with one dataset and tested with the
other. For these scenarios, the best accuracy was of 100% when
the training was performed using images from ALB and the
combination of shape and Hu moments as features. When the
training was performed with images from ACC the best accu-
racy was 100% using only shape features. The worst result in
terms of accuracy was of 96.42% when only Hu moments were
used as input in the scenario where images from ACC were
used for training and images from ALB were used during the
tests.
The results shown in Fig. 5 indicate that the classifiers built
using grammars were able to classify the images with accuracies
ranging from 96.42% to 100% for Hmin = 2. However, the
accuracy tends to fall as the value of Hmin increases, demon-
strating that our approach depends on a good discretization
process (Fig. 4).
Regarding the shape features and Hu moments, the grammar
approach was able to use both types of features to classify the
images with good results regardless of the dataset. However,
Authorized licensed use limited to: UNIVERSIDADE DE SAO PAULO. Downloaded on September 29,2023 at 12:39:34 UTC from IEEE Xplore. Restrictions apply. 
PEDRO et al.: STOCHASTIC GRAMMAR APPROACH TO MASS CLASSIFICATION IN MAMMOGRAMS 2309
Fig. 6. Accuracy achieved by each model using the shape features, Hu moments and the combined features. Combined dataset means the combination of ALB
and ACC datasets; ACC_ALB means trained in ACC dataset and tested in ALB dataset; ALB_ACC menas trained in ALB dataset and tested in ACC dataset.
TABLE I
CLASSIFIERS CREATED AND THEIR HYPERPARAMETERS, WHERE N_FEATURES
IS THE NUMBER OF FEATURES AND X.VAR() IS THE TRAIN DATASET VARIANCE
the shape features seem to be more robust, since when the Hmin
increases the accuracies tend to fall slower than when the Hu
moments were used (Fig. 4).
Images from ALB dataset were harder to classify correctly
than images from ACC dataset, especially whenHmin increases.
This may be because the ALB dataset contains a greater number
of images with circumscribed shapes that are malignant and
images with spiculated shapes that are benign than the ACC
dataset. Another important factor that may be driving this be-
havior is that there are more images in the ACC dataset than in
the ALB dataset (202 and 111, respectively) and the classifier
can be more stable as the number of images increases.
It is quite difficult to compare the results obtained by differ-
ent studies in this research area, as different datasets, images,
segmentation methods and evaluation metrics are used prior to
the classification stage. However, it can be noted that the results
obtained in the present study are comparable to some of the most
recent studies found in the literature, as can be seen in Section III.
For instance, the best accuracies obtained in this study (96.42%
to 100%) are similar to the accuracies presented in studies [21]
(99% to 100%) and [22] (98%) and superior to the accuracy
presented in [23] (75%).
C. Comparison of the Grammatical and Non-Grammatical
Classifiers
To compare the results achieved using the grammar approach
with other classifiers, the classification of images from ALB and
ACC datasets were performed using the following models: Arti-
ficial Neural Network (ANN), Support Vector Machine (SVM),
K-Nearest Neighbors (KNN), Random Forest (RF) and Light
Gradient Boosting Model (LGBM). More precisely, the three
experiments were performed for each feature category (shape
feature, Hu moments and both): one cross-validation with the
two datasets combined (the same fold division used for the
grammatical approach) and other two experiments training with
one dataset and testing with the other. Furthermore, the same
features used for the grammatical approach were also used by
the non-grammatical classifiers. Before training the models, all
feature values were scaled using the min-max normalization,
so their new values ranged from 0 to 1. Table I shows, for
each classifier induction algorithm, the tested values for the
hyperparameters.
Fig. 6 shows the classification results for all tested scenarios.
Comparing the results achieved by the proposed grammatical
approach to classify masses as benign and malignant with other
classifiers, it shows that the results achieved by the proposed
approach surpassed the other models. In special, Fig. 6 shows
that the grammar-based approach was able to classify the masses
Authorized licensed use limited to: UNIVERSIDADE DE SAO PAULO. Downloaded on September 29,2023 at 12:39:34 UTC from IEEE Xplore. Restrictions apply. 
2310 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 20, NO. 3, MAY/JUNE 2023
using Hu momments as input while the other methods showed
more difficulty in this task.
Considering the non-grammatical approaches, the best ac-
curacies were of 92.32% and 92.57% with LGBM and ANN,
respectively, when shape features were used. When only the Hu
momments were employed the non-grammatical approaches had
their worst performance, showing that these classifiers were not
able to learn the pattern of benign and malignant masses with
this type feature.
Tables II, III and IV summarize the comparison between the
proposed method and the non-grammatical approach. Besides
accuracy, these tables show the following performance mea-
sures: sensitivity, specificity, F-score and the Matthews correla-
tion coefficient (MCC).
The superior results of the grammar approach are likely
related to the number of images in the dataset. It seems that
the grammar-based classifiers demand less images to learn the
pattern of benign and malignant masses than the other classifiers,
which was previously found in [6]. A possible reason for the
performance achieved by the grammar-based models might be
related to the set of rules chosen to create the grammars that
represent the benign and malignant masses and by the fact
of the classifier be composed by two distinct parsers (created
based on the grammar used to represent benign and malig-
nant masses) that will compute probabilities of the sentence
representing the mass (sequence of tokens/features) belongs
to the language of each grammar. For this reason, grammat-
ical methods could be more used in this classification prob-
lem, in special when there are not a large number of images
available.
Despite the good results obtained by the grammar classifiers,
the proposed approach has its drawbacks. The first limitation
is the need to extract features from mass boundaries marked
by a specialist or by an automatic segmentation process. The
images in this study had their boundaries marked by special-
ists, since developing a segmentation process is also a very
challenging research area, especially for dense breasts. An-
other issue that can severely impact the results when gram-
mar classifiers are used is the discretization process as dis-
cussed in Section VI-A. However, this issue can be overcome
by using a proper algorithm as demonstrated in [6]. Addi-
tionally, a deep knowledge of the application is required to
compose the process and perform the previous cited actions.
This can decrease the generalization ability of the grammar-
based modelsto other application areas, which sometimes can
be more easily reached by machine learning methods. These
drawbacks can be more significant, especially when comparing
the proposed approach with deep learning techniques where
there is no need for prior segmentation or a discretization
process.
D. Advantages of the Grammatical Classifiers
The first advantage of our approach when compared with the
other classifiers – and especially with deep learning techniques –
is that the pattern of benign and malignant masses can be learned
TABLE II
PERFORMANCE MEASURES ACHIEVED BY EACH MODELUSING SHAPE
FEATURES: SENSITIVITY (SEN.), SPECIFICITY (SPE.), ACCURACY (ACC),
F-SCORE AND MATTHEWS CORRELATION COEFFICIENT (MCC). COMBINED
DATASET: TRAINING = BOTH, TEST = BOTH; ACC → ALB DATASETS:
TRAINING = ACC; TEST = ALB; ALB → ACC DATASETS: TRAINING =
ALB, TEST= ACC
Authorized licensed use limited to: UNIVERSIDADE DE SAO PAULO. Downloaded on September 29,2023 at 12:39:34 UTC from IEEE Xplore. Restrictions apply. 
PEDRO et al.: STOCHASTIC GRAMMAR APPROACH TO MASS CLASSIFICATION IN MAMMOGRAMS 2311
TABLE III
PERFORMANCE MEASURES ACHIEVED BY EACH MODEL USING HU MOMENTS:
SENSITIVITY (SEN.), SPECIFICITY (SPE.), ACCURACY (ACC), F-SCORE AND
MATTHEWS CORRELATION COEFFICIENT (MCC). COMBINED DATASET:
TRAINING = BOTH, TEST = BOTH; ACC → ALB DATASETS: TRAINING =
ACC; TEST = ALB; ALB → ACC DATASETS: TRAINING = ALB, TEST =
ACC
TABLE IV
PERFORMANCE MEASURES ACHIEVED BY EACH MODEL USING THE
COMBINED FEATURES: SENSITIVITY (SEN.), SPECIFICITY (SPE.), ACCURACY
(ACC), F-SCORE AND MATTHEWS CORRELATION COEFFICIENT (MCC).
COMBINED DATASET: TRAINING = BOTH, TEST = BOTH; ACC → ALB
DATASETS: TRAINING = ACC; TEST = ALB; ALB → ACC DATASETS:
TRAINING = ALB, TEST= ACC
Authorized licensed use limited to: UNIVERSIDADE DE SAO PAULO. Downloaded on September 29,2023 at 12:39:34 UTC from IEEE Xplore. Restrictions apply. 
2312 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 20, NO. 3, MAY/JUNE 2023
from a small sample of images. For instance, the ALB dataset has
111 images while the ACC dataset has 202 images. For a deep
learning approach to solve the same problem, it would demand
many more images in order to learn all the features and weights
necessary to represent the benign and malignant patterns [32].
To achieve similar results to the ones presented in this study,
deep learning strategies were combined with data augmentation
to increase the number of images or using other techniques such
as transfer learning [22] and [33].
Next, the grammar-based classifiers seem to be more stable
than the non-grammatical base classifiers used in this study,
since the accuracies obtained vary less than the accuracies
obtained by the other models (Fig. 6).
Another advantage of our approach is the simple represen-
tation of the masses used to build the classifiers that can be
seen as a hierarchical structure (Fig. 3) showing what the
algorithm does, especially when compared to artificial neu-
ral networks and other more complex approaches closer to
black boxes. By using the cited Figure as an example, we
can easily explain to health professionals that dashed circles
represent a “OR” node/rule, solid circles represent a “AND”
node/rule and the squares are the leaf nodes with the feature
values. Thus, we wonder that even with no technical knowl-
edge about computational grammars, the health professional
can understand the rules used to classify a mass as benign or
malignant.
VII. CONCLUSION
This article presented a syntactic approach to classify masses
found in mammograms. Two datasets were used: i) the first one
contains 111 images and was provided by researchers from the
University of Calgary (Canada); ii) the other dataset contains 202
images and was provided by researchers from the A. C. Camargo
Center (Brazil). The results show that the syntactic approach is
robust, can learn the pattern of benign and malignant masses
even with small samples of images and had superior results
when compared with artificial neural network, support vector
machine, k-nearest neighbors, and random forest techniques.
Furthermore, the results obtained were similar to the results
demonstrated in some of the most recent studies in this area
of research.
Considering all the experiments performed, the best accuracy
achieved by the grammar-based classifier was 100%, while the
best accuracies achieved by the ANN, SVM, KNN, RF and
LGBM were 92.57%, 90.98%, 89.91%, 90.38% and 92.32%,
respectively. Furthermore, the shape features seem to be better
suited to be used in this classification process, as the accuracies
tend to be higher when they are used.
Despite the fact that the syntactic approach depends on good
data discretization, the proposed approach proved not to overfit
the data. The tests showed accuracies from 96.42% to 100%
when training with images from one dataset and testing with im-
ages from the other dataset. This can indicate that the proposed
approach has potential for classifying structures in contexts
similar to that shown in this article.
REFERENCES
[1] WHO - World Health Organization. Accessed: Apr. 17, 2020. [Online].
Available: https://www.who.int/news-room/fact-sheets/detail/cancer
[2] R. Pedro, A. Machado-Lima, and F. Nunes, “Is mass classification in
mammograms a solved problem? - A critical review over the last 20 years,”
Expert Syst. Appl., vol. 119, pp. 90–103, 2019.
[3] R. Pedro, F. Nunes, and A. Machado-Lima, “Using grammars for pattern
recognition in images: A systematic review,” ACM Comput. Surv., vol. 46,
no. 2, pp. 26:1–26:34, Nov. 2013. [Online]. Available: http://doi.acm.org/
10.1145/2543581.2543593
[4] A. Tahmasbi, F. Saki, and S. Shokouhi, “CWLA: A novel cognitive
classifier for breast mass diagnosis,” in Proc. 18th Iranian Conf. Biomed.
Eng., 2011, pp. 255–259.
[5] R. Pedro, A. Machado-Lima, and F. Nunes, “A new syntactic approach for
masses classification in digital mammograms,” in Proc. IEEE 32nd Int.
Symp. Comput.-Based Med. Syst., 2019, pp. 385–390.
[6] R. Pedro, A. Machado-Lima, and F. Nunes, “Towards an approach us-
ing grammars for automatic classification of masses in mammograms,”
Comput. Intell., vol. 37, no. 4, pp. 1515–1544 2021.
[7] R. Rangayyan, N. Mudigonda, and J. Desautels, “Boundary modelling and
shape analysis methods for classification of mammographic masses,” Med.
Biol. Eng. Comput., vol. 38, no. 5, pp. 487–496, 2000. [Online]. Available:
http://dx.doi.org/10.1007/BF02345742
[8] G. Stiglic, S. Kocbek, I. Pernek, and P. Kokol, “Comprehensive decision
tree models in bioinformatics,” PLoS One, vol. 7, 2012, Art. no. e33812.
[9] M. B. Mainiero et al., “ACR appropriateness criteria breast cancer
screening,” J. Amer. College Radiol., vol. 14, no. 11, pp. S383–S390,
Nov. 2017. [Online]. Available: https://linkinghub.elsevier.com/retrieve/
pii/S1546144017310992
[10] H. D. Nelson, M. Pappas, A. Cantor, J. Griffin, M. Daeges, and L.
Humphrey, “Harms of breast cancer screening: Systematic review to
update the 2009 U.S. preventive services task force recommendation,” Ann.
Intern. Med., vol. 164, no. 4, Feb. 2016, Art. no. 256. [Online]. Available:
http://annals.org/article.aspx?doi=10.7326/M15--0970
[11] N. Mudigonda, R. Rangayyan, and J. Desautels, “Concavity and convexity
analysis of mammographic masses via an iterative boundary segmentation
algorithm,” in Proc. Eng. Solutions Next Millennium IEEE Can. Conf.
Elect. Comput. Eng., 1999, pp. 1489–1494.
[12] J. Suckling et al., “The mammographic image analysis society digital
mammogram database,” in Proc. 2nd Int. Workshop Digit. Mammogr.,
1994, pp. 375–378.
[13] C. Alberta Cancer Board, Alberta, “Screen test: Alberta program for
the early detection of breast cancer,” 2001/03 - Biennial Report, 2004.
[Online]. Available: http://www.cancerboard.ab.ca/screentest
[14] R. Nandi, A. Nandi, R. Rangayyan, and D. Scutt, “Classification of
breast masses in mammograms using genetic programming and fea-
ture selection,” Med. Biol. Eng. Comput., vol. 44, no. 8, pp. 683–694,
2006.
[15] R. Rangayyan and T. Nguyen, “Fractal analysis of contours of breast
masses in mammograms,” J. Digit. Imag., vol. 20, no. 3, pp. 223–237,
2007.
[16] N. Mudigonda, R. Rangayyan, and J. Desautels, “Gradient and texture
analysis for the classification of mammography masses,” IEEE Trans. Med.
Imag., vol. 19, no. 10, pp. 1032–1043, Oct. 2000.
[17] N. Azizi, N. Zemmal, M. Sellami, and N. Farah, “A new hybrid method
combining genetic algorithm and support vector machine classifier: Ap-
plication to CAD system for mammogram images,” in Proc. Int. Conf.Multimedia Comput. Syst., 2014, pp. 415–420.
[18] N. Zemmal, N. Azizi, and M. Sellami, “CAD system for classification of
mammographic abnormalities using transductive semi supervised learning
algorithm and heterogeneous features,” in Proc. 12th Int. Symp. Program.
Syst., 2015, pp. 1–9.
[19] R. Chaieb, A. Bacha, K. Kalti, and F. B. Lamine, “Image features extraction
for masses classification in mammograms,” in Proc. 6th Int. Conf. Soft
Comput. Pattern Recognit., 2014, pp. 203–208.
[20] University of South Florida, “Digital database for screening mam-
mography,” 2004. [Online]. Available: http://marathon.csee.usf.edu/
Mammography/Database.html
[21] F. Mohanty, S. Rup, and B. Dash, “Automated diagnosis of breast cancer
using parameter optimized kernel extreme learning machine,” Biomed.
Signal Process. Control, vol. 62, 2020, Art. no. 102108.
[22] A. Saber, M. Sakr, O. M. Abo-Seida, A. Keshk, and H. Chen, “A
novel deep-learning model for automatic detection and classification of
breast cancer using the transfer-learning technique,” IEEE Access, vol. 9,
pp. 71194–71209, 2021.
Authorized licensed use limited to: UNIVERSIDADE DE SAO PAULO. Downloaded on September 29,2023 at 12:39:34 UTC from IEEE Xplore. Restrictions apply. 
https://www.who.int/news-room/fact-sheets/detail/cancer
http://doi.acm.org/10.1145/2543581.2543593
http://doi.acm.org/10.1145/2543581.2543593
http://dx.doi.org/10.1007/BF02345742
https://linkinghub.elsevier.com/retrieve/pii/S1546144017310992
https://linkinghub.elsevier.com/retrieve/pii/S1546144017310992
http://annals.org/article.aspx{?}doi=10.7326/M15--0970
http://www.cancerboard.ab.ca/screentest
http://marathon.csee.usf.edu/Mammography/Database.html
http://marathon.csee.usf.edu/Mammography/Database.html
PEDRO et al.: STOCHASTIC GRAMMAR APPROACH TO MASS CLASSIFICATION IN MAMMOGRAMS 2313
[23] M. Heidari et al., “Applying a random projection algorithm to
optimize machine learning model for breast lesion classification,”
IEEE Trans. Biomed. Eng., vol. 68, no. 9, pp. 2764–2775,
Sep. 2021.
[24] U. Ramer, “An iterative procedure for the polygonal approximation of
plane curves,” Comput. Graph. Image Process., vol. 1, no. 3, pp. 244–256,
1972.
[25] R. Hirama, R. Pedro, V. Graciliano, A. Machado-Lima, and F. Nunes,
“Evaluating the impact of polygonal representations on mass classifica-
tion,” in Proc. Int. Conf. Syst. Signals Image Process., 2020, pp. 75–80.
[26] M.-K. Hu, “Visual pattern recognition by moment invariants,” IRE Trans.
Inf. Theory, vol. 8, no. 2, pp. 179–187, 1962.
[27] B. Menze et al., “A comparison of random forest and its Gini importance
with standard chemometric methods for the feature selection and classifi-
cation of spectral data,” BMC Bioinf., vol. 10, no. 1, Jul. 2009, Art. no. 213.
[Online]. Available: https://doi.org/10.1186/1471--2105-10-213
[28] M. Ribeiro, M. Ferreira, C. Traina Jr., and A. Traina, “Data pre-processing:
A new algorithm for feature selection and data discretization,” in Proc. 5th
Int. Conf. Soft Comput. Transdisciplinary Sci. Technol., New York, NY,
USA, 2008, pp. 252–257. [Online]. Available: http://doi.acm.org/10.1145/
1456223.1456277
[29] S. Zhu and D. Mumford, “A stochastic grammar of images,” Found. Trends
Comput. Graph. Vis., vol. 2, no. 4, pp. 259–362, Jan. 2006. [Online].
Available: http://dx.doi.org/10.1561/0600000018
[30] K. Fu, Syntactic Pattern Recognition and Applications. Englewood Cliffs,
NJ, USA: Prentice-Hall, 1982.
[31] J. Earley, “An efficient context-free parsing algorithm,” Commun. ACM,
vol. 13, no. 2, pp. 94–102, Feb. 1970.
[32] H. N. Khan, A. R. Shahid, B. Raza, A. H. Dar, and H. Alquhayz, “Multi-
view feature fusion based four views model for mammogram classification
using convolutional neural network,” IEEE Access, vol. 7, pp. 165724–
165733, 2019.
[33] S. J. Malebary and A. Hashmi, “Automated breast mass classification
system using deep learning and ensemble learning in digital mammogram,”
IEEE Access, vol. 9, pp. 55312–55328, 2021.
Ricardo Wandré Dias Pedro received the bachelor’s
degree in computer science and the MSc degree in
information systems, both from the University of Sao
Paulo - Brazil, in 2009 and 2013. He is currently work-
ing toward the PhD degree with the University of Sao
Paulo - Brazil. His current research interests include
machine learning and syntactic pattern recognition.
Ana Luiza Silveira Ferreira received the MD de-
gree from Faculdades Integradas Pitágoras de Montes
Claros- Brazil, in 2017. She performed her medical
residency in radiology and diagnostic imaging (2018-
2021) with A.C.Camargo Cancer Center - Brazil,
fellow (R4) in general magnetic resonance by the
Santa Izabel Hospital - Brazil (2021-2022) and is in
the fellow(R5) in Women’s Image by the CAM Group
- Brazil (2022).
Rodolph Vinicius Siqueira Pessoa received the
physician graduate degree from the University of
the State of Rio Grande do Norte - Brazil, in 2019.
He completed a medical residency in radiology and
diagnostic imaging with Fundação Antônio Prudente
– A.C. Camargo Cancer Center - Brazil (2022) -
Brazil. Currently, he is a fellow with the Interven-
tional Radiology and Angioradiology Program at the
same institution. Current areas of research interests
include: Oncology, interventional radiology, breast
imaging.
Almir Galvão Vieira Bitencourt received the MD
degree from the Federal University of Bahia, in 2007,
and the PhD degree in oncology from the Antônio
Prudente Foundation / A.C.Camargo Cancer Center
- Brazil, in 2012. He performed his medical resi-
dency in radiology and diagnostic imaging (2008-
2010) and fellow in breast imaging (2011) with the
A.C.Camargo Cancer Center - Brazil. He currently
works as a radiologist and researcher with the same
institution in the following areas: Diagnostic imaging,
oncology, and breast imaging.
Ariane Machado-Lima received the MSc degree in
computer science and the PhD degree in bioinformat-
ics, both from the University of Sao Paulo - Brazil, in
2002 and 2006, and the PhD degree sandwich period
from the Washington University School of Medicine -
EUA. She is a computer scientist with the University
of Sao Paulo - Brazil. Her current research interests
include machine learning and syntactic pattern recog-
nition.
Fátima L. S. Nunes received the bachelor’s degree
in computer science, and the master’s and PhD de-
grees in sciences, in the Graphics area. She is a full
professor with the University of São Paulo, where
she teaches and supervises master and PhD students
in computer science and computer engineering. Her
main research areas in computer science are virtual
reality, image processing, and content-based image
retrieval. Most of her research is related to solving
real problems in the health area. Currently, she is
working in problems related to life quality involv-
ing computational techniques to aid diagnosis of Autism Spectrum Disorder,
breast cancer, cardiomyopathies, and mild cognitive impairment. She also has
developing techniques to allow fast development and automatic adaptations of
serious games applied to rehabilitation processes
Authorized licensed use limited to: UNIVERSIDADE DE SAO PAULO. Downloaded on September 29,2023 at 12:39:34 UTC from IEEE Xplore. Restrictions apply. 
https://doi.org/10.1186/1471--2105-10-213
http://doi.acm.org/10.1145/1456223.1456277
http://doi.acm.org/10.1145/1456223.1456277
http://dx.doi.org/10.1561/0600000018
<<
 /ASCII85EncodePages false
 /AllowTransparency false
 /AutoPositionEPSFiles true
 /AutoRotatePages /None
 /Binding /Left
 /CalGrayProfile (Gray Gamma 2.2)
 /CalRGBProfile (sRGB IEC61966-2.1)
 /CalCMYKProfile (U.S. Web Coated \050SWOP\051 v2)
 /sRGBProfile (sRGB IEC61966-2.1)
 /CannotEmbedFontPolicy /Warning
 /CompatibilityLevel 1.4
 /CompressObjects /Off
 /CompressPages true
 /ConvertImagesToIndexed true
 /PassThroughJPEGImages true
 /CreateJobTicket false
 /DefaultRenderingIntent /Default
 /DetectBlends true
 /DetectCurves 0.0000
 /ColorConversionStrategy /sRGB
 /DoThumbnails true/EmbedAllFonts true
 /EmbedOpenType false
 /ParseICCProfilesInComments true
 /EmbedJobOptions true
 /DSCReportingLevel 0
 /EmitDSCWarnings false
 /EndPage -1
 /ImageMemory 1048576
 /LockDistillerParams true
 /MaxSubsetPct 100
 /Optimize true
 /OPM 0
 /ParseDSCComments false
 /ParseDSCCommentsForDocInfo true
 /PreserveCopyPage true
 /PreserveDICMYKValues true
 /PreserveEPSInfo false
 /PreserveFlatness true
 /PreserveHalftoneInfo true
 /PreserveOPIComments false
 /PreserveOverprintSettings true
 /StartPage 1
 /SubsetFonts true
 /TransferFunctionInfo /Remove
 /UCRandBGInfo /Preserve
 /UsePrologue false
 /ColorSettingsFile ()
 /AlwaysEmbed [ true
 /Algerian
 /Arial-Black
 /Arial-BlackItalic
 /Arial-BoldItalicMT
 /Arial-BoldMT
 /Arial-ItalicMT
 /ArialMT
 /ArialNarrow
 /ArialNarrow-Bold
 /ArialNarrow-BoldItalic
 /ArialNarrow-Italic
 /ArialUnicodeMS
 /BaskOldFace
 /Batang
 /Bauhaus93
 /BellMT
 /BellMTBold
 /BellMTItalic
 /BerlinSansFB-Bold
 /BerlinSansFBDemi-Bold
 /BerlinSansFB-Reg
 /BernardMT-Condensed
 /BodoniMTPosterCompressed
 /BookAntiqua
 /BookAntiqua-Bold
 /BookAntiqua-BoldItalic
 /BookAntiqua-Italic
 /BookmanOldStyle
 /BookmanOldStyle-Bold
 /BookmanOldStyle-BoldItalic
 /BookmanOldStyle-Italic
 /BookshelfSymbolSeven
 /BritannicBold
 /Broadway
 /BrushScriptMT
 /CalifornianFB-Bold
 /CalifornianFB-Italic
 /CalifornianFB-Reg
 /Centaur
 /Century
 /CenturyGothic
 /CenturyGothic-Bold
 /CenturyGothic-BoldItalic
 /CenturyGothic-Italic
 /CenturySchoolbook
 /CenturySchoolbook-Bold
 /CenturySchoolbook-BoldItalic
 /CenturySchoolbook-Italic
 /Chiller-Regular
 /ColonnaMT
 /ComicSansMS
 /ComicSansMS-Bold
 /CooperBlack
 /CourierNewPS-BoldItalicMT
 /CourierNewPS-BoldMT
 /CourierNewPS-ItalicMT
 /CourierNewPSMT
 /EstrangeloEdessa
 /FootlightMTLight
 /FreestyleScript-Regular
 /Garamond
 /Garamond-Bold
 /Garamond-Italic
 /Georgia
 /Georgia-Bold
 /Georgia-BoldItalic
 /Georgia-Italic
 /Haettenschweiler
 /HarlowSolid
 /Harrington
 /HighTowerText-Italic
 /HighTowerText-Reg
 /Impact
 /InformalRoman-Regular
 /Jokerman-Regular
 /JuiceITC-Regular
 /KristenITC-Regular
 /KuenstlerScript-Black
 /KuenstlerScript-Medium
 /KuenstlerScript-TwoBold
 /KunstlerScript
 /LatinWide
 /LetterGothicMT
 /LetterGothicMT-Bold
 /LetterGothicMT-BoldOblique
 /LetterGothicMT-Oblique
 /LucidaBright
 /LucidaBright-Demi
 /LucidaBright-DemiItalic
 /LucidaBright-Italic
 /LucidaCalligraphy-Italic
 /LucidaConsole
 /LucidaFax
 /LucidaFax-Demi
 /LucidaFax-DemiItalic
 /LucidaFax-Italic
 /LucidaHandwriting-Italic
 /LucidaSansUnicode
 /Magneto-Bold
 /MaturaMTScriptCapitals
 /MediciScriptLTStd
 /MicrosoftSansSerif
 /Mistral
 /Modern-Regular
 /MonotypeCorsiva
 /MS-Mincho
 /MSReferenceSansSerif
 /MSReferenceSpecialty
 /NiagaraEngraved-Reg
 /NiagaraSolid-Reg
 /NuptialScript
 /OldEnglishTextMT
 /Onyx
 /PalatinoLinotype-Bold
 /PalatinoLinotype-BoldItalic
 /PalatinoLinotype-Italic
 /PalatinoLinotype-Roman
 /Parchment-Regular
 /Playbill
 /PMingLiU
 /PoorRichard-Regular
 /Ravie
 /ShowcardGothic-Reg
 /SimSun
 /SnapITC-Regular
 /Stencil
 /SymbolMT
 /Tahoma
 /Tahoma-Bold
 /TempusSansITC
 /TimesNewRomanMT-ExtraBold
 /TimesNewRomanMTStd
 /TimesNewRomanMTStd-Bold
 /TimesNewRomanMTStd-BoldCond
 /TimesNewRomanMTStd-BoldIt
 /TimesNewRomanMTStd-Cond
 /TimesNewRomanMTStd-CondIt
 /TimesNewRomanMTStd-Italic
 /TimesNewRomanPS-BoldItalicMT
 /TimesNewRomanPS-BoldMT
 /TimesNewRomanPS-ItalicMT
 /TimesNewRomanPSMT
 /Times-Roman
 /Trebuchet-BoldItalic
 /TrebuchetMS
 /TrebuchetMS-Bold
 /TrebuchetMS-Italic
 /Verdana
 /Verdana-Bold
 /Verdana-BoldItalic
 /Verdana-Italic
 /VinerHandITC
 /Vivaldii
 /VladimirScript
 /Webdings
 /Wingdings2
 /Wingdings3
 /Wingdings-Regular
 /ZapfChanceryStd-Demi
 /ZWAdobeF
 ]
 /NeverEmbed [ true
 ]
 /AntiAliasColorImages false
 /CropColorImages true
 /ColorImageMinResolution 150
 /ColorImageMinResolutionPolicy /OK
 /DownsampleColorImages false
 /ColorImageDownsampleType /Bicubic
 /ColorImageResolution 900
 /ColorImageDepth -1
 /ColorImageMinDownsampleDepth 1
 /ColorImageDownsampleThreshold 1.00111
 /EncodeColorImages true
 /ColorImageFilter /DCTEncode
 /AutoFilterColorImages true
 /ColorImageAutoFilterStrategy /JPEG
 /ColorACSImageDict <<
 /QFactor 0.76
 /HSamples [2 1 1 2] /VSamples [2 1 1 2]
 >>
 /ColorImageDict <<
 /QFactor 0.40
 /HSamples [1 1 1 1] /VSamples [1 1 1 1]
 >>
 /JPEG2000ColorACSImageDict <<
 /TileWidth 256
 /TileHeight 256
 /Quality 15
 >>
 /JPEG2000ColorImageDict <<
 /TileWidth 256
 /TileHeight 256
 /Quality 15
 >>
 /AntiAliasGrayImages false
 /CropGrayImages true
 /GrayImageMinResolution 150
 /GrayImageMinResolutionPolicy /OK
 /DownsampleGrayImages false
 /GrayImageDownsampleType /Bicubic
 /GrayImageResolution 1200
 /GrayImageDepth -1
 /GrayImageMinDownsampleDepth 2
 /GrayImageDownsampleThreshold 1.00083
 /EncodeGrayImages true
 /GrayImageFilter /DCTEncode
 /AutoFilterGrayImages true
 /GrayImageAutoFilterStrategy /JPEG
 /GrayACSImageDict <<
 /QFactor 0.76
 /HSamples [2 1 1 2] /VSamples [2 1 1 2]
 >>
 /GrayImageDict <<
 /QFactor 0.40
 /HSamples [1 1 1 1] /VSamples [1 1 1 1]
 >>
 /JPEG2000GrayACSImageDict <<
 /TileWidth 256
 /TileHeight 256
 /Quality 15
 >>
 /JPEG2000GrayImageDict <<
 /TileWidth 256
 /TileHeight 256
 /Quality 15
 >>
 /AntiAliasMonoImages false
 /CropMonoImages true
 /MonoImageMinResolution 1200
 /MonoImageMinResolutionPolicy /OK
 /DownsampleMonoImages false
 /MonoImageDownsampleType /Bicubic
 /MonoImageResolution 1600
 /MonoImageDepth -1
 /MonoImageDownsampleThreshold 1.00063
 /EncodeMonoImages true
 /MonoImageFilter /CCITTFaxEncode
 /MonoImageDict <<
 /K -1
 >>
 /AllowPSXObjects false
 /CheckCompliance [
 /None
 ]
 /PDFX1aCheck false
 /PDFX3Check false
 /PDFXCompliantPDFOnly false
 /PDFXNoTrimBoxError true
 /PDFXTrimBoxToMediaBoxOffset [
 0.00000
 0.00000
 0.00000
 0.00000
 ]
 /PDFXSetBleedBoxToMediaBox true
 /PDFXBleedBoxToTrimBoxOffset [
 0.00000
 0.00000
 0.00000
 0.00000
 ]
 /PDFXOutputIntentProfile (None)
 /PDFXOutputConditionIdentifier ()
 /PDFXOutputCondition ()
 /PDFXRegistryName ()
 /PDFXTrapped /False
 /CreateJDFFile false
 /Description <<
 /CHS <FEFF4f7f75288fd94e9b8bbe5b9a521b5efa7684002000410064006f006200650020005000440046002065876863900275284e8e55464e1a65876863768467e5770b548c62535370300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c676562535f00521b5efa768400200050004400460020658768633002>
 /CHT <FEFF4f7f752890194e9b8a2d7f6e5efa7acb7684002000410064006f006200650020005000440046002065874ef69069752865bc666e901a554652d965874ef6768467e5770b548c52175370300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c4f86958b555f5df25efa7acb76840020005000440046002065874ef63002>
 /DAN <FEFF004200720075006700200069006e0064007300740069006c006c0069006e006700650072006e0065002000740069006c0020006100740020006f007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e007400650072002c0020006400650072002000650067006e006500720020007300690067002000740069006c00200064006500740061006c006a006500720065007400200073006b00e60072006d007600690073006e0069006e00670020006f00670020007500640073006b007200690076006e0069006e006700200061006600200066006f0072007200650074006e0069006e006700730064006f006b0075006d0065006e007400650072002e0020004400650020006f007000720065007400740065006400650020005000440046002d0064006f006b0075006d0065006e0074006500720020006b0061006e002000e50062006e00650073002000690020004100630072006f00620061007400200065006c006c006500720020004100630072006f006200610074002000520065006100640065007200200035002e00300020006f00670020006e0079006500720065002e>/DEU <FEFF00560065007200770065006e00640065006e0020005300690065002000640069006500730065002000450069006e007300740065006c006c0075006e00670065006e0020007a0075006d002000450072007300740065006c006c0065006e00200076006f006e002000410064006f006200650020005000440046002d0044006f006b0075006d0065006e00740065006e002c00200075006d002000650069006e00650020007a0075007600650072006c00e40073007300690067006500200041006e007a006500690067006500200075006e00640020004100750073006700610062006500200076006f006e00200047006500730063006800e40066007400730064006f006b0075006d0065006e00740065006e0020007a0075002000650072007a00690065006c0065006e002e00200044006900650020005000440046002d0044006f006b0075006d0065006e007400650020006b00f6006e006e0065006e0020006d006900740020004100630072006f00620061007400200075006e0064002000520065006100640065007200200035002e003000200075006e00640020006800f600680065007200200067006500f600660066006e00650074002000770065007200640065006e002e>
 /ESP <FEFF005500740069006c0069006300650020006500730074006100200063006f006e0066006900670075007200610063006900f3006e0020007000610072006100200063007200650061007200200064006f00630075006d0065006e0074006f0073002000640065002000410064006f00620065002000500044004600200061006400650063007500610064006f007300200070006100720061002000760069007300750061006c0069007a00610063006900f3006e0020006500200069006d0070007200650073006900f3006e00200064006500200063006f006e006600690061006e007a006100200064006500200064006f00630075006d0065006e0074006f007300200063006f006d00650072006300690061006c00650073002e002000530065002000700075006500640065006e00200061006200720069007200200064006f00630075006d0065006e0074006f00730020005000440046002000630072006500610064006f007300200063006f006e0020004100630072006f006200610074002c002000410064006f00620065002000520065006100640065007200200035002e003000200079002000760065007200730069006f006e0065007300200070006f00730074006500720069006f007200650073002e>
 /FRA <FEFF005500740069006c006900730065007a00200063006500730020006f007000740069006f006e00730020006100660069006e00200064006500200063007200e900650072002000640065007300200064006f00630075006d0065006e00740073002000410064006f006200650020005000440046002000700072006f00660065007300730069006f006e006e0065006c007300200066006900610062006c0065007300200070006f007500720020006c0061002000760069007300750061006c00690073006100740069006f006e0020006500740020006c00270069006d007000720065007300730069006f006e002e0020004c0065007300200064006f00630075006d0065006e00740073002000500044004600200063007200e900e90073002000700065007500760065006e0074002000ea0074007200650020006f007500760065007200740073002000640061006e00730020004100630072006f006200610074002c002000610069006e00730069002000710075002700410064006f00620065002000520065006100640065007200200035002e0030002000650074002000760065007200730069006f006e007300200075006c007400e90072006900650075007200650073002e>
 /ITA (Utilizzare queste impostazioni per creare documenti Adobe PDF adatti per visualizzare e stampare documenti aziendali in modo affidabile. I documenti PDF creati possono essere aperti con Acrobat e Adobe Reader 5.0 e versioni successive.)
 /JPN <FEFF30d330b830cd30b9658766f8306e8868793a304a3088307353705237306b90693057305f002000410064006f0062006500200050004400460020658766f8306e4f5c6210306b4f7f75283057307e305930023053306e8a2d5b9a30674f5c62103055308c305f0020005000440046002030d530a130a430eb306f3001004100630072006f0062006100740020304a30883073002000410064006f00620065002000520065006100640065007200200035002e003000204ee5964d3067958b304f30533068304c3067304d307e305930023053306e8a2d5b9a3067306f30d530a930f330c8306e57cb30818fbc307f3092884c3044307e30593002>
 /KOR <FEFFc7740020c124c815c7440020c0acc6a9d558c5ec0020be44c988b2c8c2a40020bb38c11cb97c0020c548c815c801c73cb85c0020bcf4ace00020c778c1c4d558b2940020b3700020ac00c7a50020c801d569d55c002000410064006f0062006500200050004400460020bb38c11cb97c0020c791c131d569b2c8b2e4002e0020c774b807ac8c0020c791c131b41c00200050004400460020bb38c11cb2940020004100630072006f0062006100740020bc0f002000410064006f00620065002000520065006100640065007200200035002e00300020c774c0c1c5d0c11c0020c5f40020c2180020c788c2b5b2c8b2e4002e>
 /NLD (Gebruik deze instellingen om Adobe PDF-documenten te maken waarmee zakelijke documenten betrouwbaar kunnen worden weergegeven en afgedrukt. De gemaakte PDF-documenten kunnen worden geopend met Acrobat en Adobe Reader 5.0 en hoger.)
 /NOR <FEFF004200720075006b00200064006900730073006500200069006e006e007300740069006c006c0069006e00670065006e0065002000740069006c002000e50020006f0070007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740065007200200073006f006d002000650072002000650067006e0065007400200066006f00720020007000e5006c006900740065006c006900670020007600690073006e0069006e00670020006f00670020007500740073006b007200690066007400200061007600200066006f0072007200650074006e0069006e006700730064006f006b0075006d0065006e007400650072002e0020005000440046002d0064006f006b0075006d0065006e00740065006e00650020006b0061006e002000e50070006e00650073002000690020004100630072006f00620061007400200065006c006c00650072002000410064006f00620065002000520065006100640065007200200035002e003000200065006c006c00650072002e>
 /PTB <FEFF005500740069006c0069007a006500200065007300730061007300200063006f006e00660069006700750072006100e700f50065007300200064006500200066006f0072006d00610020006100200063007200690061007200200064006f00630075006d0065006e0074006f0073002000410064006f00620065002000500044004600200061006400650071007500610064006f00730020007000610072006100200061002000760069007300750061006c0069007a006100e700e3006f002000650020006100200069006d0070007200650073007300e3006f00200063006f006e0066006900e1007600650069007300200064006500200064006f00630075006d0065006e0074006f007300200063006f006d0065007200630069006100690073002e0020004f007300200064006f00630075006d0065006e0074006f00730020005000440046002000630072006900610064006f007300200070006f00640065006d0020007300650072002000610062006500720074006f007300200063006f006d0020006f0020004100630072006f006200610074002000650020006f002000410064006f00620065002000520065006100640065007200200035002e0030002000650020007600650072007300f50065007300200070006f00730074006500720069006f007200650073002e>
 /SUO <FEFF004b00e40079007400e40020006e00e40069007400e4002000610073006500740075006b007300690061002c0020006b0075006e0020006c0075006f0074002000410064006f0062006500200050004400460020002d0064006f006b0075006d0065006e007400740065006a0061002c0020006a006f0074006b006100200073006f0070006900760061007400200079007200690074007900730061007300690061006b00690072006a006f006a0065006e0020006c0075006f00740065007400740061007600610061006e0020006e00e400790074007400e4006d0069007300650065006e0020006a0061002000740075006c006f007300740061006d0069007300650065006e002e0020004c0075006f0064007500740020005000440046002d0064006f006b0075006d0065006e00740069007400200076006f0069006400610061006e0020006100760061007400610020004100630072006f0062006100740069006c006c00610020006a0061002000410064006f00620065002000520065006100640065007200200035002e0030003a006c006c00610020006a006100200075007500640065006d006d0069006c006c0061002e>
 /SVE <FEFF0041006e007600e4006e00640020006400650020006800e4007200200069006e0073007400e4006c006c006e0069006e006700610072006e00610020006f006d002000640075002000760069006c006c00200073006b006100700061002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e007400200073006f006d00200070006100730073006100720020006600f60072002000740069006c006c006600f60072006c00690074006c006900670020007600690073006e0069006e00670020006f006300680020007500740073006b007200690066007400650072002000610076002000610066006600e4007200730064006f006b0075006d0065006e0074002e002000200053006b006100700061006400650020005000440046002d0064006f006b0075006d0065006e00740020006b0061006e002000f600700070006e00610073002000690020004100630072006f0062006100740020006f00630068002000410064006f00620065002000520065006100640065007200200035002e00300020006f00630068002000730065006e006100720065002e>
 /ENU (Use these settings to create PDFs that match the "Suggested" settings for PDF Specification 4.0)
 >>
>> setdistillerparams<<
 /HWResolution [600 600]
 /PageSize [612.000 792.000]
>> setpagedevice

Continue navegando