Buscar

Hatamleh et al 2022 (1)

Faça como milhares de estudantes: teste grátis o Passei Direto

Esse e outros conteúdos desbloqueados

16 milhões de materiais de várias disciplinas

Impressão de materiais

Agora você pode testar o

Passei Direto grátis

Você também pode ser Premium ajudando estudantes

Faça como milhares de estudantes: teste grátis o Passei Direto

Esse e outros conteúdos desbloqueados

16 milhões de materiais de várias disciplinas

Impressão de materiais

Agora você pode testar o

Passei Direto grátis

Você também pode ser Premium ajudando estudantes

Faça como milhares de estudantes: teste grátis o Passei Direto

Esse e outros conteúdos desbloqueados

16 milhões de materiais de várias disciplinas

Impressão de materiais

Agora você pode testar o

Passei Direto grátis

Você também pode ser Premium ajudando estudantes
Você viu 3, do total de 14 páginas

Faça como milhares de estudantes: teste grátis o Passei Direto

Esse e outros conteúdos desbloqueados

16 milhões de materiais de várias disciplinas

Impressão de materiais

Agora você pode testar o

Passei Direto grátis

Você também pode ser Premium ajudando estudantes

Faça como milhares de estudantes: teste grátis o Passei Direto

Esse e outros conteúdos desbloqueados

16 milhões de materiais de várias disciplinas

Impressão de materiais

Agora você pode testar o

Passei Direto grátis

Você também pode ser Premium ajudando estudantes

Faça como milhares de estudantes: teste grátis o Passei Direto

Esse e outros conteúdos desbloqueados

16 milhões de materiais de várias disciplinas

Impressão de materiais

Agora você pode testar o

Passei Direto grátis

Você também pode ser Premium ajudando estudantes
Você viu 6, do total de 14 páginas

Faça como milhares de estudantes: teste grátis o Passei Direto

Esse e outros conteúdos desbloqueados

16 milhões de materiais de várias disciplinas

Impressão de materiais

Agora você pode testar o

Passei Direto grátis

Você também pode ser Premium ajudando estudantes

Faça como milhares de estudantes: teste grátis o Passei Direto

Esse e outros conteúdos desbloqueados

16 milhões de materiais de várias disciplinas

Impressão de materiais

Agora você pode testar o

Passei Direto grátis

Você também pode ser Premium ajudando estudantes

Faça como milhares de estudantes: teste grátis o Passei Direto

Esse e outros conteúdos desbloqueados

16 milhões de materiais de várias disciplinas

Impressão de materiais

Agora você pode testar o

Passei Direto grátis

Você também pode ser Premium ajudando estudantes
Você viu 9, do total de 14 páginas

Faça como milhares de estudantes: teste grátis o Passei Direto

Esse e outros conteúdos desbloqueados

16 milhões de materiais de várias disciplinas

Impressão de materiais

Agora você pode testar o

Passei Direto grátis

Você também pode ser Premium ajudando estudantes

Prévia do material em texto

Digital Chemical Engineering 3 (2022) 100018 
Contents lists available at ScienceDirect 
Digital Chemical Engineering 
journal homepage: www.elsevier.com/locate/dche 
Design of mosquito repellent molecules via the integration of hyperbox 
machine learning and computer aided molecular design 
Mohamad Hatamleh a , Jia Wen Chong a , Raymond R. Tan b , Kathleen B. Aviso b , 
Jose Isagani B. Janairo c , Nishanth G. Chemmangattuvalappil a , ∗ 
a Department of Chemical and Environmental Engineering, University of Nottingham Malaysia, Broga Road, 43500 Selangor D.E., Malaysia 
b Center for Engineering and Sustainable Development Research, De La Salle University, 2401 Taft Avenue, 0922 Manila, Philippines 
c Department of Biology, De La Salle University, 2401 Taft Avenue, 0922 Manila, Philippines 
a r t i c l e i n f o 
Keywords: 
Mosquito repellents 
Computer-aided molecular design 
Machine learning 
Hyperbox classifier 
Cheminformatics 
Mixed-integer linear programming 
a b s t r a c t 
The use of mosquito repellents is an efficient way to prevent mosquito-borne diseases. Despite the accumulation 
of information about repellents, there remains the challenge of the lack of understanding of their mechanism of 
action. There is also a need for systematic methods for discovering new alternatives that mitigate the drawbacks 
of repellents currently in use. To address these research gaps, a computer-aided molecular design (CAMD) frame- 
work is developed for the optimal molecular design of mosquito repellents. In this framework, the mosquito 
repelling attribute of molecules are predicted using a data-driven hyperbox-based machine learning approach 
in the absence of a mechanistic prediction model. The best set of rules is selected from plausible alternative 
models developed. For the prediction of important physical properties, a group contribution-based method us- 
ing reliable models is implemented. Subsequently, the CAMD formulation is developed as a mixed-integer linear 
programming model to obtain structures with minimum viscosity. Results show that of the structures generated, 
the hyperbox classifier correctly predicted the repelling ability of all molecules found to be known repellents 
in literature. The molecules not found in the databases provide key insights on where experimental research to 
develop new repellents should be targeted. Thus, this newly developed framework can be applied as a systematic 
technique to screen and narrow down the search space for candidate mosquito repellent molecules before final 
experimental verification. 
1
 
h 
c 
o 
i 
i 
s
 
l 
p 
w 
a 
2 
q 
C 
t 
l 
t 
s 
i 
t 
t 
h 
h 
m 
n 
a 
d 
s 
D 
s 
f 
t 
l 
h
R
2
B
.0. Introduction 
Mosquitoes are vectors for many pathogens that infect millions of
umans yearly ( Tauxe et al., 2013 ). Major mosquito-borne diseases in-
lude malaria, dengue, chikungunya, and zika. Mosquitoes heavily rely
n olfaction to locate human hosts, using two volatile cues: carbon diox-
de and skin odorants ( Cardé & Gibson, 2010 ). Consequently, target-
ng mosquitos’ olfaction systems using repellent odorants is an efficient
trategy to prevent them from locating human hosts. 
One significant challenge to identifying the best insect repel-
ent molecule is relating molecular structures to the ability to re-
el mosquitos. Prominent information dates back to the 1940s; it
as around then that repellents such as indalone, dimethyl phthalate,
nd N,N-diethyl-meta-toluamide (DEET) were developed ( Paluch et al.,
010 ). Those compounds had many drawbacks. For example, DEET re-
uire high concentrations to be effective, is mildly toxic ( Robbins &
herniack, 1986 ), has an unpleasant odour, dissolves plastics and syn-
hetic rubber ( Debboun et al., 2014 ), and is ineffective against Anophe-
∗ Corresponding author at: Department of Chemical and Environmental Engineerin
E-mail address: Nishanth.C@nottingham.edu.my (N.G. Chemmangattuvalappil). 
ttps://doi.org/10.1016/j.dche.2022.100018 
eceived 4 November 2021; Received in revised form 22 February 2022; Accepted 2
772-5081/© 2022 The Author(s). Published by Elsevier Ltd on behalf of Institution
Y-NC-ND license ( http://creativecommons.org/licenses/by-nc-nd/4.0/ ) 
es mosquito species ( Rutledge et al., 1978 ). These drawbacks initiated
he quest to find alternative repellents. Although new alternatives were
uggested, they still shared a few drawbacks ( Afify et al., 2019 ). Surpris-
ngly, despite their long history of use, the mechanism of action of syn-
hetic repellents is not well-understood ( Afify et al., 2019 ). Improving
he understanding will aid in identifying new repellents by explaining
ow compounds with various functional groups influence mosquito be-
aviour. Although some research has revealed new information about
osquito olfactory mechanism ( Hallem et al., 2005 ), the development of
ew repellents remains significantly reliant on studies from the 1940s,
s those studies contain information regarding the repelling ability of
ifferent structural classes of chemicals ( Paluch et al., 2010 ). Relying on
uch a conventional methodology is rather impractical and inefficient.
ue to combinatorial explosion, there are numerous possible candidate
tructures that need to be experimentally studied to verify their per-
ormance. There is therefore a need for an initial systematic technique
o screen the design of mosquito repellents. Computer-aided molecu-
ar design (CAMD) is a model-based design method with the advantage
g, Nottingham Malaysia, Selangor, Malaysia. 
3 February 2022 
 of Chemical Engineers (IChemE). This is an open access article under the CC 
https://doi.org/10.1016/j.dche.2022.100018
http://www.ScienceDirect.com
http://www.elsevier.com/locate/dche
http://crossmark.crossref.org/dialog/?doi=10.1016/j.dche.2022.100018&domain=pdf
mailto:Nishanth.C@nottingham.edu.my
https://doi.org/10.1016/j.dche.2022.100018
http://creativecommons.org/licenses/by-nc-nd/4.0/
M. Hatamleh, J.W. Chong, R.R. Tan et al. Digital Chemical Engineering 3 (2022) 100018 
o 
fi 
(
 
o 
G 
s 
e 
O 
o 
c 
R 
c 
A 
t 
c 
p 
S 
b 
a 
r 
t 
i 
r 
o
 
s 
u 
w 
l 
f 
d 
t 
C 
t 
m 
s 
v 
w 
o 
m 
o 
b 
f 
f 
s 
e
 
b 
I 
t 
i 
p 
o 
o 
m 
t 
I 
c 
m 
m
2
 
g 
2 
v 
D 
o 
M 
t
 
( 
i 
t 
c 
c 
s 
a 
a 
i 
a 
t 
s 
v 
Nomenclature 
List of abbreviations 
ANN Artificial neural network 
CAMD Computer-aided molecular design 
DEET N,N-diethyl-meta-toluamide 
GC Group contribution 
MILP Mixed-integer linear programming 
ML Machine learning 
QSAR Quantitative structure-activity relationship 
RSML Rough set-based machine learning 
RST Rough set theory 
SVM Supper-vector machine 
TI Topological indices 
Parameters of MILP-based hyperbox classifier model 
𝜀 Threshold for the proportion of false negatives (Type II 
error) 
C j 
∗ True membership of sample j 
M Arbitrary large number 
N T Total number of samples that are not members of a set 
P T Total number of samples in the set 
X ji Value of sample j in dimension i 
Z L ik Lowest possible bound of box k in dimension i 
Z U ik Uppermost possible of box k in dimension i 
Decision Variables of MILP-based hyperbox classifier model 
𝛼 Proportion of false positives (Type I error) 
𝛽 Proportion of false negatives (Type II error) 
b L ik Binary variable which signifies the activation ( b 
L 
ik = 1) 
or not ( b L ik = 0) of the lower limit of box k in dimension 
i 
b U ik Binary variable which signifies the activation ( b 
U 
ik = 1) 
or not ( b U ik = 0) of the upper limit of box k in dimension 
i 
b jk Binary variable which indicates if sample j is enclosed 
in box k ( b jk = 1) 
c j Classification of sample j based on hyperbox 
q L ijk Binary variable which indicates if samplej is below the 
lower bound of box k in dimension i ( q L ijk = 1) 
q U ijk Binary variable which indicates if sample j is above the 
upper bound of box k in dimension i ( q U ijk = 1) 
x L ik Lower bound of box k in dimension i 
x U ik Upper bound of box k in dimension i 
f quickly identifying promising candidates that can be further veri-
ed through experimental techniques to obtain a final optimal structure
 Zhang et al., 2020 ). 
The most commonly used property prediction model in the context
f CAMD is group contribution (GC) models ( Austin et al., 2016 ). In
C models, the properties of a certain molecule can be estimated by
umming the frequency of occurrence multiplied by the contribution of
ach molecular group present in the molecule ( Marrero & Gani, 2001 ).
ther property prediction models used in CAMD include the utilisation
f topological indices (TI) based molecular descriptors, which are cal-
ulated based on principles of chemical graph theory ( Trinajsti ć, 1992 ).
ecently, CAMD has been utilised in the design and development of pro-
esses in several areas of applications ( Chemmangattuvalappil, 2020 ).
 CAMD approach was applied to obtain the best ionic liquid for
he liquid-liquid extraction of fuel oils in the desulphurisation pro-
ess ( Song et al., 2018 ). A candidate list of liquids was generated, and
rocess simulation was implemented to select the superior candidate
ong et al (2019) . relied on CAMD to determine the optimum liquid to
e used as the entrainer in the extractive distillation process that sep-
rates alkanes from cycloalkanes. Within a list of entrainers, the supe-
2 
ior candidate was chosen based on the thermodynamic performance of
ypical extractive distillation processes. Solvent design techniques were
mplemented using CAMD to identify a solvent for liquid phase organic
eactions ( Zhou et al., 2020 ). The best solvent was chosen based on the
btained selectivity and reaction rates of various organic reactions. 
CAMD also has numerous applications in chemical product design. A
creening method for the design of fragrance molecules was developed
sing CAMD ( Zhang et al., 2018 ). The characteristic odours of molecules
ere predicted using a data-driven machine learning approach. Simi-
arly, a mathematical programming-based approach for the design of
ragrance molecules was developed ( Radhakrishnapany et al., 2020 ). A
ata-driven rough-set based machine learning (RSML) model was used
o predict odour properties ( Radhakrishnapany et al., 2020 ). A novel
AMD methodology that formulated a solvent mixture design problem
hat simultaneously optimised the number, identity, and composition of
ixture components was introduced ( Jonuzaj et al., 2016 ). They demon-
trated their methodology via a case study to identify the superior sol-
ent mixture that maximized the solubility of ibuprofen. A CAMD frame-
ork that identified potential solvent candidates for the addition to bio-
il such that targeted properties like viscosity and low heating value are
et to use bio-oil as fuel was presented ( Mah et al., 2019 ). They devel-
ped a phase stability model to evaluate the miscibility of the solvent-oil
lend. A CAMD framework was developed for the design of all kinds of
ormulated products ( Zhang el al., 2017 ). In one of two case studies, the
ramework was used to design electrolytes that meet desired attributes
uch as electrical insulation, environment friendliness, and resistance to
lectrolyte degradation. 
To build a good CAMD framework, property prediction models must
e available to be applied. This is the basic requirement of CAMD.
mportant thermodynamic, transport or environment-related properties
hat are essential for the performance of consumer products have ex-
sting property prediction models. However, there is a lack of adequate
rediction models for certain attributes that are a principal requirement
f products, such as the taste of a molecule for use in food products,
r the smell of a molecule for use in perfumes. In the case of designing
osquito repellent molecules, no mechanistic prediction model exists
hat links the mosquito repelling attribute to its molecular structure.
n such cases, obtaining a prediction model for use in CAMD is rather
hallenging. To address this research gap, a data-driven hyperbox based
achine learning (ML) tools was proposed in this work by linking the
osquito repelling attribute to the molecular structure. 
.0. Machine learning and hyperbox 
ML deals with the development of models that are trained using al-
orithms on empirical data coming from various sources ( Khuat et al.,
021 ). One of the most prominent uses of ML is the contribution and de-
elopment to mining patterns hidden in data repositories ( Zakaryazad &
uman, 2016 ). ML is becoming a mainstream tool for making sense out
f large quantities of data in many domains ( Jordan & Mitchell, 2015 ).
L tools have been developed to serve practical applications in place of
raditional statistics to support decision-making ( Makridakis, 2017 ). 
Two of the most common ML tools are Artificial Neural Networks
ANN) and Support Vector Machine (SVM), which rely on the availabil-
ty of data with a huge number of data points. For sparse data sets, al-
ernative techniques such as RSML and hyperbox-based ML tools can be
onsidered as potential tools to identify interpretable patterns. In chemi-
al product design, the availability of data is usually limited. In addition,
cientific plausibility and interpretability are often important consider-
tions in the development of ML models ( Rudin, 2019 ). The distinct
dvantage of such models is their ability to generate transparent and
ntuitive results to predict various outcomes. A hyperbox is essentially
n n-dimensional hyper-rectangle, where n is equal to the number of at-
ributes in a particular dataset. In the training data, the hyperbox dimen-
ions are calibrated to optimally fit the dataset in a training procedure
ia a second model ( Xu and Papageorgiou, 2009 ). A hyperbox model
M. Hatamleh, J.W. Chong, R.R. Tan et al. Digital Chemical Engineering 3 (2022) 100018 
c 
t 
( 
a 
a 
m 
d 
a 
o 
s 
t 
a 
i 
(
 
b 
c 
t 
t 
p 
f 
n 
f 
T 
h 
c 
o 
m 
i 
U 
o 
a 
a 
f 
p 
b 
T 
1
 
g 
i 
s 
r 
l 
i 
b 
n 
& 
c 
d 
C 
t 
o 
T 
w 
( 
e 
2
 
p 
d 
t 
s 
c 
i 
e 
Table 1 
GC property prediction models used to estimate molecular properties. 
Product 
Attribute Technical Property Model 
Safe to use Flash point F p (K) F p − 150 . 0218 = ∑
i 
N i F p1i 
Volatile Boiling point T b (K) 𝑇 𝑏 = 
244 . 7889 ln 
∑
𝑖 
𝑁 𝑖 𝑇 𝑏 1 𝑖 
Non-toxic Lethal Concentration LC-50 
(mol/L) 
− log(L C 50 ) + 
2 . 1841 = 
∑
i 
N i L C 50 , 1i 
Stable Hildebrand solubility 
parameter 𝛿 (MPa 0.5 ) 
δ − 20 . 7339 = 
∑
i 
N i δ1i 
Spray-able Dynamic viscosity 𝜂 (mPa.s) ln (η) = 
∑
i 
N i η1i 
Molar volume V m (m 
3 /kmol) V m − 0 . 0123 = ∑
i 
N i V m1i + 
∑
j 
M j V m2j 
Table 2 
Example E-state index calculation for the hydroxyl group in 1-hexanol 
Atom no. Atom group 𝛿v 𝛿 I D r ij r 
2 
ij ∆I i Calculation 
1 sOH 5 1 6 0 1 1 0.000 
2 ssCH 2 2 2 1.5 1 2 4 1.125 
3 ssCH 2 2 2 1.5 2 3 9 0.500 
4 ssCH 2 2 2 1.5 3 4 16 0.281 
5 ssCH 2 2 2 1.5 4 5 25 0.180 
6 ssCH 2 2 2 1.5 5 6 36 0.125 
7 sCH 3 1 1 2 6 7 49 0.082 
∆I i 2.293 
Si 8.293 
a 
b 
a 
b 
t 
a 
p 
p 
m 
d 
t 
t
3
 
b 
F 
d 
l 
a 
d
3
 
f 
a 
m 
l 
f 
a 
t 
s 
a 
C 
an be readily interpreted as a set of IF-THEN rules which readily map
o typical human thought processes. Mixed-integer linear programming
MILP) is favoured as the training model as it involves both continuous
nd integer variables which allows for the exploration of both optimal
nd near-optimal solutions ( Voll et al., 2015 ). In addition, solving MILP
odels does not presentmajor computational difficulties. The validation
ataset is used to evaluate the performance of the model. Hyperbox has
n added advantage over related techniques such as RSML; it is based
n MILP, which can be solved readily with standard branch-and-bound
olvers in commercial optimisation software. Compared with other ML
ools, the hyperbox classifier is yet to become mainstream. However, in
reas where clear and interpretable rules are required for decision mak-
ng, the hyperbox classifier is a significant potential tool to be applied
 Tan et al., 2020 ). 
Xu and Papageorgiou (2009) pioneered the implementation of MILP-
ased hyperbox methodology to minimise the number of incorrectly
lassified data samples for multi-class data classification problems. Al-
hough it was necessary to re-optimise their MILP model iteratively,
hey concluded that their approach is highly competitive in terms of
rediction accuracy upon comparison with other standard classifiers. It
ollowed that an enhanced algorithm was developed that reduced the
umber of steps to avoid the iterative procedure ( Maskooki, 2013 ). A
urther improvement was done to include Type I (false positive) and
ype II (false negative) prediction errors ( Yang et al., 2015 ). MILP-based
yperbox was used in a geological reservoir classification problem for
arbon dioxide storage ( Tan et al., 2020 ). They aimed to predict whether
r not a storage site is secure based on geological data. To classify phar-
aceutical compounds into either of two categories: low or high bind-
ng free energy, MILP-based hyperbox was used ( Tardu et al., 2016 ).
sing 6 common molecular descriptors, their approach outperformed
ther techniques, obtaining an 83.55% prediction accuracy. The same
pproach was applied to peptide-mediated biomineralization, which is
n emerging biomimetic technique for developing nanomaterials with
unctional properties ( Janairo et al., 2020 ). Their developed model can
redict whether or not a biomineralization peptide has a strong or weak
inding affinity towards a gold surface based on its peptide sequence.
heir model outperformed other ML algorithms, achieving a staggering
00% prediction accuracy upon validation. 
Other hyperbox-based algorithms have applications in chemical en-
ineering. A modified fuzzy min-max neural network with batch learn-
ng hyperbox algorithm was used to find anomalies in the cooling
ystem of an industrial blast furnace, based on input data taken di-
ectly from sensors ( Meneganti et al., 1998 ). From real sensor data col-
ected from the circulating water system of a power generation plant
n Malaysia, a fuzzy min-max neural network combined with a sym-
olic rule extraction hyperbox algorithm was used to identify and diag-
ose heat transfer equipment ( Chen et al., 2004 ). Similarly, Quteishat
 Lim (2007) utilized the same approach to identify faults occurring in
ondensers of a circulating water system of a power generation plant to
etermine whether or not heat transfer conditions are efficient. In the
AMD context to predict attributes of molecules based on their struc-
ures, a deep ANN was utilised to predict 15 molecular properties based
n the quantum chemistry QM9 database ( Valencia-Marquez & Flores-
lacuahuac, 2020 ). Two integrated ML models: ANN-GC and SVM-GC
ere used to predict the solubility of carbon dioxide in ionic liquids
 Song et al., 2020 ). Moreover, RSML was utilised to predict odour prop-
rties for the design of fragrant molecules ( Radhakrishnapany et al.,
020 ). 
The interpretability of hyperbox models is extremely valuable in the
rediction of attributes as the current models used for the property pre-
iction are developed based on black-box models. Most conventional ML
ools are black-box models that offer no transparency into how the deci-
ion is made. In the absence of prediction models, data-driven ML tools
an be utilised to establish a link between attributes of a product and
ts molecular structure using TIs. Together with existing GC-based prop-
rty prediction models for the estimation of thermodynamic, transport,
3 
nd environment-related properties, a CAMD problem with hyperbox-
ased ML algorithms incorporated was formulated in this work to design
n optimal molecule for mosquito repellents. The data-driven hyperbox
ased ML approach was utilised to predict the mosquito repelling at-
ribute of molecules by selecting the best set of rules from plausible
lternative models developed. On the other hand, GC-based property
rediction method was applied for the prediction of other important
hysical properties. Here, the CAMD problem was formulated as a MILP
odel to generate molecule structures with minimum viscosity. With the
eveloped methodology, new molecule structures with repelling abili-
ies can be identified, providing new insights into experimental research
hat developes new repellents. 
.0. Methodology 
The methodology developed to design mosquito repellent molecules
y integrating hyperbox ML tool and molecular design is illustrated in
igure 1 . The properties that satisfy technical requirements were pre-
icted using existing GC-based models. The hyperbox ML tool was uti-
ized to develop a model that predicts whether a molecule possesses the
bility to repel mosquitos. In this work, a hybrid of GC-hyperbox was
eveloped to solve the CAMD problem. 
.1. Step 1, 2 and 3b: Product Attribute and Model Identification 
Mosquito repellents must attain several attributes to enable their ef-
ective use. It is paramount that those required attributes are identified
t the initial stage of the design. Other than being effective in repelling
osquitos, which is the main function of the product, a mosquito repel-
ent molecule must also possess several desirable attributes to be used
or its application. They should be safe to use, non-toxic, stable, volatile,
nd sprayable. The technical attributes were translated into quantita-
ive properties in Table 1 . The flash point was used to quantify the
afety attribute. A higher flash point reduces the possibility of forming
 flammable mixture in the air. The Fathead minnow 96-hour Lethal
oncentration 50 ( LC-50 ) is commonly referred to as the concentration
M. Hatamleh, J.W. Chong, R.R. Tan et al. Digital Chemical Engineering 3 (2022) 100018 
Fig. 1. CAMD model used for the design of 
mosquito repellent molecules. 
o 
s 
t 
o 
t 
d 
t 
a 
i 
i 
s 
m 
i 
l 
s 
m
 
e 
W 
t 
f chemical in the air that kills 50% of the tested animals during the ob-
ervation period. In this case study, LC-50 was used to quantify the non-
oxicity to human body via an inhalation route. An important property
f repellents is their volatility. Repellents should also be volatile enough
o form a vapour layer above the skin surface, which the mosquito will
etect and avoid. However, the repellent must not be too volatile, since
his reduces the effective duration of the repellent application ( da Silva
nd Ricci-Júnior, 2020 ). The boiling point is used to quantify the volatil-
ty. The molecule must possess a Hildebrand solubility parameter that
s near that of the solvent mixture used in the final product to achieve
4 
tability. A binary mixture of methanol and 1-butanol with 33 mol%
ethanol was assumed as the solvent mixture ( Conte et al., 2011 ). This
s because, from the list of mixtures presented by the authors, it had the
owest cost. Finally, dynamic viscosity and molar volume quantify the
pray-ability attribute. The less viscous the molecule, and the higher it’s
olar volume, the more spray-able the product is. 
To continue the CAMD procedure, a mathematical model that can
stimate the abovementioned targeted properties ( Table 1 ) is required.
ithin this context, GC is the most popular property prediction model
o estimate the derived properties Table 1 . demonstrates the identified
M. Hatamleh, J.W. Chong, R.R. Tan et al. Digital Chemical Engineering 3 (2022)100018 
Fig. 2. Representation of 3 hyperboxes in 2 di- 
mensions used to classify positive samples. 
G 
v 
( 
u
3
u
 
d 
fi 
o
 
t 
C 
a 
( 
f 
a 
a 
n
 
i 
i 
n 
𝛼 
h 
r 
i 
I
 
f 
c 
m 
I 
b 
a 
a 
m 
i
Fig. 3. Molecular structure of 1-hexanol with each atom group numbered. 
 
a 
t 
l 
s 
a 
t
 
i 
p 
t 
2 
w
𝐼 
 
g
δ 
δ 
 
b 
p 
c
S 
Δ 
 
d
C models for flash point, Hildebrand solubility parameter and molar
olume ( Hukkerikar et al., 2012a ), in addition to lethal concentration
 Hukkerikar et al., 2012b ), and dynamic viscosity ( Conte et al., 2008 )
p to first-order groups. Ni is the number of occurrences of group i . 
.2. Step 3a: Develop a Prediction Model for the Repellent Requirement 
sing Hyperbox ML Tool 
The model that predicts whether a molecule can repel mosquitos was
eveloped by using a hyperbox-based ML tool, which is a binary classi-
er. A dataset consisting of 56 molecules binary classified as ‘attractant’
r ‘repellent’ towards Aedes aegypti mosquitos was used. 
The following example given by Tan et al. (2020) serves to illustrate
he concept of using hyperbox classifiers and how to interpret its result.
onsider a small dataset of 10 sample students, of which 6 are classified
s positive ( z = 1) for having a degree, and the remaining 4 are negative
 z = 0), where z is the binary output variable that can be predicted
rom the 2 dimensions x and y as shown in Fig. 2 . Three hyperboxes
re chosen for this example. The hyperboxes were determined to engulf
s many positive samples as possible, whilst simultaneously excluding
egative samples. 
In some cases, perfect separation cannot be achieved. It is observed
n Fig. 2 that hyperbox 2 mistakenly contains a negative sample. This
s known as Type I error, or false positive, whose occurrence rate is de-
oted by 𝛼. In this case, 1 of 4 negative samples is misclassified, giving
= 0.25. Similarly, 1 positive sample fails to lie within any one of the 3
yperboxes. This is a Type II error, or false negative, whose occurrence
ate is denoted by 𝛽. Here, 1 of 6 positive samples is misclassified, yield-
ng 𝛽 = 0.17. The hyperboxes can be interpreted as the following set of
F-THEN rules: 
1 IF (x ≤ 2) AND (y ≤ 2) THEN (z = 1) 
2 IF (1 ≤ x ≤ 5) AND (1 ≤ y ≤ 3) THEN (z = 1) 
3 IF (6 ≤ x ≤ 7) THEN (z = 1) 
The rules are disjunctive; a student fulfilling either one is sufficient
or the model to predict a degree was obtained. Collectively, the rules
omprise a rule-based model to predict the binary variable z from di-
ensions x and y . In this example, the hyperboxes were known a priori .
n this work, however, the size and location of the hyperboxes are to
e determined using a second training model. The goal is to generate
 set of rules that determines if a molecule in the dataset is classified
s a repellent, with reasonable Type I and II error rates. For that, each
olecule in the dataset is to be characterized by a certain topological
ndex. 
5 
Some TIs available to use are the connectivity index, shape index,
nd electrotopological-state index. The latter has been utilized to relate
o the odour of aliphatic esters in a quantitative structure-activity re-
ationship (QSAR) study ( de Mello Castanho Amboni et al., 2000 ). No
imilar study was found in the surveyed literature that relates TI to the
bility of a molecule to repel mosquitos. Therefore, it was hypothesized
hat the electrotopological-state index can be used in this work. 
Using electrotopological-state indices, or E-state, each atom present
n a molecular graph is represented by an E-state variable that enci-
her the intrinsic electronic state of the atom as affected by other elec-
ronic influences from the remaining atoms in the molecule ( Conte et al.,
011 ). The intrinsic state I of an atom is calculated according to Eq. 1 ,
here N is the principal quantum number of the valence shell: 
 = 
(
2 
𝑁 
)2 
× 𝛿𝑣 + 1 
𝛿
(1)
Values for 𝛿v and 𝛿 are given by Conte et al. (2011) for various atom
roups. They are defined as: 
= σ − h (2)
v = σ + π + n − h (3)
Where 𝜎 is the number of electrons in sigma orbitals, 𝜋 is the num-
er of electrons in pi orbitals, n is the number of electrons in lone
airs and h is the number of hydrogen atoms bonded to the atom in
onsideration. 
The E-state index for atom i Si can be calculated as follows: 
 i = I i + ΔI i (4)
The perturbation term ∆I i in Eq. 5 is defined as: 
𝐼 𝑖 = 
∑ 𝐼 𝑖 − 𝐼 𝑗 
𝑟 2 
𝑖𝑗 𝑗 ≠ 𝑖 (5)
Where r ij is the distance between atom i and j counted as the graph
istance D as shown in Fig. 3 . 
M. Hatamleh, J.W. Chong, R.R. Tan et al. Digital Chemical Engineering 3 (2022) 100018 
Table 3 
Mathematical description of the MILP model used to train the Hyperbox 
model. 
Model’s Equations Purpose 
min 𝛼+ 𝛽
𝛽 < 𝜀 
(6) 
(7) 
Minimize the total number of Type I 
and Type II error while constraining 
Type II error below a certain 
threshold 
𝛼 = 
∑
𝑗 ( 𝑐 𝑗 −C ∗ j ) 
N T 
, ∀𝑗 ∈
𝑆 𝑁 
(8) Definition of Type I error. 
β = 
∑
𝑗 ( C ∗ j − 𝑐 𝑗 ) 
P T 
, ∀j ∈ S P (9) Definition of Type II error. 
X ji > 𝑥 𝐿 𝑖𝑘 − Δ − 
M( 1 − 𝑏 𝑗𝑘 ) , ∀𝑖, j 
(10) Generate the external boundaries of 
hyper box k to avoid incorrectly 
classifying samples. X ji < 𝑥 𝑈 𝑖𝑘 + Δ + 
M( 1 − 𝑏 𝑗𝑘 ) , ∀i , j 
(11) 
X ji > 
𝑥 𝐿 
𝑖𝑘 
− M( 1 − 𝑏 𝑗𝑘 ) , ∀i , j 
(12) Generate the boundaries of 
Hyperbox k . 
X ji < 
𝑥 𝑈 
𝑖𝑘 
+ M( 1 − 𝑏 𝑗𝑘 ) , ∀i , j 
(13) 
Z L ik − M( 1 − 𝑏 
𝐿 
𝑖𝑘 
) ≤ 
𝑥 𝐿 
𝑖𝑘 
≤ Z L ik + M 𝑏 
𝐿 
𝑖𝑘 
, ∀i , k 
(14) Determine upper and lower bound of 
k in dimension i . 
𝐼𝑓 𝑏 𝐿 
𝑖𝑘 
= 1 Z L ik ≤ 
𝑥 𝐿 
𝑖𝑘 
≤ Z L ik + M , ∀i , k 
(15) 
𝐼𝑓 𝑏 𝐿 
𝑖𝑘 
= 0 Z L ik − M ≤ 
𝑥 𝐿 
𝑖𝑘 
≤ Z L ik , ∀i , k 
(16) 
Z U ik − M 𝑏 
𝑈 
𝑖𝑘 
≤ 𝑥 𝑈 
𝑖𝑘 
≤ 
Z U ik + M( 1 − 𝑏 
𝑈 
𝑖𝑘 
) , ∀i , k 
(17) 
𝐼𝑓 𝑏 𝑈 
𝑖𝑘 
= 1 Z U ik − M ≤ 
𝑥 𝑈 
𝑖𝑘 
≤ Z U ik , ∀i , k 
(18) 
𝐼𝑓 𝑏 𝑈 
𝑖𝑘 
= 0 Z U ik ≤ 
𝑥 𝑈 
𝑖𝑘 
≤ Z U ik + M , ∀i , k 
(19) 
X ji ≤ 𝑥 𝐿 𝑖𝑘 − Δ + 
M( 1 − 𝑞 𝐿 
𝑖𝑗𝑘 
) , ∀i , j 
(20) Identify whether the sample lies 
outside Hyperbox k in 1 or more 
dimensions. X ji ≥ 𝑥 𝑈 𝑖𝑘 + Δ − 
M( 1 − 𝑞 𝑈 
𝑖𝑗𝑘 
) , ∀i , j 
(21) 
∑
𝑖 
𝑞 𝐿 
𝑖𝑗𝑘 
+ 𝑞 𝑈 
𝑖𝑗𝑘 
≤ 
M( 1 − 𝑏 𝑗𝑘 ) , ∀𝑗, 𝑘 
(22) Reject a sample j from being within k 
if it is outside the bounds of k in 1 or 
more dimensions. ∑
𝑖 
𝑞 𝐿 
𝑖𝑗𝑘 
+ 𝑞 𝑈 
𝑖𝑗𝑘 
≥ 
( 1 − 𝑏 𝑗𝑘 ) , ∀𝑗, 𝑘 
(23) 
∑
𝑘 
𝑏 𝑗𝑘 ≤ M 𝑐 𝑗 , ∀𝑗 (24) Classify as a positive sample if j 
belongs in any k . ∑
𝑘 
𝑏 𝑗𝑘 ≥ 𝑐 𝑗 , ∀𝑗 (25) Tighten the constraint. 
b jk , b 
U 
ik , b 
L 
ik , q 
U 
ijk , q 
L 
ijk , c j 𝜖 {0 , 1} 
(26) Define all binary variables in the 
model. 
 
t 
i 
d 
s 
i 
m
 
t 
T 
2 
s 
3 
e 
t
 
t 
s 
i 
s
3
 
o 
a 
o 
t 
u 
p 
t
3
 
p 
t 
i 
m 
a 
f 
t 
m 
m
M
𝐹 
𝐿 
𝛿 
𝑉 
 
( 
a 
f 
i 
i 
r 
2
 
l 
∑
 
 
o 
a
 
t 
m 
i 
i 
H 
o 
f 
i 
b 
g 
t
3
 
i 
e 
d 
m 
The E-state index for every atom group present in every molecule in
he dataset is to be calculated. An example calculation for the E-state
ndex of the hydroxyl group present in 1-hexanol, a molecule in the
ataset, is shown in Table 2 , with the aid of Fig. 3 . In Table 2 , ‘s’ is a
ingle covalent bond. The indices of the six most occurring atom groups
n the dataset are to be used as the dimensions for the hyperbox-based
odel. 
The molecules in the dataset are split into two: training and valida-
ion data. In the training data, the hyperbox classifier is trained via MILP
able 3 . (Equation 6 – 26) demonstrates the MILP model used ( Tan et al.,
020 ). This MILP formulation can be solved to global optimality by a
tandard branch-and-bound solver. Of the 56 molecules inthe dataset,
9 randomly selected molecules were used for training. Using the gen-
rated set of rules, the remaining 17 molecules were used to evaluate
he reliability of the hyperbox model. 
The rules generated from the hyperbox model are based on pat-
erns found in the dataset and are not necessarily scientifically plau-
ible. Therefore, before the model can be used in the subsequent stages,
t is checked for scientific coherency. If the model was found not to be
cientifically coherent, an alternative set of rules will be developed. 
.3. Step 4: structural descriptors for CAMD 
Two different models were established. The first comprises a set
f rules by electrotopological-state indices of the six most common
6 
tom groups generated by the hyperbox model to predict the ability
f molecules to repel mosquitos. The second utilizes existing GC models
o predict properties to attain the technical requirements of the prod-
ct. Both models cannot be used simultaneously to formulate the CAMD
roblem since electropological-state indices have rather unique struc-
ures. 
.4. Step 5: design of mosquito repellent molecules using CAMD 
CAMD is the reverse of a property prediction problem where the
otential structures of mosquito repellent molecules are generated such
hat they satisfy the desired target properties ( Harper and Gani, 2000 ). It
s a reverse mathematical problem that can be formulated using a MILP
odel, which can be solved to global optimality by a standard branch-
nd-bound solver. In this study, the CAMD model can be formulated as
ollows ( Eqs. 27 – 32). Since the target in this problem is to improve
he sprayability, either viscosity or molar volume can be minimized to
eet that objective. In this case viscosity is chosen as the property to be
inimized. 
in 𝜂 (27) 
Subject to the following property constraints: 
 𝑝 > 333 . 15 𝐾 (28)
𝐶 − 50 > 0 . 05 mol∕L (29)
− 3 ≤ 24 . 89 ≤ 𝛿 + 3 MP a 0 . 5 (30)
 𝑚 < 0 . 075 m 3 ∕ kmol (31)
Flash point constraint is based on the Globally Harmonized System
GHS) of chemicals where a flash point above 60 o C is not regarded as
 flammable material. An LC 50 of 0.05 mol/L is acceptable limit for a
ormulated product ( Conte et al. 2011 ).The limit on solubility parameter
s chosen based on the solubility parameter of isopropyl alcohol, which
s the most commonly used solvent in mosquito repellents. In the insect
epellents, the limit of molar volume is 0.075 m 3 /kmol ( Conte et al.,
011 ). 
The total number of valences and atoms of the molecule should be se-
ected such that the molecule must be complete without any free bonds:
𝑛 1 
𝑖 =1 
𝑥 𝑖 + 2 
𝑛 2 ∑
𝑛 1 
𝑥 𝑖 + 3 
𝑛 3 ∑
𝑛 2 
𝑥 𝑖 + 4 
𝑛 4 ∑
𝑛 3 
𝑥 𝑖 = 2 
{ [ 
𝑛 1 ∑
𝑖 =1 
𝑥 𝑖 
] 
− 1 + 𝑅 
} 
(32)
Where x i is the molecular group i, n 1 , n 2 , n 3 , and n 4 are the number
f molecular groups with valence one, two, three, and four, respectively
nd N is the total number of groups in the molecule. 
The set of rules by electrotopological-state indices generated by
he hyperbox-based model to predict the molecules’ ability to repel
osquitos cannot be implemented in the CAMD model directly. This
s because the electropological-state of an atom group is a path-based
ndex that depends on the remaining groups present in the molecule.
owever, constraints are added to ensure the model includes at least
ne of each required atomic group (depending on the generated rules
rom the hyperbox-based model) of the most common six atom groups
n the resultant structures. Subsequently, the structures obtained are to
e characterized by the same topological descriptor to ensure the rules
enerated by the hyperbox-based model are attained. Structures that fail
o attain the rules are rejected. 
.5. Step 6: Verification 
The final molecule generated by CAMD can be verified in terms of
ts ability to repel mosquitos by initially checking whether the molecule
xists in the dataset. In case of its absence, a literature search can be con-
ucted to check if repellent properties have recently been reported. If the
olecule cannot be identified by literature, it must mean the molecule is
M. Hatamleh, J.W. Chong, R.R. Tan et al. Digital Chemical Engineering 3 (2022) 100018 
Table 4 
A summary of E-state indices characterisation of the dataset. 
Atom Group E-State Value Occurrence 
Minimum Maximum Average 
sCH 3 0 8.457 4.089 53 
ssCH 2 0 28.202 10.563 46 
dssC -1.185 3.120 0.08 39 
dO -1.734 11.885 10.355 36 
ssO 0 6.109 5.158 22 
sOH 0 15.771 9.008 21 
n 
i 
s
 
b 
a 
r
4
4
 
t 
a 
a 
c 
t 
v 
t 
T
4
 
d 
p 
i 
o 
a 
g 
b
4
 
S 
i 
d 
s
 
r 
d 
s 
a 
v 
T
4
 
c 
fi 
i 
d 
o 
w 
T 
c 
a 
t 
I 
s 
s 
s 
a 
d
 
f
 
 
 
 
 
c 
v 
t 
T 
b 
t 
b 
o 
a 
E 
i
 
l
 
 
 
m 
t 
I
 
( 
T
 
a
4
 
n 
f 
m 
b 
a 
t 
t 
A 
t 
t 
p 
p 
s 
m 
ovel and its performance as a mosquito repellent still needs to be ver-
fied experimentally. Such experimental verification is not within the
cope of this study. 
In the case where the dataset or literature identifies a molecule as
eing a mosquito attractant, the molecule is rejected. If all the gener-
ted molecules are identified as attractants, the hyperbox model will be
evisited. 
.0. Results and discussion 
.1. Hyperbox-based ML model 
This work analyses the underlying link between a molecule’s struc-
ure and its mosquito repelling attribute. The dataset used consists of
 total of 56 molecules binary classified as being either repellent or
ttractant towards Aedes aegypti mosquitos. To develop the hyperbox
lassifier, the MILP model discussed in Section 3.2 is used and solved
o global optimality using a branch-and-bound solver (LINGO extended
ersion 18.0.56). The model will generate sets of disjunctive rules such
hat if fulfilled, a molecule is capable of repelling mosquitos subject to
ype I and Type II error criteria. 
.1.1. Dataset characterisation with electrotopological-state indices 
All 56 molecules in the dataset are characterised with E-state in-
ices based on their constituent atom groups Table A1 . of the Appendix
rovides the indices of every atom group of every molecule Table 4 .
llustrates a summary of the E-state indices of the six most frequently
ccurring atom groups in the dataset, where ‘d’ and ‘s’ refer to double
nd single bonds respectively. Of the 6 most frequently occurring atom
roups, 3 are oxygen heteroatoms while the remaining 3 are carbon-
ased. 
.1.2. Training and validation datasets 
Seven variables, which serve as the hyperbox dimensions, are used.
ix are continuous variables consisting of the E-state indices presented
n Table 4 , whereas one discrete variable ‘Repellency’ is used as the
ecision attribute. A Repellency value of 1 indicates that the molecule
ample is a mosquito repellent. 
Among the 56 molecule samples in the dataset, 35 were found to be
epellent while the remaining 21 were attractants. The dataset was then
ivided into two; one set is used to train the model, while the second
et is used for validation. A total of 39 randomly selected samples were
llocated to the training set, and the remaining 17 samples were used for
alidation. The training and validation data are shown in Table A2 and
able A3 of the Appendix. 
.1.3. Solutions with varying number of hyperboxes 
The number of hyperboxes is a user-defined parameter such that the
hosen number hyperboxes improves model performance without over-
tting the rules to the training data. Thus, hyperbox models with var-
ous numbers of hyperboxes are generated. The number of hyperboxes
epends on the nature of the problem being solved. In this problem,
verfitting of the dataset was observed when more than 3 hyperboxes
ere used. In all cases, Type I error ( 𝛼) is minimised while constraining
7 
ype II error ( 𝛽) below a certain threshold( 𝛽 < 𝜀 ). This is because in this
ase, Type I error, which is the error involved in mistakenly classifying
n attractant molecule as a repellent, carries much more consequence
han mistakenly classifying a repellent molecule as an attractant (Type
I error). An error margin, ∆ is also defined to indicate the minimum
eparation of hyperboxes between positive and negative samples. Any
ample that falls within the error margins indicates an ambiguity in the
elected class and are considered misclassified. The value ∆ is chosen
s 0.05 for this design. The results were then tested on the validation
ataset. 
The resultant set of rules when three hyperboxes were used is as
ollows: 
• Rule A1: IF (sOH ≤ 7.863) AND (sCH 3 ≥ 4.376) AND (ssCH 2 ≤
22.436) AND (dssC ≥ -0.012) THEN (Repellency = 1) 
• Rule A2: IF (sOH ≤ 7.863) AND (dO ≤ 11.438) AND (ssO ≥ 0.05)
AND (sCH 3 ≤ 8.407) AND (ssCH 2 ≤ 20.887) THEN (Repellency = 1)
• Rule A3: IF (sOH ≥ 8.517) AND (sCH 3 ≤ 5.893) AND (ssCH 2 ≥ 0.05)
THEN (Repellency = 1) 
This set of rules are disjunctive; fulfilling either one is sufficient to
orrectly classify a potentially repellent molecule. In the training and
alidation data, if a molecule does not possess a certain atom group of
he six most occurring, a value of 0 was awarded to the missing atom.
herefore, Rules A1 to A3, and the subsequent rules for 2 and 1 hyper-
ox(es), is interpreted as follows: If any of the atom groups present in
he rule is present in the molecule, its E-state index should be bound
y the given limits to obtain a Repellency = 1. However, the presence
f an atom group in the molecule is only required when the rule has
 positive (and a zero) lower bound or a negative upper bound on its
-state index. If the rule lacks a limit for a certain index, then that index
s redundant. 
The resultant set of rules when two hyperboxes were used is as fol-
ows: 
• Rule B1: IF (sOH ≤ 8.161) AND (sCH 3 ≤ 7.159) AND (1.534 ≤ ssCH 2 
≤ 22.436) AND (dssC ≥ -0.609) THEN (Repellency = 1) 
• Rule B2: IF (sOH ≥ 9.349) AND (0 < dO ≤ 11.542) AND (ssO > 0)
AND (sCH 3 ≥ 4.529) AND (ssCH 2 ≤ 3.557) AND (dssC > 0) THEN
(Repellency = 1) 
In Rule B1, only the ssCH 2 group is required to be present to deem a
olecule as being repellent. However, if a molecule contains any one of
he remaining groups, their E-State should be bound by the rule’s limits.
n B2, all atom groups must be present except dssC. 
The rule when one hyperbox was used is as follows: 
Rule D: IF (sOH ≤ 7.640) AND (dO > 0) AND (ssO ≤ 6.109) AND
sCH 3 ≤ 7.957) AND (1.534 ≤ ssCH 2 ≤ 22.725) AND (dssC ≥ -0.337)
HEN (Repellency = 1) 
In Rule D, only dO and ssCH 2 groups need to be present to classify
 compound as a repellent. 
.1.4. Comparison between solutions 
In the developed model, specific key model parameters such as the
umber of hyperboxes, and the value of ∆ and 𝜀 are user-defined. There-
ore, alternate solutions must be surveyed before the final decision is
ade on which rule(s) to use. The best validation performances obtained
y running the model with a different number of hyperboxes individu-
lly is presented in Table 5 . The solution with one hyperbox resulted in
he best performance of the model when validated against the valida-
ion dataset, where the obtained 𝛼 and 𝛽 are 0.00 and 0.30 respectively.
 total of 3 out of 17 molecules were misclassified, yielding a predic-
ion accuracy of 82.35%. This result demonstrates quantitative evidence
hat the developed decision model is reliable, especially for a risk-averse
roduct designer who would prefer to be conservative when selecting a
otential mosquito repellent molecule. In such a case, mistakenly clas-
ifying an attractant molecule as a repellent (Type I error) carries much
ore consequence than the reverse case (Type II error). Amongst the
M. Hatamleh, J.W. Chong, R.R. Tan et al. Digital Chemical Engineering 3 (2022) 100018 
Table 5 
Best validation performances for solutions with different number of hyperboxes 
Number of 
Hyperboxes 
Attractants 
Misclassified 
Repellents 
Misclassified 
𝛼 𝛽
3 1 3 0.14 0.30 
2 2 4 0.29 0.40 
1 0 3 0.00 0.30 
d 
f
 
o 
T
4
 
m 
o 
s 
t 
t 
v 
a 
p 
d 
m 
e 
i 
t
4
 
p 
t 
M 
i 
H 
g 
t 
i 
M 
1 
t 
p 
a
4
 
I 
m 
m 
n 
d 
s 
a 
E 
l 
s 
s 
o 
i
Table 6 
Characterisation of the candidate structures by E-State Indices 
Atom Group E-State Indices Requirement Attained? 
Molecule no. 1 
dsCH 0.931 - 
dO 9.405 Y 
ssCH 2 1.684 Y 
sCH 3 1.981 Y 
Molecule no. 2 
dsCH 0.962 - 
dO 9.564 Y 
ssCH 2 2.902 Y 
sCH 3 2.073 Y 
Molecule no. 3 
dsCH 0.980 - 
dO 9.774 Y 
sssC 0.581 - 
ssCH 2 1.829 Y 
sCH 3 4.169 Y 
Molecule no. 4 
dsCH 0.982 - 
dO 9.680 Y 
ssCH 2 3.513 Y 
sCH 3 2.130 Y 
Molecule no. 5 
dsCH 0.992 - 
dO 9.825 Y 
sssC 0.722 - 
ssCH 2 2.975 Y 
sCH 3 4.318 Y 
Molecule no. 6 
dssC -0.711 N 
dO 9.599 Y 
ssCH 2 1.023 N 
sCH 3 1.841 Y 
sOH 7.913 N 
Molecule no. 7 
dssC 0.252 Y 
dO 10.332 Y 
ssCH 2 1.653 Y 
sCH 3 1.845 Y 
sNH 2 5.084 - 
Molecule no. 8 
dssC 0.337 Y 
dO 10.579 Y 
ssCH 2 9.842 Y 
sCH 3 3.910 Y 
Table 7 
CAS registration number of the candidate structures 
Molecule no. Name CAS no. 
1 Butanal 123-72-8 
2 Pentanal 110-62-3 
3 3-methylpentanal 15877-57-3 
4 Hexanal 66-25-1 
5 4-methylhexanal 41065-97-8 
7 1-aminopentan-3-one 87156-66-9 
8 2-Undecanone 112-12-9 
4
 
a 
P 
n
 
c 
h 
b 
a 
a 
o 
t 
ifferent rules obtained, Rule D is, therefore, the best, and will be used
or the CAMD. 
Using Rule D, the k-fold cross-validation is performed with a k value
f 3 against 3 sets of 17 randomly selected samples from the dataset.
he performance is comparable with the results shown in Table 5 . 
.2. Computer-aided molecular design of mosquito repellent molecules 
The developed model is used to design mosquito repellent
olecules using CAMD. The MILP model presented in the methodol-
gy Section 3.4 is used to identify the repellent that can be used as a
prayable product with other relevant attributes. . It is to be noted that,
he attributes such as flash point, LC 50 and solubility product only need
o meet the constraints in order to be suitable for this application. Both
iscosity and molar volume are related to sprayability of the product
nd the problem can be formulated by optimising either one of these
roperties. The desired attributes are estimated using GC property pre-
iction models and are attained involving property constraints. After the
olecules are generated, they will be characterised by E-state indices to
nsure that Rule D from Section 4.1.3 is attained. If not, the structure
s rejected. Finally, a literature search is conducted to examine whether
he obtained structures identify as known repellents in the literature. 
.2.1. Generated molecular structures 
From Rule D, only the dO and ssCH2 atom groups are required to be
resent in the structure. The lower limit of the E-State index for dO, and
he lower and upper limit for ssCH2 cannot be directly constrained in the
ILP model. This is because the E-State of an atom group is a path-based
ndex that depends on the remaining groups present in the molecule.
owever, constraints were added to force the model to include both
roups at least once in the resultant structure. In Section 4.2.2 , the ob-
ained structures are characterised by E-State indices to ensure Rule D
s attained. The molecules generated are butanal (1), pentanal (2), 3-
ethylpentanal (3), hexanal (4), 4-Methylhexanal (5), butyric acid (6),
-Aminopentan-3-one (7) and 2-Undecanone (8). The molecular struc-
ures obtained along with their estimated properties are shown in Ap-
endix Table A7 . An E-State indices check, and validation against liter-
ture is required before deciding the best candidate. 
.2.2. E-State Indices Check 
The model was solved to obtain the optimum molecular structure.
nteger cutshave been used to generate seven additional alternative
olecules. This procedure relies on adding constraints to the MILP
odel to disqualify previously identified solutions, and thus identify
ew ones. The 8 generated structures are characterised by E-State in-
ices Table 6 . illustrates the result. Rule D dictates that only the dO and
sCH2 atom groups are required to be present. However, all the gener-
ted molecules also include the sCH3 group. Nevertheless, the respective
-State index for all the generated molecules is still bounded within the
imit given in the rule. Similarly, molecule number 6 includes dssC and
OH, and molecules 7 and 8 include dssC. Atom group dsCH, sssCH, and
NH2 are not present in the rule and are hence redundant. The indices
f all molecules attain Rule D, except for molecule number 6. Therefore,
t is rejected. 
8 
.2.3. Verification 
A literature search was conducted on the feasible structures gener-
ted. As a result, they can all be found in reputable databases such as
ubChem ( Kim et al., 2021 ) Table 7 . illustrates their CAS registration
umber. Molecule 6 was omitted as it failed the E-State check. 
The effectiveness of the molecules in repelling mosquitos was
hecked in the literature. It was found that butanal (molecule 1) in-
ibits the activity of the CO 2 complex in Aedes aegypti , and hence can
e used in mosquito repellent products Ray & Turner (2015) . proposed
n insect repellent comprising of combinations of butanal, pentanal,
nd hexanal. The authors claim that those 3 molecules, along with
thers, efficiently inhibit carbon dioxide response in the neurones of
he Drosophila antenna and play an important role in CO masking in
2 
M. Hatamleh, J.W. Chong, R.R. Tan et al. Digital Chemical Engineering 3 (2022) 100018 
m 
U 
r 
l 
m 
i
4
4
 
i 
R 
E 
a 
i 
a 
a 
a 
i 
v
 
t 
e 
g 
a 
i 
g 
t 
o 
t 
a 
b 
a 
s 
N 
r 
s 
a 
t 
t 
g 
c
 
i 
i 
t 
g 
s 
j 
T 
o 
t 
i 
i 
s 
h 
t
 
i 
g 
d 
t 
a 
l 
t 
u 
F 
i 
E 
b
4
 
h 
H 
m 
m 
t 
d 
a 
a 
n 
a 
f
5
 
h 
m 
i 
c 
w 
p 
b 
t 
b 
s 
I 
t 
C 
o 
t 
m 
g 
h 
o 
T 
s 
l 
s 
u 
m 
t 
i 
t 
l 
r 
p
D
 
i 
t
A
 
o 
u
osquitos. Therefore, molecules 2 and 4 are also known repellents. 2-
ndecanone (molecule 8) was also found to be a known repellent de-
ived from a wild tomatoes plant ( Farrar Jr. & Kennedy, 1987 ). It is be-
ieved that the molecule plays a prominent role in natural plant defence
echanisms against herbivores ( Kennedy, 2003 ). Molecule 1, however,
s considered the best candidate as it has the lowest value of viscosity. 
.3. Analysis and discussion 
.3.1. Hyperbox model 
Rule plausibility is an important aspect to gauge whether the result-
ng ML classifier is detecting real or spurious patterns in the dataset.
ule D was examined for its scientific coherency. The rule is based on
-state indices, which encipher the intrinsic electronic state of an atom
s affected by other electronic influences from the remaining atoms
n the molecule ( Conte et al., 2011 ). The implications are as follows;
toms that have 𝜋 bonds, or lone pairs of electrons, or are terminal
toms tend to have a large positive E-state index, whereas atoms that
re based on 𝜎 bonds, or lack lone pairs of electrons, or are buried
n the interior of the molecule have small positive or negative index
alues. 
In Rule D, the only heteroatom required to be present in the struc-
ure is dO. No upper bound on its E-state is dictated; the mere pres-
nce of the atom group is enough. The E-state index of this atom
roup will always be positive since the Oxygen atom has both 𝜋 bonds
nd lone pairs of electrons. This is reflected in the model by assign-
ng a lower bound limit of zero. The required presence of this atom
roup infers that functional groups such as aldehydes, ketones, es-
ers, and amides are responsible for the mosquito repelling attribute
f molecules. This is scientifically coherent when examining the struc-
ures of some common repellents. DEET, the most commonly used, has
n amide group. Dimethyl phthalate has two ester groups. IR3535 has
oth a tertiary amide and an ester group. Aldehydes such as butanal
nd pentanal are known to be mosquito repellents. An interesting re-
ult suggests that the Nitrogen atom in amide groups, or any other
itrogen-based functional group, is not responsible for the mosquito
epelling attribute; rather it is solely the Oxygen atom that makes
tructures effective as repellents. The model correctly classified DEET
nd butyl anthranilate (both contain Nitrogen) without the considera-
ion of E-state indices of Nitrogen-containing atom groups. However,
hey were only 2 of 3 molecules in the dataset that contained Nitro-
en, so a more diverse dataset is needed before establishing a sound
onclusion. 
The ssO atom group, which can be represented by the Oxygen atom
n the cetonic group of an ester, is deemed optional by Rule D. However,
f a molecule contains that group, its E-state index value must be equal
o or below 6.109. The higher the degree of branching in the cetonic
roup, or the larger the straight-chain, the lower the E-state value of
sO. To appraise, the cetonic Oxygen atom in methyl dodecanoate is ad-
acent to a terminal sCH 3 group and has an E-state index value of 4.583.
herefore, the upper limit dictated by the model gives the cetonic group
f esters the flexibility of having small, unbranched chains. This infers
hat if a mosquito repellent molecule is dependent on an ester group for
ts performance, the structure of its cetonic group should not necessar-
ly be complex. This is scientifically coherent; dimethyl phthalate has a
imple methyl group bonded to both its cetonic Oxygen atoms, IR3535
as an ethyl group, and MGK Repellent 326 has a propyl group bonded
o both its cetonic Oxygen atoms. 
The model also dictates the presence of the ssCH 2 group. This is triv-
al; the atom group is not terminal and is used to link different function
roups together in almost every organic compound. However, the model
ictates a small positive lower bound on its E-state index. This suggests
hat in general, small, chained structures based on 2 or 3 Carbon atoms
re not effective as repellents. Indeed, most repellents have medium to
ong Carbon chains. The model also has a high positive upper bound for
he atom group. This allows for the multiple occurrences of ssCH groups
2 
9 
p to a certain extent (E-state indices of occurring groups are summed).
or example, the model correctly predicted that octadecyl propanoate
s a mosquito attractant, as the 18 occurring ssCH 2 groups have a sum
-state index of 23.225, which is slightly above the upper bound given
y the model. 
.3.2. Computer-aided molecular design 
A total of 8 structures were generated using the MILP model. Only
exanal (molecule no. 4) was present in the original dataset used.
ad the methodology been limited to conventional approaches to find
osquito repellent molecules, the 7 other structures would have been
issed. Although experimental evaluation is still necessary to verify
heir performance, the search for repellent molecules has been narrowed
own to a large extent. The model has shown its effectiveness in gener-
ting mosquito repellents based on a dataset, identifying repellents that
re even excluded from the dataset. Therefore, the molecules that are
ot recognised as mosquito repellents such as molecule numbers 3, 5,
nd 7, provide valuable insight on where further experimental research
or new repellents should be focused. 
. Conclusion 
In this work, a CAMD framework that integrates a data-driven
yperbox-based machine learning model with group contribution
odels has been developed for the systematic design and screen-
ng of mosquito repellent molecules. Withinthe framework, group
ontribution-based models were used to predict the physical properties,
hereas the hyperbox-based model was implemented to generate trans-
arent IF-THEN rules that predict the repelling ability of the molecule
ased on its electrotopological-state indices. A MILP model was used
o train the hyperbox classifier. Different sets of rules were generated
ased on the user-defined number of hyperboxes and error margin. The
et of rules which performed best in terms of achieving the lowest Type
 error upon validation was chosen and incorporated in the CAMD op-
imisation model. Results show that of the structures generated from
AMD, the hyperbox classifier correctly predicted the repelling ability
f all molecules found to be known repellents in literature. For the struc-
ures which were not verified as repellents by literature, further experi-
ental verification is required to ensure their effectiveness. Only 1 of 8
enerated structures is present in the original dataset used to train the
yperbox classifier. This result demonstrates the ability of this method-
logy to identify potential chemical structures with mosquito repellence.
he developed framework can be applied as a systematic technique to
creen and narrow down the search space for candidate mosquito repel-
ent molecules before final experimental verification. This data-driven
creening capability can make the development of new chemical prod-
cts faster and more economical. In the future, the utilisation of a larger,
ore diverse dataset of known repellents and attractants can improve
he scientific coherency of the hyperbox classifier. In addition, by allow-
ng the CAMD model to synthesise cyclic and aromatic structures, po-
entially superior candidates may be explored. Finally, combining topo-
ogical indices with group contribution models would yield more accu-
ate property predictions, but at the expense of increased computational
ower. 
eclaration of Competing Interest 
The authors declare that they have no known competing financial
nterests or personal relationships that could have appeared to influence
he work reported in this paper. 
cknowledgments 
The authors would like to express sincere gratitude to the Ministry
f Higher Education, Malaysia for the realisation of this research project
nder the Grant FRGS/1/2019/TK02/UNIM/02/1 . 
https://doi.org/10.13039/501100003093
M. Hatamleh, J.W. Chong, R.R. Tan et al. Digital Chemical Engineering 3 (2022) 100018 
A
T
E
Hydrocarbons 
H ssS sCH3 ssCH2 dssC aasC aaCH sssCH ssssC dsCH dCH2 
0 2.164 5.043 0 0 0 0 0 0 0 
0 4.479 21.871 0.008 0 0 0 0 0 0 
0 3.732 17.992 -0.067 0 0 0 0 0 0 
0 0 0 -0.879 0.331 8.296 0 0 0 0 
0 2.27 20.199 -0.653 0 0 0 0 0 0 
0 6.124 0 0 2.545 5.805 0.399 0 0 0 
0 6.756 5.314 0 0 0 0.823 0.378 0 0 
0 4.308 22.486 -0.012 0 0 0 0 0 0 
0 6.73 2.358 3.12 0 0 0.74 0 4.718 0 
0 3.732 17.992 -0.067 0 0 0 0 0 0 
0 6.373 3.564 0.499 0 0 0.681 0.057 0 0 
0 4.39 22.208 -0.002 0 0 0 0 0 0 
0 0 0 -0.821 1.419 7.1 0 0 0 0 
0 6.895 3.607 0 0 0 0.749 0.605 0 0 
0 5.893 1.67 1.296 0 0 0 -0.702 3.712 3.559 
0 4.597 28.202 0 0 0 0 0 0 0 
0 6.846 5.152 0 0 0 0.701 0.527 0 0 
0 2.046 2.36 -0.337 0.915 6.922 0 0 0 0 
0 2.247 14.36 -0.657 0 0 0 0 0 0 
0 4.483 21.829 0.011 0 0 0 0 0 0 
0 4.376 22.214 -0.012 0 0 0 0 0 0 
0 2.057 3.277 -0.682 0 0 0 0 0 0 
0 2.057 3.277 -0.682 0 0 0 0 0 0 
0 7.159 2.836 1.667 0 0 1.01 1.384 2.444 0 
0 8.457 3.601 -0.125 0 0 0.908 0.565 0 0 
0 3.723 20.937 -0.065 0 0 0 0 0 0 
0 0 0 0 0.322 8.713 0 0 0 0 
0 6.541 0 0 2.759 8.712 0.653 0 0 0 
0 5.993 1.534 0.124 1.914 7.722 0 0 0 0 
0 3.86 23.225 0.126 0 0 0 0 0 0 
0 1.841 1.023 -0.711 0 0 0 0 0 0 
0 6.132 0 0 2.526 5.837 0.488 0 0 0 
3.546 4.12 0 0 0 0 0 0 0 0 
18 0 2.11 0 1.315 2.544 8.311 0 0 2.032 0 
0 6.214 2.953 2.674 0 0 0.602 0 0 0 
0 4.499 21.757 0.012 0 0 0 0 0 0 
0 2.228 11.77 -0.659 0 0 0 0 0 0 
0 4.5 21.749 0.013 0 0 0 0 0 0 
0 1.534 0.788 0 1.764 5.273 0 0 1.811 3.631 
0 4.429 3.95 0 0 0 0.718 0 2.023 3.714 
0 6.322 2.942 1.364 0 0 0.545 0 3.231 0 
0 4.289 22.582 -0.03 0 0 0 0 0 0 
0 2.23 9.301 0 0 0 0 0 0 0 
0 2.13 4.208 0 0 0 0 0 0.982 0 
0 4.427 22.08 -0.002 0 0 0 0 0 0 
0 2.034 1.858 0 0 0 -0.287 0 1.559 3.429 
0 4.458 21.942 0.004 0 0 0 0 0 0 
0 3.675 0 0 2.103 7.962 0 0 4.071 0 
0 2.247 14.36 -0.657 0 0 0 0 0 0 
0 3.697 12.162 -0.071 0 0 0 0 0 0 
0 3.766 0.278 -0.713 0 0 0.275 0 0 0 
0 1.771 0.382 -0.334 1.066 6.062 0 0 0 0 
0 2.168 5.57 0.996 0 0 0 0 0 0 
0 2.112 4.553 -0.675 0 0 0 0 0 0 
0 1.197 0 -1.185 0 0 -1.231 0 0 0 
3.768 4.309 2.59 0 0 0 0 0 0 0 
. Appendix 
able A1 
-state indices characterisation of molecules in the dataset 
No. Molecule Heteroatoms 
sOH dO ssO sNH2 sssN ssN
1 1-hexanol 8.293 0 0 0 0 0 
2 Tridecyl octanoate 0 11.536 5.292 0 0 0 
3 Methyl tetradecanoate 0 10.891 4.618 0 0 0 
4 Benzoic acid 8.385 -1.734 0 0 0 0 
5 Elaidic acid 8.517 10.33 0 0 0 0 
6 Thymol 9.462 0 0 0 0 0 
7 Cineol 0 0 6.06 0 0 0 
8 Propyl octadecanoate 0 11.314 5.071 0 0 0 
9 gamma-terpinene 0 0 0 0 0 0 
10 Methyl hexadecanoate 0 10.891 4.618 0 0 0 
11 Fenchone 0 11.683 0 0 0 0 
12 Butyl heptadecanoate 0 11.414 5.157 0 0 0 
13 Salicylic acid 9.187 9.488 0 0 0 0 
14 Borneol 9.814 0 0 0 0 0 
15 Linalool 9.489 0 0 0 0 0 
16 n-heneicosane 0 0 0 0 0 0 
17 1,4-cineole 0 0 6.109 0 0 0 
18 butyl anthranilate 0 11.438 5.03 5.625 0 0 
19 Tetradecanoic acid 8.457 10.26 0 0 0 0 
20 Heptyl tetradecanoate 0 11.57 5.275 0 0 0 
21 Hexadecyl pentanoate 0 11.314 5.212 0 0 0 
22 1-hexanoic acid 8.14 9.874 0 0 0 0 
23 Hexanoic acid 8.14 9.874 0 0 0 0 
24 Alpha-pinene 0 0 0 0 0 0 
25 Borneol Acetate 0 11.007 5.445 0 0 0 
26 Methyl octadecanoate 0 10.923 4.63 0 0 0 
27 Phenol 8.632 0 0 0 0 0 
28 p-cymene 0 0 0 0 0 0 
29 N,N-diethyl-meta-toluamide 0 11.885 0 0 1.828 0 
30 Octadecyl propanoate 0 10.69 5.265 0 0 0 
31 Butanoic acid 7.913 8.275 0 0 0 0 
32 5-isopropyl-2-methyl phenol 9.349 0 0 0 0 0 
33 Dimethyl disulfide 0 0 0 0 0 0 
34 Skatole 0 0 0 0 0 3.1
35 Pulegone 0 11.388 0 0 0 0 
36 Undecyl decanoate 0 11.592 5.308 0 0 0 
37 Dodecanoic acid 8.413 10.206 0 0 0 0 
38 Nonyl dodecanoate 0 11.602 5.303 0 0 0 
39 Eugenol 9.252 0 4.948 0 0 0 
40 1-octene-3-ol 0 0 0 0 0 0 
41 Citronellal 0 10.097 0 0 0 0 
42 Heptadecyl butanoate 0 11.169 5.157 0 0 0 
43 1-nonanol 8.469 0 0 0 0 0 
44 hexanal 0 9.68 0 0 0 0 
45 Pentadecyl hexanoate 0 11.414 5.25 0 0 0 
46 1-hexen-3-ol 8.74 0 0 0 0 0 
47 Tetradecyl heptanoate 0 11.485 5.275 0 0 0 
48 trans-anethole 0 0 5.06 0 0 0 
49 Myristic acid 8.457 10.26 0 0 0 0 
50 Methyl dodecanoate 0 10.797 4.583 0 0 0 
51 3-methylbutanoic acid 8.084 9.81 0 0 0 0 
52 ethyl anthranilate 0 11.183 4.81 5.606 0 0 
53 heptaldehyde 0 9.768 0 0 0 0 
54 Heptanoic acid 8.213 9.962 0 0 0 0 
55 lactic acid 15.771 9.449 0 0 0 0 
56 Methyl propyl disulfide 0 0 0 0 0 0 
10 
M. Hatamleh, J.W. Chong, R.R. Tan et al. Digital Chemical Engineering 3 (2022) 100018 
Table A2 
Training dataset used to train the Hyperbox model 
Sample Molecule E-State Indices Repellency 
sOH dO ssO sCH3 ssCH2 dssC 
1 1-hexanol 8.293 0 0 2.164 5.043 0 1 
2 Tridecyl octanoate 0 11.536 5.292 4.479 21.871 0.008 1 
3 Methyl tetradecanoate 0 10.891 4.618 3.732 17.992 -0.067 1 
4 Benzoic acid 8.385 -1.734 0 0 0 -0.879 0 
5 Elaidic acid 8.517 10.33 0 2.27 20.199 -0.653 1 
6 Thymol 9.462 0 0 6.124 0 0 1 
7 Cineol 0 0 6.06 6.756 5.314 0 1 
8 Propyl octadecanoate 0 11.314 5.071 4.308 22.486 -0.012 0 
9 gamma-terpinene 0 0 0 6.73 2.358 3.12 1 
10 Methyl hexadecanoate 0 10.891 4.618 3.732 17.992 -0.067 1 
11 Fenchone 0 11.683 0 6.373 3.564 0.499 1 
12 Butyl heptadecanoate 0 11.414 5.157 4.39 22.208 -0.002 1 
13 Salicylic acid 9.187 9.488 0 0 0 -0.821 0 
14 Borneol 9.814 0 0 6.895 3.607 0 0 
15 Linalool 9.489 0 0 5.893 1.67 1.296 1 
16 n-heneicosane 0 0 0 4.597 28.202 0 0 
17 1,4-cineole 0 0 6.109 6.846 5.152 0 1 
18 butyl anthranilate 0 11.438 5.03 2.046 2.36 -0.3371 
19 Tetradecanoic acid 8.457 10.26 0 2.247 14.36 -0.657 0 
20 Heptyl tetradecanoate 0 11.57 5.275 4.483 21.829 0.011 1 
21 Hexadecyl pentanoate 0 11.314 5.212 4.376 22.214 -0.012 1 
22 1-hexanoic acid 8.14 9.874 0 2.057 3.277 -0.682 1 
23 Hexanoic acid 8.14 9.874 0 2.057 3.277 -0.682 0 
24 Alpha-pinene 0 0 0 7.159 2.836 1.667 1 
25 Borneol Acetate 0 11.007 5.445 8.457 3.601 -0.125 0 
26 Methyl octadecanoate 0 10.923 4.63 3.723 20.937 -0.065 1 
27 Phenol 8.632 0 0 0 0 0 0 
28 p-cymene 0 0 0 6.541 0 0 1 
29 N,N-diethyl-meta-toluamide 0 11.885 0 5.993 1.534 0.124 1 
30 Octadecyl propanoate 0 10.69 5.265 3.86 23.225 0.126 0 
31 Butanoic acid 7.913 8.275 0 1.841 1.023 -0.711 0 
32 5-isopropyl-2-methyl phenol 9.349 0 0 6.132 0 0 1 
33 Dimethyl disulfide 0 0 0 4.12 0 0 0 
34 Skatole 0 0 0 2.11 0 1.315 0 
35 Pulegone 0 11.388 0 6.214 2.953 2.674 1 
36 Undecyl decanoate 0 11.592 5.308 4.499 21.757 0.012 1 
37 Dodecanoic acid 8.413 10.206 0 2.228 11.77 -0.659 0 
38 Nonyl dodecanoate 0 11.602 5.303 4.5 21.749 0.013 1 
39 Eugenol 9.252 0 4.948 1.534 0.788 0 1 
Table A3 
Validation dataset used to validate the developed Hyperbox model. 
Sample Molecule E-State Indices Repellency 
sOH dO ssO sCH3 ssCH2 dssC 
40 1-octene-3-ol 0 0 0 4.429 3.95 0 0 
41 Citronellal 0 10.097 0 6.322 2.942 1.364 1 
42 Heptadecyl butanoate 0 11.169 5.157 4.289 22.582 -0.03 1 
43 1-nonanol 8.469 0 0 2.23 9.301 0 0 
44 hexanal 0 9.68 0 2.13 4.208 0 1 
45 Pentadecyl hexanoate 0 11.414 5.25 4.427 22.08 -0.002 1 
46 1-hexen-3-ol 8.74 0 0 2.034 1.858 0 1 
47 Tetradecyl heptanoate 0 11.485 5.275 4.458 21.942 0.004 1 
48 trans-anethole 0 0 5.06 3.675 0 0 1 
49 Myristic acid 8.457 10.26 0 2.247 14.36 -0.657 0 
50 Methyl dodecanoate 0 10.797 4.583 3.697 12.162 -0.071 1 
51 3-methylbutanoic acid 8.084 9.81 0 3.766 0.278 -0.713 0 
52 ethyl anthranilate 0 11.183 4.81 1.771 0.382 -0.334 1 
53 heptaldehyde 0 9.768 0 2.168 5.57 0.996 1 
54 Heptanoic acid 8.213 9.962 0 2.112 4.553 -0.675 0 
55 lactic acid 15.771 9.449 0 1.197 0 -1.185 0 
56 Methyl propyl disulfide 0 0 0 4.309 2.59 0 0 
11 
M. Hatamleh, J.W. Chong, R.R. Tan et al. Digital Chemical Engineering 3 (2022) 100018 
Table A4 
Set 1 of K-Fold Cross Validation 
Molecule E-State Indices Repellency 
sOH dO ssO sCH3 ssCH2 dssC 
5-isopropyl-2-methyl phenol 9.349 0 0 6.132 0 0 1 
Borneol Acetate 0 11.007 5.445 8.457 3.601 -0.125 0 
Heptyl tetradecanoate 0 11.57 5.275 4.483 21.829 0.011 1 
Tetradecyl heptanoate 0 11.485 5.275 4.458 21.942 0.004 1 
Cineol 0 0 6.06 6.756 5.314 0 1 
3-methylbutanoic acid 8.084 9.81 0 3.766 0.278 -0.713 0 
n-heneicosane 0 0 0 4.597 28.202 0 0 
Nonyl dodecanoate 0 11.602 5.303 4.5 21.749 0.013 1 
hexanal 0 9.68 0 2.13 4.208 0 1 
Octadecyl propanoate 0 10.69 5.265 3.86 23.225 0.126 0 
Heptadecyl butanoate 0 11.169 5.157 4.289 22.582 -0.03 1 
Butyl heptadecanoate 0 11.414 5.157 4.39 22.208 -0.002 1 
Elaidic acid 8.517 10.33 0 2.27 20.199 -0.653 1 
Fenchone 0 11.683 0 6.373 3.564 0.499 1 
Borneol 9.814 0 0 6.895 3.607 0 0 
Phenol 8.632 0 0 0 0 0 0 
Hexadecyl pentanoate 0 11.314 5.212 4.376 22.214 -0.012 1 
Table A5 
Set 2 of K-Fold Cross Validation 
Molecule E-State Indices Repellency 
sOH dO ssO sCH3 ssCH2 dssC 
1,4-cineole 0 0 6.109 6.846 5.152 0 1 
Tridecyl octanoate 0 11.536 5.292 4.479 21.871 0.008 1 
Pulegone 0 11.388 0 6.214 2.953 2.674 1 
Borneol Acetate 0 11.007 5.445 8.457 3.601 -0.125 0 
Hexadecyl pentanoate 0 11.314 5.212 4.376 22.214 -0.012 1 
Linalool 9.489 0 0 5.893 1.67 1.296 1 
Nonyl dodecanoate 0 11.602 5.303 4.5 21.749 0.013 1 
p-cymene 0 0 0 6.541 0 0 1 
Borneol 9.814 0 0 6.895 3.607 0 0 
Butyl heptadecanoate 0 11.414 5.157 4.39 22.208 -0.002 1 
1-hexen-3-ol 8.74 0 0 2.034 1.858 0 1 
Citronellal 0 10.097 0 6.322 2.942 1.364 1 
n-heneicosane 0 0 0 4.597 28.202 0 0 
Heptanoic acid 8.213 9.962 0 2.112 4.553 -0.675 0 
Heptyl tetradecanoate 0 11.57 5.275 4.483 21.829 0.011 1 
Hexanoic acid 8.14 9.874 0 2.057 3.277 -0.682 0 
heptaldehyde 0 9.768 0 2.168 5.57 0.996 1 
Table A6 
Set 3 of K-Fold Cross Validation 
Molecule E-State Indices Repellency 
sOH dO ssO sCH3 ssCH2 dssC 
Dodecanoic acid 8.413 10.206 0 2.228 11.77 -0.659 0 
Phenol 8.632 0 0 0 0 0 0 
Benzoic acid 8.385 -1.734 0 0 0 -0.879 0 
Methyl octadecanoate 0 10.923 4.63 3.723 20.937 -0.065 1 
Propyl octadecanoate 0 11.314 5.071 4.308 22.486 -0.012 0 
lactic acid 15.771 9.449 0 1.197 0 -1.185 0 
Methyl hexadecanoate 0 10.891 4.618 3.732 17.992 -0.067 1 
Skatole 0 0 0 2.11 0 1.315 0 
N,N-diethyl-meta-toluamide 0 11.885 0 5.993 1.534 0.124 1 
1-octene-3-ol 0 0 0 4.429 3.95 0 0 
1-hexanol 8.293 0 0 2.164 5.043 0 1 
trans-anethole 0 0 5.06 3.675 0 0 1 
Thymol 9.462 0 0 6.124 0 0 1 
Tetradecanoic acid 8.457 10.26 0 2.247 14.36 -0.657 0 
hexanal 0 9.68 0 2.13 4.208 0 1 
5.2
4.5
Hexadecyl pentanoate 0 11.314 
Methyl dodecanoate 0 10.797 
12 
12 4.376 22.214 -0.012 1 
83 3.697 12.162 -0.071 1 
M. Hatamleh, J.W. Chong, R.R. Tan et al. Digital Chemical Engineering 3 (2022) 100018 
Table A7 
Generated Molecular Structures 
Property Description 
Molecule number 1 2 
Molecular name Butanal Pentanal 
Molecular formula C 4 H 8 O C 5 H 10 O 
Molecular weight (kg/kmol) 72.11 86.13 
Molecular structure 
Viscosity (mPa.s) 1.59 1.97 
Flash point (K) 406.73 418.14 
Lethal Concentration (mol/L) 65.01 32.15 
Solubility parameter (MPa 0.5 ) 23.81 23.68 
Molar volume (m 3 /kmol) 0.11 0.12 
Property Description 
Molecule number 3 4 
Molecular name 3-Methylpentanal Hexanal 
Molecular formula C 6 H 12 O C 6 H 12 O 
Molecular weight (kg/kmol) 100.16 100.16 
Molecular structure 
Viscosity (mPa.s) 2.13 2.43 
Flash point (K) 422.08 429.55 
Lethal Concentration (mol/L) 31.15 15.90 
Solubility parameter (MPa 0.5 ) 23.02 23.55 
Molar volume (m 3 /kmol) 0.14 0.14 
Property Description 
Molecule number 5 6 
Molecular name 4-Methylhexanal Butyric acid 
Molecular formula C 7 H 14 O C 4 H 8 O 2 
Molecular weight (kg/kmol) 114.19 88.11 
Molecular structure 
Viscosity (mPa.s) 2.63 17.73 
Flash point (K) 433.49 468.30 
Lethal Concentration (mol/L) 15.40 308.46 
Solubility parameter (MPa 0.5 ) 22.89 26.52 
Molar volume (m 3 /kmol) 0.16 0.13 
Property Description 
Molecule number 7 8 
Molecular name 1-Aminopentan-3-one 2-Undecanone 
Molecular formula C 5 H 11 NO C 11 H 22 O 
Molecular weight (kg/kmol) 101.15 170.29 
Molecular structure 
Viscosity (mPa.s) 17.22 5.77 
Flash point (K) 479.80 499.52 
Lethal Concentration (mol/L) 228.40 6.68 
Solubility parameter (MPa 0.5 ) 23.71 20.50 
Molar volume (m 3 /kmol) 0.17 0.24 
R
A 
A 
 
C 
 
C 
C 
 
 
C 
C 
 
d 
 
d 
 
D 
F 
eferences 
fify, A., Betz, J.F., Riabinina, O., Lahondère, C., Potter, C.J., 2019. Commonly Used Insect
Repellents Hide Human Odors from Anopheles Mosquitoes. Curr. Biol. 29 (21), 3669–
3680.e5. doi: 10.1016/j.cub.2019.09.007 . 
ustin, N.D. , Sahinidis, N.V. , Trahan, D.W. , 2016. Computer-aided molecular design: an
introduction and review of tools, applications, and solution techniques. Chem. Eng.
Res. Des. 116, 2–26 . 
ardé, R.T. , Gibson, G. , 2010. Host finding by female mosquitoes: Mechanisms of orienta-
tion to host odours and other cues. In: Olfaction in vector-host interactions. Wagenin-
gen Academic Publishers, pp. 115–141 CABDirect . 
hemmangattuvalappil, NG. , 2020. Development of solvent design methodologies
using computer-aided molecular design tools. Curr. Opin. Chem. Eng. 27, 
51–59 . 
13 
hen, K.Y. , Lim, C.P. , Lai, W.K. , 2004. Fault Detection and Diagnosis Using the Fuzzy
Min-Max Neural Network with Rule Extraction. In: Negoita, M.Gh., Howlett, R.J.,
Jain, L.C. (Eds.), Knowledge-Based Intelligent Information and Engineering Systems.
Springer, Berlin Heidelberg, pp. 357–364 . 
onte, E., Gani, R., Ng, K.M., 2011. Design of formulated products: A systematic method-
ology. AlChE J. 57 (9), 2431–2449. doi: 10.1002/aic.12458 . 
onte, E., Martinho, A., Matos, H.A., Gani,R., 2008. Combined Group-Contribution and
Atom Connectivity Index-Based Methods for Estimation of Surface Tension and Vis-
cosity. Ind. Eng. Chem. Res. 47 (20), 7940–7954. doi: 10.1021/ie071572w . 
a Silva, M.R.M., Ricci-Júnior, E., 2020. An approach to natural insect repellent formu-
lations: from basic research to technological development. Acta Trop. 212, 105419.
doi: 10.1016/j.actatropica.2020.105419 . 
e Mello Castanho Amboni, R.D., da Silva Junkes, B., Yunes, R.A., Heinzen, V.E.F., 2000.
Quantitative Structure − Odor Relationships of Aliphatic Esters Using Topological In-
dices. J. Agric. Food Chem. 48 (8), 3517–3521. doi: 10.1021/jf991039u . 
ebboun, M., Frances, S.P., Strickman, D., 2014. Insect Repellents Handbook, Second
Edition. Taylor & Francis https://books.google.com.my/books?id = fSZZBAAAQBAJ . 
arrar Jr., R.R., Kennedy, G.G, 1987. 2-Undecanone, a constituent of the glandu-
https://doi.org/10.1016/j.cub.2019.09.007
http://refhub.elsevier.com/S2772-5081(22)00009-6/optJjV5xBCzzi
http://refhub.elsevier.com/S2772-5081(22)00009-6/optJjV5xBCzzi
http://refhub.elsevier.com/S2772-5081(22)00009-6/optJjV5xBCzzi
http://refhub.elsevier.com/S2772-5081(22)00009-6/optJjV5xBCzzi
http://refhub.elsevier.com/S2772-5081(22)00009-6/sbref0002
http://refhub.elsevier.com/S2772-5081(22)00009-6/sbref0002
http://refhub.elsevier.com/S2772-5081(22)00009-6/sbref0002
http://refhub.elsevier.com/S2772-5081(22)00009-6/sbref0003
http://refhub.elsevier.com/S2772-5081(22)00009-6/sbref0003
http://refhub.elsevier.com/S2772-5081(22)00009-6/sbref0004
http://refhub.elsevier.com/S2772-5081(22)00009-6/sbref0004
http://refhub.elsevier.com/S2772-5081(22)00009-6/sbref0004
http://refhub.elsevier.com/S2772-5081(22)00009-6/sbref0004
https://doi.org/10.1002/aic.12458
https://doi.org/10.1021/ie071572w
https://doi.org/10.1016/j.actatropica.2020.105419
https://doi.org/10.1021/jf991039u
https://books.google.com.my/books?id=fSZZBAAAQBAJ
M. Hatamleh, J.W. Chong, R.R. Tan et al. Digital Chemical Engineering 3 (2022) 100018 
 
 
H 
 
H 
 
H 
 
 
H 
 
 
J 
J 
 
J 
K 
 
K 
 
M 
 
K 
 
M 
 
 
M 
M 
 
M 
 
P 
 
Q 
 
 
R 
 
 
R 
 
R 
 
R 
 
R 
 
S 
 
 
S 
 
S 
 
T 
 
T 
 
T 
 
T
V 
 
V 
 
X 
Y 
 
Z 
Z 
 
Z 
 
Z 
Z 
 
lar trichomes of Lycopersicon hirsutum f. Glabratum: Effects on Heliothis zea
and Manduca sexta growth and survival. Entomol. Exp. Appl. 43 (1), 17–23.
doi: 10.1111/j.1570-7458.1987.tb02196.x . 
allem, E.A., Dahanukar, A., Carlson, J.R., 2005. INSECT ODOR AND TASTE RECEPTORS.
Annu. Rev. Entomol. 51 (1), 113–135. doi: 10.1146/annurev.ento.51.051705.113646 .
arper, P.M., Gani, R., 2000. A multi-step and multi-level approach for com-
puter aided molecular design. Comput. Chem. Eng. 24 (2), 677–683.
doi: 10.1016/S0098-1354(00)00410-5 . 
ukkerikar, A.S., Kalakul, S., Sarup, B., Young, D.M., Sin, G., Gani, R., 2012b. Estimation
of Environment-Related Properties of Chemicals for Design of Sustainable Processes:
Development of Group-Contribution+ (GC+) Property Models and Uncertainty Anal-
ysis. J. Chem. Inf. Model. 52 (11), 2823–2839. doi: 10.1021/ci300350r . 
ukkerikar, A.S., Sarup, B., Ten Kate, A., Abildskov, J., Sin, G., Gani, R., 2012a. Group-
contribution+ (GC+) based estimation of properties of pure components: Improved
property estimation and uncertainty analysis. Fluid Phase Equilib. 321, 25–43.
doi: 10.1016/j.fluid.2012.02.010 . 
anairo, J.I.B., Aviso, K.B., Promentilla, M.A.B., Tan, R.R., 2020. Enhanced Hyperbox Clas-
sifier Model for Nanomaterial Discovery. AI 1 (2). doi: 10.3390/ai1020020 . 
onuzaj, S., Akula, P.T., Kleniati, P.-M., Adjiman, C.S., 2016. The formulation of opti-
mal mixtures with generalized disjunctive programming: A solvent design case study.
AlChE J. 62 (5), 1616–1633. doi: 10.1002/aic.15122 . 
ordan, M.I., Mitchell, T.M., 2015. Machine learning: Trends, perspectives, and prospects.
Science 349 (6245), 255. doi: 10.1126/science.aaa8415 . 
ennedy, G.G., 2003. Tomato, Pests, Parasitoids, and Predators: Tritrophic Interactions In-
volving the Genus Lycopersicon. Annu. Rev. Entomol. 48 (1), 51–72. doi: 10.1146/an-
nurev.ento.48.091801.112733 . 
huat, T.T., Ruta, D., Gabrys, B., 2021. Hyperbox-based machine learning al-
gorithms: A comprehensive survey. Soft Computing 25 (2), 1325–1363.
doi: 10.1007/s00500-020-05226-7 . 
eneganti, M., Saviello, F.S., Tagliaferri, R., 1998. Fuzzy neural networks for classifi-
cation and detection of anomalies. IEEE Trans. Neural Networks 9 (5), 848–861.
doi: 10.1109/72.712157 . 
im, S., Chen, J., Cheng, T., Gindulyte, A., He, J., He, S., Li, Q., Shoemaker, B.A.,
Thiessen, P.A., Yu, B., Zaslavsky, L., Zhang, J., Bolton, E.E., 2021. PubChem in 2021:
new data content and improved web interfaces. Nucleic Acids Res. 49 (D1), D1388–
D1395. doi: 10.1093/nar/gkaa971 . 
ah, A.X.Y., Chin, H.H., Neoh, J.Q., Aboagwa, O.A., Thangalazhy-Gopakumar, S., Chem-
mangattuvalappil, N.G., 2019. Design of bio-oil additives via computer-aided molecu-
lar design tools and phase stability analysis on final blends. Comput. Chem. Eng. 123,
257–271. doi: 10.1016/j.compchemeng.2019.01.008 . 
akridakis, S., 2017. The forthcoming Artificial Intelligence (AI) revolution: Its impact
on society and firms. Futures 90, 46–60. doi: 10.1016/j.futures.2017.03.006 . 
arrero, J., Gani, R., 2001. Group-contribution based estimation of pure component prop-
erties. In: Proceedings of the Fourteenth Symposium on Thermophysical Properties,
pp. 183–184. doi: 10.1016/S0378-3812(01)00431-9 183–208https://doi.org/ . 
askooki, A., 2013. Improving the efficiency of a mixed integer linear programming based
approach for multi-class classification problem. Comput. Ind. Eng. 66 (2), 383–388.
doi: 10.1016/j.cie.2013.07.005 . 
aluch, G., Bartholomay, L., Coats, J., 2010. Mosquito repellents: A review of
chemical structure diversity and olfaction. Pest Manage. Sci. 66 (9), 925–935.
doi: 10.1002/ps.1974 . 
uteishat, A.M. , Lim, C.P. , 2007. A Modified Fuzzy Min-Max Neural Network and Its
Application to Fault Classification. In: Saad, A., Dahal, K., Sarfraz, M., Roy, R. (Eds.),
Soft Computing in Industrial Applications. Springer, Berlin Heidelberg, pp. 179–188 .
adhakrishnapany, K.T., Wong, C.Y., Tan, F.K., Chong, J.W., Tan, R.R., Aviso, K.B.,
Janairo, J.I.B., Chemmangattuvalappil, N.G., 2020. Design of fragrant molecules
through the incorporation of rough sets into computer-aided molecular design. Mol.
Syst. Des. Eng. 5 (8), 1391–1416. doi: 10.1039/D0ME00067A . 
14 
ay, A., Turner, S.L. (2015). Insect repellent and attrac-
tants. (Patent No. US 8945595B2). United States Patent.
https://patentimages.storage.googleapis.com/d3/1a/d9/c8de2147be5dc3/US89455 
95.pdf 
obbins, P.J., Cherniack, M.G., 1986. Review of the biodistribution and toxicity of the
insect repellent N,N-diethyl-m-toluamide (DEET). J. Toxicol. Environ. Health 18 (4),
503–525. doi: 10.1080/15287398609530891 . 
udin, C., 2019. Stop explaining black box machine learning models for high stakes
decisions and use interpretable models instead. Nat. Mach. Intell. 1 (5), 206–215.
doi: 10.1038/s42256-019-0048-x . 
utledge, L.C., Moussa, M.A., Lowe, C.A., Sofield, R.K., 1978. Comparative Sensitivity of
Mosquito Species and Strains to the Repellent Diethyl Toluamide1. J. Med. Entomol.
14 (5), 536–541. doi: 10.1093/jmedent/14.5.536 . 
ong, Z., Li, X., Chao, H., Mo, F., Zhou, T., Cheng, H., Chen, L., Qi, Z., 2019.
Computer-aided ionic liquid design for alkane/cycloalkane extractive distillation
process. Special Issue: Ionic Liquids in Energy and Environment 4 (2), 154–165.
doi: 10.1016/j.gee.2018.12.001 . 
ong, Z., Shi, H., Zhang, X., Zhou, T., 2020. Prediction of CO2 solubility in
ionic liquids using machine learning methods. Chem. Eng. Sci. 223, 115752.
doi: 10.1016/j.ces.2020.115752 . 
ong, Z., Zhang, C., Qi, Z., Zhou, T., Sundmacher, K., 2018. Computer-aided design of
ionic liquids as

Outros materiais