Buscar

IgorAugustoBrandao_DISSERT


Prévia do material em texto

UNIVERSIDADE FEDERAL DO RIO GRANDE DO NORTE
INSTITUTO METRÓPOLE DIGITAL
PROGRAMA DE PÓS-GRADUAÇÃO EM BIOINFORMÁTICA
MESTRADO EM BIOINFORMÁTICA
Igor Augusto Brandão
Abordagens da biologia de sistemas na
investigação dos pontos de articulação nas
rotas metabólicas do KEGG
Natal - RN
2020
UNIVERSIDADE FEDERAL DO RIO GRANDE DO NORTE
INSTITUTO METRÓPOLE DIGITAL
PROGRAMA DE PÓS-GRADUAÇÃO EM BIOINFORMÁTICA
MESTRADO EM BIOINFORMÁTICA
Igor Augusto Brandão
Abordagens da biologia de sistemas na investigação
dos pontos de articulação nas rotas metabólicas do
KEGG
Universidade Federal do Rio Grande do Norte
Orientador: Prof. Dr. Rodrigo Juliani Siqueira Dalmolin – UFRN
Coorientador: Prof. Dr. José Miguel Ortega – UFMG
Natal - RN
2020
UNIVERSIDADE FEDERAL DO RIO GRANDE DO NORTE
INSTITUTO METRÓPOLE DIGITAL
PROGRAMA DE PÓS-GRADUAÇÃO EM BIOINFORMÁTICA
MESTRADO EM BIOINFORMÁTICA
Abordagens da biologia de sistemas na investigação
dos pontos de articulação nas rotas metabólicas do
KEGG
Dissertação apresentada ao Programa de Pós-
graduação em BioInformática da Universidade Federal
do Rio Grande do Norte por Igor Augusto Brandão
como parte dos requisitos para a obtenção do tı́tulo
de Mestre em BioInformática.
Área de concentração: Bioinformática
Linha de Pesquisa: Biologia de Sistemas
Orientador: Prof. Dr. Rodrigo Juliani S. Dalmolin
Coorientador: Prof. Dr. José Miguel Ortega
Natal, RN
2020
Brandão, Igor Augusto.
 Abordagens da biologia de sistemas na investigação dos pontos
de articulação nas rotas metabólicas do KEGG / Igor Augusto
Brandão. - Natal, 2020.
 61.: il.
 Dissertação (Mestrado) - Universidade Federal do Rio Grande
do Norte. Instituto Métropole Digital. Programa de Pós-graduação
em Bioinformática.
 Orientador: Prof. Dr. Rodrigo Juliani Siqueira Dalmolin.
 Coorientador: Prof. Dr. José Miguel Ortega.
 1. Biologia de sistemas - Dissertação. 2. KEGG - Dissertação.
3. Rotas metabólicas - Dissertação. 4. Topologia de redes -
Dissertação. 5. Pontos de articulação - Dissertação. 6.
Visualização de redes - Dissertação. I. Dalmolin, Rodrigo
Juliani Siqueira. II. Ortega, José Miguel. III. Universidade
Federal do Rio Grande do Norte. IV. Título.
RN/UF/BSCB CDU 57:004
Universidade Federal do Rio Grande do Norte - UFRN
Sistema de Bibliotecas - SISBI
Catalogação de Publicação na Fonte. UFRN - Biblioteca Setorial Prof. Leopoldo Nelson - ­Centro de Biociências - CB
Elaborado por KATIA REJANE DA SILVA - CRB-15/351
IGOR AUGUSTO BRANDÃO 
 
ABORDAGENS DA BIOLOGIA DE SISTEMAS NA INVESTIGAÇÃO DOS PONTOS 
DE ARTICULAÇÃO NAS ROTAS METABÓLICAS DO KEGG 
Defesa de Mestrado apresentanda ao Programa de Pós-Graduação em Bioinformática da 
Universidade Federal do Rio Grande do Norte. 
 
Área de concentração: Bioinformática 
Linha de Pesquisa: Biologia de Sistemas 
Orientador: Prof. Dr. Rodrigo Juliani Siqueira Dalmolin 
Natal, 14 de agosto de 2020. 
 
 
 
BANCA EXAMINADORA 
 
 
 
 
Prof. Dr. Rodrigo Juliani Siqueira Dalmolin 
Universidade Federal do Rio Grande do Norte 
(Presidente) 
(por videoconferência) 
 
 
 
 
Prof. Dr. César Rennó Costa 
Universidade Federal do Rio Grande do Norte 
(Examinador Interno) 
(por videoconferência) 
 
 
 
 
Prof. Dr. Ricardo D’Oliveira Albanus 
Universidade de Michigan 
(Examinador Externo à Instituição) 
(por videoconferência) 
Agradecimentos
Este trabalho constitui uma importante fase do desenvolvimento acadêmico e exigiu
diversas habilidades que não se adquirem à portas fechadas: paciência, capacidade
de criar parcerias, perseverança e acima de tudo sonho - o sonho de trazer todo o
aprendizado para a vida.
Primeiramente agradeço à Alice pelas importantes contribuições para este trabalho;
Aos meus pais que mesmo distantes estão sempre presentes em minha vida; Ao meu
professor e orientador, Rodrigo Dalmolin, por acreditar em minha capacidade e por
me guiar neste inı́cio de minha jornada acadêmica; Ao professor Lucélio do Instituto
Metrópole Digital, pelo apoio ao longo destes anos em diversas empreitadas; Aos
colegas do grupo de pesquisa, em especial à Clóvis e Diego, pelas contribuições para
este trabalho; Aos colegas do BioME pela convivência e por tornar o caminho mais
agradável; Aos demais professores e colaboradores do BioME, em especial Jessica e
Aldo, pela disposição e paciência em me ajudar nos momentos em que precisei.
O presente trabalho foi realizado no PPg-Bioinfo/UFRN com apoio da Coordenação
de Aperfeiçoamento de Pessoal de Nı́vel Superior - Brasil (CAPES) - Código de Finan-
ciamento 001.
Sumário
Lista de ilustrações . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
Lista de tabelas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1 INTRODUÇÃO . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.1 Biologia de sistemas . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.2 O estudo de redes aplicado a vias metabólicas . . . . . . . . . . . 13
1.3 Caracterı́sticas topológicas de uma rede . . . . . . . . . . . . . . 15
1.4 Estudo das vias metabólicas do KEGG . . . . . . . . . . . . . . . 18
2 JUSTIFICATIVA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3 OBJETIVOS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.1 Objetivo geral . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.2 Objetivos especı́ficos . . . . . . . . . . . . . . . . . . . . . . . . . . 21
Artigo 23
Discussão 48
Conclusão 56
REFERÊNCIAS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
APÊNDICE A – MATERIAL SUPLEMENTAR DO ARTIGO . . . . . 62
6
Lista de ilustrações
Figura 1 – Representação de uma via metabólica como grafo . . . . . . . . . . 14
Figura 2 – Avaliação topológica de uma rede . . . . . . . . . . . . . . . . . . . 16
Figura 3 – Impacto da remoção de um AP sobre a rede . . . . . . . . . . . . . 50
Figura 4 – Ferramenta de visualização de redes KPV . . . . . . . . . . . . . . . 52
7
Lista de tabelas
Tabela 1 – Propriedades topológicas de uma rede . . . . . . . . . . . . . . . . 17
Tabela 2 – Identificadores do KEGG . . . . . . . . . . . . . . . . . . . . . . . . 19
8
Lista de Sı́mbolos e Abreviações
KEGG – Kyoto Encyclopedia of Genes and Genomes
AP – Articulation point
EC – Enzyme commission number
KO – KEGG Orthology
GO – Gene Ontology
XML – Extensible Markup Language
KGML – KEGG XML
KPV – KEGG Pathway Viewer
9
Resumo
O estudo da essencialidade das proteı́nas por meio de métodos laboratoriais é caro
e não escalável para grandes quantidades de proteı́nas, desta forma é relevante
avaliar a essencialidade das várias proteı́nas de uma via metabólica como um todo
através de ferramentas computacionais. Em geral, uma via metabólica pode ser
analisada como grafos, os quais fornecem diferentes recursos para o estudo das
caracterı́sticas topológicas de redes, como os seus pontos de articulação e disposição
dos nós. Atualmente, pesquisas em bioinformática estudam a essencialidade de
proteı́nas com base nas métricas de betweenness e degree, contudo a teoria dos grafos
sugere que os pontos de articulação podem ser nós importante em uma rede, resta
avaliar se esses pontos de articulação são de fato essenciais para as vias metabólicas
e seu impacto topológico na rede. Utilizando análises baseadas em métricas de
rede, o nosso objetivo é verificar se de fato esses pontos de articulação representam
gargalos na rede, sendo estes caracterizados como proteı́nas de frequências elevadas
e localizadas no centro das redes. Para tanto, identificamos os pontos de articulação
em diferentes vias metabólicas do KEGG, avaliamos o impacto de cada um deles,
calculamos sua frequência e comparamos suas ocorrências com as demais proteı́nas.
Inicialmente, fizemos o levantamento das vias metabólicas do KEGG que estavam
disponı́veis através dos arquivosKGML associados às redes. Após a listagem das vias
disponı́veis, os dados estruturais de cada uma delas foram convertidos em objetos do
tipo grafo. Os parâmetros ponto de articulação, betweenness e degree foram utilizados
para classificar as proteı́nas constantes em cada via metabólica. Aproximadamente
20% das proteı́nas foram classificadas como pontos de articulação, das quais 3,75%
foram identificadas pela alta frequência e localização em regiões centrais da rede.
Além disso, a maior concentração dos pontos de articulação ocorreu na faixa de
frequência dos 80 a 90%. Um padrão de não aleatoriedade na distribuição dos pontos
de articulação foi identificado nos grupos com frequências acima de 74,5%.Finalmente,
a biossı́ntese de esteroides foi a via metabólica com o maior número de pontos de
articulação com frequências superiores a 80% em sua constituição. A oxidoredutase foi
a classe dos pontos de articulação presente no maior número de vias metabólicas. As
descobertas sugerem que os gargalos das redes avaliadas são pontos de articulação
com as frequências mais altas e localizados no centro da rede. Resta realizar análises
mais aprofundadas a respeito dos papéis biológicos destes pontos de articulação
encontrados.
Palavras-chaves: Biologia de sistemas, KEGG, ponto de articulação, via metabólica,
topologia de redes, visualização de redes.
10
Abstract
The study of proteins essentiality through laboratory methods is expensive, time-
consuming and not scalable for large amounts of proteins. Besides, it is relevant
to evaluate the essentiality of several proteins of a metabolic pathway as a whole. The
metabolic pathways can be analyzed as graphs, which provide several tools to study
the topological features such as the articulation points. Nowadays, research in bioinfor-
matics studies the essentiality of proteins based on betweenness and degree metrics,
however, graph theory suggests that articulation points could be essential nodes in a
network. It remains to be determined whether these articulation points are essential in
metabolic pathways and their topological impact on the network. Using network analysis
via metrics and biologic curation, we aim to verify if bottlenecks are proteins with the
highest frequencies and located in the center of KEGG metabolic pathways. For this
purpose, we identified the articulation points in different networks, evaluate the impact
of each articulation point, calculate their frequency and compare them with occurrences
of non-articulation points. We consulted KEGG pathways available as KGML files. After,
the data was transformed into a graph object. Two centrality parameters including
articulation points and degree are determined and the essential proteins based on
these parameters are classified. Approximately 20% of the proteins are articulation
points. The articulation points with high-frequency which are located in central regions
of the network were considered the most important (3.75%). In addition, the highest
concentration of articulation points occurred in the frequency range of 80-90%. A pattern
of non-randomness of articulation points was identified in the protein groups that have
a frequency of at least 74.5%. Finally, steroid biosynthesis is the metabolic pathway
with the highest number of articulation points with frequency higher than 80%. Besides,
oxidoreductase is the articulation point class present in the highest number of metabolic
pathways. Overall, the findings suggest that bottlenecks are articulation points with
highest frequencies and located in the center of the network. It remains to perform a
deep analysis on the articulation points biological roles.
Key words: Systems biology, KEGG, articulation point, metabolic pathway, network
topology, network visualization.
11
12
1 Introdução
1.1 Biologia de sistemas
O desafio de compreender sistemas biológicos complexos é o que caracteriza o
campo de estudo da biologia, a qual requer a integração de pesquisas experimentais
e computacionais [1]. A biologia de sistemas é uma disciplina que estuda sistemas
biológicos de uma perspectiva holı́stica e interdisciplinar, reunindo profissionais de dife-
rentes campos de estudo tais como biólogos, matemáticos, cientistas da computação,
fı́sicos e engenheiros, de forma a obter componentes orientados para biologia e com-
ponentes orientados para sistemas, tais como bases de dados. Além disso, a biologia
de sistemas é caracterizada por grandes conjuntos de dados e modelagem, sendo
estes decorrentes do desenvolvimento de tecnologias de alta vazão (high-throughput)
nos anos 90. Sem modelos matemáticos e simulações computacionais, no entanto, os
dados não podiam ser entendidos na época [2]. Os primeiros biólogos de sistemas
fizeram diversas tentativas para melhorar a metodologia de estudo destas informações,
tais como o desenvolvimento de algoritmos para a manipulação de grandes conjuntos
de dados e formalismos que modelam uma rede especı́fica, incluindo redes reguladoras
de genes, redes de transdução de sinal e redes metabólicas [3].
Um aspecto relevante sobre as informações coletadas por meio do projeto Genoma
Humano é a hierarquia que estas seguem na natureza, a saber: DNA → mRNA →
proteı́na → interações proteicas → vias de informação → redes de informação →
células → tecidos ou redes de células → um organismo → populações → ecologias.
A tarefa central da biologia de sistemas é (I) coletar informações de cada um desses
nı́veis distintos para sistemas biológicos individuais e (II) integrar esses dados para
gerar modelos matemáticos preditivos do sistema [4]. A informação biológica possui
caracterı́sticas importantes:
• Opera em vários nı́veis hierárquicos de organização;
• É processada em redes complexas;
• Essas redes de informação são tipicamente robustas, de modo que muitas
perturbações únicas não as afetam significativamente;
• Existem nós-chave na rede, onde perturbações podem ter efeitos profundos;
estes nós oferecem alvos poderosos para a compreensão e manipulação do
sistema [4].
Introdução 13
Nesse contexto, é importante ressaltar que sistemas biológicos, como células,
são considerados ”sistemas complexos”. Uma noção geral de sistemas complexos é
composta por um número muito grande de elementos simples e idênticos que interagem
de maneira a produzir comportamentos ”complexos” [5], no entanto a realidade dos
sistemas biológicos é um pouco diferente. Nesse caso, um grande número de conjuntos
de elementos funcionalmente diversos e frequentemente multifuncionais interage de
maneira seletiva e não linear para produzir comportamentos coerentes. Ao contrário
dos sistemas complexos de elementos simples (nos quais as funções emergem das
propriedades das redes que formam e não de qualquer elemento especı́fico), as funções
nos sistemas biológicos dependem de uma combinação da rede e dos elementos
especı́ficos envolvidos [1].
Por muito tempo, assumiu-se que os componentes de sistemas complexos, como
a célula, são aleatoriamente conectados. Entretanto, na década passada um grande
número de pesquisas mostrou que muitas redes reais, independentemente de idade,
função e escopo, convergem para arquiteturas semelhantes, uma universalidade que
permitiu pesquisadores de diferentes disciplinas adotar a teoria das redes como um
paradigma comum [6].
1.2 O estudo de redes aplicado a vias metabólicas
As redes regulatórias e de interação são representações importantes das redes
de proteı́nas e são caracterizadas por possuir arestas direcionais e não direcionais,
respectivamente [7]. O estudo e a aplicação da teoria dos grafos no contexto de redes
biológicas ganharam notoriedade, pois é necessário entender sistematicamente como
as moléculas interagem umas com as outras e como suas funções são determinadas
na complexa maquinaria celular, isoladamente ou em conjunto com outras células [8].
As viasmetabólicas e os ciclos são cadeias de reação em que produtos quı́micos
se tornam o substrato para a próxima etapa [9]. Além disso, a análise estrutural e
funcional de redes metabólicas baseadas no genoma é importante para entender
os princı́pios de design e a regulação do metabolismo em nı́vel sistêmico. A rede
metabólica é organizada em muitos módulos pequenos e altamente conectados, os
quais se combinam hierarquicamente em unidades maiores e menos coesas [10].
Portanto, é importante uma redução racional da rede metabólica em sua estrutura
central e uma compreensão mais profunda de seus módulos funcionais [11].
A principal caracterı́stica das vias metabólicas é o fato de suas reações serem
conectadas por seus intermediários, sendo os produtos de uma reação os reagentes
das reações subsequentes [12]. Em geral, as vias metabólicas podem ser classificadas
como redes de fluxo, onde uma variável especı́fica, como fluxo de massa ou energia,
pode ser conservada em cada nó. As redes metabólicas têm propriedades únicas,
Introdução 14
como as restrições de conservação que devem ser satisfeitas em cada nó. Outra
propriedade é a representação dos nós das redes metabólicas como metabólitos e as
arestas como reações catalisadas por produtos genéticos especı́ficos [13]. Abaixo, é
apresentada a Figura 1 exemplificando a representação de uma via metabólica como
grafo.
Figura 1 – Representação da via metabólica relativa a biossı́ntese de estilbenoide
como grafo (desenvolvido pelo autor1).
Em termos de grafos, existem diversas maneiras de se construir uma rede a partir de
um modelo metabólico [14]. Por exemplo, (I) é possı́vel representar uma via metabólica
como um grafo no qual os metabólitos são os nós e arestas são as reações que
transformam um metabólito em outro [11, 15, 16, 17], (II) como um grafo em que as
reações são os nós e arestas os metabólitos compartilhados entre elas [18, 19, 20],
ou até mesmo (III) como um grafo bipartido com reações e metabólitos como nós [21].
O estudo de redes sugere que as redes biológicas têm duas propriedades estruturais
importantes. Primeira, a maioria dessas redes (incluindo as vias metabólicas) são
scale-free e possuem uma propriedade de ”small-world”, caracterizada por grande parte
de suas conexões serem estabelecidas entre os vértices mais próximos [7, 11, 13]. No
1 Disponı́vel em: <〈https://igorabrandao.com.br/kegg-pathway-bottleneck//pathway module/viewer?
pathwayCode=cGIwWHJ0bFJNZlRtRm4rcSs3aGRBZz09〉>. Acesso em: 08 jun. 2020.
https://igorabrandao.com.br/kegg-pathway-bottleneck//pathway_module/viewer?pathwayCode=cGIwWHJ0bFJNZlRtRm4rcSs3aGRBZz09
https://igorabrandao.com.br/kegg-pathway-bottleneck//pathway_module/viewer?pathwayCode=cGIwWHJ0bFJNZlRtRm4rcSs3aGRBZz09
Introdução 15
entanto, é importante observar a existência de forças motrizes que podem influenciar na
criação de um novo nó na rede biológica, como embaralhamento de exons, retroposição,
elementos móveis, transferência horizontal de genes, duplicação gênica e o fato de
uma nova conexão deste nó refletir sua origem com a natureza de suas propriedades
[22]. Segunda, sugere-se que as redes scale-free tenham alta tolerância a erros
(resistência a falhas aleatórias) e baixa tolerância a ataques (vulnerabilidade à falha
dos nós altamente conectados) [13].
1.3 Caracterı́sticas topológicas de uma rede
Os perfis topológicos de redes podem ser avaliados através da observação de
caracterı́sticas especı́ficas da via, como por exemplo:
1.3.1 Betweenness
A métrica betweenness centrality avalia o número total de caminhos mais curtos,
não redundantes, e que passam por um determinado nó ou aresta [23].
1.3.2 Degree
O degree informa quantos links o nó possui para se conectar a outros nós [7].
É importante ressaltar que a distribuição do degree é ampla, com uma cauda que
geralmente segue uma lei de potência: portanto, muitos vértices com baixo degree
coexistem com alguns vértices com alto degree [24].
A avaliação do degree dos vértices de uma rede traz a tona a ideia de nós hub,
definidos como nós que estão conectados a um número substancial de outros nós
na rede, e em particular, a nós que são mais densamente conectados do que os
nós mais altamente conectado em uma rede scale-free [25]. Contudo, definir este
número substancial de conexões pode ser algo subjetivo, por isso utilizou-se como
referência a definição de Yu et al., na qual os nós hubs são aqueles com as 20%
maiores distribuições de degree [23].
1.3.3 Comunidades
A distribuição das arestas em uma rede não é apenas global, mas também lo-
calmente não homogênea, com altas concentrações de arestas dentro de grupos
especı́ficos de vértices e baixas concentrações entre esses grupos. Essa caracterı́stica
de redes é conhecida como estrutura de comunidade [26].
Introdução 16
1.3.4 Ponto de articulação
Um nó é um AP se a sua remoção desconectar a rede ou aumentar o número
de componentes conectados a ela [27, 28]. Além disso, cada AP possui um impacto
diferente sobre a rede, o qual pode ser definido como o número de vértices que são
desconectados do principal (maior) componente conectado (CC) sobrevivente após a
remoção do vértice [29].
1.3.5 Pontes
Uma aresta pode ser classificada como ponte se a sua remoção desconectar o
grafo ou aumentar o número de componentes conectados [30].
1.3.6 Exemplo de avaliação topológica de uma rede
A partir da Figura 2 é possı́vel identificar alguns dos elementos topológicos citados
anteriormente:
Figura 2 – Avaliação topológica de uma rede. Os cı́rculos numerados na figura
representam os vértices da rede, as linhas conectando os vértices são as
arestas (neste caso não direcionadas). Esta rede possui três comunidades:
A, B e C, demarcadas pelos cı́rculos tracejados e pela cor dos vértices que
as compõem. Além disso, é possı́vel identificar que os nós participantes de
uma mesma comunidade são altamente conectados entre si, porém
possuem poucas ligações com as demais comunidades. A aresta que
conecta os nós [1, 4] representa uma ponte do grafo e os nós 1, 3, 6, 7 e 8
são os pontos de articulação da rede, uma vez que se um deles for
removido o grafo será desconectado (desenvolvido pelo autor).
Introdução 17
Adicionalmente, é possı́vel avaliar outras propriedades topológicas da rede não
visı́veis em uma primeira observação através da tabela 1:
Tabela 1 – Propriedades topológicas de uma rede. A numeração presente na coluna
node é equivalente aos números dos vértices da figura 2. A coluna
betweenness mostra que os vértices 6, 7, 8 são aqueles que mais estão
presentes nas conexões de ponta-a-ponta da rede, como por exemplo [2,
11] ou [1, 12]. Além disso, fica claro que os nós 11 e 12 são os mais
densamente conectados, uma vez que estes fazem vizinhança com quatro
nós. Por fim, a coluna diameter mostra que o número necessário de passos
para conectar as extremidados da rede é sete, no caso para conectar os
nós [4, 10] (desenvolvido pelo autor).
Esta foi somente uma sugestão para a realização de uma análise topológica de rede,
existem outras maneiras de se estudar um grafo e outras propriedades que podem ser
exploradas conforme a necessidade da pesquisa. Além disso, é possı́vel aplicar este
tipo de metodologia para estudos massivos em bases de dados.
Introdução 18
1.4 Estudo das vias metabólicas do KEGG
Atualmente existem diversas bases de dados contendo informações sobre vias
metabólicas e seus componentes. No entanto, foi utilizada a plataforma KEGG, pelo fato
desta possuir uma base de conhecimento manualmente curada em mapas metabólicos
de interação molecular e relações de ortologia entre genes e produtos gênicos [31].
No KEGG, os nós representam enzimas e as arestas reações que transformam um
metabólito em outro [11, 15, 16]. O KEGG consiste em trêsbases de dados: (I)
PATHWAY, para representação de funções de ordem superior em termos de redes de
interação, (II) GENES, para a coleção de catálogos de genes para todos os genomas
completamente sequenciados e alguns parciais [32], e (III) LIGAND, para a coleção de
compostos quı́micos na célula, moléculas e reações enzimáticas [33].
O banco de dados KEGG/PATHWAY, especializado em vias metabólicas, é repre-
sentado por 500 diagramas gráficos para as vias metabólicas de referência. Cada via
pode ser vista como uma rede de enzimas ou uma rede de ECs. Uma vez que os genes
das enzimas são identificados no genoma (com base na semelhança de sequência
e na correlação posicional dos genes) e os ECs adequadamente atribuı́dos, as vias
especı́ficas do organismo podem ser construı́das computacionalmente, correlacionando
genes no genoma com produtos gênicos (enzimas) nas vias de referência conforme os
ECs correspondentes [32].
Os diagramas de rotas do KEGG podem ser utilizados como referência para
interpretação funcional, além de conter relações binárias que representam interações e
relações moleculares, as quais podem ser utilizadas para calcular e comparar rotas [31].
Vale ressaltar que em termos de estrutura informacional, cada via metabólica do KEGG
possui um arquivo KGML com a identificação das proteı́nas, reações quı́micas e suas
interações. Cada diagrama do KEGG é identificado pelo KEGG identifier, conforme
tabela 2. Embora estes diagramas sejam informativos, um estudo topológico de rede
sobre a via metabólica pode ser importante para entender caracterı́sticas especı́ficas
de cada rede. Nesse sentido, a representação via grafos pode ser importante, uma vez
que a representação visual de redes oferece uma forma rápida e intuitiva de compreen-
der seus resultados, além de possibilitar o estudo de diversas caracterı́sticas destas
redes por meio da avaliação topológica.
Introdução 19
Identificador Significado
map via de referência desenhada manualmente
ko via de referência que destaca os KOs
ec via de referência que destaca os ECs
rn via de referência que destaca as reações
<org> via especı́fica do organismo gerada pela conversão de KOs em identi-
ficadores de genes (Entrez Gene ID)
Numeração
011 via global (arestas vinculadas aos KOs)
012 via geral (arestas vinculadas aos KOs)
010 via de estrutura quı́mica
07 via de estrutura de medicamento
outras via regular (nós vinculados aos KOs)
Tabela 2 – Identificadores do KEGG. Cada mapa metabólico é identificado pela
combinação de um prefixo com 2-4 letras e um número de 5 dı́gitos
(desenvolvido pelo Laboratório Kanehisa2).
2 Disponı́vel em: <〈https://www.genome.jp/kegg/pathway.html〉>. Acesso em: 19 jun. 2020.
https://www.genome.jp/kegg/pathway.html
20
2 Justificativa
O desafio no processo de obtenção e mineração de dados em larga escala consiste
em uma atividade recorrente no âmbito da pesquisa cientı́fica e também em projetos
comerciais. Diante disso, o desenvolvimento e utilização de ferramentas para análise
de dados massivos em diversas bases de conhecimento é necessário para fornecer
aos biólogos uma visão integrada dos dados biológicos [34]. Neste contexto, a prin-
cipal contribuição deste trabalho é a avaliação sistemática da topologia dos mapas
metabólicos disponı́veis na plataforma KEGG com o intuito de encontrar os pontos de
vulnerabilidade destas vias, uma vez que a disruptura delas pode estar associadas a
desordens ou doenças para os organismos em geral [35].
A determinação experimental de genes essenciais em laboratório ainda é cara,
demorada e não escalável. Neste sentido, previsões computacionais de genes es-
senciais podem ser utilizadas para restringir o universo de estudo e fornecer uma
lista de prioridades para validação experimental [36]. Adicionalmente, a utilização de
propriedades baseadas em rede são capazes de prever melhor os genes essenciais
entre diferentes organismos, uma vez que estes genes são efetivamente transferidos
em organismos intimamente relacionados por compartilharem genes ortólogos [36].
21
3 Objetivos
3.1 Objetivo geral
Verificar se os pontos de vulnerabilidade das vias metabólicas do KEGG possuem
as seguintes caracterı́sticas: (I) são os nós com as maiores ocorrências entre as
vias metabólicas de diferentes organismos, (II) desconectam a rede em componentes
distintos após sua remoção, (III) estão localizados no centro destas vias metabólicas.
3.2 Objetivos especı́ficos
– Avaliar a ocorrência de cada proteı́na no conjunto das vias metabólicas do KEGG
para os diferentes organismos disponı́veis;
– Identificar quais nós são APs e analisar suas caracterı́sticas topológicas para
cada via metabólica em diferentes organismos;
– Detectar as faixas de frequência em que os APs não estão aleatoriamente dis-
tribuı́dos;
– Analisar funcionalmente os APs para compreender o seu papel biológico;
– Desenvolver uma ferramenta para visualização interativa das redes de maneira a
permitir o acesso as caracterı́sticas topológicas de seus componentes e manipulação
das mesmas.
Artigo:
Systems biology approaches in the
investigation of articulation points in
KEGG metabolic pathways
Autores: Igor A. Brandão, Clovis F. Reis, Alice B. Câmara, Diego A. A. Morais,
Leonardo R. S. Campos, Rodrigo J. S. Dalmolin
 
Systems biology approaches in the investigation of articulation points in KEGG metabolic pathways 
 
 
 
Igor A. Brandão​1​, Clovis F. Reis​1​, Alice B. Câmara​2​,​ ​Diego A. A. Morais​1​, Leonardo R. S. Campos​1​, 
Rodrigo J. S. Dalmolin​1* 
 
 
1 ​Bioinformatics multidisciplinary environment, Federal University of Rio Grande do Norte, Natal, Rio 
Grande do Norte, Brasil 
2​ Biophysics and Pharmacology Department, Federal University of Rio Grande do Norte, Natal, Rio 
Grande do Norte, Brasil 
 
 
 
1 ​1722, Odilon Gomes de Lima Av, Natal, Rio Grande do Norte 59078-400 
2 ​3000, Sen. Salgador Filho Av, Natal, Rio Grande do Norte 59064-741 
 
 
*Corresponding author 
E-mail:​ rodrigo.dalmolin@imd.ufrn.br 
Running title: ​Articulation points Investigation in KEGG metabolic pathways 
 
 
 
Key-words:​ KEGG, articulation point, metabolic pathway, network topology, network visualization. 
 
 
1 
 
 
Abstract 
 
The metabolic pathways can be analyzed as 
graphs, which provide several tools to study the 
topological features such as the articulation points. 
Nowadays, research in bioinformatics studies the 
essentiality of proteins based on betweenness and 
degree metrics, however, graph theory suggests 
that articulation points could be essential nodes in 
a network. It remains to be determined whether 
these articulation points are essential in metabolic 
pathways and their topological impact on the 
network. Here, we constructed biological networks 
of 153 canonical metabolic pathways from the 
KEGG database. We identified the articulation 
points of each canonical network and evaluated 
their frequency in 6221 species annotated in the 
KEGG database. Using network analysis via 
metrics and biologic curation, we verified that 
articulation points are proteins with high 
frequencies across species when comparing with 
non-articulation points. The articulation points 
with high-frequency are located in central regions 
of the network, while those articulation points with 
low frequency are in the network periphery. 
Finally, steroid biosynthesis is the metabolic 
pathway with the highest number of articulation 
points with a frequency higher than 80%. Besides, 
oxidoreductase is the articulation point class 
present in the highest number of metabolic 
pathways. It remains to perform a deep analysis of 
the articulation point biological roles. 
 
Introduction 
 
Metabolic pathways are composedof several 
components, such as enzymes and other proteins, 
which need to work coordinately to perform a 
given function (1). There are metabolic pathways 
widely distributed in the tree of life, and others 
restricted to a few taxonomic groups (2). The 
distribution of different metabolic pathways 
suggests that some of them are important to almost 
all organisms while others are important to 
organisms in a specific niche or from specific taxa 
(2). It is common to observe variations in the same 
metabolic pathway comparing different organisms. 
For example, both humans and yeast have 
carbohydrate metabolism. However, the yeast 
Saccharomyces cerevisiae has the enzyme 
pyruvate decarboxylase (EC:4.1.1.1), an enzyme 
absent in humans, while both species have enolase 
(EC:4.2.1.11) (3). 
 
A central characteristic of metabolic pathways is 
that reactions are connected by their intermediates. 
The products of a reaction are the reactants of 
subsequent pathway steps (4). In general, the 
metabolic pathways can be classified as flow 
networks, where a specific variable, such as mass 
or energy, can be conserved at each node. The 
study and application of graph theory in the 
context of metabolic networks has gained 
notoriety since it is necessary to systematically 
understand how molecules interact with each other 
and how their functions are determined within the 
complex cell machinery, alone or together with 
other cells (5). Additionally, structural and 
functional analysis of genome-based large-scale 
metabolic networks is important for understanding 
the design principles and regulation of metabolism 
at a system level. The metabolic networks are 
organized into many small and highly connected 
modules that combine hierarchically into larger, 
less cohesive units (6)​. ​Therefore, it is important 
for a rational reduction of the metabolic networks 
to their core structure and a deeper understanding 
of its functional modules (7). 
 
A great source of curated functional information 
about metabolic networks is the KEGG database 
(Kyoto Encyclopedia of Genes and Genomes - 
https://www.genome.jp/kegg/). It holds a 
knowledge base on metabolic pathway maps of 
molecular interaction and orthology relationships 
between genes/gene products in different species 
(8). In KEGG, nodes depict enzymes or other 
proteins, and edges represent the reactions that 
transform one metabolite into another (9-11). 
Although KEGG diagrams are informative and 
easy to understand, a network topological study 
about the metabolic pathway might be important to 
2 
 
understand specific network characteristics. In this 
sense, the representation via graphs could stand 
out because it enables the study of several network 
properties (12). 
 
Network analysis suggests that most biological 
networks, including metabolic networks, are 
scale-free and possess a "small-world" property 
characterized by a short average path length (7, 13, 
14). There are biological phenomena that can 
influence a novel node attachment in a biological 
network, such as exon shuffling, retroposition, 
mobile elements, horizontal gene transfer, gene 
duplication (15). Scale-free networks are 
suggested to have high error tolerance (against 
random failure) and low attack tolerance 
(vulnerability to the failure of the highly connected 
nodes) (14). The network topological properties, 
such as betweenness, degree, articulation points 
(AP), and bridges, helps to understand network 
characteristics. Betweenness centrality measures 
the total number of non-redundant shortest paths 
going through a certain node or edge (16). The 
degree tells how many links a specific node has 
(13). Also, an edge can be classified as a bridge if 
its removal disconnects the graph or increases the 
number of connected components otherwise (17). 
A node is an AP if its removal disconnects the 
network or increases the number of connected 
components of the network (18, 19). 
 
Articulation points could be particularly crucial to 
metabolic networks. Taking into account that 
metabolic networks commonly involve mass flux, 
the deletion of an AP might interrupt the flux and, 
therefore, deeply impact the pathway. To date, no 
research systematically evaluated the topology of 
metabolic maps as a whole to study the APs. The 
study of APs is relevant because it is possible to 
detect the most vulnerable sites in a metabolic 
pathway by identifying these points. The APs can 
be measured by the number of vertices that get 
disconnected from the main (largest) surviving 
connected component after the removal of the AP 
(20). 
Here, we built the respective metabolic network 
from each KEGG canonical metabolic map 
available in the KEGG website and identified their 
articulation points. We investigated the frequency 
of each node from the canonical metabolic 
networks by evaluating their presence in the 
metabolic maps for each species present in the 
KEGG database. According to our results, there 
are two sets of APs: peripheral AP, which have 
low frequency and low betweenness; and central 
AP, with high frequency and high betweenness. 
Finally, we develop an interactive visualization 
enabling access to each protein topological 
features and the simulation of creation and 
remotion of nodes to provide a user-friendly 
interaction with the assessed networks. 
 
Results 
 
Metabolic pathways from KEGG maps 
 
In total, we retrieved 153 metabolic pathways 
from the KEGG database. A script was developed 
in R language using the KEGGREST package to 
load the list of metabolic pathways with its 
characteristics and the list of 6221 organisms with 
their respective taxonomic information. 
Approximately 32% of metabolic pathways are 
related to less than one thousand ​species (Figure 
1)​. This might be due to the existence of specific 
pathways for certain plants, bacteria, and fungi, 
which are involved with their cellular functions in 
these organisms. An example would be 
Benzoxazinoid biosynthesis route 00402 
(Supporting information S1a) responsible for the 
defense of some plant species (Supporting 
information S1b). Also, approximately 31% of 
metabolic pathways are related to at least 83% of 
KEGG species. These pathways being important to 
different types of living beings (Supporting 
information S2), such as glycolysis and 
gluconeogenesis route 00010, present in 6084 
species (Supporting information S3). 
 
3 
 
 
Figure 1. ​Distribution of species quantity by metabolic pathways available in KEGG. X-axis: metabolic 
pathway. Y-axis: organisms count. The orange bars represent pathways related to eukaryotes. Eukaryotes 
represent 8.6% (534) of total species (6221). Blue bars represent pathways related to prokaryotes. 
Prokaryotes represent 91.4% (5687) of total species. 
 
We then converted each 153 KEGG metabolic 
maps into metabolic networks. Figure 2A shows 
the original KEGG map (map 0010) while Figure 
2B shows the respective metabolic network, where 
nodes represent enzymes and links represent the 
reaction. The KEGG representation (Figure 2A), 
focused on displaying the chemical reaction 
sequentiallyto understand its logic and to 
emphasize the subproducts of each step. On the 
other hand, the graph representation (Figure 2B) 
provides a perspective regarding the topologic 
organization, the role of each node in the network 
arrangement, the communities and the hubs that 
connect them, and the network bottlenecks which 
could be interpreted as points of network 
vulnerability. 
4 
 
 
Figure 2. ​Comparison between KEGG and graph representation of glycolysis and gluconeogenesis 
pathway. ​A. ​Canonical KEGG map representing the 0010 - glycolysis and gluconeogenesis pathway. The 
rectangles represent enzymes and the circles represent metabolites (extracted from the KEGG website - 
https://www.genome.jp/kegg/pathway.html). ​B. Graph visualization generated with 58 nodes and 190 
edges. Nodes represent enzymes and edges represent reactions. Node size represents protein frequency, 
colored borders indicate APs, color in the center of the node indicates betweenness level. The 
classification and EC code of the protein are present on the node label. The network was made using the 
R programming language (visNetwork package). 
 
Articulation points identification 
 
Articulation points are nodes whose removal 
disconnects the network in more than one 
component (21). We then identified the 
articulation points (AP) of each 153 metabolic 
networks. The APs represented 20.46% (n=1125) 
of the total evaluated nodes (n=5497), while the 
Non-APs nodes (n=4372) corresponded to 79.54% 
(Figure 3A). The number of APs varies depending 
on the pathway, ranging from 1 to 34 (Figure 3B). 
 
It is noteworthy that some metabolic pathways 
were composed mostly of APs, for example, the 
Benzoxazinoid biosynthesis pathway 00402 
(Supporting information S1), which is composed 
of eight proteins in which six are APs representing 
a one-way bridge in terms of mass flow. Other 
pathways had just peripheral APs, such as ​Caffeine 
metabolism pathway 00232 (Supporting 
information S4). In this case, there are few 
communities, which are overcrowded and some of 
them are connected through bridges that represent 
one-way paths. These bridges are composed of 
APs, which connect with only one protein. 
Another example is the ​Fatty acid synthesis 
pathway 00061 (Supporting information S5), 
formed of an overcrowded central community with 
5 
 
7 peripheries highly connected to the center and 
another periphery connected to the center through 
just one bridge. This pathway contains 3 
components disconnected from the main network 
component, each with their own APs. However, 
there are several highly connected nodes in the 
network central area, although none of them are an 
AP due to the connection handles, which 
guarantee redundancy in this area of the network. 
 
 
Figure 3. ​Distribution of articulation points. The AP represents 20.46% while the Non-AP (79.54%) of 
the total studied nodes ​A. ​The number of AP varies depending on the pathway. B. The x-axis represents 
each pathway and the y-axis depicts the number of nodes per pathway. Protein count per pathway varies 
from 2 to 170 nodes. 
 
Articulation points distribution across species 
 
Despite the possibility to find a given metabolic 
pathway in many species, not the same reactions 
are observed in each organism, especially when 
comparing evolutionarily distant species. 
Therefore, the node frequency in a given 
metabolic network might vary when evaluating 
different species. We evaluated the node frequency 
of each of the 153 metabolic networks. We first 
6 
 
identified the species in which a given metabolic 
network occurs and then ranked network nodes 
according to the observed frequency of occurring 
species. Through the hypergeometric analysis, we 
identified that the upper-frequency stratum (>= 
75% of frequency) were enriched by APs (p-value 
< 0.01). Within this filter, 1047 more frequent 
proteins were selected, which presented a higher 
frequency than the cut-off point, of which 236 are 
APs (22.54%). Regarding the total of studied 
proteins, 1125 APs (20.46%) were evaluated in 
which a higher AP density was observed above the 
cut-off point when compared to non-AP proteins 
(Figure 4A). Besides, at frequencies groups 
>=80% and <30%, we found the highest 
concentration of AP (Figure 4A). To differentiate 
the profile of the APs in the groups of high 
frequency (>=80%) and low frequency (<30%), 
we studied the betweenness of the APs as an 
indicator of centrality in the assessed metabolic 
pathways. In this sense, we observed that the APs 
in the group greater/equal than 80% presented 
betweenness higher than the group lower than 30% 
(Figure 4B), suggesting that these APs are located 
at more strategic points of the metabolic pathways 
(Supporting information S6). Additionally, we 
evaluated the APs distribution by each frequency 
range and also the overall proportion of the APs 
over the set of studied proteins. Our analysis 
showed that the highest concentration of APs 
occurred in the frequency range of 80-90%, which 
was confirmed by the binomial test (Figure 4C). It 
suggests that this set of proteins is essential for its 
metabolic pathways since they are present in most 
of the species containing a particular metabolic 
pathway. 
 
 
7 
 
Figure 4. ​AP proteins profile assessment. A. ​Heatmap showing the concentration of proteins by 
frequency. The proteins were divided into 2 groups: AP and Non-AP protein. The darker shades of blue 
indicate a higher concentration of proteins. The lighter shades of blue indicate lower concentrations of 
proteins. The dashed line represents the cut-off point of the hypergeometric. Proteins that fall below the 
cut-off point are randomly distributed when compared to proteins above the cut of the hypergeometric. ​B. 
Betweenness in APs frequencies groups <30% and >=80%. Distribution of betweenness in proteins 
groups lower than 30% (853) or greater/equal than 80% (206) of frequency. Analyses were performed 
using 1059 APs. Upper and lower whiskers (dashed line in blue) mark Q1-1.5 x IQR (interquartile range) 
and Q3+1.5 x IQR points, respectively. Points above and below those marks are considered outliers. The 
red line represents the median, and the red dot the mean of the entire distribution. The p-value assigned 
refers to a Wilcoxon-Mann-Whitney test, comparing both distributions. ​C. The ratio of APs by protein 
frequency ranges. The bars correspond to APs over total proteins found in each frequency range. Each bin 
represents an interval of 10%. The blue bar corresponds to the interval where the proportion of APs is 
considered significant (Binomial test, p-value<0.01), which p-value is highlighted. The dashed line 
represents the expected ratio of APs in the entire protein dataset (20.46%). 
 
 
Functional analysis 
 
We observed that aspartate-semialdehyde 
dehydrogenase and alcohol dehydrogenase, 
enzymes that catalyze the electrons transfer from 
the reductant to the oxidant (electron acceptor), 
were the APs associated with a greater number of 
metabolic pathways (6 metabolic pathways) 
(Figure 5). 351types of oxidoreductase can be 
found in the DEG, a database about essential 
GENES, demonstrating the essentiality of this AP 
class. In this context, 132 oxidoreductase types are 
present in bacterias, 9 oxidoreductase types are 
expressed in archaea, and 210 oxidoreductase 
types are expressed in eukaryotes. 
Additionally, we could observe that the steroid 
biosynthesis is the metabolic pathway with the 
highest number of articulation points with a 
frequency higher than 80% (16 APs). Moreover, 
Isoquinoline biosynthesis and carotenoid 
biosynthesis are the metabolic pathways with the 
highest number of articulation points with a 
frequency lower than 30% (27 APs). Besides, we 
evaluated the functions and pathways of the APs 
associated with the greatest number of pathways. 
Amino acid biosynthesis is the main function 
related to these APs (Supporting information S7). 
Cysteine and methionine metabolism, as well as 
glycine, serine, and threonine metabolism, are the 
main pathways related to these APs (Supporting 
information S7). 
8 
 
 
Figure 5. ​Distribution of APs with frequencies >=80% and <30% through metabolic pathways. A. Top 
10 metabolic pathways with the greatest number of APs with frequency <30% ​B. Top 10 APs with 
frequency <30% present in the highest number of metabolic pathways. ​C. Top 10 metabolic pathways 
with the greatest number of APs with frequency >=80%. ​D. Top 10 APs with frequency >=80% present in 
the highest number of metabolic pathways. ​E. Top 10 metabolic pathways with the greatest number of 
9 
 
APs with frequency >=80% and betweenness higher than the median. ​F. Top 10 APs with frequency 
>=80% present in the highest number of metabolic pathways and betweenness higher than the median. 
 
Network Visualization 
 
We provided dynamic HTML visualizations 
available at 
https://igorabrandao.com.br/kegg-pathway-bottlen
eck, which allow the observation of several 
metabolic pathways. It is possible to adjust the 
network size and position, as well as to select 
proteins according to their classification, making it 
possible to highlight protein groups of interest. 
Finally, the AP remotion can disconnect the 
network and the proteins metrics (impact, degree, 
and betweenness) can be easily viewed. 
 
Discussion 
 
The present study aimed to investigate the APs for 
all metabolic pathways available in the KEGG 
database. The remotion of one of these APs will 
cause the network rupture since it is a central point 
that connects various network complexes and/or 
peripheral regions (20). Therefore we can suggest 
that APs might be essential proteins to the 
pathways they belong to. Complementary to the 
network topological analysis, it is important to 
understand the profile of proteins classified as AP. 
We found a considerable concentration of APs in 
the protein groups with higher frequencies. We 
verified that APs tend to be high-frequency 
proteins, suggesting that these proteins may be 
more essential to the metabolic pathway when 
compared with proteins with non-APs. However, 
we identified two groups of APs; a group with 
high frequency and high betweenness and a group 
with low frequency and low betweenness. In 
practice, this implies that APs with low 
betweenness disconnect peripheral regions from 
the network, resulting in a disbalance between the 
components of the remaining network, while the 
APs with high betweenness are probably located in 
central regions. 
 
A study demonstrated that an important node in a 
network can be calculated based on the 
betweenness centrality metric plus its degrees (16). 
Other studies suggest performing the identification 
and removal of APs to provide a new perspective 
on the organizational principles of complex 
networks (21). We observed that, in the metabolic 
networks evaluated here, the APs could be 
important since they play a key role in ensuring 
the robustness and connectivity of these networks. 
Studies have shown that network integrity can be 
disrupted, contributing to serious impairments in 
disorders such as depression, psychotic disorders, 
motor diseases, or even lethal mutations enriched 
in the group of highly connected proteins that are 
APs (22). Investigations into the integration and 
coordination of large-scale functional networks 
subserving various conditions must be explored. 
For example, assessing non-static functional 
connectivity within whole-brain networks can be 
important to unravel some mysteries in the central 
nervous system. Studies assessing network 
integrity in the central nervous system were 
developed. However, it is also crucial to identify 
essential proteins in these networks (23-25). 
Another study related to inflammatory bowel 
disease (IBD) found that there are seven 
hub-bottleneck proteins (which can be understood 
as APs) in the IBD network responsible to 
maintain the network integrity. Network 
evaluation and complex analysis of IBD essential 
proteins are important to provide a new glance of 
the disease. Proteins are in a complex interactome 
organization that any small changes in each 
individual may lead to dysfunction of the whole 
system (26). 
 
We performed a functional analysis to determine 
the main metabolic pathways regarding the APs 
using KEGG ECs. A study also used KEGG to 
perform functional analyses and studies used Gene 
Ontology to determine the biological process and 
molecular functions of proteins related to cancer 
(27, 28). However, it is essential to highlight that 
no other study evaluated the main metabolic 
pathways associated with all APs. Most APs with 
10 
 
the greatest number of associated pathways are 
oxidoreductases (AP classes) and the synthesis of 
steroids was the metabolic pathway with the 
highest number of APs with a frequency higher 
than 80%. In fact, on the ​Steroid hormone 
biosynthesis pathway 00140, 14 of 16 APs are 
oxidoreductases. Oxidoreductases participate in 
the metabolism of some ethers. A 5α-reductase 
(5α-R) and 3α- or 3β-hydroxysteroid 
oxidoreductase (HSOR), is important to the 
generation of reduced metabolites of progesterone 
and testosterone, acting on several steroid and 
neurotransmitter receptors and regulating the 
neural function and behavior. 
 
Progesterone is metabolized into 
dihydro-progesterone by the action of 5α-R and 
subsequently into 3α, 5α-tetrahydro-progesterone, 
or into 3β, 5α-tetrahydro-progesterone, by the 
action of 3α- or 3β-HSOR, respectively. Besides, 
testosterone is converted into dihydrotestosterone 
and subsequently into 5α-androstane-3α,17β-diol 
or 5α-androstane-3β,17β-diol. The enzymatic 
reaction performed by 5α-R is the rate-limiting 
step during the 3α,5αreduced synthesis. The 
Oxidoreductase can also convert corticosterone 
and deoxycorticosterone into 
dihydro-corticosteroids and 
dihydro-deoxycorticosterone, respectively, and 
this latter metabolite can be converted into 
3α,5α-tetrahydro-desoxycorticosterone (29). 
 
Finally, to simplify the understanding of the 
topology of the metabolic pathway, our work 
proposes the visualization of these pathways as 
dynamic networks. Network visualization is an 
importantaspect of this work since the dynamic 
visualization provides the possibility to explore 
network topological features graphically, 
providing rich details, facilitating the network 
understanding, and extracting relevant biological 
information (12). 
 
In summary, we suggested that bottlenecks are 
articulation points with the highest frequencies and 
located in the center of the network. New studies 
will be required to establish more relationships 
between these key proteins (APs) and topological 
and biological attributes of the metabolic 
pathways. Additional studies with metabolomics, 
which is a powerful technology that allows for the 
assessment of global metabolic profiles, can be 
used to distinguish between diseased and 
non-diseased status information (30). Another 
future possibility is the usage of machine learning 
(ML) algorithms to create a predictive model to 
identify potential essential proteins in various 
metabolic pathways. Finally, knockout studies can 
be applied to experimentally evaluate whether 
specific APs are essential for a given metabolic 
pathway. For example, recently a receptor 
involved in memory (NOP receptor) has been 
experimentally associated with metabolic 
pathways involved in depression; it remains to 
assess this receptor essentiality on the path through 
bioinformatics tools. 
 
Experimental procedures 
 
Extraction and processing of metabolic 
pathway data 
 
Initially, the list of all species available in KEGG 
was generated along with the list of metabolic 
pathways for each species, and they were the 
reference for the selection of metabolic pathways 
used in the study. Each pathway has an associated 
KGML (KEGG XML) file that contains the 
enzymes and reactions data necessary to generate 
the graph object. It is important to note that in 
some KEGG metabolic pathways there are 
different nodes with the same enzyme commission 
number (EC). Since it was not possible to identify 
these nodes uniquely via the EC or KEGG 
identifier, we needed to combine the following 
node attributes to differentiate them from each 
other: EC, reaction code, and KEGG diagram (x, 
y) coordinates. 
 
We used the ​R/Bioconductor ​KEGGREST 
package (31) to load the reference pathway of 
interest from the REST API provided by KEGG. 
We used 153 metabolic pathways of all species 
available on KEGG (totalizing more than 600.000 
11 
 
datasets). KEGG data was parsed using a function 
adapted from the KEGGREST package. After that, 
compounds and interconnections between 
metabolic pathways were removed to study each 
isolated pathway. Graph properties, such as 
communities, betweenness centrality, and degree, 
were extracted using the iGraph R/Bioconductor 
package (32). 
 
APs detection 
 
AP detection was performed according to Tarjan's 
algorithm. Briefly, APs detection relies on the 
depth search algorithms (DFS) for graphs, which 
consists of a technique to visit each node in a 
given graph until its deepest level. Each node 
reached is placed on a stack and a record is kept of 
the lowest node on the stack to which it is 
connected by a path of unstacked nodes. When a 
new node cannot be reached from the top of the 
stack, the top node is deleted and the search is 
continued from the next node on the stack. If the 
top node of the stack does not connect to a node 
lower than the second node on the stack, then this 
second point is an AP of the graph. All edges 
examined during the search are placed on another 
stack. Therefore, when an AP is found, the edges 
of the corresponding biconnected component may 
be retrieved and placed in an output array (33). 
 
Protein frequency evaluation 
 
To evaluate the protein frequency for each 
metabolic pathway all the organisms that 
contained it were identified, and the occurrence of 
each node of the canonical route was verified 
taking into account the number of these organisms. 
The protein total frequency is determined by the 
number of times this protein appears in the various 
organisms pathways divided by the total 
organisms that possess the given pathway. 
The protein occurrence counting is performed via 
color attribute evaluation in the organism KGML 
file associated with a pathway, more specifically 
into its graphic section. Since KEGG annotates the 
color attributes to display the presence or absence 
of a given protein in a pathway map, we used it to 
perform the protein counting. In the organism 
pathways, nodes with ​bgcolor attribute equal to 
#BFFFBF indicate its presence, while #FFFFFF 
indicates its absence. 
 
Also, we generated a dictionary to match the 
organism nodes with a corresponding EC number 
relative to the canonical route, since in KEGG, the 
organism nodes are represented as ENTREZ code. 
This dictionary contains the attributes ​pathway 
code​, ​EC number​, ​reaction code​, ​protein x-axis​, 
and ​y-axis parsed from the KGML file. Via the 
combination of (reaction code + protein x-axis + 
y-axis) nodes attributes it's possible to match the 
EC number with ENTREZ code and generate the 
dictionary ID to represent this relation. 
 
We normalized the protein frequency according to 
the total species containing a certain metabolic 
pathway since the studied pathways showed 
different numbers of organisms. Also, we 
normalized the proteins according to the maximum 
and minimum frequencies of proteins related to a 
given metabolic pathway. The highest normalized 
value was considered 1, the lowest value was 
considered 0, and others values were within this 
scale from 0 to 1. 
 
Statistical analysis 
 
We assembled a dataset containing the protein 
information, including its frequency, from the 
evaluated pathways using the R language, and the 
nodes with a frequency of 0% were removed to 
avoid possible bias. To test the AP enrichment, we 
applied the hypergeometric test using the ​phyper 
built-in R function with ​lower.tail parameter set as 
false​. Also, we sorted the dataset by the protein 
frequency in descending order to verify the APs 
concentration in groups of higher frequencies. The 
hypergeometric test was applied ​N​ times, with ​N 
equal to the number of proteins in the dataset. 
Each hypergeometric instance evaluates the 
current and the previous proteins in the list in a 
cumulative way. A p-value < 0.01 was considered 
significant, showing that a frequency range is 
enriched with AP proteins. 
12 
 
 
Complementary, we conducted a binomial test 
(p-value < 0.01 ) grouping the proteins in ranges 
of 10% to confirm which range is the most 
enriched with APs. Since we observed an AP 
concentration higher than expected in proteins 
with lower frequencies, we divided the APs into 
the groups with higher (>=80%) and lower (<30%) 
frequencies. To establish differences between 
these groups, we performed comparative analyses 
using the Wilcoxon-Mann-Whitney test (p-value < 
0.01) over the APs topological features 
distribution of each group, for example, the 
betweenness centrality metric. 
 
Functional analysis 
 
To evaluate the main metabolic pathways related 
to the APs, we performed descriptive statistics toidentify the APs role with higher (>= 80%) and 
lower (<30%) frequency. The protein information 
was obtained via the KEGG dbget system plus the 
protein EC number, for example, the 
Aspartate-semialdehyde dehydrogenase ​with EC 
number 1.2.1.11 can be found at 
https://www.kegg.jp/dbget-bin/www_bget?1.2.1.11
. Into the dbget page, it is possible to retrieve the 
protein name, class, and related pathways. Also, 
we searched the protein function in the Uniprot 
database. 
 
Additionally, we retrieved the AP proteins 
correspondence into the Database of Essential 
Genes (DEG) using their classes to confirm their 
essentiality, since DEG holds information of 
essential genes that are indispensable for the 
survival of an organism. 
 
Metabolic pathways visualization 
 
Since a metabolic pathway can be considered a 
mass flow, we represent it by directed graphs 
where nodes are the enzymes and edges are the 
chemical reactions. Also, we provided a graphical 
visualization of the pathways displaying the main 
characteristics from its nodes such as AP status 
(border in blue), betweenness centrality 
(background color), frequency (node size), protein 
classification (label), and bridge edges (red 
arrowed line) using the visNetwork R package 
(34). 
 
Data availability​: The scripts and datasets 
presented in this paper have all been deposited in 
the Github repository with the following link: 
https://github.com/igorabrandao/kegg-pathway-bo
ttleneck​. 
 
Acknowledgments 
We acknowledge KEGG databases for providing public access to their data. We also would like to thank 
NPAD/UFRN for computational resources. 
 
Author contributions 
R.J.S.D. conceived the project. I.A.B., C.F.R., D.A.A.M., and L.R.S.C. gathered and curated the data. 
I.A.B., C.F.R., D.A.A.M., and L.R.S.C performed the analysis. I.A.B. and A.B.C. organized 
Supplementary Material. I.A.B., A.B.C., and R.J.S.D. wrote the manuscript. All authors revised and 
approved the manuscript. 
 
Funding and additional information 
The scholarships were financed by governmental Brazilian agency Coordination for the Improvement of 
Higher Education Personnel (CAPES– Portuguese: Coordenação de Aperfeiçoamento de Pessoal de Nível 
Superior), National Council of Technological and Scientific Development (CNPq) and PROPESQ-UFRN. 
The fellowship was supported by the National Council of Technological and Scientific Development 
(CNPq) (308258/2018-5). 
13 
 
Conflict of interest​: ​The authors declare that they have no conflicts of interest with the contents of this 
article. 
 
References 
 
1. Chatterjee, R., & Yuan, L. (2006) Directed evolution of metabolic pathways. ​TRENDS in 
Biotechnology​, ​24​(1), 28-38. 
2. Dalmolin, R. J., Castro, M. A., Rybarczyk Filho, J. L., Souza, L. H., de Almeida, R. M., & Moreira, J. 
C. (2011) Evolutionary plasticity determination by orthologous groups distribution. ​Biology Direct​; ​6​(1), 
1-18. 
3. Paludo, G. P., Lorenzatto, K. R., Bonatto, D., & Ferreira, H. B. (2015) Systems biology approach 
reveals possible evolutionarily conserved moonlighting functions for enolase. ​Computational biology and 
chemistry​, ​58​, 1-8. 
4. Maniadi EM, Tollis IG. (2014) ​Analysis and visualization of metabolic pathways and networks: A 
hypergraph approach​. 1st Ed., Spain, Valencia, 109-112. 
5. Barabasi AL, Oltvai ZN. (2004) Network biology: understanding the cell's functional organization. 
Nature reviews genetics​; ​5​(2), 101. 
6. Ravasz E, Somera AL, Mongru DA, Oltvai ZN, Barabási AL. (2002) Hierarchical organization of 
modularity in metabolic networks. ​Science​, ​297​(5586), 1551-1555. 
7. Ma HW, Zeng AP. (2003) The connectivity structure, giant strong component and centrality of 
metabolic networks. ​Bioinformatics​, ​19​(11), 1423-1430. 
8. Kanehisa M, Furumichi M, Tanabe M, Sato Y, Morishima K. (2016) KEGG: new perspectives on 
genomes, pathways, diseases and drugs. ​Nucleic acids research​, ​45​(1), 353-361. 
9. Jeong H, Tombor B, Albert R, Oltvai ZN, Barabási AL. (2000) The large-scale organization of 
metabolic networks. ​Nature​, ​407​(6804), 651. 
10. Wagner A, Fell DA. (2001) The small world inside large metabolic networks. ​Proceedings of the 
Royal Society of London​, ​268​(1478), 1803-1810. 
11. Ma HW, Zeng AP. (2003) The connectivity structure, giant strong component and centrality of 
metabolic networks. ​Bioinformatics​, ​19​(11), 1423-1430. 
12. Beguerisse-Díaz M, Bosque G, Oyarzún D, Picó J, Barahona M. (2018) Flux-dependent graphs for 
metabolic networks. ​NPJ systems biology and applications​, ​4​(1), 32. 
13. Jeong H, Mason SP, Barabási AL, Oltvai ZN. (2001) Lethality and centrality in protein networks. 
Nature​, ​411​(6833), 41. 
14. Mahadevan R, Palsson BO. (2005) Properties of metabolic networks: structure versus function. 
Biophysical jornal​, ​88​(1), 7-9. 
15. Ferreira RM, Rybarczyk-Filho JL, Dalmolin RJ, Castro MA, Moreira JC, Brunnet LG, de Almeida 
RM. (2013) Preferential duplication of intermodular hub genes: an evolutionary signature in eukaryotes 
genome networks. ​PloS one,​ ​8​(2). 
16. Yu H, Kim PM, Sprecher E, Trifonov V, Gerstein M. (2007) The importance of bottlenecks in protein 
networks: correlation with gene essentiality and expression dynamics. ​PLoS computational biology​, ​3​(4), 
e59. 
17. Ausiello G, Firmani D, Laura AL. (2012) Real-time monitoring of undirected networks: Articulation 
points, bridges, and connected and biconnected components. ​Networks,​ ​59​(3), 275-288. 
14 
 
18. Behzad M, Chartrand G. (1972) ​Introduction to the Theory of Graphs​. 6st Ed., Allyn and Bacon, 
EUA, Boston. 
19. Harary F. (1996) ​Graph theory​. 6st Ed., Addison-Wesley, England, London. 
20. Farina G. (2015) A linear time algorithm to compute the impact of all the articulation points, ​1,​ 1-4. 
21. Tian L, Bashan A, Shi D-N, Liu Y-Y. (2017) Articulation points in complex networks. ​Nat Commun​, 
8​(1), 14223. 
22. Pržulj, N., Wigle, D. A., & Jurisica, I. (2004) Functional topology in a network of protein interactions. 
Bioinformatics​, ​20​(3), 340-348. 
23. Ge R, Downar J, Blumberger DM, Daskalakis ZJ, Lam RW, Vila-Rodriguez F. (2019) Structural 
network integrity of the central executive network is associated with the therapeutic effect of rTMS in 
treatment resistant depression. ​Progress in Neuro-Psychopharmacology and Biological Psychiatry, ​92, 
217-225. 
24. Larivière S, Ward NS, Boudrias MH. (2018) Disrupted functional network integrity and flexibility 
after stroke: Relation to motor impairments. ​NeuroImage: Clinical​, ​19,​ 883-891. 
25. Sheffield JM, Kandala S, Tamminga CA, Pearlson GD, Keshavan MS, Sweeney JA et al. (2017) 
Transdiagnostic associations between functional brain network integrity and cognition. ​JAMA psychiatry​, 
74​(6), 605-613. 
26. Asadzadeh-Aghdaee H, Shahrokh S, Norouzinia M, Hosseini M, Keramatinia A, Naghibzadeh B et al. 
(2016) Introduction of inflammatory bowel disease biomarkers panel using protein-protein interaction 
(PPI) network analysis, ​9​(1), 8-13. 
27. Barabási AL, Gulbahce N, Loscalzo J. (2011) Network medicine: a network-based approach to human 
disease. ​Nature reviews genetics​, ​12​(1), 56-68. 
28. Ritz A, Poirel CL, Tegge AN, Sharp N, Simmons K, Powell AM et al. (2016) Pathways on demand: 
automatedreconstruction of human signaling networks. ​NPJ systems biology and applications​, ​2​, 16002. 
29. Supper J, Spangenberg L, Planatscher H, Dräger A, Schröder A, Zell A. (2009) BowTieBuilder: 
modeling signal transduction pathways. ​BMC Syst Biol​, ​3​(1), 67. 
30. Khan FM, Gupta SK, Wolkenhauer O. (2018) Integrative workflows for network analysis. ​Essays in 
Biochemistry​, ​62​(4), 549–61. 
31. Tenenbaum D. (2016) KEGGREST: Client-side REST access to KEGG. ​R package version​, ​1​(1). 
32. Csardi G, Nepusz T. (2006) The igraph software package for complex network research. ​InterJournal, 
Complex Systems​, ​1695​(5): 1-9. 
33. Hopcroft J, Tarjan R. (1973) Algorithm 447: efficient algorithms for graph manipulation. 
Communications of the ACM,​ ​16​(6), 372-378. 
34. Almende BV, Thieurmel B Robert T. (2016) visNetwork: Network Visualization using ‘vis. js’ 
Library, ​R package version​, ​1​(1). 
 
Supporting Information 
 
Supporting information S1. ​Pathway 00402 - Benzoxazinoid biosynthesis. ​A. Dynamic visualization 
generated with 8 nodes and 7 edges. ​B.​ KEGG metabolic pathway diagram. 
 
15 
 
 
 
 
 
16 
 
Supporting information S2. ​Classification and quantity of organisms used in the study. ​A. Prokaryotes; 
B.​ Eukaryotes. 
 
 
17 
 
Supporting information S3. ​Visualization of pathway 0010 (glycolysis and gluconeogenesis) in the 
KEGG database. KEGG presents information about the metabolic pathway as a sequential block diagram. 
In which colored or white boxes represent proteins. Colored boxes are proteins present in the metabolic 
pathway, while whites are absent proteins in these pathways. Source: 
https://www.kegg.jp/kegg-bin/show_pathway?org_name=hsa& mapno=00010. 
 
 
18 
 
Supporting information S4. ​Pathway 00232 - Caffeine metabolism. ​A. Dynamic visualization 
with 26 nodes and 76 edges. ​B.​ KEGG metabolic pathway diagram. 
 
19 
 
Supporting information S5. ​Pathway 00061 - Fatty acid biosynthesis. ​A. Dynamic visualization with 
170 nodes and 1063 edges. ​B.​ KEGG metabolic pathway diagram. 
 
 
20 
 
 
21 
 
Supporting information S6. A. APs with high betweenness and frequency of pathway 00010 - glycolysis 
and gluconeogenesis. I) AP ec:4.1.2.13 with betweenness of 0.0255 and frequency of 96.98%. II) AP 
ec:4.2.1.11 with betweenness of 0.0490 and frequency of 98.26%. III) AP ec:2.7.1.40 with betweenness 
of 0.0282 and frequency of 91.73%. ​B. APs with low betweenness and frequency of pathway 00640 - 
propanoate metabolism. IV) AP ec:1.1.1.80 with betweenness of 0.0023 and frequency of 2.23%. V) AP 
ec:4.1.1.4 with betweenness of 0.0034 and frequency of 7.69%. VI) AP ec:2.8.3.8 with betweenness of 
0.0042 and frequency of 9.79%. VII) AP ec:1.1.1.77 with betweenness of 0.0017 and frequency of 
6.89%. 
 
 
 
 
22 
 
 
 
23 
 
Supporting information S7. ​Functions and metabolic pathways related to the APs present in a higher 
number of metabolic pathways. The metabolic pathways were obtained from KEGG, the APs functions 
were obtained from Uniprot. 
 
AP name and class Metabolic Pathways Functions in prokaryotes 
● EC 1.2.1.11 - 
Aspartate-semialdeh
yde dehydrogenase 
● Oxidoreductase 
1. Glycine, serine and 
threonine metabolism 
2. Cysteine and methionine 
metabolism 
3. Lysine biosynthesis 
Biosynthesis of amino acids in prokaryotes, fungi, and some 
higher plants. It forms an early branch point in the metabolic 
pathway forming lysine, methionine, leucine, and isoleucine from 
aspartate 
● EC 2.2.1.6 - 
Acetolactate 
synthase 
● Transferases 
1. Valine, leucine and 
isoleucine biosynthesis 
2. Butanoate metabolism 
3. C5-Branched dibasic 
acid metabolism 
Catalyzes the first step in the synthesis of the branched-chain 
amino acids in plants, animals, and microorganisms, transferring 
aldehyde or ketone residues. 
● EC 1.1.1.1 - Alcohol 
dehydrogenase 
● Oxidoreductase 
1. Chloroalkane and 
chloroalkane degradation 
2. Metabolism of 
xenobiotics by cytochrome 
P450 
3. Tyrosine metabolism 
Facilitate the interconversion between alcohols and aldehydes or 
ketones with the reduction of nicotinamide adenine dinucleotide 
(NAD+) to NADH. In yeast, plants, and many bacteria, some 
alcohol dehydrogenases catalyze the opposite reaction as part of 
fermentation to ensure a constant supply of NAD+. In humans 
and animals, they serve to break down alcohols that otherwise are 
toxic, and they also participate in the generation of useful 
aldehyde, ketone, or alcohol groups during biosynthesis of 
various metabolites. 
● EC 6.3.2.9 - 
D-glutamate ligase 
● Ligases 
1. D-Glutamine and 
D-glutamate metabolism 
2. Peptidoglycan 
biosynthesis 
Form carbon-nitrogen bonds as acid-D-amino-acid ligases 
(peptide synthases). 
● EC 4.1.3.27 - 
Anthranilate 
synthase 
● Lyases 
1. Phenylalanine, tyrosine 
and tryptophan 
biosynthesis 
2. Phenazine biosynthesis 
Catalyzes the two-step biosynthesis of anthranilate, an 
intermediate in the biosynthesis of L-tryptophan. In the first step, 
the glutamine-binding beta subunit (TrpG) of anthranilate 
synthase (AS) provides the glutamine amidotransferase activity 
which generates ammonia as a substrate that, along with 
chorismate, is used in the second step, catalyzed by the large 
alpha subunit of AS (TrpE) to produce anthranilate. In the 
absence of TrpG, TrpE can synthesize anthranilate directly from 
chorismate and high concentrations of ammonia. 
● EC 1.1.1.3 - 
Homoserine 
dehydrogenase 
● Oxidoreductase 
1. Glycine, serine and 
threonine metabolism 
2. Cysteine and methionine 
metabolism 
Catalyzes the third step in the aspartate pathway; the 
NAD(P)-dependent reduction of aspartate beta-semialdehyde into 
homoserine. Homoserine is an intermediate in the biosynthesis of 
threonine, isoleucine, and methionine. In photosynthetic 
organisms, glutamine, glutamate, and aspartate accumulate during 
the day and are used to synthesize other amino acids. At night, 
aspartate is converted to asparagine for storage. Mammals lack 
the enzymes involved in the aspartate metabolic pathway, 
including homoserine dehydrogenase. As lysine, threonine, 
24 
 
methionine, and isoleucine are made in this pathway, they are 
considered essential amino acids for mammals. 
● EC 4.2.1.2 - 
Fumarate hydratase 
● Lyases 
1. Citrate cycle (TCA 
cycle) 
2. Pyruvate metabolism 
Catalyzes the reversible hydration/dehydration of fumarate to 
malate. Fumarase comes in two forms: mitochondrial and 
cytosolic. The mitochondrial isoenzyme is involved in the Krebs 
Cycle and the cytosolic isoenzyme is involved in the metabolism 
of amino acids and fumarate. 
● EC 1.14.14.91 - 
Cinnamate 
4-monooxygenase 
● Oxidoreductases 
1. Flavonoid biosynthesis 
2. Stilbenoid, 
diarylheptanoid and 
gingerol biosynthesis 
Acts on paired donors, with O2 as oxidant and incorporation or 
reduction of oxygen. The oxygen incorporated need not be 
derived from O2 with NADH or NADPH as one donor, and 
incorporation of one oxygen atom into the other donor. 
● EC 1.1.1.100 - 
Beta-ketoacyl 
reductase 
● Oxidoreductases 
1. Fatty acid biosynthesis 
2. Biotin metabolism 
Catalyzes the reaction of conversion of 
3-hydroxyacyl-[acyl-carrier-protein] to 
3-oxoacyl-[acyl-carrier-protein] using NADPH as a reducing 
agent. Acts on the CH-OH group as hydride donors with NAD+ 
or NADP+ as hydride acceptors.