IgorAugustoBrandao_DISSERT

Biologia

•

Escuela Universidad Nacional

Sebastian Gutierrez

20/05/2024

Prévia do material em texto

UNIVERSIDADE FEDERAL DO RIO GRANDE DO NORTE
INSTITUTO METRÓPOLE DIGITAL
PROGRAMA DE PÓS-GRADUAÇÃO EM BIOINFORMÁTICA
MESTRADO EM BIOINFORMÁTICA
Igor Augusto Brandão
Abordagens da biologia de sistemas na
investigação dos pontos de articulação nas
rotas metabólicas do KEGG
Natal - RN
2020
UNIVERSIDADE FEDERAL DO RIO GRANDE DO NORTE
INSTITUTO METRÓPOLE DIGITAL
PROGRAMA DE PÓS-GRADUAÇÃO EM BIOINFORMÁTICA
MESTRADO EM BIOINFORMÁTICA
Igor Augusto Brandão
Abordagens da biologia de sistemas na investigação
dos pontos de articulação nas rotas metabólicas do
KEGG
Universidade Federal do Rio Grande do Norte
Orientador: Prof. Dr. Rodrigo Juliani Siqueira Dalmolin – UFRN
Coorientador: Prof. Dr. José Miguel Ortega – UFMG
Natal - RN
2020
UNIVERSIDADE FEDERAL DO RIO GRANDE DO NORTE
INSTITUTO METRÓPOLE DIGITAL
PROGRAMA DE PÓS-GRADUAÇÃO EM BIOINFORMÁTICA
MESTRADO EM BIOINFORMÁTICA
Abordagens da biologia de sistemas na investigação
dos pontos de articulação nas rotas metabólicas do
KEGG
Dissertação apresentada ao Programa de Pós-
graduação em BioInformática da Universidade Federal
do Rio Grande do Norte por Igor Augusto Brandão
como parte dos requisitos para a obtenção do tı́tulo
de Mestre em BioInformática.
Área de concentração: Bioinformática
Linha de Pesquisa: Biologia de Sistemas
Orientador: Prof. Dr. Rodrigo Juliani S. Dalmolin
Coorientador: Prof. Dr. José Miguel Ortega
Natal, RN
2020
Brandão, Igor Augusto.
Abordagens da biologia de sistemas na investigação dos pontos
de articulação nas rotas metabólicas do KEGG / Igor Augusto
Brandão. - Natal, 2020.
61.: il.
Dissertação (Mestrado) - Universidade Federal do Rio Grande
do Norte. Instituto Métropole Digital. Programa de Pós-graduação
em Bioinformática.
Orientador: Prof. Dr. Rodrigo Juliani Siqueira Dalmolin.
Coorientador: Prof. Dr. José Miguel Ortega.
1. Biologia de sistemas - Dissertação. 2. KEGG - Dissertação.
3. Rotas metabólicas - Dissertação. 4. Topologia de redes -
Dissertação. 5. Pontos de articulação - Dissertação. 6.
Visualização de redes - Dissertação. I. Dalmolin, Rodrigo
Juliani Siqueira. II. Ortega, José Miguel. III. Universidade
Federal do Rio Grande do Norte. IV. Título.
RN/UF/BSCB CDU 57:004
Universidade Federal do Rio Grande do Norte - UFRN
Sistema de Bibliotecas - SISBI
Catalogação de Publicação na Fonte. UFRN - Biblioteca Setorial Prof. Leopoldo Nelson - Centro de Biociências - CB
Elaborado por KATIA REJANE DA SILVA - CRB-15/351
IGOR AUGUSTO BRANDÃO

ABORDAGENS DA BIOLOGIA DE SISTEMAS NA INVESTIGAÇÃO DOS PONTOS
DE ARTICULAÇÃO NAS ROTAS METABÓLICAS DO KEGG
Defesa de Mestrado apresentanda ao Programa de Pós-Graduação em Bioinformática da
Universidade Federal do Rio Grande do Norte.

Área de concentração: Bioinformática
Linha de Pesquisa: Biologia de Sistemas
Orientador: Prof. Dr. Rodrigo Juliani Siqueira Dalmolin
Natal, 14 de agosto de 2020.

BANCA EXAMINADORA

Prof. Dr. Rodrigo Juliani Siqueira Dalmolin
Universidade Federal do Rio Grande do Norte
(Presidente)
(por videoconferência)

Prof. Dr. César Rennó Costa
Universidade Federal do Rio Grande do Norte
(Examinador Interno)
(por videoconferência)

Prof. Dr. Ricardo D’Oliveira Albanus
Universidade de Michigan
(Examinador Externo à Instituição)
(por videoconferência)
Agradecimentos
Este trabalho constitui uma importante fase do desenvolvimento acadêmico e exigiu
diversas habilidades que não se adquirem à portas fechadas: paciência, capacidade
de criar parcerias, perseverança e acima de tudo sonho - o sonho de trazer todo o
aprendizado para a vida.
Primeiramente agradeço à Alice pelas importantes contribuições para este trabalho;
Aos meus pais que mesmo distantes estão sempre presentes em minha vida; Ao meu
professor e orientador, Rodrigo Dalmolin, por acreditar em minha capacidade e por
me guiar neste inı́cio de minha jornada acadêmica; Ao professor Lucélio do Instituto
Metrópole Digital, pelo apoio ao longo destes anos em diversas empreitadas; Aos
colegas do grupo de pesquisa, em especial à Clóvis e Diego, pelas contribuições para
este trabalho; Aos colegas do BioME pela convivência e por tornar o caminho mais
agradável; Aos demais professores e colaboradores do BioME, em especial Jessica e
Aldo, pela disposição e paciência em me ajudar nos momentos em que precisei.
O presente trabalho foi realizado no PPg-Bioinfo/UFRN com apoio da Coordenação
de Aperfeiçoamento de Pessoal de Nı́vel Superior - Brasil (CAPES) - Código de Finan-
ciamento 001.
Sumário
Lista de ilustrações . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
Lista de tabelas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1 INTRODUÇÃO . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.1 Biologia de sistemas . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.2 O estudo de redes aplicado a vias metabólicas . . . . . . . . . . . 13
1.3 Caracterı́sticas topológicas de uma rede . . . . . . . . . . . . . . 15
1.4 Estudo das vias metabólicas do KEGG . . . . . . . . . . . . . . . 18
2 JUSTIFICATIVA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3 OBJETIVOS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.1 Objetivo geral . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.2 Objetivos especı́ficos . . . . . . . . . . . . . . . . . . . . . . . . . . 21
Artigo 23
Discussão 48
Conclusão 56
REFERÊNCIAS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
APÊNDICE A – MATERIAL SUPLEMENTAR DO ARTIGO . . . . . 62
6
Lista de ilustrações
Figura 1 – Representação de uma via metabólica como grafo . . . . . . . . . . 14
Figura 2 – Avaliação topológica de uma rede . . . . . . . . . . . . . . . . . . . 16
Figura 3 – Impacto da remoção de um AP sobre a rede . . . . . . . . . . . . . 50
Figura 4 – Ferramenta de visualização de redes KPV . . . . . . . . . . . . . . . 52
7
Lista de tabelas
Tabela 1 – Propriedades topológicas de uma rede . . . . . . . . . . . . . . . . 17
Tabela 2 – Identificadores do KEGG . . . . . . . . . . . . . . . . . . . . . . . . 19
8
Lista de Sı́mbolos e Abreviações
KEGG – Kyoto Encyclopedia of Genes and Genomes
AP – Articulation point
EC – Enzyme commission number
KO – KEGG Orthology
GO – Gene Ontology
XML – Extensible Markup Language
KGML – KEGG XML
KPV – KEGG Pathway Viewer
9
Resumo
O estudo da essencialidade das proteı́nas por meio de métodos laboratoriais é caro
e não escalável para grandes quantidades de proteı́nas, desta forma é relevante
avaliar a essencialidade das várias proteı́nas de uma via metabólica como um todo
através de ferramentas computacionais. Em geral, uma via metabólica pode ser
analisada como grafos, os quais fornecem diferentes recursos para o estudo das
caracterı́sticas topológicas de redes, como os seus pontos de articulação e disposição
dos nós. Atualmente, pesquisas em bioinformática estudam a essencialidade de
proteı́nas com base nas métricas de betweenness e degree, contudo a teoria dos grafos
sugere que os pontos de articulação podem ser nós importante em uma rede, resta
avaliar se esses pontos de articulação são de fato essenciais para as vias metabólicas
e seu impacto topológico na rede. Utilizando análises baseadas em métricas de
rede, o nosso objetivo é verificar se de fato esses pontos de articulação representam
gargalos na rede, sendo estes caracterizados como proteı́nas de frequências elevadas
e localizadas no centro das redes. Para tanto, identificamos os pontos de articulação
em diferentes vias metabólicas do KEGG, avaliamos o impacto de cada um deles,
calculamos sua frequência e comparamos suas ocorrências com as demais proteı́nas.
Inicialmente, fizemos o levantamento das vias metabólicas do KEGG que estavam
disponı́veis através dos arquivosKGML associados às redes. Após a listagem das vias
disponı́veis, os dados estruturais de cada uma delas foram convertidos em objetos do
tipo grafo. Os parâmetros ponto de articulação, betweenness e degree foram utilizados
para classificar as proteı́nas constantes em cada via metabólica. Aproximadamente
20% das proteı́nas foram classificadas como pontos de articulação, das quais 3,75%
foram identificadas pela alta frequência e localização em regiões centrais da rede.
Além disso, a maior concentração dos pontos de articulação ocorreu na faixa de
frequência dos 80 a 90%. Um padrão de não aleatoriedade na distribuição dos pontos
de articulação foi identificado nos grupos com frequências acima de 74,5%.Finalmente,
a biossı́ntese de esteroides foi a via metabólica com o maior número de pontos de
articulação com frequências superiores a 80% em sua constituição. A oxidoredutase foi
a classe dos pontos de articulação presente no maior número de vias metabólicas. As
descobertas sugerem que os gargalos das redes avaliadas são pontos de articulação
com as frequências mais altas e localizados no centro da rede. Resta realizar análises
mais aprofundadas a respeito dos papéis biológicos destes pontos de articulação
encontrados.
Palavras-chaves: Biologia de sistemas, KEGG, ponto de articulação, via metabólica,
topologia de redes, visualização de redes.
10
Abstract
The study of proteins essentiality through laboratory methods is expensive, time-
consuming and not scalable for large amounts of proteins. Besides, it is relevant
to evaluate the essentiality of several proteins of a metabolic pathway as a whole. The
metabolic pathways can be analyzed as graphs, which provide several tools to study
the topological features such as the articulation points. Nowadays, research in bioinfor-
matics studies the essentiality of proteins based on betweenness and degree metrics,
however, graph theory suggests that articulation points could be essential nodes in a
network. It remains to be determined whether these articulation points are essential in
metabolic pathways and their topological impact on the network. Using network analysis
via metrics and biologic curation, we aim to verify if bottlenecks are proteins with the
highest frequencies and located in the center of KEGG metabolic pathways. For this
purpose, we identified the articulation points in different networks, evaluate the impact
of each articulation point, calculate their frequency and compare them with occurrences
of non-articulation points. We consulted KEGG pathways available as KGML files. After,
the data was transformed into a graph object. Two centrality parameters including
articulation points and degree are determined and the essential proteins based on
these parameters are classified. Approximately 20% of the proteins are articulation
points. The articulation points with high-frequency which are located in central regions
of the network were considered the most important (3.75%). In addition, the highest
concentration of articulation points occurred in the frequency range of 80-90%. A pattern
of non-randomness of articulation points was identified in the protein groups that have
a frequency of at least 74.5%. Finally, steroid biosynthesis is the metabolic pathway
with the highest number of articulation points with frequency higher than 80%. Besides,
oxidoreductase is the articulation point class present in the highest number of metabolic
pathways. Overall, the findings suggest that bottlenecks are articulation points with
highest frequencies and located in the center of the network. It remains to perform a
deep analysis on the articulation points biological roles.
Key words: Systems biology, KEGG, articulation point, metabolic pathway, network
topology, network visualization.
11
12
1 Introdução
1.1 Biologia de sistemas
O desafio de compreender sistemas biológicos complexos é o que caracteriza o
campo de estudo da biologia, a qual requer a integração de pesquisas experimentais
e computacionais [1]. A biologia de sistemas é uma disciplina que estuda sistemas
biológicos de uma perspectiva holı́stica e interdisciplinar, reunindo profissionais de dife-
rentes campos de estudo tais como biólogos, matemáticos, cientistas da computação,
fı́sicos e engenheiros, de forma a obter componentes orientados para biologia e com-
ponentes orientados para sistemas, tais como bases de dados. Além disso, a biologia
de sistemas é caracterizada por grandes conjuntos de dados e modelagem, sendo
estes decorrentes do desenvolvimento de tecnologias de alta vazão (high-throughput)
nos anos 90. Sem modelos matemáticos e simulações computacionais, no entanto, os
dados não podiam ser entendidos na época [2]. Os primeiros biólogos de sistemas
fizeram diversas tentativas para melhorar a metodologia de estudo destas informações,
tais como o desenvolvimento de algoritmos para a manipulação de grandes conjuntos
de dados e formalismos que modelam uma rede especı́fica, incluindo redes reguladoras
de genes, redes de transdução de sinal e redes metabólicas [3].
Um aspecto relevante sobre as informações coletadas por meio do projeto Genoma
Humano é a hierarquia que estas seguem na natureza, a saber: DNA → mRNA →
proteı́na → interações proteicas → vias de informação → redes de informação →
células → tecidos ou redes de células → um organismo → populações → ecologias.
A tarefa central da biologia de sistemas é (I) coletar informações de cada um desses
nı́veis distintos para sistemas biológicos individuais e (II) integrar esses dados para
gerar modelos matemáticos preditivos do sistema [4]. A informação biológica possui
caracterı́sticas importantes:
• Opera em vários nı́veis hierárquicos de organização;
• É processada em redes complexas;
• Essas redes de informação são tipicamente robustas, de modo que muitas
perturbações únicas não as afetam significativamente;
• Existem nós-chave na rede, onde perturbações podem ter efeitos profundos;
estes nós oferecem alvos poderosos para a compreensão e manipulação do
sistema [4].
Introdução 13
Nesse contexto, é importante ressaltar que sistemas biológicos, como células,
são considerados ”sistemas complexos”. Uma noção geral de sistemas complexos é
composta por um número muito grande de elementos simples e idênticos que interagem
de maneira a produzir comportamentos ”complexos” [5], no entanto a realidade dos
sistemas biológicos é um pouco diferente. Nesse caso, um grande número de conjuntos
de elementos funcionalmente diversos e frequentemente multifuncionais interage de
maneira seletiva e não linear para produzir comportamentos coerentes. Ao contrário
dos sistemas complexos de elementos simples (nos quais as funções emergem das
propriedades das redes que formam e não de qualquer elemento especı́fico), as funções
nos sistemas biológicos dependem de uma combinação da rede e dos elementos
especı́ficos envolvidos [1].
Por muito tempo, assumiu-se que os componentes de sistemas complexos, como
a célula, são aleatoriamente conectados. Entretanto, na década passada um grande
número de pesquisas mostrou que muitas redes reais, independentemente de idade,
função e escopo, convergem para arquiteturas semelhantes, uma universalidade que
permitiu pesquisadores de diferentes disciplinas adotar a teoria das redes como um
paradigma comum [6].
1.2 O estudo de redes aplicado a vias metabólicas
As redes regulatórias e de interação são representações importantes das redes
de proteı́nas e são caracterizadas por possuir arestas direcionais e não direcionais,
respectivamente [7]. O estudo e a aplicação da teoria dos grafos no contexto de redes
biológicas ganharam notoriedade, pois é necessário entender sistematicamente como
as moléculas interagem umas com as outras e como suas funções são determinadas
na complexa maquinaria celular, isoladamente ou em conjunto com outras células [8].
As viasmetabólicas e os ciclos são cadeias de reação em que produtos quı́micos
se tornam o substrato para a próxima etapa [9]. Além disso, a análise estrutural e
funcional de redes metabólicas baseadas no genoma é importante para entender
os princı́pios de design e a regulação do metabolismo em nı́vel sistêmico. A rede
metabólica é organizada em muitos módulos pequenos e altamente conectados, os
quais se combinam hierarquicamente em unidades maiores e menos coesas [10].
Portanto, é importante uma redução racional da rede metabólica em sua estrutura
central e uma compreensão mais profunda de seus módulos funcionais [11].
A principal caracterı́stica das vias metabólicas é o fato de suas reações serem
conectadas por seus intermediários, sendo os produtos de uma reação os reagentes
das reações subsequentes [12]. Em geral, as vias metabólicas podem ser classificadas
como redes de fluxo, onde uma variável especı́fica, como fluxo de massa ou energia,
pode ser conservada em cada nó. As redes metabólicas têm propriedades únicas,
Introdução 14
como as restrições de conservação que devem ser satisfeitas em cada nó. Outra
propriedade é a representação dos nós das redes metabólicas como metabólitos e as
arestas como reações catalisadas por produtos genéticos especı́ficos [13]. Abaixo, é
apresentada a Figura 1 exemplificando a representação de uma via metabólica como
grafo.
Figura 1 – Representação da via metabólica relativa a biossı́ntese de estilbenoide
como grafo (desenvolvido pelo autor1).
Em termos de grafos, existem diversas maneiras de se construir uma rede a partir de
um modelo metabólico [14]. Por exemplo, (I) é possı́vel representar uma via metabólica
como um grafo no qual os metabólitos são os nós e arestas são as reações que
transformam um metabólito em outro [11, 15, 16, 17], (II) como um grafo em que as
reações são os nós e arestas os metabólitos compartilhados entre elas [18, 19, 20],
ou até mesmo (III) como um grafo bipartido com reações e metabólitos como nós [21].
O estudo de redes sugere que as redes biológicas têm duas propriedades estruturais
importantes. Primeira, a maioria dessas redes (incluindo as vias metabólicas) são
scale-free e possuem uma propriedade de ”small-world”, caracterizada por grande parte
de suas conexões serem estabelecidas entre os vértices mais próximos [7, 11, 13]. No
1 Disponı́vel em: <〈https://igorabrandao.com.br/kegg-pathway-bottleneck//pathway module/viewer?
pathwayCode=cGIwWHJ0bFJNZlRtRm4rcSs3aGRBZz09〉>. Acesso em: 08 jun. 2020.
https://igorabrandao.com.br/kegg-pathway-bottleneck//pathway_module/viewer?pathwayCode=cGIwWHJ0bFJNZlRtRm4rcSs3aGRBZz09
https://igorabrandao.com.br/kegg-pathway-bottleneck//pathway_module/viewer?pathwayCode=cGIwWHJ0bFJNZlRtRm4rcSs3aGRBZz09
Introdução 15
entanto, é importante observar a existência de forças motrizes que podem influenciar na
criação de um novo nó na rede biológica, como embaralhamento de exons, retroposição,
elementos móveis, transferência horizontal de genes, duplicação gênica e o fato de
uma nova conexão deste nó refletir sua origem com a natureza de suas propriedades
[22]. Segunda, sugere-se que as redes scale-free tenham alta tolerância a erros
(resistência a falhas aleatórias) e baixa tolerância a ataques (vulnerabilidade à falha
dos nós altamente conectados) [13].
1.3 Caracterı́sticas topológicas de uma rede
Os perfis topológicos de redes podem ser avaliados através da observação de
caracterı́sticas especı́ficas da via, como por exemplo:
1.3.1 Betweenness
A métrica betweenness centrality avalia o número total de caminhos mais curtos,
não redundantes, e que passam por um determinado nó ou aresta [23].
1.3.2 Degree
O degree informa quantos links o nó possui para se conectar a outros nós [7].
É importante ressaltar que a distribuição do degree é ampla, com uma cauda que
geralmente segue uma lei de potência: portanto, muitos vértices com baixo degree
coexistem com alguns vértices com alto degree [24].
A avaliação do degree dos vértices de uma rede traz a tona a ideia de nós hub,
definidos como nós que estão conectados a um número substancial de outros nós
na rede, e em particular, a nós que são mais densamente conectados do que os
nós mais altamente conectado em uma rede scale-free [25]. Contudo, definir este
número substancial de conexões pode ser algo subjetivo, por isso utilizou-se como
referência a definição de Yu et al., na qual os nós hubs são aqueles com as 20%
maiores distribuições de degree [23].
1.3.3 Comunidades
A distribuição das arestas em uma rede não é apenas global, mas também lo-
calmente não homogênea, com altas concentrações de arestas dentro de grupos
especı́ficos de vértices e baixas concentrações entre esses grupos. Essa caracterı́stica
de redes é conhecida como estrutura de comunidade [26].
Introdução 16
1.3.4 Ponto de articulação
Um nó é um AP se a sua remoção desconectar a rede ou aumentar o número
de componentes conectados a ela [27, 28]. Além disso, cada AP possui um impacto
diferente sobre a rede, o qual pode ser definido como o número de vértices que são
desconectados do principal (maior) componente conectado (CC) sobrevivente após a
remoção do vértice [29].
1.3.5 Pontes
Uma aresta pode ser classificada como ponte se a sua remoção desconectar o
grafo ou aumentar o número de componentes conectados [30].
1.3.6 Exemplo de avaliação topológica de uma rede
A partir da Figura 2 é possı́vel identificar alguns dos elementos topológicos citados
anteriormente:
Figura 2 – Avaliação topológica de uma rede. Os cı́rculos numerados na figura
representam os vértices da rede, as linhas conectando os vértices são as
arestas (neste caso não direcionadas). Esta rede possui três comunidades:
A, B e C, demarcadas pelos cı́rculos tracejados e pela cor dos vértices que
as compõem. Além disso, é possı́vel identificar que os nós participantes de
uma mesma comunidade são altamente conectados entre si, porém
possuem poucas ligações com as demais comunidades. A aresta que
conecta os nós [1, 4] representa uma ponte do grafo e os nós 1, 3, 6, 7 e 8
são os pontos de articulação da rede, uma vez que se um deles for
removido o grafo será desconectado (desenvolvido pelo autor).
Introdução 17
Adicionalmente, é possı́vel avaliar outras propriedades topológicas da rede não
visı́veis em uma primeira observação através da tabela 1:
Tabela 1 – Propriedades topológicas de uma rede. A numeração presente na coluna
node é equivalente aos números dos vértices da figura 2. A coluna
betweenness mostra que os vértices 6, 7, 8 são aqueles que mais estão
presentes nas conexões de ponta-a-ponta da rede, como por exemplo [2,
11] ou [1, 12]. Além disso, fica claro que os nós 11 e 12 são os mais
densamente conectados, uma vez que estes fazem vizinhança com quatro
nós. Por fim, a coluna diameter mostra que o número necessário de passos
para conectar as extremidados da rede é sete, no caso para conectar os
nós [4, 10] (desenvolvido pelo autor).
Esta foi somente uma sugestão para a realização de uma análise topológica de rede,
existem outras maneiras de se estudar um grafo e outras propriedades que podem ser
exploradas conforme a necessidade da pesquisa. Além disso, é possı́vel aplicar este
tipo de metodologia para estudos massivos em bases de dados.
Introdução 18
1.4 Estudo das vias metabólicas do KEGG
Atualmente existem diversas bases de dados contendo informações sobre vias
metabólicas e seus componentes. No entanto, foi utilizada a plataforma KEGG, pelo fato
desta possuir uma base de conhecimento manualmente curada em mapas metabólicos
de interação molecular e relações de ortologia entre genes e produtos gênicos [31].
No KEGG, os nós representam enzimas e as arestas reações que transformam um
metabólito em outro [11, 15, 16]. O KEGG consiste em trêsbases de dados: (I)
PATHWAY, para representação de funções de ordem superior em termos de redes de
interação, (II) GENES, para a coleção de catálogos de genes para todos os genomas
completamente sequenciados e alguns parciais [32], e (III) LIGAND, para a coleção de
compostos quı́micos na célula, moléculas e reações enzimáticas [33].
O banco de dados KEGG/PATHWAY, especializado em vias metabólicas, é repre-
sentado por 500 diagramas gráficos para as vias metabólicas de referência. Cada via
pode ser vista como uma rede de enzimas ou uma rede de ECs. Uma vez que os genes
das enzimas são identificados no genoma (com base na semelhança de sequência
e na correlação posicional dos genes) e os ECs adequadamente atribuı́dos, as vias
especı́ficas do organismo podem ser construı́das computacionalmente, correlacionando
genes no genoma com produtos gênicos (enzimas) nas vias de referência conforme os
ECs correspondentes [32].
Os diagramas de rotas do KEGG podem ser utilizados como referência para
interpretação funcional, além de conter relações binárias que representam interações e
relações moleculares, as quais podem ser utilizadas para calcular e comparar rotas [31].
Vale ressaltar que em termos de estrutura informacional, cada via metabólica do KEGG
possui um arquivo KGML com a identificação das proteı́nas, reações quı́micas e suas
interações. Cada diagrama do KEGG é identificado pelo KEGG identifier, conforme
tabela 2. Embora estes diagramas sejam informativos, um estudo topológico de rede
sobre a via metabólica pode ser importante para entender caracterı́sticas especı́ficas
de cada rede. Nesse sentido, a representação via grafos pode ser importante, uma vez
que a representação visual de redes oferece uma forma rápida e intuitiva de compreen-
der seus resultados, além de possibilitar o estudo de diversas caracterı́sticas destas
redes por meio da avaliação topológica.
Introdução 19
Identificador Significado
map via de referência desenhada manualmente
ko via de referência que destaca os KOs
ec via de referência que destaca os ECs
rn via de referência que destaca as reações
<org> via especı́fica do organismo gerada pela conversão de KOs em identi-
ficadores de genes (Entrez Gene ID)
Numeração
011 via global (arestas vinculadas aos KOs)
012 via geral (arestas vinculadas aos KOs)
010 via de estrutura quı́mica
07 via de estrutura de medicamento
outras via regular (nós vinculados aos KOs)
Tabela 2 – Identificadores do KEGG. Cada mapa metabólico é identificado pela
combinação de um prefixo com 2-4 letras e um número de 5 dı́gitos
(desenvolvido pelo Laboratório Kanehisa2).
2 Disponı́vel em: <〈https://www.genome.jp/kegg/pathway.html〉>. Acesso em: 19 jun. 2020.
https://www.genome.jp/kegg/pathway.html
20
2 Justificativa
O desafio no processo de obtenção e mineração de dados em larga escala consiste
em uma atividade recorrente no âmbito da pesquisa cientı́fica e também em projetos
comerciais. Diante disso, o desenvolvimento e utilização de ferramentas para análise
de dados massivos em diversas bases de conhecimento é necessário para fornecer
aos biólogos uma visão integrada dos dados biológicos [34]. Neste contexto, a prin-
cipal contribuição deste trabalho é a avaliação sistemática da topologia dos mapas
metabólicos disponı́veis na plataforma KEGG com o intuito de encontrar os pontos de
vulnerabilidade destas vias, uma vez que a disruptura delas pode estar associadas a
desordens ou doenças para os organismos em geral [35].
A determinação experimental de genes essenciais em laboratório ainda é cara,
demorada e não escalável. Neste sentido, previsões computacionais de genes es-
senciais podem ser utilizadas para restringir o universo de estudo e fornecer uma
lista de prioridades para validação experimental [36]. Adicionalmente, a utilização de
propriedades baseadas em rede são capazes de prever melhor os genes essenciais
entre diferentes organismos, uma vez que estes genes são efetivamente transferidos
em organismos intimamente relacionados por compartilharem genes ortólogos [36].
21
3 Objetivos
3.1 Objetivo geral
Verificar se os pontos de vulnerabilidade das vias metabólicas do KEGG possuem
as seguintes caracterı́sticas: (I) são os nós com as maiores ocorrências entre as
vias metabólicas de diferentes organismos, (II) desconectam a rede em componentes
distintos após sua remoção, (III) estão localizados no centro destas vias metabólicas.
3.2 Objetivos especı́ficos
– Avaliar a ocorrência de cada proteı́na no conjunto das vias metabólicas do KEGG
para os diferentes organismos disponı́veis;
– Identificar quais nós são APs e analisar suas caracterı́sticas topológicas para
cada via metabólica em diferentes organismos;
– Detectar as faixas de frequência em que os APs não estão aleatoriamente dis-
tribuı́dos;
– Analisar funcionalmente os APs para compreender o seu papel biológico;
– Desenvolver uma ferramenta para visualização interativa das redes de maneira a
permitir o acesso as caracterı́sticas topológicas de seus componentes e manipulação
das mesmas.
Artigo:
Systems biology approaches in the
investigation of articulation points in
KEGG metabolic pathways
Autores: Igor A. Brandão, Clovis F. Reis, Alice B. Câmara, Diego A. A. Morais,
Leonardo R. S. Campos, Rodrigo J. S. Dalmolin

Systems biology approaches in the investigation of articulation points in KEGG metabolic pathways

Igor A. Brandão1, Clovis F. Reis1, Alice B. Câmara2, Diego A. A. Morais1, Leonardo R. S. Campos1,
Rodrigo J. S. Dalmolin1*

1 Bioinformatics multidisciplinary environment, Federal University of Rio Grande do Norte, Natal, Rio
Grande do Norte, Brasil
2 Biophysics and Pharmacology Department, Federal University of Rio Grande do Norte, Natal, Rio
Grande do Norte, Brasil

1 1722, Odilon Gomes de Lima Av, Natal, Rio Grande do Norte 59078-400
2 3000, Sen. Salgador Filho Av, Natal, Rio Grande do Norte 59064-741

*Corresponding author
E-mail: rodrigo.dalmolin@imd.ufrn.br
Running title: Articulation points Investigation in KEGG metabolic pathways

Key-words: KEGG, articulation point, metabolic pathway, network topology, network visualization.

Abstract

The metabolic pathways can be analyzed as
graphs, which provide several tools to study the
topological features such as the articulation points.
Nowadays, research in bioinformatics studies the
essentiality of proteins based on betweenness and
degree metrics, however, graph theory suggests
that articulation points could be essential nodes in
a network. It remains to be determined whether
these articulation points are essential in metabolic
pathways and their topological impact on the
network. Here, we constructed biological networks
of 153 canonical metabolic pathways from the
KEGG database. We identified the articulation
points of each canonical network and evaluated
their frequency in 6221 species annotated in the
KEGG database. Using network analysis via
metrics and biologic curation, we verified that
articulation points are proteins with high
frequencies across species when comparing with
non-articulation points. The articulation points
with high-frequency are located in central regions
of the network, while those articulation points with
low frequency are in the network periphery.
Finally, steroid biosynthesis is the metabolic
pathway with the highest number of articulation
points with a frequency higher than 80%. Besides,
oxidoreductase is the articulation point class
present in the highest number of metabolic
pathways. It remains to perform a deep analysis of
the articulation point biological roles.

Introduction

Metabolic pathways are composedof several
components, such as enzymes and other proteins,
which need to work coordinately to perform a
given function (1). There are metabolic pathways
widely distributed in the tree of life, and others
restricted to a few taxonomic groups (2). The
distribution of different metabolic pathways
suggests that some of them are important to almost
all organisms while others are important to
organisms in a specific niche or from specific taxa
(2). It is common to observe variations in the same
metabolic pathway comparing different organisms.
For example, both humans and yeast have
carbohydrate metabolism. However, the yeast
Saccharomyces cerevisiae has the enzyme
pyruvate decarboxylase (EC:4.1.1.1), an enzyme
absent in humans, while both species have enolase
(EC:4.2.1.11) (3).

A central characteristic of metabolic pathways is
that reactions are connected by their intermediates.
The products of a reaction are the reactants of
subsequent pathway steps (4). In general, the
metabolic pathways can be classified as flow
networks, where a specific variable, such as mass
or energy, can be conserved at each node. The
study and application of graph theory in the
context of metabolic networks has gained
notoriety since it is necessary to systematically
understand how molecules interact with each other
and how their functions are determined within the
complex cell machinery, alone or together with
other cells (5). Additionally, structural and
functional analysis of genome-based large-scale
metabolic networks is important for understanding
the design principles and regulation of metabolism
at a system level. The metabolic networks are
organized into many small and highly connected
modules that combine hierarchically into larger,
less cohesive units (6). Therefore, it is important
for a rational reduction of the metabolic networks
to their core structure and a deeper understanding
of its functional modules (7).

A great source of curated functional information
about metabolic networks is the KEGG database
(Kyoto Encyclopedia of Genes and Genomes -
https://www.genome.jp/kegg/). It holds a
knowledge base on metabolic pathway maps of
molecular interaction and orthology relationships
between genes/gene products in different species
(8). In KEGG, nodes depict enzymes or other
proteins, and edges represent the reactions that
transform one metabolite into another (9-11).
Although KEGG diagrams are informative and
easy to understand, a network topological study
about the metabolic pathway might be important to
2

understand specific network characteristics. In this
sense, the representation via graphs could stand
out because it enables the study of several network
properties (12).

Network analysis suggests that most biological
networks, including metabolic networks, are
scale-free and possess a "small-world" property
characterized by a short average path length (7, 13,
14). There are biological phenomena that can
influence a novel node attachment in a biological
network, such as exon shuffling, retroposition,
mobile elements, horizontal gene transfer, gene
duplication (15). Scale-free networks are
suggested to have high error tolerance (against
random failure) and low attack tolerance
(vulnerability to the failure of the highly connected
nodes) (14). The network topological properties,
such as betweenness, degree, articulation points
(AP), and bridges, helps to understand network
characteristics. Betweenness centrality measures
the total number of non-redundant shortest paths
going through a certain node or edge (16). The
degree tells how many links a specific node has
(13). Also, an edge can be classified as a bridge if
its removal disconnects the graph or increases the
number of connected components otherwise (17).
A node is an AP if its removal disconnects the
network or increases the number of connected
components of the network (18, 19).

Articulation points could be particularly crucial to
metabolic networks. Taking into account that
metabolic networks commonly involve mass flux,
the deletion of an AP might interrupt the flux and,
therefore, deeply impact the pathway. To date, no
research systematically evaluated the topology of
metabolic maps as a whole to study the APs. The
study of APs is relevant because it is possible to
detect the most vulnerable sites in a metabolic
pathway by identifying these points. The APs can
be measured by the number of vertices that get
disconnected from the main (largest) surviving
connected component after the removal of the AP
(20).
Here, we built the respective metabolic network
from each KEGG canonical metabolic map
available in the KEGG website and identified their
articulation points. We investigated the frequency
of each node from the canonical metabolic
networks by evaluating their presence in the
metabolic maps for each species present in the
KEGG database. According to our results, there
are two sets of APs: peripheral AP, which have
low frequency and low betweenness; and central
AP, with high frequency and high betweenness.
Finally, we develop an interactive visualization
enabling access to each protein topological
features and the simulation of creation and
remotion of nodes to provide a user-friendly
interaction with the assessed networks.

Results

Metabolic pathways from KEGG maps

In total, we retrieved 153 metabolic pathways
from the KEGG database. A script was developed
in R language using the KEGGREST package to
load the list of metabolic pathways with its
characteristics and the list of 6221 organisms with
their respective taxonomic information.
Approximately 32% of metabolic pathways are
related to less than one thousand species (Figure
1). This might be due to the existence of specific
pathways for certain plants, bacteria, and fungi,
which are involved with their cellular functions in
these organisms. An example would be
Benzoxazinoid biosynthesis route 00402
(Supporting information S1a) responsible for the
defense of some plant species (Supporting
information S1b). Also, approximately 31% of
metabolic pathways are related to at least 83% of
KEGG species. These pathways being important to
different types of living beings (Supporting
information S2), such as glycolysis and
gluconeogenesis route 00010, present in 6084
species (Supporting information S3).

Figure 1. Distribution of species quantity by metabolic pathways available in KEGG. X-axis: metabolic
pathway. Y-axis: organisms count. The orange bars represent pathways related to eukaryotes. Eukaryotes
represent 8.6% (534) of total species (6221). Blue bars represent pathways related to prokaryotes.
Prokaryotes represent 91.4% (5687) of total species.

We then converted each 153 KEGG metabolic
maps into metabolic networks. Figure 2A shows
the original KEGG map (map 0010) while Figure
2B shows the respective metabolic network, where
nodes represent enzymes and links represent the
reaction. The KEGG representation (Figure 2A),
focused on displaying the chemical reaction
sequentiallyto understand its logic and to
emphasize the subproducts of each step. On the
other hand, the graph representation (Figure 2B)
provides a perspective regarding the topologic
organization, the role of each node in the network
arrangement, the communities and the hubs that
connect them, and the network bottlenecks which
could be interpreted as points of network
vulnerability.
4

Figure 2. Comparison between KEGG and graph representation of glycolysis and gluconeogenesis
pathway. A. Canonical KEGG map representing the 0010 - glycolysis and gluconeogenesis pathway. The
rectangles represent enzymes and the circles represent metabolites (extracted from the KEGG website -
https://www.genome.jp/kegg/pathway.html). B. Graph visualization generated with 58 nodes and 190
edges. Nodes represent enzymes and edges represent reactions. Node size represents protein frequency,
colored borders indicate APs, color in the center of the node indicates betweenness level. The
classification and EC code of the protein are present on the node label. The network was made using the
R programming language (visNetwork package).

Articulation points identification

Articulation points are nodes whose removal
disconnects the network in more than one
component (21). We then identified the
articulation points (AP) of each 153 metabolic
networks. The APs represented 20.46% (n=1125)
of the total evaluated nodes (n=5497), while the
Non-APs nodes (n=4372) corresponded to 79.54%
(Figure 3A). The number of APs varies depending
on the pathway, ranging from 1 to 34 (Figure 3B).

It is noteworthy that some metabolic pathways
were composed mostly of APs, for example, the
Benzoxazinoid biosynthesis pathway 00402
(Supporting information S1), which is composed
of eight proteins in which six are APs representing
a one-way bridge in terms of mass flow. Other
pathways had just peripheral APs, such as Caffeine
metabolism pathway 00232 (Supporting
information S4). In this case, there are few
communities, which are overcrowded and some of
them are connected through bridges that represent
one-way paths. These bridges are composed of
APs, which connect with only one protein.
Another example is the Fatty acid synthesis
pathway 00061 (Supporting information S5),
formed of an overcrowded central community with
5

7 peripheries highly connected to the center and
another periphery connected to the center through
just one bridge. This pathway contains 3
components disconnected from the main network
component, each with their own APs. However,
there are several highly connected nodes in the
network central area, although none of them are an
AP due to the connection handles, which
guarantee redundancy in this area of the network.

Figure 3. Distribution of articulation points. The AP represents 20.46% while the Non-AP (79.54%) of
the total studied nodes A. The number of AP varies depending on the pathway. B. The x-axis represents
each pathway and the y-axis depicts the number of nodes per pathway. Protein count per pathway varies
from 2 to 170 nodes.

Articulation points distribution across species

Despite the possibility to find a given metabolic
pathway in many species, not the same reactions
are observed in each organism, especially when
comparing evolutionarily distant species.
Therefore, the node frequency in a given
metabolic network might vary when evaluating
different species. We evaluated the node frequency
of each of the 153 metabolic networks. We first
6

identified the species in which a given metabolic
network occurs and then ranked network nodes
according to the observed frequency of occurring
species. Through the hypergeometric analysis, we
identified that the upper-frequency stratum (>=
75% of frequency) were enriched by APs (p-value
< 0.01). Within this filter, 1047 more frequent
proteins were selected, which presented a higher
frequency than the cut-off point, of which 236 are
APs (22.54%). Regarding the total of studied
proteins, 1125 APs (20.46%) were evaluated in
which a higher AP density was observed above the
cut-off point when compared to non-AP proteins
(Figure 4A). Besides, at frequencies groups
>=80% and <30%, we found the highest
concentration of AP (Figure 4A). To differentiate
the profile of the APs in the groups of high
frequency (>=80%) and low frequency (<30%),
we studied the betweenness of the APs as an
indicator of centrality in the assessed metabolic
pathways. In this sense, we observed that the APs
in the group greater/equal than 80% presented
betweenness higher than the group lower than 30%
(Figure 4B), suggesting that these APs are located
at more strategic points of the metabolic pathways
(Supporting information S6). Additionally, we
evaluated the APs distribution by each frequency
range and also the overall proportion of the APs
over the set of studied proteins. Our analysis
showed that the highest concentration of APs
occurred in the frequency range of 80-90%, which
was confirmed by the binomial test (Figure 4C). It
suggests that this set of proteins is essential for its
metabolic pathways since they are present in most
of the species containing a particular metabolic
pathway.

Figure 4. AP proteins profile assessment. A. Heatmap showing the concentration of proteins by
frequency. The proteins were divided into 2 groups: AP and Non-AP protein. The darker shades of blue
indicate a higher concentration of proteins. The lighter shades of blue indicate lower concentrations of
proteins. The dashed line represents the cut-off point of the hypergeometric. Proteins that fall below the
cut-off point are randomly distributed when compared to proteins above the cut of the hypergeometric. B.
Betweenness in APs frequencies groups <30% and >=80%. Distribution of betweenness in proteins
groups lower than 30% (853) or greater/equal than 80% (206) of frequency. Analyses were performed
using 1059 APs. Upper and lower whiskers (dashed line in blue) mark Q1-1.5 x IQR (interquartile range)
and Q3+1.5 x IQR points, respectively. Points above and below those marks are considered outliers. The
red line represents the median, and the red dot the mean of the entire distribution. The p-value assigned
refers to a Wilcoxon-Mann-Whitney test, comparing both distributions. C. The ratio of APs by protein
frequency ranges. The bars correspond to APs over total proteins found in each frequency range. Each bin
represents an interval of 10%. The blue bar corresponds to the interval where the proportion of APs is
considered significant (Binomial test, p-value<0.01), which p-value is highlighted. The dashed line
represents the expected ratio of APs in the entire protein dataset (20.46%).

Functional analysis

We observed that aspartate-semialdehyde
dehydrogenase and alcohol dehydrogenase,
enzymes that catalyze the electrons transfer from
the reductant to the oxidant (electron acceptor),
were the APs associated with a greater number of
metabolic pathways (6 metabolic pathways)
(Figure 5). 351types of oxidoreductase can be
found in the DEG, a database about essential
GENES, demonstrating the essentiality of this AP
class. In this context, 132 oxidoreductase types are
present in bacterias, 9 oxidoreductase types are
expressed in archaea, and 210 oxidoreductase
types are expressed in eukaryotes.
Additionally, we could observe that the steroid
biosynthesis is the metabolic pathway with the
highest number of articulation points with a
frequency higher than 80% (16 APs). Moreover,
Isoquinoline biosynthesis and carotenoid
biosynthesis are the metabolic pathways with the
highest number of articulation points with a
frequency lower than 30% (27 APs). Besides, we
evaluated the functions and pathways of the APs
associated with the greatest number of pathways.
Amino acid biosynthesis is the main function
related to these APs (Supporting information S7).
Cysteine and methionine metabolism, as well as
glycine, serine, and threonine metabolism, are the
main pathways related to these APs (Supporting
information S7).
8

Figure 5. Distribution of APs with frequencies >=80% and <30% through metabolic pathways. A. Top
10 metabolic pathways with the greatest number of APs with frequency <30% B. Top 10 APs with
frequency <30% present in the highest number of metabolic pathways. C. Top 10 metabolic pathways
with the greatest number of APs with frequency >=80%. D. Top 10 APs with frequency >=80% present in
the highest number of metabolic pathways. E. Top 10 metabolic pathways with the greatest number of
9

APs with frequency >=80% and betweenness higher than the median. F. Top 10 APs with frequency
>=80% present in the highest number of metabolic pathways and betweenness higher than the median.

Network Visualization

We provided dynamic HTML visualizations
available at
https://igorabrandao.com.br/kegg-pathway-bottlen
eck, which allow the observation of several
metabolic pathways. It is possible to adjust the
network size and position, as well as to select
proteins according to their classification, making it
possible to highlight protein groups of interest.
Finally, the AP remotion can disconnect the
network and the proteins metrics (impact, degree,
and betweenness) can be easily viewed.

Discussion

The present study aimed to investigate the APs for
all metabolic pathways available in the KEGG
database. The remotion of one of these APs will
cause the network rupture since it is a central point
that connects various network complexes and/or
peripheral regions (20). Therefore we can suggest
that APs might be essential proteins to the
pathways they belong to. Complementary to the
network topological analysis, it is important to
understand the profile of proteins classified as AP.
We found a considerable concentration of APs in
the protein groups with higher frequencies. We
verified that APs tend to be high-frequency
proteins, suggesting that these proteins may be
more essential to the metabolic pathway when
compared with proteins with non-APs. However,
we identified two groups of APs; a group with
high frequency and high betweenness and a group
with low frequency and low betweenness. In
practice, this implies that APs with low
betweenness disconnect peripheral regions from
the network, resulting in a disbalance between the
components of the remaining network, while the
APs with high betweenness are probably located in
central regions.

A study demonstrated that an important node in a
network can be calculated based on the
betweenness centrality metric plus its degrees (16).
Other studies suggest performing the identification
and removal of APs to provide a new perspective
on the organizational principles of complex
networks (21). We observed that, in the metabolic
networks evaluated here, the APs could be
important since they play a key role in ensuring
the robustness and connectivity of these networks.
Studies have shown that network integrity can be
disrupted, contributing to serious impairments in
disorders such as depression, psychotic disorders,
motor diseases, or even lethal mutations enriched
in the group of highly connected proteins that are
APs (22). Investigations into the integration and
coordination of large-scale functional networks
subserving various conditions must be explored.
For example, assessing non-static functional
connectivity within whole-brain networks can be
important to unravel some mysteries in the central
nervous system. Studies assessing network
integrity in the central nervous system were
developed. However, it is also crucial to identify
essential proteins in these networks (23-25).
Another study related to inflammatory bowel
disease (IBD) found that there are seven
hub-bottleneck proteins (which can be understood
as APs) in the IBD network responsible to
maintain the network integrity. Network
evaluation and complex analysis of IBD essential
proteins are important to provide a new glance of
the disease. Proteins are in a complex interactome
organization that any small changes in each
individual may lead to dysfunction of the whole
system (26).

We performed a functional analysis to determine
the main metabolic pathways regarding the APs
using KEGG ECs. A study also used KEGG to
perform functional analyses and studies used Gene
Ontology to determine the biological process and
molecular functions of proteins related to cancer
(27, 28). However, it is essential to highlight that
no other study evaluated the main metabolic
pathways associated with all APs. Most APs with
10

the greatest number of associated pathways are
oxidoreductases (AP classes) and the synthesis of
steroids was the metabolic pathway with the
highest number of APs with a frequency higher
than 80%. In fact, on the Steroid hormone
biosynthesis pathway 00140, 14 of 16 APs are
oxidoreductases. Oxidoreductases participate in
the metabolism of some ethers. A 5α-reductase
(5α-R) and 3α- or 3β-hydroxysteroid
oxidoreductase (HSOR), is important to the
generation of reduced metabolites of progesterone
and testosterone, acting on several steroid and
neurotransmitter receptors and regulating the
neural function and behavior.

Progesterone is metabolized into
dihydro-progesterone by the action of 5α-R and
subsequently into 3α, 5α-tetrahydro-progesterone,
or into 3β, 5α-tetrahydro-progesterone, by the
action of 3α- or 3β-HSOR, respectively. Besides,
testosterone is converted into dihydrotestosterone
and subsequently into 5α-androstane-3α,17β-diol
or 5α-androstane-3β,17β-diol. The enzymatic
reaction performed by 5α-R is the rate-limiting
step during the 3α,5αreduced synthesis. The
Oxidoreductase can also convert corticosterone
and deoxycorticosterone into
dihydro-corticosteroids and
dihydro-deoxycorticosterone, respectively, and
this latter metabolite can be converted into
3α,5α-tetrahydro-desoxycorticosterone (29).

Finally, to simplify the understanding of the
topology of the metabolic pathway, our work
proposes the visualization of these pathways as
dynamic networks. Network visualization is an
importantaspect of this work since the dynamic
visualization provides the possibility to explore
network topological features graphically,
providing rich details, facilitating the network
understanding, and extracting relevant biological
information (12).

In summary, we suggested that bottlenecks are
articulation points with the highest frequencies and
located in the center of the network. New studies
will be required to establish more relationships
between these key proteins (APs) and topological
and biological attributes of the metabolic
pathways. Additional studies with metabolomics,
which is a powerful technology that allows for the
assessment of global metabolic profiles, can be
used to distinguish between diseased and
non-diseased status information (30). Another
future possibility is the usage of machine learning
(ML) algorithms to create a predictive model to
identify potential essential proteins in various
metabolic pathways. Finally, knockout studies can
be applied to experimentally evaluate whether
specific APs are essential for a given metabolic
pathway. For example, recently a receptor
involved in memory (NOP receptor) has been
experimentally associated with metabolic
pathways involved in depression; it remains to
assess this receptor essentiality on the path through
bioinformatics tools.

Experimental procedures

Extraction and processing of metabolic
pathway data

Initially, the list of all species available in KEGG
was generated along with the list of metabolic
pathways for each species, and they were the
reference for the selection of metabolic pathways
used in the study. Each pathway has an associated
KGML (KEGG XML) file that contains the
enzymes and reactions data necessary to generate
the graph object. It is important to note that in
some KEGG metabolic pathways there are
different nodes with the same enzyme commission
number (EC). Since it was not possible to identify
these nodes uniquely via the EC or KEGG
identifier, we needed to combine the following
node attributes to differentiate them from each
other: EC, reaction code, and KEGG diagram (x,
y) coordinates.

We used the R/Bioconductor KEGGREST
package (31) to load the reference pathway of
interest from the REST API provided by KEGG.
We used 153 metabolic pathways of all species
available on KEGG (totalizing more than 600.000
11

datasets). KEGG data was parsed using a function
adapted from the KEGGREST package. After that,
compounds and interconnections between
metabolic pathways were removed to study each
isolated pathway. Graph properties, such as
communities, betweenness centrality, and degree,
were extracted using the iGraph R/Bioconductor
package (32).

APs detection

AP detection was performed according to Tarjan's
algorithm. Briefly, APs detection relies on the
depth search algorithms (DFS) for graphs, which
consists of a technique to visit each node in a
given graph until its deepest level. Each node
reached is placed on a stack and a record is kept of
the lowest node on the stack to which it is
connected by a path of unstacked nodes. When a
new node cannot be reached from the top of the
stack, the top node is deleted and the search is
continued from the next node on the stack. If the
top node of the stack does not connect to a node
lower than the second node on the stack, then this
second point is an AP of the graph. All edges
examined during the search are placed on another
stack. Therefore, when an AP is found, the edges
of the corresponding biconnected component may
be retrieved and placed in an output array (33).

Protein frequency evaluation

To evaluate the protein frequency for each
metabolic pathway all the organisms that
contained it were identified, and the occurrence of
each node of the canonical route was verified
taking into account the number of these organisms.
The protein total frequency is determined by the
number of times this protein appears in the various
organisms pathways divided by the total
organisms that possess the given pathway.
The protein occurrence counting is performed via
color attribute evaluation in the organism KGML
file associated with a pathway, more specifically
into its graphic section. Since KEGG annotates the
color attributes to display the presence or absence
of a given protein in a pathway map, we used it to
perform the protein counting. In the organism
pathways, nodes with bgcolor attribute equal to
#BFFFBF indicate its presence, while #FFFFFF
indicates its absence.

Also, we generated a dictionary to match the
organism nodes with a corresponding EC number
relative to the canonical route, since in KEGG, the
organism nodes are represented as ENTREZ code.
This dictionary contains the attributes pathway
code, EC number, reaction code, protein x-axis,
and y-axis parsed from the KGML file. Via the
combination of (reaction code + protein x-axis +
y-axis) nodes attributes it's possible to match the
EC number with ENTREZ code and generate the
dictionary ID to represent this relation.

We normalized the protein frequency according to
the total species containing a certain metabolic
pathway since the studied pathways showed
different numbers of organisms. Also, we
normalized the proteins according to the maximum
and minimum frequencies of proteins related to a
given metabolic pathway. The highest normalized
value was considered 1, the lowest value was
considered 0, and others values were within this
scale from 0 to 1.

Statistical analysis

We assembled a dataset containing the protein
information, including its frequency, from the
evaluated pathways using the R language, and the
nodes with a frequency of 0% were removed to
avoid possible bias. To test the AP enrichment, we
applied the hypergeometric test using the phyper
built-in R function with lower.tail parameter set as
false. Also, we sorted the dataset by the protein
frequency in descending order to verify the APs
concentration in groups of higher frequencies. The
hypergeometric test was applied N times, with N
equal to the number of proteins in the dataset.
Each hypergeometric instance evaluates the
current and the previous proteins in the list in a
cumulative way. A p-value < 0.01 was considered
significant, showing that a frequency range is
enriched with AP proteins.
12

Complementary, we conducted a binomial test
(p-value < 0.01 ) grouping the proteins in ranges
of 10% to confirm which range is the most
enriched with APs. Since we observed an AP
concentration higher than expected in proteins
with lower frequencies, we divided the APs into
the groups with higher (>=80%) and lower (<30%)
frequencies. To establish differences between
these groups, we performed comparative analyses
using the Wilcoxon-Mann-Whitney test (p-value <
0.01) over the APs topological features
distribution of each group, for example, the
betweenness centrality metric.

Functional analysis

To evaluate the main metabolic pathways related
to the APs, we performed descriptive statistics toidentify the APs role with higher (>= 80%) and
lower (<30%) frequency. The protein information
was obtained via the KEGG dbget system plus the
protein EC number, for example, the
Aspartate-semialdehyde dehydrogenase with EC
number 1.2.1.11 can be found at
https://www.kegg.jp/dbget-bin/www_bget?1.2.1.11
. Into the dbget page, it is possible to retrieve the
protein name, class, and related pathways. Also,
we searched the protein function in the Uniprot
database.

Additionally, we retrieved the AP proteins
correspondence into the Database of Essential
Genes (DEG) using their classes to confirm their
essentiality, since DEG holds information of
essential genes that are indispensable for the
survival of an organism.

Metabolic pathways visualization

Since a metabolic pathway can be considered a
mass flow, we represent it by directed graphs
where nodes are the enzymes and edges are the
chemical reactions. Also, we provided a graphical
visualization of the pathways displaying the main
characteristics from its nodes such as AP status
(border in blue), betweenness centrality
(background color), frequency (node size), protein
classification (label), and bridge edges (red
arrowed line) using the visNetwork R package
(34).

Data availability: The scripts and datasets
presented in this paper have all been deposited in
the Github repository with the following link:
https://github.com/igorabrandao/kegg-pathway-bo
ttleneck.

Acknowledgments
We acknowledge KEGG databases for providing public access to their data. We also would like to thank
NPAD/UFRN for computational resources.

Author contributions
R.J.S.D. conceived the project. I.A.B., C.F.R., D.A.A.M., and L.R.S.C. gathered and curated the data.
I.A.B., C.F.R., D.A.A.M., and L.R.S.C performed the analysis. I.A.B. and A.B.C. organized
Supplementary Material. I.A.B., A.B.C., and R.J.S.D. wrote the manuscript. All authors revised and
approved the manuscript.

Funding and additional information
The scholarships were financed by governmental Brazilian agency Coordination for the Improvement of
Higher Education Personnel (CAPES– Portuguese: Coordenação de Aperfeiçoamento de Pessoal de Nível
Superior), National Council of Technological and Scientific Development (CNPq) and PROPESQ-UFRN.
The fellowship was supported by the National Council of Technological and Scientific Development
(CNPq) (308258/2018-5).
13

Conflict of interest: The authors declare that they have no conflicts of interest with the contents of this
article.

References

1. Chatterjee, R., & Yuan, L. (2006) Directed evolution of metabolic pathways. TRENDS in
Biotechnology, 24(1), 28-38.
2. Dalmolin, R. J., Castro, M. A., Rybarczyk Filho, J. L., Souza, L. H., de Almeida, R. M., & Moreira, J.
C. (2011) Evolutionary plasticity determination by orthologous groups distribution. Biology Direct; 6(1),
1-18.
3. Paludo, G. P., Lorenzatto, K. R., Bonatto, D., & Ferreira, H. B. (2015) Systems biology approach
reveals possible evolutionarily conserved moonlighting functions for enolase. Computational biology and
chemistry, 58, 1-8.
4. Maniadi EM, Tollis IG. (2014) Analysis and visualization of metabolic pathways and networks: A
hypergraph approach. 1st Ed., Spain, Valencia, 109-112.
5. Barabasi AL, Oltvai ZN. (2004) Network biology: understanding the cell's functional organization.
Nature reviews genetics; 5(2), 101.
6. Ravasz E, Somera AL, Mongru DA, Oltvai ZN, Barabási AL. (2002) Hierarchical organization of
modularity in metabolic networks. Science, 297(5586), 1551-1555.
7. Ma HW, Zeng AP. (2003) The connectivity structure, giant strong component and centrality of
metabolic networks. Bioinformatics, 19(11), 1423-1430.
8. Kanehisa M, Furumichi M, Tanabe M, Sato Y, Morishima K. (2016) KEGG: new perspectives on
genomes, pathways, diseases and drugs. Nucleic acids research, 45(1), 353-361.
9. Jeong H, Tombor B, Albert R, Oltvai ZN, Barabási AL. (2000) The large-scale organization of
metabolic networks. Nature, 407(6804), 651.
10. Wagner A, Fell DA. (2001) The small world inside large metabolic networks. Proceedings of the
Royal Society of London, 268(1478), 1803-1810.
11. Ma HW, Zeng AP. (2003) The connectivity structure, giant strong component and centrality of
metabolic networks. Bioinformatics, 19(11), 1423-1430.
12. Beguerisse-Díaz M, Bosque G, Oyarzún D, Picó J, Barahona M. (2018) Flux-dependent graphs for
metabolic networks. NPJ systems biology and applications, 4(1), 32.
13. Jeong H, Mason SP, Barabási AL, Oltvai ZN. (2001) Lethality and centrality in protein networks.
Nature, 411(6833), 41.
14. Mahadevan R, Palsson BO. (2005) Properties of metabolic networks: structure versus function.
Biophysical jornal, 88(1), 7-9.
15. Ferreira RM, Rybarczyk-Filho JL, Dalmolin RJ, Castro MA, Moreira JC, Brunnet LG, de Almeida
RM. (2013) Preferential duplication of intermodular hub genes: an evolutionary signature in eukaryotes
genome networks. PloS one, 8(2).
16. Yu H, Kim PM, Sprecher E, Trifonov V, Gerstein M. (2007) The importance of bottlenecks in protein
networks: correlation with gene essentiality and expression dynamics. PLoS computational biology, 3(4),
e59.
17. Ausiello G, Firmani D, Laura AL. (2012) Real-time monitoring of undirected networks: Articulation
points, bridges, and connected and biconnected components. Networks, 59(3), 275-288.
14

18. Behzad M, Chartrand G. (1972) Introduction to the Theory of Graphs. 6st Ed., Allyn and Bacon,
EUA, Boston.
19. Harary F. (1996) Graph theory. 6st Ed., Addison-Wesley, England, London.
20. Farina G. (2015) A linear time algorithm to compute the impact of all the articulation points, 1, 1-4.
21. Tian L, Bashan A, Shi D-N, Liu Y-Y. (2017) Articulation points in complex networks. Nat Commun,
8(1), 14223.
22. Pržulj, N., Wigle, D. A., & Jurisica, I. (2004) Functional topology in a network of protein interactions.
Bioinformatics, 20(3), 340-348.
23. Ge R, Downar J, Blumberger DM, Daskalakis ZJ, Lam RW, Vila-Rodriguez F. (2019) Structural
network integrity of the central executive network is associated with the therapeutic effect of rTMS in
treatment resistant depression. Progress in Neuro-Psychopharmacology and Biological Psychiatry, 92,
217-225.
24. Larivière S, Ward NS, Boudrias MH. (2018) Disrupted functional network integrity and flexibility
after stroke: Relation to motor impairments. NeuroImage: Clinical, 19, 883-891.
25. Sheffield JM, Kandala S, Tamminga CA, Pearlson GD, Keshavan MS, Sweeney JA et al. (2017)
Transdiagnostic associations between functional brain network integrity and cognition. JAMA psychiatry,
74(6), 605-613.
26. Asadzadeh-Aghdaee H, Shahrokh S, Norouzinia M, Hosseini M, Keramatinia A, Naghibzadeh B et al.
(2016) Introduction of inflammatory bowel disease biomarkers panel using protein-protein interaction
(PPI) network analysis, 9(1), 8-13.
27. Barabási AL, Gulbahce N, Loscalzo J. (2011) Network medicine: a network-based approach to human
disease. Nature reviews genetics, 12(1), 56-68.
28. Ritz A, Poirel CL, Tegge AN, Sharp N, Simmons K, Powell AM et al. (2016) Pathways on demand:
automatedreconstruction of human signaling networks. NPJ systems biology and applications, 2, 16002.
29. Supper J, Spangenberg L, Planatscher H, Dräger A, Schröder A, Zell A. (2009) BowTieBuilder:
modeling signal transduction pathways. BMC Syst Biol, 3(1), 67.
30. Khan FM, Gupta SK, Wolkenhauer O. (2018) Integrative workflows for network analysis. Essays in
Biochemistry, 62(4), 549–61.
31. Tenenbaum D. (2016) KEGGREST: Client-side REST access to KEGG. R package version, 1(1).
32. Csardi G, Nepusz T. (2006) The igraph software package for complex network research. InterJournal,
Complex Systems, 1695(5): 1-9.
33. Hopcroft J, Tarjan R. (1973) Algorithm 447: efficient algorithms for graph manipulation.
Communications of the ACM, 16(6), 372-378.
34. Almende BV, Thieurmel B Robert T. (2016) visNetwork: Network Visualization using ‘vis. js’
Library, R package version, 1(1).

Supporting Information

Supporting information S1. Pathway 00402 - Benzoxazinoid biosynthesis. A. Dynamic visualization
generated with 8 nodes and 7 edges. B. KEGG metabolic pathway diagram.

Supporting information S2. Classification and quantity of organisms used in the study. A. Prokaryotes;
B. Eukaryotes.

Supporting information S3. Visualization of pathway 0010 (glycolysis and gluconeogenesis) in the
KEGG database. KEGG presents information about the metabolic pathway as a sequential block diagram.
In which colored or white boxes represent proteins. Colored boxes are proteins present in the metabolic
pathway, while whites are absent proteins in these pathways. Source:
https://www.kegg.jp/kegg-bin/show_pathway?org_name=hsa& mapno=00010.

Supporting information S4. Pathway 00232 - Caffeine metabolism. A. Dynamic visualization
with 26 nodes and 76 edges. B. KEGG metabolic pathway diagram.

Supporting information S5. Pathway 00061 - Fatty acid biosynthesis. A. Dynamic visualization with
170 nodes and 1063 edges. B. KEGG metabolic pathway diagram.

Supporting information S6. A. APs with high betweenness and frequency of pathway 00010 - glycolysis
and gluconeogenesis. I) AP ec:4.1.2.13 with betweenness of 0.0255 and frequency of 96.98%. II) AP
ec:4.2.1.11 with betweenness of 0.0490 and frequency of 98.26%. III) AP ec:2.7.1.40 with betweenness
of 0.0282 and frequency of 91.73%. B. APs with low betweenness and frequency of pathway 00640 -
propanoate metabolism. IV) AP ec:1.1.1.80 with betweenness of 0.0023 and frequency of 2.23%. V) AP
ec:4.1.1.4 with betweenness of 0.0034 and frequency of 7.69%. VI) AP ec:2.8.3.8 with betweenness of
0.0042 and frequency of 9.79%. VII) AP ec:1.1.1.77 with betweenness of 0.0017 and frequency of
6.89%.

Supporting information S7. Functions and metabolic pathways related to the APs present in a higher
number of metabolic pathways. The metabolic pathways were obtained from KEGG, the APs functions
were obtained from Uniprot.

AP name and class Metabolic Pathways Functions in prokaryotes
● EC 1.2.1.11 -
Aspartate-semialdeh
yde dehydrogenase
● Oxidoreductase
1. Glycine, serine and
threonine metabolism
2. Cysteine and methionine
metabolism
3. Lysine biosynthesis
Biosynthesis of amino acids in prokaryotes, fungi, and some
higher plants. It forms an early branch point in the metabolic
pathway forming lysine, methionine, leucine, and isoleucine from
aspartate
● EC 2.2.1.6 -
Acetolactate
synthase
● Transferases
1. Valine, leucine and
isoleucine biosynthesis
2. Butanoate metabolism
3. C5-Branched dibasic
acid metabolism
Catalyzes the first step in the synthesis of the branched-chain
amino acids in plants, animals, and microorganisms, transferring
aldehyde or ketone residues.
● EC 1.1.1.1 - Alcohol
dehydrogenase
● Oxidoreductase
1. Chloroalkane and
chloroalkane degradation
2. Metabolism of
xenobiotics by cytochrome
P450
3. Tyrosine metabolism
Facilitate the interconversion between alcohols and aldehydes or
ketones with the reduction of nicotinamide adenine dinucleotide
(NAD+) to NADH. In yeast, plants, and many bacteria, some
alcohol dehydrogenases catalyze the opposite reaction as part of
fermentation to ensure a constant supply of NAD+. In humans
and animals, they serve to break down alcohols that otherwise are
toxic, and they also participate in the generation of useful
aldehyde, ketone, or alcohol groups during biosynthesis of
various metabolites.
● EC 6.3.2.9 -
D-glutamate ligase
● Ligases
1. D-Glutamine and
D-glutamate metabolism
2. Peptidoglycan
biosynthesis
Form carbon-nitrogen bonds as acid-D-amino-acid ligases
(peptide synthases).
● EC 4.1.3.27 -
Anthranilate
synthase
● Lyases
1. Phenylalanine, tyrosine
and tryptophan
biosynthesis
2. Phenazine biosynthesis
Catalyzes the two-step biosynthesis of anthranilate, an
intermediate in the biosynthesis of L-tryptophan. In the first step,
the glutamine-binding beta subunit (TrpG) of anthranilate
synthase (AS) provides the glutamine amidotransferase activity
which generates ammonia as a substrate that, along with
chorismate, is used in the second step, catalyzed by the large
alpha subunit of AS (TrpE) to produce anthranilate. In the
absence of TrpG, TrpE can synthesize anthranilate directly from
chorismate and high concentrations of ammonia.
● EC 1.1.1.3 -
Homoserine
dehydrogenase
● Oxidoreductase
1. Glycine, serine and
threonine metabolism
2. Cysteine and methionine
metabolism
Catalyzes the third step in the aspartate pathway; the
NAD(P)-dependent reduction of aspartate beta-semialdehyde into
homoserine. Homoserine is an intermediate in the biosynthesis of
threonine, isoleucine, and methionine. In photosynthetic
organisms, glutamine, glutamate, and aspartate accumulate during
the day and are used to synthesize other amino acids. At night,
aspartate is converted to asparagine for storage. Mammals lack
the enzymes involved in the aspartate metabolic pathway,
including homoserine dehydrogenase. As lysine, threonine,
24

methionine, and isoleucine are made in this pathway, they are
considered essential amino acids for mammals.
● EC 4.2.1.2 -
Fumarate hydratase
● Lyases
1. Citrate cycle (TCA
cycle)
2. Pyruvate metabolism
Catalyzes the reversible hydration/dehydration of fumarate to
malate. Fumarase comes in two forms: mitochondrial and
cytosolic. The mitochondrial isoenzyme is involved in the Krebs
Cycle and the cytosolic isoenzyme is involved in the metabolism
of amino acids and fumarate.
● EC 1.14.14.91 -
Cinnamate
4-monooxygenase
● Oxidoreductases
1. Flavonoid biosynthesis
2. Stilbenoid,
diarylheptanoid and
gingerol biosynthesis
Acts on paired donors, with O2 as oxidant and incorporation or
reduction of oxygen. The oxygen incorporated need not be
derived from O2 with NADH or NADPH as one donor, and
incorporation of one oxygen atom into the other donor.
● EC 1.1.1.100 -
Beta-ketoacyl
reductase
● Oxidoreductases
1. Fatty acid biosynthesis
2. Biotin metabolism
Catalyzes the reaction of conversion of
3-hydroxyacyl-[acyl-carrier-protein] to
3-oxoacyl-[acyl-carrier-protein] using NADPH as a reducing
agent. Acts on the CH-OH group as hydride donors with NAD+
or NADP+ as hydride acceptors.

Conteúdos escolhidos para você

15 pág.

REDES COMPLEXAS PARA COMPARAR PADRÕES DE EXPRESSÃO DE VIAS E PROTEÍNAS ASSOCIADAS AO DESENVOLVIMENTO E TRATAMENTO DO GLIOBLASTOMA MULTIFORME

292 pág.

Bioinformatica_1 12022

ESTÁCIO

IgorAugustoBrandao_DISSERT

Biologia

Escuela Universidad Nacional

Mais conteúdos dessa disciplina

Conteúdos escolhidos para você

Resumo Bioinformática

Bioinformatica_1.1

Bioinformatica_1.1 2

REDES COMPLEXAS PARA COMPARAR PADRÕES DE EXPRESSÃO DE VIAS E PROTEÍNAS ASSOCIADAS AO DESENVOLVIMENTO E TRATAMENTO DO GLIOBLASTOMA MULTIFORME

Bioinformatica_1 12022

Perguntas dessa disciplina

O estabelecimento de uma infecção, em um hospedeiro susceptível, envolve vários mecanismos, sendo um dos mais relevantes o modo de interação do mic...

Quais os principais componentes do sistema imunológico inato?

2. Quais são as categorias de meios de cultura e qual o objetivo de utilização destes meios?

4. Quais são as estratégias ou mecanismos efetores de defesa principais da imunidade adquirida para combater a maioria dos microrganismos?

Quais são os processos responsáveis pela transferência de material genético de uma célula bacteriana para outra?