Prévia do material em texto
UNIVERSIDADE FEDERAL DO RIO GRANDE DO NORTE INSTITUTO METRÓPOLE DIGITAL PROGRAMA DE PÓS-GRADUAÇÃO EM BIOINFORMÁTICA MESTRADO EM BIOINFORMÁTICA Igor Augusto Brandão Abordagens da biologia de sistemas na investigação dos pontos de articulação nas rotas metabólicas do KEGG Natal - RN 2020 UNIVERSIDADE FEDERAL DO RIO GRANDE DO NORTE INSTITUTO METRÓPOLE DIGITAL PROGRAMA DE PÓS-GRADUAÇÃO EM BIOINFORMÁTICA MESTRADO EM BIOINFORMÁTICA Igor Augusto Brandão Abordagens da biologia de sistemas na investigação dos pontos de articulação nas rotas metabólicas do KEGG Universidade Federal do Rio Grande do Norte Orientador: Prof. Dr. Rodrigo Juliani Siqueira Dalmolin – UFRN Coorientador: Prof. Dr. José Miguel Ortega – UFMG Natal - RN 2020 UNIVERSIDADE FEDERAL DO RIO GRANDE DO NORTE INSTITUTO METRÓPOLE DIGITAL PROGRAMA DE PÓS-GRADUAÇÃO EM BIOINFORMÁTICA MESTRADO EM BIOINFORMÁTICA Abordagens da biologia de sistemas na investigação dos pontos de articulação nas rotas metabólicas do KEGG Dissertação apresentada ao Programa de Pós- graduação em BioInformática da Universidade Federal do Rio Grande do Norte por Igor Augusto Brandão como parte dos requisitos para a obtenção do tı́tulo de Mestre em BioInformática. Área de concentração: Bioinformática Linha de Pesquisa: Biologia de Sistemas Orientador: Prof. Dr. Rodrigo Juliani S. Dalmolin Coorientador: Prof. Dr. José Miguel Ortega Natal, RN 2020 Brandão, Igor Augusto. Abordagens da biologia de sistemas na investigação dos pontos de articulação nas rotas metabólicas do KEGG / Igor Augusto Brandão. - Natal, 2020. 61.: il. Dissertação (Mestrado) - Universidade Federal do Rio Grande do Norte. Instituto Métropole Digital. Programa de Pós-graduação em Bioinformática. Orientador: Prof. Dr. Rodrigo Juliani Siqueira Dalmolin. Coorientador: Prof. Dr. José Miguel Ortega. 1. Biologia de sistemas - Dissertação. 2. KEGG - Dissertação. 3. Rotas metabólicas - Dissertação. 4. Topologia de redes - Dissertação. 5. Pontos de articulação - Dissertação. 6. Visualização de redes - Dissertação. I. Dalmolin, Rodrigo Juliani Siqueira. II. Ortega, José Miguel. III. Universidade Federal do Rio Grande do Norte. IV. Título. RN/UF/BSCB CDU 57:004 Universidade Federal do Rio Grande do Norte - UFRN Sistema de Bibliotecas - SISBI Catalogação de Publicação na Fonte. UFRN - Biblioteca Setorial Prof. Leopoldo Nelson - Centro de Biociências - CB Elaborado por KATIA REJANE DA SILVA - CRB-15/351 IGOR AUGUSTO BRANDÃO ABORDAGENS DA BIOLOGIA DE SISTEMAS NA INVESTIGAÇÃO DOS PONTOS DE ARTICULAÇÃO NAS ROTAS METABÓLICAS DO KEGG Defesa de Mestrado apresentanda ao Programa de Pós-Graduação em Bioinformática da Universidade Federal do Rio Grande do Norte. Área de concentração: Bioinformática Linha de Pesquisa: Biologia de Sistemas Orientador: Prof. Dr. Rodrigo Juliani Siqueira Dalmolin Natal, 14 de agosto de 2020. BANCA EXAMINADORA Prof. Dr. Rodrigo Juliani Siqueira Dalmolin Universidade Federal do Rio Grande do Norte (Presidente) (por videoconferência) Prof. Dr. César Rennó Costa Universidade Federal do Rio Grande do Norte (Examinador Interno) (por videoconferência) Prof. Dr. Ricardo D’Oliveira Albanus Universidade de Michigan (Examinador Externo à Instituição) (por videoconferência) Agradecimentos Este trabalho constitui uma importante fase do desenvolvimento acadêmico e exigiu diversas habilidades que não se adquirem à portas fechadas: paciência, capacidade de criar parcerias, perseverança e acima de tudo sonho - o sonho de trazer todo o aprendizado para a vida. Primeiramente agradeço à Alice pelas importantes contribuições para este trabalho; Aos meus pais que mesmo distantes estão sempre presentes em minha vida; Ao meu professor e orientador, Rodrigo Dalmolin, por acreditar em minha capacidade e por me guiar neste inı́cio de minha jornada acadêmica; Ao professor Lucélio do Instituto Metrópole Digital, pelo apoio ao longo destes anos em diversas empreitadas; Aos colegas do grupo de pesquisa, em especial à Clóvis e Diego, pelas contribuições para este trabalho; Aos colegas do BioME pela convivência e por tornar o caminho mais agradável; Aos demais professores e colaboradores do BioME, em especial Jessica e Aldo, pela disposição e paciência em me ajudar nos momentos em que precisei. O presente trabalho foi realizado no PPg-Bioinfo/UFRN com apoio da Coordenação de Aperfeiçoamento de Pessoal de Nı́vel Superior - Brasil (CAPES) - Código de Finan- ciamento 001. Sumário Lista de ilustrações . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 Lista de tabelas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 1 INTRODUÇÃO . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 1.1 Biologia de sistemas . . . . . . . . . . . . . . . . . . . . . . . . . . 12 1.2 O estudo de redes aplicado a vias metabólicas . . . . . . . . . . . 13 1.3 Caracterı́sticas topológicas de uma rede . . . . . . . . . . . . . . 15 1.4 Estudo das vias metabólicas do KEGG . . . . . . . . . . . . . . . 18 2 JUSTIFICATIVA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 3 OBJETIVOS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 3.1 Objetivo geral . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 3.2 Objetivos especı́ficos . . . . . . . . . . . . . . . . . . . . . . . . . . 21 Artigo 23 Discussão 48 Conclusão 56 REFERÊNCIAS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 APÊNDICE A – MATERIAL SUPLEMENTAR DO ARTIGO . . . . . 62 6 Lista de ilustrações Figura 1 – Representação de uma via metabólica como grafo . . . . . . . . . . 14 Figura 2 – Avaliação topológica de uma rede . . . . . . . . . . . . . . . . . . . 16 Figura 3 – Impacto da remoção de um AP sobre a rede . . . . . . . . . . . . . 50 Figura 4 – Ferramenta de visualização de redes KPV . . . . . . . . . . . . . . . 52 7 Lista de tabelas Tabela 1 – Propriedades topológicas de uma rede . . . . . . . . . . . . . . . . 17 Tabela 2 – Identificadores do KEGG . . . . . . . . . . . . . . . . . . . . . . . . 19 8 Lista de Sı́mbolos e Abreviações KEGG – Kyoto Encyclopedia of Genes and Genomes AP – Articulation point EC – Enzyme commission number KO – KEGG Orthology GO – Gene Ontology XML – Extensible Markup Language KGML – KEGG XML KPV – KEGG Pathway Viewer 9 Resumo O estudo da essencialidade das proteı́nas por meio de métodos laboratoriais é caro e não escalável para grandes quantidades de proteı́nas, desta forma é relevante avaliar a essencialidade das várias proteı́nas de uma via metabólica como um todo através de ferramentas computacionais. Em geral, uma via metabólica pode ser analisada como grafos, os quais fornecem diferentes recursos para o estudo das caracterı́sticas topológicas de redes, como os seus pontos de articulação e disposição dos nós. Atualmente, pesquisas em bioinformática estudam a essencialidade de proteı́nas com base nas métricas de betweenness e degree, contudo a teoria dos grafos sugere que os pontos de articulação podem ser nós importante em uma rede, resta avaliar se esses pontos de articulação são de fato essenciais para as vias metabólicas e seu impacto topológico na rede. Utilizando análises baseadas em métricas de rede, o nosso objetivo é verificar se de fato esses pontos de articulação representam gargalos na rede, sendo estes caracterizados como proteı́nas de frequências elevadas e localizadas no centro das redes. Para tanto, identificamos os pontos de articulação em diferentes vias metabólicas do KEGG, avaliamos o impacto de cada um deles, calculamos sua frequência e comparamos suas ocorrências com as demais proteı́nas. Inicialmente, fizemos o levantamento das vias metabólicas do KEGG que estavam disponı́veis através dos arquivosKGML associados às redes. Após a listagem das vias disponı́veis, os dados estruturais de cada uma delas foram convertidos em objetos do tipo grafo. Os parâmetros ponto de articulação, betweenness e degree foram utilizados para classificar as proteı́nas constantes em cada via metabólica. Aproximadamente 20% das proteı́nas foram classificadas como pontos de articulação, das quais 3,75% foram identificadas pela alta frequência e localização em regiões centrais da rede. Além disso, a maior concentração dos pontos de articulação ocorreu na faixa de frequência dos 80 a 90%. Um padrão de não aleatoriedade na distribuição dos pontos de articulação foi identificado nos grupos com frequências acima de 74,5%.Finalmente, a biossı́ntese de esteroides foi a via metabólica com o maior número de pontos de articulação com frequências superiores a 80% em sua constituição. A oxidoredutase foi a classe dos pontos de articulação presente no maior número de vias metabólicas. As descobertas sugerem que os gargalos das redes avaliadas são pontos de articulação com as frequências mais altas e localizados no centro da rede. Resta realizar análises mais aprofundadas a respeito dos papéis biológicos destes pontos de articulação encontrados. Palavras-chaves: Biologia de sistemas, KEGG, ponto de articulação, via metabólica, topologia de redes, visualização de redes. 10 Abstract The study of proteins essentiality through laboratory methods is expensive, time- consuming and not scalable for large amounts of proteins. Besides, it is relevant to evaluate the essentiality of several proteins of a metabolic pathway as a whole. The metabolic pathways can be analyzed as graphs, which provide several tools to study the topological features such as the articulation points. Nowadays, research in bioinfor- matics studies the essentiality of proteins based on betweenness and degree metrics, however, graph theory suggests that articulation points could be essential nodes in a network. It remains to be determined whether these articulation points are essential in metabolic pathways and their topological impact on the network. Using network analysis via metrics and biologic curation, we aim to verify if bottlenecks are proteins with the highest frequencies and located in the center of KEGG metabolic pathways. For this purpose, we identified the articulation points in different networks, evaluate the impact of each articulation point, calculate their frequency and compare them with occurrences of non-articulation points. We consulted KEGG pathways available as KGML files. After, the data was transformed into a graph object. Two centrality parameters including articulation points and degree are determined and the essential proteins based on these parameters are classified. Approximately 20% of the proteins are articulation points. The articulation points with high-frequency which are located in central regions of the network were considered the most important (3.75%). In addition, the highest concentration of articulation points occurred in the frequency range of 80-90%. A pattern of non-randomness of articulation points was identified in the protein groups that have a frequency of at least 74.5%. Finally, steroid biosynthesis is the metabolic pathway with the highest number of articulation points with frequency higher than 80%. Besides, oxidoreductase is the articulation point class present in the highest number of metabolic pathways. Overall, the findings suggest that bottlenecks are articulation points with highest frequencies and located in the center of the network. It remains to perform a deep analysis on the articulation points biological roles. Key words: Systems biology, KEGG, articulation point, metabolic pathway, network topology, network visualization. 11 12 1 Introdução 1.1 Biologia de sistemas O desafio de compreender sistemas biológicos complexos é o que caracteriza o campo de estudo da biologia, a qual requer a integração de pesquisas experimentais e computacionais [1]. A biologia de sistemas é uma disciplina que estuda sistemas biológicos de uma perspectiva holı́stica e interdisciplinar, reunindo profissionais de dife- rentes campos de estudo tais como biólogos, matemáticos, cientistas da computação, fı́sicos e engenheiros, de forma a obter componentes orientados para biologia e com- ponentes orientados para sistemas, tais como bases de dados. Além disso, a biologia de sistemas é caracterizada por grandes conjuntos de dados e modelagem, sendo estes decorrentes do desenvolvimento de tecnologias de alta vazão (high-throughput) nos anos 90. Sem modelos matemáticos e simulações computacionais, no entanto, os dados não podiam ser entendidos na época [2]. Os primeiros biólogos de sistemas fizeram diversas tentativas para melhorar a metodologia de estudo destas informações, tais como o desenvolvimento de algoritmos para a manipulação de grandes conjuntos de dados e formalismos que modelam uma rede especı́fica, incluindo redes reguladoras de genes, redes de transdução de sinal e redes metabólicas [3]. Um aspecto relevante sobre as informações coletadas por meio do projeto Genoma Humano é a hierarquia que estas seguem na natureza, a saber: DNA → mRNA → proteı́na → interações proteicas → vias de informação → redes de informação → células → tecidos ou redes de células → um organismo → populações → ecologias. A tarefa central da biologia de sistemas é (I) coletar informações de cada um desses nı́veis distintos para sistemas biológicos individuais e (II) integrar esses dados para gerar modelos matemáticos preditivos do sistema [4]. A informação biológica possui caracterı́sticas importantes: • Opera em vários nı́veis hierárquicos de organização; • É processada em redes complexas; • Essas redes de informação são tipicamente robustas, de modo que muitas perturbações únicas não as afetam significativamente; • Existem nós-chave na rede, onde perturbações podem ter efeitos profundos; estes nós oferecem alvos poderosos para a compreensão e manipulação do sistema [4]. Introdução 13 Nesse contexto, é importante ressaltar que sistemas biológicos, como células, são considerados ”sistemas complexos”. Uma noção geral de sistemas complexos é composta por um número muito grande de elementos simples e idênticos que interagem de maneira a produzir comportamentos ”complexos” [5], no entanto a realidade dos sistemas biológicos é um pouco diferente. Nesse caso, um grande número de conjuntos de elementos funcionalmente diversos e frequentemente multifuncionais interage de maneira seletiva e não linear para produzir comportamentos coerentes. Ao contrário dos sistemas complexos de elementos simples (nos quais as funções emergem das propriedades das redes que formam e não de qualquer elemento especı́fico), as funções nos sistemas biológicos dependem de uma combinação da rede e dos elementos especı́ficos envolvidos [1]. Por muito tempo, assumiu-se que os componentes de sistemas complexos, como a célula, são aleatoriamente conectados. Entretanto, na década passada um grande número de pesquisas mostrou que muitas redes reais, independentemente de idade, função e escopo, convergem para arquiteturas semelhantes, uma universalidade que permitiu pesquisadores de diferentes disciplinas adotar a teoria das redes como um paradigma comum [6]. 1.2 O estudo de redes aplicado a vias metabólicas As redes regulatórias e de interação são representações importantes das redes de proteı́nas e são caracterizadas por possuir arestas direcionais e não direcionais, respectivamente [7]. O estudo e a aplicação da teoria dos grafos no contexto de redes biológicas ganharam notoriedade, pois é necessário entender sistematicamente como as moléculas interagem umas com as outras e como suas funções são determinadas na complexa maquinaria celular, isoladamente ou em conjunto com outras células [8]. As viasmetabólicas e os ciclos são cadeias de reação em que produtos quı́micos se tornam o substrato para a próxima etapa [9]. Além disso, a análise estrutural e funcional de redes metabólicas baseadas no genoma é importante para entender os princı́pios de design e a regulação do metabolismo em nı́vel sistêmico. A rede metabólica é organizada em muitos módulos pequenos e altamente conectados, os quais se combinam hierarquicamente em unidades maiores e menos coesas [10]. Portanto, é importante uma redução racional da rede metabólica em sua estrutura central e uma compreensão mais profunda de seus módulos funcionais [11]. A principal caracterı́stica das vias metabólicas é o fato de suas reações serem conectadas por seus intermediários, sendo os produtos de uma reação os reagentes das reações subsequentes [12]. Em geral, as vias metabólicas podem ser classificadas como redes de fluxo, onde uma variável especı́fica, como fluxo de massa ou energia, pode ser conservada em cada nó. As redes metabólicas têm propriedades únicas, Introdução 14 como as restrições de conservação que devem ser satisfeitas em cada nó. Outra propriedade é a representação dos nós das redes metabólicas como metabólitos e as arestas como reações catalisadas por produtos genéticos especı́ficos [13]. Abaixo, é apresentada a Figura 1 exemplificando a representação de uma via metabólica como grafo. Figura 1 – Representação da via metabólica relativa a biossı́ntese de estilbenoide como grafo (desenvolvido pelo autor1). Em termos de grafos, existem diversas maneiras de se construir uma rede a partir de um modelo metabólico [14]. Por exemplo, (I) é possı́vel representar uma via metabólica como um grafo no qual os metabólitos são os nós e arestas são as reações que transformam um metabólito em outro [11, 15, 16, 17], (II) como um grafo em que as reações são os nós e arestas os metabólitos compartilhados entre elas [18, 19, 20], ou até mesmo (III) como um grafo bipartido com reações e metabólitos como nós [21]. O estudo de redes sugere que as redes biológicas têm duas propriedades estruturais importantes. Primeira, a maioria dessas redes (incluindo as vias metabólicas) são scale-free e possuem uma propriedade de ”small-world”, caracterizada por grande parte de suas conexões serem estabelecidas entre os vértices mais próximos [7, 11, 13]. No 1 Disponı́vel em: <〈https://igorabrandao.com.br/kegg-pathway-bottleneck//pathway module/viewer? pathwayCode=cGIwWHJ0bFJNZlRtRm4rcSs3aGRBZz09〉>. Acesso em: 08 jun. 2020. https://igorabrandao.com.br/kegg-pathway-bottleneck//pathway_module/viewer?pathwayCode=cGIwWHJ0bFJNZlRtRm4rcSs3aGRBZz09 https://igorabrandao.com.br/kegg-pathway-bottleneck//pathway_module/viewer?pathwayCode=cGIwWHJ0bFJNZlRtRm4rcSs3aGRBZz09 Introdução 15 entanto, é importante observar a existência de forças motrizes que podem influenciar na criação de um novo nó na rede biológica, como embaralhamento de exons, retroposição, elementos móveis, transferência horizontal de genes, duplicação gênica e o fato de uma nova conexão deste nó refletir sua origem com a natureza de suas propriedades [22]. Segunda, sugere-se que as redes scale-free tenham alta tolerância a erros (resistência a falhas aleatórias) e baixa tolerância a ataques (vulnerabilidade à falha dos nós altamente conectados) [13]. 1.3 Caracterı́sticas topológicas de uma rede Os perfis topológicos de redes podem ser avaliados através da observação de caracterı́sticas especı́ficas da via, como por exemplo: 1.3.1 Betweenness A métrica betweenness centrality avalia o número total de caminhos mais curtos, não redundantes, e que passam por um determinado nó ou aresta [23]. 1.3.2 Degree O degree informa quantos links o nó possui para se conectar a outros nós [7]. É importante ressaltar que a distribuição do degree é ampla, com uma cauda que geralmente segue uma lei de potência: portanto, muitos vértices com baixo degree coexistem com alguns vértices com alto degree [24]. A avaliação do degree dos vértices de uma rede traz a tona a ideia de nós hub, definidos como nós que estão conectados a um número substancial de outros nós na rede, e em particular, a nós que são mais densamente conectados do que os nós mais altamente conectado em uma rede scale-free [25]. Contudo, definir este número substancial de conexões pode ser algo subjetivo, por isso utilizou-se como referência a definição de Yu et al., na qual os nós hubs são aqueles com as 20% maiores distribuições de degree [23]. 1.3.3 Comunidades A distribuição das arestas em uma rede não é apenas global, mas também lo- calmente não homogênea, com altas concentrações de arestas dentro de grupos especı́ficos de vértices e baixas concentrações entre esses grupos. Essa caracterı́stica de redes é conhecida como estrutura de comunidade [26]. Introdução 16 1.3.4 Ponto de articulação Um nó é um AP se a sua remoção desconectar a rede ou aumentar o número de componentes conectados a ela [27, 28]. Além disso, cada AP possui um impacto diferente sobre a rede, o qual pode ser definido como o número de vértices que são desconectados do principal (maior) componente conectado (CC) sobrevivente após a remoção do vértice [29]. 1.3.5 Pontes Uma aresta pode ser classificada como ponte se a sua remoção desconectar o grafo ou aumentar o número de componentes conectados [30]. 1.3.6 Exemplo de avaliação topológica de uma rede A partir da Figura 2 é possı́vel identificar alguns dos elementos topológicos citados anteriormente: Figura 2 – Avaliação topológica de uma rede. Os cı́rculos numerados na figura representam os vértices da rede, as linhas conectando os vértices são as arestas (neste caso não direcionadas). Esta rede possui três comunidades: A, B e C, demarcadas pelos cı́rculos tracejados e pela cor dos vértices que as compõem. Além disso, é possı́vel identificar que os nós participantes de uma mesma comunidade são altamente conectados entre si, porém possuem poucas ligações com as demais comunidades. A aresta que conecta os nós [1, 4] representa uma ponte do grafo e os nós 1, 3, 6, 7 e 8 são os pontos de articulação da rede, uma vez que se um deles for removido o grafo será desconectado (desenvolvido pelo autor). Introdução 17 Adicionalmente, é possı́vel avaliar outras propriedades topológicas da rede não visı́veis em uma primeira observação através da tabela 1: Tabela 1 – Propriedades topológicas de uma rede. A numeração presente na coluna node é equivalente aos números dos vértices da figura 2. A coluna betweenness mostra que os vértices 6, 7, 8 são aqueles que mais estão presentes nas conexões de ponta-a-ponta da rede, como por exemplo [2, 11] ou [1, 12]. Além disso, fica claro que os nós 11 e 12 são os mais densamente conectados, uma vez que estes fazem vizinhança com quatro nós. Por fim, a coluna diameter mostra que o número necessário de passos para conectar as extremidados da rede é sete, no caso para conectar os nós [4, 10] (desenvolvido pelo autor). Esta foi somente uma sugestão para a realização de uma análise topológica de rede, existem outras maneiras de se estudar um grafo e outras propriedades que podem ser exploradas conforme a necessidade da pesquisa. Além disso, é possı́vel aplicar este tipo de metodologia para estudos massivos em bases de dados. Introdução 18 1.4 Estudo das vias metabólicas do KEGG Atualmente existem diversas bases de dados contendo informações sobre vias metabólicas e seus componentes. No entanto, foi utilizada a plataforma KEGG, pelo fato desta possuir uma base de conhecimento manualmente curada em mapas metabólicos de interação molecular e relações de ortologia entre genes e produtos gênicos [31]. No KEGG, os nós representam enzimas e as arestas reações que transformam um metabólito em outro [11, 15, 16]. O KEGG consiste em trêsbases de dados: (I) PATHWAY, para representação de funções de ordem superior em termos de redes de interação, (II) GENES, para a coleção de catálogos de genes para todos os genomas completamente sequenciados e alguns parciais [32], e (III) LIGAND, para a coleção de compostos quı́micos na célula, moléculas e reações enzimáticas [33]. O banco de dados KEGG/PATHWAY, especializado em vias metabólicas, é repre- sentado por 500 diagramas gráficos para as vias metabólicas de referência. Cada via pode ser vista como uma rede de enzimas ou uma rede de ECs. Uma vez que os genes das enzimas são identificados no genoma (com base na semelhança de sequência e na correlação posicional dos genes) e os ECs adequadamente atribuı́dos, as vias especı́ficas do organismo podem ser construı́das computacionalmente, correlacionando genes no genoma com produtos gênicos (enzimas) nas vias de referência conforme os ECs correspondentes [32]. Os diagramas de rotas do KEGG podem ser utilizados como referência para interpretação funcional, além de conter relações binárias que representam interações e relações moleculares, as quais podem ser utilizadas para calcular e comparar rotas [31]. Vale ressaltar que em termos de estrutura informacional, cada via metabólica do KEGG possui um arquivo KGML com a identificação das proteı́nas, reações quı́micas e suas interações. Cada diagrama do KEGG é identificado pelo KEGG identifier, conforme tabela 2. Embora estes diagramas sejam informativos, um estudo topológico de rede sobre a via metabólica pode ser importante para entender caracterı́sticas especı́ficas de cada rede. Nesse sentido, a representação via grafos pode ser importante, uma vez que a representação visual de redes oferece uma forma rápida e intuitiva de compreen- der seus resultados, além de possibilitar o estudo de diversas caracterı́sticas destas redes por meio da avaliação topológica. Introdução 19 Identificador Significado map via de referência desenhada manualmente ko via de referência que destaca os KOs ec via de referência que destaca os ECs rn via de referência que destaca as reações <org> via especı́fica do organismo gerada pela conversão de KOs em identi- ficadores de genes (Entrez Gene ID) Numeração 011 via global (arestas vinculadas aos KOs) 012 via geral (arestas vinculadas aos KOs) 010 via de estrutura quı́mica 07 via de estrutura de medicamento outras via regular (nós vinculados aos KOs) Tabela 2 – Identificadores do KEGG. Cada mapa metabólico é identificado pela combinação de um prefixo com 2-4 letras e um número de 5 dı́gitos (desenvolvido pelo Laboratório Kanehisa2). 2 Disponı́vel em: <〈https://www.genome.jp/kegg/pathway.html〉>. Acesso em: 19 jun. 2020. https://www.genome.jp/kegg/pathway.html 20 2 Justificativa O desafio no processo de obtenção e mineração de dados em larga escala consiste em uma atividade recorrente no âmbito da pesquisa cientı́fica e também em projetos comerciais. Diante disso, o desenvolvimento e utilização de ferramentas para análise de dados massivos em diversas bases de conhecimento é necessário para fornecer aos biólogos uma visão integrada dos dados biológicos [34]. Neste contexto, a prin- cipal contribuição deste trabalho é a avaliação sistemática da topologia dos mapas metabólicos disponı́veis na plataforma KEGG com o intuito de encontrar os pontos de vulnerabilidade destas vias, uma vez que a disruptura delas pode estar associadas a desordens ou doenças para os organismos em geral [35]. A determinação experimental de genes essenciais em laboratório ainda é cara, demorada e não escalável. Neste sentido, previsões computacionais de genes es- senciais podem ser utilizadas para restringir o universo de estudo e fornecer uma lista de prioridades para validação experimental [36]. Adicionalmente, a utilização de propriedades baseadas em rede são capazes de prever melhor os genes essenciais entre diferentes organismos, uma vez que estes genes são efetivamente transferidos em organismos intimamente relacionados por compartilharem genes ortólogos [36]. 21 3 Objetivos 3.1 Objetivo geral Verificar se os pontos de vulnerabilidade das vias metabólicas do KEGG possuem as seguintes caracterı́sticas: (I) são os nós com as maiores ocorrências entre as vias metabólicas de diferentes organismos, (II) desconectam a rede em componentes distintos após sua remoção, (III) estão localizados no centro destas vias metabólicas. 3.2 Objetivos especı́ficos – Avaliar a ocorrência de cada proteı́na no conjunto das vias metabólicas do KEGG para os diferentes organismos disponı́veis; – Identificar quais nós são APs e analisar suas caracterı́sticas topológicas para cada via metabólica em diferentes organismos; – Detectar as faixas de frequência em que os APs não estão aleatoriamente dis- tribuı́dos; – Analisar funcionalmente os APs para compreender o seu papel biológico; – Desenvolver uma ferramenta para visualização interativa das redes de maneira a permitir o acesso as caracterı́sticas topológicas de seus componentes e manipulação das mesmas. Artigo: Systems biology approaches in the investigation of articulation points in KEGG metabolic pathways Autores: Igor A. Brandão, Clovis F. Reis, Alice B. Câmara, Diego A. A. Morais, Leonardo R. S. Campos, Rodrigo J. S. Dalmolin Systems biology approaches in the investigation of articulation points in KEGG metabolic pathways Igor A. Brandão1, Clovis F. Reis1, Alice B. Câmara2, Diego A. A. Morais1, Leonardo R. S. Campos1, Rodrigo J. S. Dalmolin1* 1 Bioinformatics multidisciplinary environment, Federal University of Rio Grande do Norte, Natal, Rio Grande do Norte, Brasil 2 Biophysics and Pharmacology Department, Federal University of Rio Grande do Norte, Natal, Rio Grande do Norte, Brasil 1 1722, Odilon Gomes de Lima Av, Natal, Rio Grande do Norte 59078-400 2 3000, Sen. Salgador Filho Av, Natal, Rio Grande do Norte 59064-741 *Corresponding author E-mail: rodrigo.dalmolin@imd.ufrn.br Running title: Articulation points Investigation in KEGG metabolic pathways Key-words: KEGG, articulation point, metabolic pathway, network topology, network visualization. 1 Abstract The metabolic pathways can be analyzed as graphs, which provide several tools to study the topological features such as the articulation points. Nowadays, research in bioinformatics studies the essentiality of proteins based on betweenness and degree metrics, however, graph theory suggests that articulation points could be essential nodes in a network. It remains to be determined whether these articulation points are essential in metabolic pathways and their topological impact on the network. Here, we constructed biological networks of 153 canonical metabolic pathways from the KEGG database. We identified the articulation points of each canonical network and evaluated their frequency in 6221 species annotated in the KEGG database. Using network analysis via metrics and biologic curation, we verified that articulation points are proteins with high frequencies across species when comparing with non-articulation points. The articulation points with high-frequency are located in central regions of the network, while those articulation points with low frequency are in the network periphery. Finally, steroid biosynthesis is the metabolic pathway with the highest number of articulation points with a frequency higher than 80%. Besides, oxidoreductase is the articulation point class present in the highest number of metabolic pathways. It remains to perform a deep analysis of the articulation point biological roles. Introduction Metabolic pathways are composedof several components, such as enzymes and other proteins, which need to work coordinately to perform a given function (1). There are metabolic pathways widely distributed in the tree of life, and others restricted to a few taxonomic groups (2). The distribution of different metabolic pathways suggests that some of them are important to almost all organisms while others are important to organisms in a specific niche or from specific taxa (2). It is common to observe variations in the same metabolic pathway comparing different organisms. For example, both humans and yeast have carbohydrate metabolism. However, the yeast Saccharomyces cerevisiae has the enzyme pyruvate decarboxylase (EC:4.1.1.1), an enzyme absent in humans, while both species have enolase (EC:4.2.1.11) (3). A central characteristic of metabolic pathways is that reactions are connected by their intermediates. The products of a reaction are the reactants of subsequent pathway steps (4). In general, the metabolic pathways can be classified as flow networks, where a specific variable, such as mass or energy, can be conserved at each node. The study and application of graph theory in the context of metabolic networks has gained notoriety since it is necessary to systematically understand how molecules interact with each other and how their functions are determined within the complex cell machinery, alone or together with other cells (5). Additionally, structural and functional analysis of genome-based large-scale metabolic networks is important for understanding the design principles and regulation of metabolism at a system level. The metabolic networks are organized into many small and highly connected modules that combine hierarchically into larger, less cohesive units (6). Therefore, it is important for a rational reduction of the metabolic networks to their core structure and a deeper understanding of its functional modules (7). A great source of curated functional information about metabolic networks is the KEGG database (Kyoto Encyclopedia of Genes and Genomes - https://www.genome.jp/kegg/). It holds a knowledge base on metabolic pathway maps of molecular interaction and orthology relationships between genes/gene products in different species (8). In KEGG, nodes depict enzymes or other proteins, and edges represent the reactions that transform one metabolite into another (9-11). Although KEGG diagrams are informative and easy to understand, a network topological study about the metabolic pathway might be important to 2 understand specific network characteristics. In this sense, the representation via graphs could stand out because it enables the study of several network properties (12). Network analysis suggests that most biological networks, including metabolic networks, are scale-free and possess a "small-world" property characterized by a short average path length (7, 13, 14). There are biological phenomena that can influence a novel node attachment in a biological network, such as exon shuffling, retroposition, mobile elements, horizontal gene transfer, gene duplication (15). Scale-free networks are suggested to have high error tolerance (against random failure) and low attack tolerance (vulnerability to the failure of the highly connected nodes) (14). The network topological properties, such as betweenness, degree, articulation points (AP), and bridges, helps to understand network characteristics. Betweenness centrality measures the total number of non-redundant shortest paths going through a certain node or edge (16). The degree tells how many links a specific node has (13). Also, an edge can be classified as a bridge if its removal disconnects the graph or increases the number of connected components otherwise (17). A node is an AP if its removal disconnects the network or increases the number of connected components of the network (18, 19). Articulation points could be particularly crucial to metabolic networks. Taking into account that metabolic networks commonly involve mass flux, the deletion of an AP might interrupt the flux and, therefore, deeply impact the pathway. To date, no research systematically evaluated the topology of metabolic maps as a whole to study the APs. The study of APs is relevant because it is possible to detect the most vulnerable sites in a metabolic pathway by identifying these points. The APs can be measured by the number of vertices that get disconnected from the main (largest) surviving connected component after the removal of the AP (20). Here, we built the respective metabolic network from each KEGG canonical metabolic map available in the KEGG website and identified their articulation points. We investigated the frequency of each node from the canonical metabolic networks by evaluating their presence in the metabolic maps for each species present in the KEGG database. According to our results, there are two sets of APs: peripheral AP, which have low frequency and low betweenness; and central AP, with high frequency and high betweenness. Finally, we develop an interactive visualization enabling access to each protein topological features and the simulation of creation and remotion of nodes to provide a user-friendly interaction with the assessed networks. Results Metabolic pathways from KEGG maps In total, we retrieved 153 metabolic pathways from the KEGG database. A script was developed in R language using the KEGGREST package to load the list of metabolic pathways with its characteristics and the list of 6221 organisms with their respective taxonomic information. Approximately 32% of metabolic pathways are related to less than one thousand species (Figure 1). This might be due to the existence of specific pathways for certain plants, bacteria, and fungi, which are involved with their cellular functions in these organisms. An example would be Benzoxazinoid biosynthesis route 00402 (Supporting information S1a) responsible for the defense of some plant species (Supporting information S1b). Also, approximately 31% of metabolic pathways are related to at least 83% of KEGG species. These pathways being important to different types of living beings (Supporting information S2), such as glycolysis and gluconeogenesis route 00010, present in 6084 species (Supporting information S3). 3 Figure 1. Distribution of species quantity by metabolic pathways available in KEGG. X-axis: metabolic pathway. Y-axis: organisms count. The orange bars represent pathways related to eukaryotes. Eukaryotes represent 8.6% (534) of total species (6221). Blue bars represent pathways related to prokaryotes. Prokaryotes represent 91.4% (5687) of total species. We then converted each 153 KEGG metabolic maps into metabolic networks. Figure 2A shows the original KEGG map (map 0010) while Figure 2B shows the respective metabolic network, where nodes represent enzymes and links represent the reaction. The KEGG representation (Figure 2A), focused on displaying the chemical reaction sequentiallyto understand its logic and to emphasize the subproducts of each step. On the other hand, the graph representation (Figure 2B) provides a perspective regarding the topologic organization, the role of each node in the network arrangement, the communities and the hubs that connect them, and the network bottlenecks which could be interpreted as points of network vulnerability. 4 Figure 2. Comparison between KEGG and graph representation of glycolysis and gluconeogenesis pathway. A. Canonical KEGG map representing the 0010 - glycolysis and gluconeogenesis pathway. The rectangles represent enzymes and the circles represent metabolites (extracted from the KEGG website - https://www.genome.jp/kegg/pathway.html). B. Graph visualization generated with 58 nodes and 190 edges. Nodes represent enzymes and edges represent reactions. Node size represents protein frequency, colored borders indicate APs, color in the center of the node indicates betweenness level. The classification and EC code of the protein are present on the node label. The network was made using the R programming language (visNetwork package). Articulation points identification Articulation points are nodes whose removal disconnects the network in more than one component (21). We then identified the articulation points (AP) of each 153 metabolic networks. The APs represented 20.46% (n=1125) of the total evaluated nodes (n=5497), while the Non-APs nodes (n=4372) corresponded to 79.54% (Figure 3A). The number of APs varies depending on the pathway, ranging from 1 to 34 (Figure 3B). It is noteworthy that some metabolic pathways were composed mostly of APs, for example, the Benzoxazinoid biosynthesis pathway 00402 (Supporting information S1), which is composed of eight proteins in which six are APs representing a one-way bridge in terms of mass flow. Other pathways had just peripheral APs, such as Caffeine metabolism pathway 00232 (Supporting information S4). In this case, there are few communities, which are overcrowded and some of them are connected through bridges that represent one-way paths. These bridges are composed of APs, which connect with only one protein. Another example is the Fatty acid synthesis pathway 00061 (Supporting information S5), formed of an overcrowded central community with 5 7 peripheries highly connected to the center and another periphery connected to the center through just one bridge. This pathway contains 3 components disconnected from the main network component, each with their own APs. However, there are several highly connected nodes in the network central area, although none of them are an AP due to the connection handles, which guarantee redundancy in this area of the network. Figure 3. Distribution of articulation points. The AP represents 20.46% while the Non-AP (79.54%) of the total studied nodes A. The number of AP varies depending on the pathway. B. The x-axis represents each pathway and the y-axis depicts the number of nodes per pathway. Protein count per pathway varies from 2 to 170 nodes. Articulation points distribution across species Despite the possibility to find a given metabolic pathway in many species, not the same reactions are observed in each organism, especially when comparing evolutionarily distant species. Therefore, the node frequency in a given metabolic network might vary when evaluating different species. We evaluated the node frequency of each of the 153 metabolic networks. We first 6 identified the species in which a given metabolic network occurs and then ranked network nodes according to the observed frequency of occurring species. Through the hypergeometric analysis, we identified that the upper-frequency stratum (>= 75% of frequency) were enriched by APs (p-value < 0.01). Within this filter, 1047 more frequent proteins were selected, which presented a higher frequency than the cut-off point, of which 236 are APs (22.54%). Regarding the total of studied proteins, 1125 APs (20.46%) were evaluated in which a higher AP density was observed above the cut-off point when compared to non-AP proteins (Figure 4A). Besides, at frequencies groups >=80% and <30%, we found the highest concentration of AP (Figure 4A). To differentiate the profile of the APs in the groups of high frequency (>=80%) and low frequency (<30%), we studied the betweenness of the APs as an indicator of centrality in the assessed metabolic pathways. In this sense, we observed that the APs in the group greater/equal than 80% presented betweenness higher than the group lower than 30% (Figure 4B), suggesting that these APs are located at more strategic points of the metabolic pathways (Supporting information S6). Additionally, we evaluated the APs distribution by each frequency range and also the overall proportion of the APs over the set of studied proteins. Our analysis showed that the highest concentration of APs occurred in the frequency range of 80-90%, which was confirmed by the binomial test (Figure 4C). It suggests that this set of proteins is essential for its metabolic pathways since they are present in most of the species containing a particular metabolic pathway. 7 Figure 4. AP proteins profile assessment. A. Heatmap showing the concentration of proteins by frequency. The proteins were divided into 2 groups: AP and Non-AP protein. The darker shades of blue indicate a higher concentration of proteins. The lighter shades of blue indicate lower concentrations of proteins. The dashed line represents the cut-off point of the hypergeometric. Proteins that fall below the cut-off point are randomly distributed when compared to proteins above the cut of the hypergeometric. B. Betweenness in APs frequencies groups <30% and >=80%. Distribution of betweenness in proteins groups lower than 30% (853) or greater/equal than 80% (206) of frequency. Analyses were performed using 1059 APs. Upper and lower whiskers (dashed line in blue) mark Q1-1.5 x IQR (interquartile range) and Q3+1.5 x IQR points, respectively. Points above and below those marks are considered outliers. The red line represents the median, and the red dot the mean of the entire distribution. The p-value assigned refers to a Wilcoxon-Mann-Whitney test, comparing both distributions. C. The ratio of APs by protein frequency ranges. The bars correspond to APs over total proteins found in each frequency range. Each bin represents an interval of 10%. The blue bar corresponds to the interval where the proportion of APs is considered significant (Binomial test, p-value<0.01), which p-value is highlighted. The dashed line represents the expected ratio of APs in the entire protein dataset (20.46%). Functional analysis We observed that aspartate-semialdehyde dehydrogenase and alcohol dehydrogenase, enzymes that catalyze the electrons transfer from the reductant to the oxidant (electron acceptor), were the APs associated with a greater number of metabolic pathways (6 metabolic pathways) (Figure 5). 351types of oxidoreductase can be found in the DEG, a database about essential GENES, demonstrating the essentiality of this AP class. In this context, 132 oxidoreductase types are present in bacterias, 9 oxidoreductase types are expressed in archaea, and 210 oxidoreductase types are expressed in eukaryotes. Additionally, we could observe that the steroid biosynthesis is the metabolic pathway with the highest number of articulation points with a frequency higher than 80% (16 APs). Moreover, Isoquinoline biosynthesis and carotenoid biosynthesis are the metabolic pathways with the highest number of articulation points with a frequency lower than 30% (27 APs). Besides, we evaluated the functions and pathways of the APs associated with the greatest number of pathways. Amino acid biosynthesis is the main function related to these APs (Supporting information S7). Cysteine and methionine metabolism, as well as glycine, serine, and threonine metabolism, are the main pathways related to these APs (Supporting information S7). 8 Figure 5. Distribution of APs with frequencies >=80% and <30% through metabolic pathways. A. Top 10 metabolic pathways with the greatest number of APs with frequency <30% B. Top 10 APs with frequency <30% present in the highest number of metabolic pathways. C. Top 10 metabolic pathways with the greatest number of APs with frequency >=80%. D. Top 10 APs with frequency >=80% present in the highest number of metabolic pathways. E. Top 10 metabolic pathways with the greatest number of 9 APs with frequency >=80% and betweenness higher than the median. F. Top 10 APs with frequency >=80% present in the highest number of metabolic pathways and betweenness higher than the median. Network Visualization We provided dynamic HTML visualizations available at https://igorabrandao.com.br/kegg-pathway-bottlen eck, which allow the observation of several metabolic pathways. It is possible to adjust the network size and position, as well as to select proteins according to their classification, making it possible to highlight protein groups of interest. Finally, the AP remotion can disconnect the network and the proteins metrics (impact, degree, and betweenness) can be easily viewed. Discussion The present study aimed to investigate the APs for all metabolic pathways available in the KEGG database. The remotion of one of these APs will cause the network rupture since it is a central point that connects various network complexes and/or peripheral regions (20). Therefore we can suggest that APs might be essential proteins to the pathways they belong to. Complementary to the network topological analysis, it is important to understand the profile of proteins classified as AP. We found a considerable concentration of APs in the protein groups with higher frequencies. We verified that APs tend to be high-frequency proteins, suggesting that these proteins may be more essential to the metabolic pathway when compared with proteins with non-APs. However, we identified two groups of APs; a group with high frequency and high betweenness and a group with low frequency and low betweenness. In practice, this implies that APs with low betweenness disconnect peripheral regions from the network, resulting in a disbalance between the components of the remaining network, while the APs with high betweenness are probably located in central regions. A study demonstrated that an important node in a network can be calculated based on the betweenness centrality metric plus its degrees (16). Other studies suggest performing the identification and removal of APs to provide a new perspective on the organizational principles of complex networks (21). We observed that, in the metabolic networks evaluated here, the APs could be important since they play a key role in ensuring the robustness and connectivity of these networks. Studies have shown that network integrity can be disrupted, contributing to serious impairments in disorders such as depression, psychotic disorders, motor diseases, or even lethal mutations enriched in the group of highly connected proteins that are APs (22). Investigations into the integration and coordination of large-scale functional networks subserving various conditions must be explored. For example, assessing non-static functional connectivity within whole-brain networks can be important to unravel some mysteries in the central nervous system. Studies assessing network integrity in the central nervous system were developed. However, it is also crucial to identify essential proteins in these networks (23-25). Another study related to inflammatory bowel disease (IBD) found that there are seven hub-bottleneck proteins (which can be understood as APs) in the IBD network responsible to maintain the network integrity. Network evaluation and complex analysis of IBD essential proteins are important to provide a new glance of the disease. Proteins are in a complex interactome organization that any small changes in each individual may lead to dysfunction of the whole system (26). We performed a functional analysis to determine the main metabolic pathways regarding the APs using KEGG ECs. A study also used KEGG to perform functional analyses and studies used Gene Ontology to determine the biological process and molecular functions of proteins related to cancer (27, 28). However, it is essential to highlight that no other study evaluated the main metabolic pathways associated with all APs. Most APs with 10 the greatest number of associated pathways are oxidoreductases (AP classes) and the synthesis of steroids was the metabolic pathway with the highest number of APs with a frequency higher than 80%. In fact, on the Steroid hormone biosynthesis pathway 00140, 14 of 16 APs are oxidoreductases. Oxidoreductases participate in the metabolism of some ethers. A 5α-reductase (5α-R) and 3α- or 3β-hydroxysteroid oxidoreductase (HSOR), is important to the generation of reduced metabolites of progesterone and testosterone, acting on several steroid and neurotransmitter receptors and regulating the neural function and behavior. Progesterone is metabolized into dihydro-progesterone by the action of 5α-R and subsequently into 3α, 5α-tetrahydro-progesterone, or into 3β, 5α-tetrahydro-progesterone, by the action of 3α- or 3β-HSOR, respectively. Besides, testosterone is converted into dihydrotestosterone and subsequently into 5α-androstane-3α,17β-diol or 5α-androstane-3β,17β-diol. The enzymatic reaction performed by 5α-R is the rate-limiting step during the 3α,5αreduced synthesis. The Oxidoreductase can also convert corticosterone and deoxycorticosterone into dihydro-corticosteroids and dihydro-deoxycorticosterone, respectively, and this latter metabolite can be converted into 3α,5α-tetrahydro-desoxycorticosterone (29). Finally, to simplify the understanding of the topology of the metabolic pathway, our work proposes the visualization of these pathways as dynamic networks. Network visualization is an importantaspect of this work since the dynamic visualization provides the possibility to explore network topological features graphically, providing rich details, facilitating the network understanding, and extracting relevant biological information (12). In summary, we suggested that bottlenecks are articulation points with the highest frequencies and located in the center of the network. New studies will be required to establish more relationships between these key proteins (APs) and topological and biological attributes of the metabolic pathways. Additional studies with metabolomics, which is a powerful technology that allows for the assessment of global metabolic profiles, can be used to distinguish between diseased and non-diseased status information (30). Another future possibility is the usage of machine learning (ML) algorithms to create a predictive model to identify potential essential proteins in various metabolic pathways. Finally, knockout studies can be applied to experimentally evaluate whether specific APs are essential for a given metabolic pathway. For example, recently a receptor involved in memory (NOP receptor) has been experimentally associated with metabolic pathways involved in depression; it remains to assess this receptor essentiality on the path through bioinformatics tools. Experimental procedures Extraction and processing of metabolic pathway data Initially, the list of all species available in KEGG was generated along with the list of metabolic pathways for each species, and they were the reference for the selection of metabolic pathways used in the study. Each pathway has an associated KGML (KEGG XML) file that contains the enzymes and reactions data necessary to generate the graph object. It is important to note that in some KEGG metabolic pathways there are different nodes with the same enzyme commission number (EC). Since it was not possible to identify these nodes uniquely via the EC or KEGG identifier, we needed to combine the following node attributes to differentiate them from each other: EC, reaction code, and KEGG diagram (x, y) coordinates. We used the R/Bioconductor KEGGREST package (31) to load the reference pathway of interest from the REST API provided by KEGG. We used 153 metabolic pathways of all species available on KEGG (totalizing more than 600.000 11 datasets). KEGG data was parsed using a function adapted from the KEGGREST package. After that, compounds and interconnections between metabolic pathways were removed to study each isolated pathway. Graph properties, such as communities, betweenness centrality, and degree, were extracted using the iGraph R/Bioconductor package (32). APs detection AP detection was performed according to Tarjan's algorithm. Briefly, APs detection relies on the depth search algorithms (DFS) for graphs, which consists of a technique to visit each node in a given graph until its deepest level. Each node reached is placed on a stack and a record is kept of the lowest node on the stack to which it is connected by a path of unstacked nodes. When a new node cannot be reached from the top of the stack, the top node is deleted and the search is continued from the next node on the stack. If the top node of the stack does not connect to a node lower than the second node on the stack, then this second point is an AP of the graph. All edges examined during the search are placed on another stack. Therefore, when an AP is found, the edges of the corresponding biconnected component may be retrieved and placed in an output array (33). Protein frequency evaluation To evaluate the protein frequency for each metabolic pathway all the organisms that contained it were identified, and the occurrence of each node of the canonical route was verified taking into account the number of these organisms. The protein total frequency is determined by the number of times this protein appears in the various organisms pathways divided by the total organisms that possess the given pathway. The protein occurrence counting is performed via color attribute evaluation in the organism KGML file associated with a pathway, more specifically into its graphic section. Since KEGG annotates the color attributes to display the presence or absence of a given protein in a pathway map, we used it to perform the protein counting. In the organism pathways, nodes with bgcolor attribute equal to #BFFFBF indicate its presence, while #FFFFFF indicates its absence. Also, we generated a dictionary to match the organism nodes with a corresponding EC number relative to the canonical route, since in KEGG, the organism nodes are represented as ENTREZ code. This dictionary contains the attributes pathway code, EC number, reaction code, protein x-axis, and y-axis parsed from the KGML file. Via the combination of (reaction code + protein x-axis + y-axis) nodes attributes it's possible to match the EC number with ENTREZ code and generate the dictionary ID to represent this relation. We normalized the protein frequency according to the total species containing a certain metabolic pathway since the studied pathways showed different numbers of organisms. Also, we normalized the proteins according to the maximum and minimum frequencies of proteins related to a given metabolic pathway. The highest normalized value was considered 1, the lowest value was considered 0, and others values were within this scale from 0 to 1. Statistical analysis We assembled a dataset containing the protein information, including its frequency, from the evaluated pathways using the R language, and the nodes with a frequency of 0% were removed to avoid possible bias. To test the AP enrichment, we applied the hypergeometric test using the phyper built-in R function with lower.tail parameter set as false. Also, we sorted the dataset by the protein frequency in descending order to verify the APs concentration in groups of higher frequencies. The hypergeometric test was applied N times, with N equal to the number of proteins in the dataset. Each hypergeometric instance evaluates the current and the previous proteins in the list in a cumulative way. A p-value < 0.01 was considered significant, showing that a frequency range is enriched with AP proteins. 12 Complementary, we conducted a binomial test (p-value < 0.01 ) grouping the proteins in ranges of 10% to confirm which range is the most enriched with APs. Since we observed an AP concentration higher than expected in proteins with lower frequencies, we divided the APs into the groups with higher (>=80%) and lower (<30%) frequencies. To establish differences between these groups, we performed comparative analyses using the Wilcoxon-Mann-Whitney test (p-value < 0.01) over the APs topological features distribution of each group, for example, the betweenness centrality metric. Functional analysis To evaluate the main metabolic pathways related to the APs, we performed descriptive statistics toidentify the APs role with higher (>= 80%) and lower (<30%) frequency. The protein information was obtained via the KEGG dbget system plus the protein EC number, for example, the Aspartate-semialdehyde dehydrogenase with EC number 1.2.1.11 can be found at https://www.kegg.jp/dbget-bin/www_bget?1.2.1.11 . Into the dbget page, it is possible to retrieve the protein name, class, and related pathways. Also, we searched the protein function in the Uniprot database. Additionally, we retrieved the AP proteins correspondence into the Database of Essential Genes (DEG) using their classes to confirm their essentiality, since DEG holds information of essential genes that are indispensable for the survival of an organism. Metabolic pathways visualization Since a metabolic pathway can be considered a mass flow, we represent it by directed graphs where nodes are the enzymes and edges are the chemical reactions. Also, we provided a graphical visualization of the pathways displaying the main characteristics from its nodes such as AP status (border in blue), betweenness centrality (background color), frequency (node size), protein classification (label), and bridge edges (red arrowed line) using the visNetwork R package (34). Data availability: The scripts and datasets presented in this paper have all been deposited in the Github repository with the following link: https://github.com/igorabrandao/kegg-pathway-bo ttleneck. Acknowledgments We acknowledge KEGG databases for providing public access to their data. We also would like to thank NPAD/UFRN for computational resources. Author contributions R.J.S.D. conceived the project. I.A.B., C.F.R., D.A.A.M., and L.R.S.C. gathered and curated the data. I.A.B., C.F.R., D.A.A.M., and L.R.S.C performed the analysis. I.A.B. and A.B.C. organized Supplementary Material. I.A.B., A.B.C., and R.J.S.D. wrote the manuscript. All authors revised and approved the manuscript. Funding and additional information The scholarships were financed by governmental Brazilian agency Coordination for the Improvement of Higher Education Personnel (CAPES– Portuguese: Coordenação de Aperfeiçoamento de Pessoal de Nível Superior), National Council of Technological and Scientific Development (CNPq) and PROPESQ-UFRN. The fellowship was supported by the National Council of Technological and Scientific Development (CNPq) (308258/2018-5). 13 Conflict of interest: The authors declare that they have no conflicts of interest with the contents of this article. References 1. Chatterjee, R., & Yuan, L. (2006) Directed evolution of metabolic pathways. TRENDS in Biotechnology, 24(1), 28-38. 2. Dalmolin, R. J., Castro, M. A., Rybarczyk Filho, J. L., Souza, L. H., de Almeida, R. M., & Moreira, J. C. (2011) Evolutionary plasticity determination by orthologous groups distribution. Biology Direct; 6(1), 1-18. 3. Paludo, G. P., Lorenzatto, K. R., Bonatto, D., & Ferreira, H. B. (2015) Systems biology approach reveals possible evolutionarily conserved moonlighting functions for enolase. Computational biology and chemistry, 58, 1-8. 4. Maniadi EM, Tollis IG. (2014) Analysis and visualization of metabolic pathways and networks: A hypergraph approach. 1st Ed., Spain, Valencia, 109-112. 5. Barabasi AL, Oltvai ZN. (2004) Network biology: understanding the cell's functional organization. Nature reviews genetics; 5(2), 101. 6. Ravasz E, Somera AL, Mongru DA, Oltvai ZN, Barabási AL. (2002) Hierarchical organization of modularity in metabolic networks. Science, 297(5586), 1551-1555. 7. Ma HW, Zeng AP. (2003) The connectivity structure, giant strong component and centrality of metabolic networks. Bioinformatics, 19(11), 1423-1430. 8. Kanehisa M, Furumichi M, Tanabe M, Sato Y, Morishima K. (2016) KEGG: new perspectives on genomes, pathways, diseases and drugs. Nucleic acids research, 45(1), 353-361. 9. Jeong H, Tombor B, Albert R, Oltvai ZN, Barabási AL. (2000) The large-scale organization of metabolic networks. Nature, 407(6804), 651. 10. Wagner A, Fell DA. (2001) The small world inside large metabolic networks. Proceedings of the Royal Society of London, 268(1478), 1803-1810. 11. Ma HW, Zeng AP. (2003) The connectivity structure, giant strong component and centrality of metabolic networks. Bioinformatics, 19(11), 1423-1430. 12. Beguerisse-Díaz M, Bosque G, Oyarzún D, Picó J, Barahona M. (2018) Flux-dependent graphs for metabolic networks. NPJ systems biology and applications, 4(1), 32. 13. Jeong H, Mason SP, Barabási AL, Oltvai ZN. (2001) Lethality and centrality in protein networks. Nature, 411(6833), 41. 14. Mahadevan R, Palsson BO. (2005) Properties of metabolic networks: structure versus function. Biophysical jornal, 88(1), 7-9. 15. Ferreira RM, Rybarczyk-Filho JL, Dalmolin RJ, Castro MA, Moreira JC, Brunnet LG, de Almeida RM. (2013) Preferential duplication of intermodular hub genes: an evolutionary signature in eukaryotes genome networks. PloS one, 8(2). 16. Yu H, Kim PM, Sprecher E, Trifonov V, Gerstein M. (2007) The importance of bottlenecks in protein networks: correlation with gene essentiality and expression dynamics. PLoS computational biology, 3(4), e59. 17. Ausiello G, Firmani D, Laura AL. (2012) Real-time monitoring of undirected networks: Articulation points, bridges, and connected and biconnected components. Networks, 59(3), 275-288. 14 18. Behzad M, Chartrand G. (1972) Introduction to the Theory of Graphs. 6st Ed., Allyn and Bacon, EUA, Boston. 19. Harary F. (1996) Graph theory. 6st Ed., Addison-Wesley, England, London. 20. Farina G. (2015) A linear time algorithm to compute the impact of all the articulation points, 1, 1-4. 21. Tian L, Bashan A, Shi D-N, Liu Y-Y. (2017) Articulation points in complex networks. Nat Commun, 8(1), 14223. 22. Pržulj, N., Wigle, D. A., & Jurisica, I. (2004) Functional topology in a network of protein interactions. Bioinformatics, 20(3), 340-348. 23. Ge R, Downar J, Blumberger DM, Daskalakis ZJ, Lam RW, Vila-Rodriguez F. (2019) Structural network integrity of the central executive network is associated with the therapeutic effect of rTMS in treatment resistant depression. Progress in Neuro-Psychopharmacology and Biological Psychiatry, 92, 217-225. 24. Larivière S, Ward NS, Boudrias MH. (2018) Disrupted functional network integrity and flexibility after stroke: Relation to motor impairments. NeuroImage: Clinical, 19, 883-891. 25. Sheffield JM, Kandala S, Tamminga CA, Pearlson GD, Keshavan MS, Sweeney JA et al. (2017) Transdiagnostic associations between functional brain network integrity and cognition. JAMA psychiatry, 74(6), 605-613. 26. Asadzadeh-Aghdaee H, Shahrokh S, Norouzinia M, Hosseini M, Keramatinia A, Naghibzadeh B et al. (2016) Introduction of inflammatory bowel disease biomarkers panel using protein-protein interaction (PPI) network analysis, 9(1), 8-13. 27. Barabási AL, Gulbahce N, Loscalzo J. (2011) Network medicine: a network-based approach to human disease. Nature reviews genetics, 12(1), 56-68. 28. Ritz A, Poirel CL, Tegge AN, Sharp N, Simmons K, Powell AM et al. (2016) Pathways on demand: automatedreconstruction of human signaling networks. NPJ systems biology and applications, 2, 16002. 29. Supper J, Spangenberg L, Planatscher H, Dräger A, Schröder A, Zell A. (2009) BowTieBuilder: modeling signal transduction pathways. BMC Syst Biol, 3(1), 67. 30. Khan FM, Gupta SK, Wolkenhauer O. (2018) Integrative workflows for network analysis. Essays in Biochemistry, 62(4), 549–61. 31. Tenenbaum D. (2016) KEGGREST: Client-side REST access to KEGG. R package version, 1(1). 32. Csardi G, Nepusz T. (2006) The igraph software package for complex network research. InterJournal, Complex Systems, 1695(5): 1-9. 33. Hopcroft J, Tarjan R. (1973) Algorithm 447: efficient algorithms for graph manipulation. Communications of the ACM, 16(6), 372-378. 34. Almende BV, Thieurmel B Robert T. (2016) visNetwork: Network Visualization using ‘vis. js’ Library, R package version, 1(1). Supporting Information Supporting information S1. Pathway 00402 - Benzoxazinoid biosynthesis. A. Dynamic visualization generated with 8 nodes and 7 edges. B. KEGG metabolic pathway diagram. 15 16 Supporting information S2. Classification and quantity of organisms used in the study. A. Prokaryotes; B. Eukaryotes. 17 Supporting information S3. Visualization of pathway 0010 (glycolysis and gluconeogenesis) in the KEGG database. KEGG presents information about the metabolic pathway as a sequential block diagram. In which colored or white boxes represent proteins. Colored boxes are proteins present in the metabolic pathway, while whites are absent proteins in these pathways. Source: https://www.kegg.jp/kegg-bin/show_pathway?org_name=hsa& mapno=00010. 18 Supporting information S4. Pathway 00232 - Caffeine metabolism. A. Dynamic visualization with 26 nodes and 76 edges. B. KEGG metabolic pathway diagram. 19 Supporting information S5. Pathway 00061 - Fatty acid biosynthesis. A. Dynamic visualization with 170 nodes and 1063 edges. B. KEGG metabolic pathway diagram. 20 21 Supporting information S6. A. APs with high betweenness and frequency of pathway 00010 - glycolysis and gluconeogenesis. I) AP ec:4.1.2.13 with betweenness of 0.0255 and frequency of 96.98%. II) AP ec:4.2.1.11 with betweenness of 0.0490 and frequency of 98.26%. III) AP ec:2.7.1.40 with betweenness of 0.0282 and frequency of 91.73%. B. APs with low betweenness and frequency of pathway 00640 - propanoate metabolism. IV) AP ec:1.1.1.80 with betweenness of 0.0023 and frequency of 2.23%. V) AP ec:4.1.1.4 with betweenness of 0.0034 and frequency of 7.69%. VI) AP ec:2.8.3.8 with betweenness of 0.0042 and frequency of 9.79%. VII) AP ec:1.1.1.77 with betweenness of 0.0017 and frequency of 6.89%. 22 23 Supporting information S7. Functions and metabolic pathways related to the APs present in a higher number of metabolic pathways. The metabolic pathways were obtained from KEGG, the APs functions were obtained from Uniprot. AP name and class Metabolic Pathways Functions in prokaryotes ● EC 1.2.1.11 - Aspartate-semialdeh yde dehydrogenase ● Oxidoreductase 1. Glycine, serine and threonine metabolism 2. Cysteine and methionine metabolism 3. Lysine biosynthesis Biosynthesis of amino acids in prokaryotes, fungi, and some higher plants. It forms an early branch point in the metabolic pathway forming lysine, methionine, leucine, and isoleucine from aspartate ● EC 2.2.1.6 - Acetolactate synthase ● Transferases 1. Valine, leucine and isoleucine biosynthesis 2. Butanoate metabolism 3. C5-Branched dibasic acid metabolism Catalyzes the first step in the synthesis of the branched-chain amino acids in plants, animals, and microorganisms, transferring aldehyde or ketone residues. ● EC 1.1.1.1 - Alcohol dehydrogenase ● Oxidoreductase 1. Chloroalkane and chloroalkane degradation 2. Metabolism of xenobiotics by cytochrome P450 3. Tyrosine metabolism Facilitate the interconversion between alcohols and aldehydes or ketones with the reduction of nicotinamide adenine dinucleotide (NAD+) to NADH. In yeast, plants, and many bacteria, some alcohol dehydrogenases catalyze the opposite reaction as part of fermentation to ensure a constant supply of NAD+. In humans and animals, they serve to break down alcohols that otherwise are toxic, and they also participate in the generation of useful aldehyde, ketone, or alcohol groups during biosynthesis of various metabolites. ● EC 6.3.2.9 - D-glutamate ligase ● Ligases 1. D-Glutamine and D-glutamate metabolism 2. Peptidoglycan biosynthesis Form carbon-nitrogen bonds as acid-D-amino-acid ligases (peptide synthases). ● EC 4.1.3.27 - Anthranilate synthase ● Lyases 1. Phenylalanine, tyrosine and tryptophan biosynthesis 2. Phenazine biosynthesis Catalyzes the two-step biosynthesis of anthranilate, an intermediate in the biosynthesis of L-tryptophan. In the first step, the glutamine-binding beta subunit (TrpG) of anthranilate synthase (AS) provides the glutamine amidotransferase activity which generates ammonia as a substrate that, along with chorismate, is used in the second step, catalyzed by the large alpha subunit of AS (TrpE) to produce anthranilate. In the absence of TrpG, TrpE can synthesize anthranilate directly from chorismate and high concentrations of ammonia. ● EC 1.1.1.3 - Homoserine dehydrogenase ● Oxidoreductase 1. Glycine, serine and threonine metabolism 2. Cysteine and methionine metabolism Catalyzes the third step in the aspartate pathway; the NAD(P)-dependent reduction of aspartate beta-semialdehyde into homoserine. Homoserine is an intermediate in the biosynthesis of threonine, isoleucine, and methionine. In photosynthetic organisms, glutamine, glutamate, and aspartate accumulate during the day and are used to synthesize other amino acids. At night, aspartate is converted to asparagine for storage. Mammals lack the enzymes involved in the aspartate metabolic pathway, including homoserine dehydrogenase. As lysine, threonine, 24 methionine, and isoleucine are made in this pathway, they are considered essential amino acids for mammals. ● EC 4.2.1.2 - Fumarate hydratase ● Lyases 1. Citrate cycle (TCA cycle) 2. Pyruvate metabolism Catalyzes the reversible hydration/dehydration of fumarate to malate. Fumarase comes in two forms: mitochondrial and cytosolic. The mitochondrial isoenzyme is involved in the Krebs Cycle and the cytosolic isoenzyme is involved in the metabolism of amino acids and fumarate. ● EC 1.14.14.91 - Cinnamate 4-monooxygenase ● Oxidoreductases 1. Flavonoid biosynthesis 2. Stilbenoid, diarylheptanoid and gingerol biosynthesis Acts on paired donors, with O2 as oxidant and incorporation or reduction of oxygen. The oxygen incorporated need not be derived from O2 with NADH or NADPH as one donor, and incorporation of one oxygen atom into the other donor. ● EC 1.1.1.100 - Beta-ketoacyl reductase ● Oxidoreductases 1. Fatty acid biosynthesis 2. Biotin metabolism Catalyzes the reaction of conversion of 3-hydroxyacyl-[acyl-carrier-protein] to 3-oxoacyl-[acyl-carrier-protein] using NADPH as a reducing agent. Acts on the CH-OH group as hydride donors with NAD+ or NADP+ as hydride acceptors.