In her book, The Invention of Nature, Andrea Wulf quotes the great naturalist and explorer Alexander von Humboldt as describing the world “as a living whole, not a dead aggregate” .
Nevertheless, the “dead aggregate” remained thrilling: in 2000, 10 years after the advent of the Human Genome Project, President Bill Clinton declared that the project was poised to “revolutionize the diagnosis, prevention, and treatment of most, if not all, human diseases”
The Human Genome Project was merely a harbinger, and the biology big data era is just beginning. Price tags are falling fast; in 2023, Illumina’s NovaSeq X boasts a $200 cost which – even excluding the cost of sample acquisition and preparation – is a minute fraction of the HGP’s cost just 20 years later. As a result, the largest repository of genetic sequences, GenBank, contains 504,000 formally described species as of 2023,
Attempts to read a human’s past, present, and future from their As, Ts, Cs, and Gs have stalled and diverged into a variety of broad assays. Researchers have found significant benefits in different profiling techniques generally arranged by the Central Dogma, the model by which biology is best understood: DNA is transcribed into RNA, which translates into proteins that form the functional basis for life. Assays have stratified to include growing sectors in transcriptomics, proteomics, and metabolomics – corresponding to the biochemical entities of RNA, proteins, and metabolites – each estimated in the billions and growing fast. Given that the Central Dogma is an oversimplification of biological processes, and the reality that more and more assays measuring things like the microbiome and the epigenome are surfacing every day, the potential for life sciences data generation seems unlimited.
What’s Driving Momentum?
Companies continue to commit billions of dollars and millions of hours to omics data collection for three major reasons. First, researchers now have substantially more assets at their disposal, especially with the now understood limitations of genomics. It is now possible to employ large datasets to improve assets like biomarkers and directly benefit patients while generating intellectual property. Second, the nature of these big data experiments has fundamentally reconfigured the hypothesis-driven research landscape. Instead of experiments operating to reject or confirm a hypothesis, these big datasets function as laboratories, leading to a variety of hypothesis-generating experiments. Third, the advent of artificial intelligence is substantially enabling novel and innovative means to interrogate the data. The petabytes of data stored in individual databases can now be classified, scoured, and corroborated in innumerable ways.
The optimistic single-omic field has been exposed as a Pollyanna. Instead, researchers have moved towards a more comprehensive understanding of these complex systems using a combination of multiple omics in the expanding and intuitively named field of “multiomics”. In some cases, this simply means analyzing two datasets in tandem; for example, do the same pathways appear enriched in both the metabolomic and the proteomic data? In more complicated studies, researchers can use advanced statistical methods like neural networks or graph learning to integrate these datasets.
Omics All the Way Down
Multiomics evolved naturally from single omic experiments, which in turn evolved from more specific assays. Biological assays – commercialized as biomarkers or research aids – are quietly a part of everyday life for many people. Those with diabetes require blood sugar surveillance using a continuous glucose monitor (CGM), while a metabolic blood panel – measuring key electrolytes, creatinine, glucose, and urea – can identify dysfunctions in key organs before they become catastrophic. Beyond monitoring, biomarkers can be useful for diagnosis, prognosis, susceptibility, and other functions.
Given the commercialization potential of these specific measurements, researchers are scrambling to find new ways to qualify function, dysfunction, or disease early and comprehensively. Improvements in cost and analytical capacity have enabled wider research strategies: what began as measuring a single or few metabolites, sequences, or proteins in a biospecimen has grown into multiple thousands of measurements. That same human genome that took over thirteen years to sequence can now be had in less than a week. Over 5,000 metabolites that might’ve taken months individually to elucidate can be delivered as high-quality matches in under five weeks from Metabolon. Measuring at scale has finally met industry’s tempo and cost, opening a vast array of new options for researchers.
Beyond the obvious “why not?”, these wider profiling techniques have multiple benefits. Much is unknown: for example, while those same CGMs are essential for Type 1 diabetics, other, less treatable diseases might require detection or qualification beyond monitoring. Research has demonstrated that combinations of biomarkers can be more precise than carbohydrate antigen profiling in ovarian and pancreatic cancers.
The New Laboratory
With over 63,000 known human genes, more than 100,000 postulated human proteins, and a nearly unlimited number of endogenous and exogenous human metabolites, this field is just getting started. Untargeted commercial transcriptomics assays typically provide 10,000-25,000 annotated sequences, and mass spectrometric (MS) assays for metabolomics and proteomics provide several thousand annotated entities. Notably, the process to usher “raw” analytical data to quality biochemical annotation is complex and continuously improving as researchers better understand the data that is collected. The underlying data, now required by many publications and databases in addition to the final annotated version, can provide ongoing insights as processing techniques mature. These ongoing innovations have created a new type of experimentation: massive data collection which provides a new laboratory to perform hypothesis-directed research in.
Public repositories and biobanks are proving to be a crucial instrument for companies at any maturity and size. For startups, having researchers who are familiar with accessing and analyzing repositories like GenBank, the Gene Expression Omnibus (GEO), and the Metabolomics Workbench is essential to providing initial insights to direct expensive lab operations. Large companies fund and leverage these datasets as a means of crucial operations like de-risking clinical trials,
As well-engineered as these public repositories are, understanding the provenance, analytical settings, and processing steps is crucial to obtaining robust biological insights. These public datasets are prone to the “tragedy of the commons” where they can be clouded by incorrect metadata, poor run quality, or unsuitable data pipelines. Metabolomics data from mass spectrometry, for example, can depend on a variety of analytical factors like instrument settings and platform and inter-laboratory variability is common.
Biological Intelligence
Supposing that the data are high-quality and well-engineered, artificial intelligence is facilitating the next frontier of research. Data are relatively easy to generate and there are established instruments and workflows to provide gigabytes of data in a few hours. The resulting files are where some of the greatest opportunities are found; the raw files contain untapped multitudes of information known as analytical “dark matter,”
Data from omics analytical platforms are not straightforward to analyze and are subject to a torturous route of processing. Starting this process from scratch requires a significant degree of collaboration, expertise, and computational power, and thus has led to any number of intermediate, semi-raw datasets.
On the other end, biological insights can be combined in increasingly innovative ways to provide more comprehensive insights. Classification, regression, and unsupervised learning algorithms are being increasingly applied to biological data, yielding a cottage industry of biomarker discovery engines. Working with multiple layers of omics data at scale has demonstrated consistently better-performing models than singleton data.
Metabolomics
Metabolomics remains a relatively untapped and challenging opportunity for researchers. Where the genome exists as a static blueprint of the cell, the metabolome provides a counterpoint as the transient, interactive layer of biology. It is the closest layer of biology to the phenotype, presenting the final “decisions” of cells as they respond to internal and external signals.
Conclusions
Biological data at scale has long existed as a “dead aggregate.” While some interesting conclusions can be found in it, advances in the scale, sensitivity, and analysis of the assays that comprise multiomics are breathing new life into it. Accurately reflecting the complexity and interconnectedness of the biological systems that researchers study is already demonstrating consistent, quantitative benefit. As access to new processing tools and more curated data commons improves, the concurrent acceleration of biological insights will result in positive outcomes for patients, less risky drug development pipelines, and more mechanistic understanding of the underlying biological processes. Metabolomics plays a key role as the interface between individual phenotype and the environment it’s in but is a difficult field requiring substantial collaboration and expertise. Even so, multiomics is the new frontier for research and promises substantial breakthroughs in the years to come.
References
Wulf A. The Invention of Nature: Alexander von Humboldt’s New World. Alfred A. Knopf; 2015. National Human Genome Research Institute. Draft of the Human Genome Sequence Announcement at the White House. Published online 2000. Collins FS, McKusick VA. Implications of the Human Genome Project for Medical Science. JAMA. 2001;285(5):540-544. doi:10.1001/jama.285.5.540 Hall SS. Revolution Postponed: Why the Human Genome Project Has Been Disappointing. Sci Am. Published online October 2010. Congressional Budget Office. Rsearch and Development in the Pharmaceutical Industry.; 2021. Sayers EW, Cavanaugh M, Clark K, et al. GenBank 2023 update. Nucleic Acids Res. 2023;51(D1):D141-D144. Sudlow C, Gallacher J, Allen N, et al. UK Biobank: An Open Access Resource for Identifying the Causes of a Wide Range of Complex Diseases of Middle and Old Age. PLoS Med. 2015;12(3):e1001779-. https://doi.org/10.1371/journal.pmed.1001779 Liao WW, Asri M, Ebler J, et al. A draft human pangenome reference. Nature. 2023;617(7960):312-324. doi:10.1038/s41586-023-05896-x Nova One Advisor. Genomics Market Size to Hit USD 157.47 Billion by 2023. BioSpace. Athieniti E, Spyrou GM. A guide to multi-omics data collection and integration for translational medicine. Comput Struct Biotechnol J. 2023;21:134-149. doi:https://doi.org/10.1016/j.csbj.2022.11.050 Califf RM. Biomarker definitions and their applications. Exp Biol Med. 2018;243(3):213-221. doi:10.1177/1535370217750088 Bodaghi A, Fattahi N, Ramazani A. Biomarkers: Promising and valuable tools towards diagnosis, prognosis and treatment of Covid-19 and other diseases. Heliyon. 2023;9(2):e13323. doi:https://doi.org/10.1016/j.heliyon.2023.e13323 PMI. Biomarker Market Size & Share to Exceed USD 288.5 Billion by 2034, at CAGR of 13.6%. “Growing Emphasis on Early Disease Detection”-By PMI. Yahoo! Finance. Kang KN, Koh EY, Jang JY, Kim CW. Multiple biomarkers are more accurate than a combination of carbohydrate antigen 125 and human epididymis protein 4 for ovarian cancer screening. Obstet Gynecol Sci. 2022;65(4):346-354. doi:10.5468/ogs.22017 Kane LE, Mellotte GS, Mylod E, et al. Diagnostic Accuracy of Blood-based Biomarkers for Pancreatic Cancer: A Systematic Review and Meta-analysis. Cancer Research Communications. 2022;2(10):1229-1243. doi:10.1158/2767-9764.CRC-22-0190 Lee T, Zheng Jie Teng T, Shelat VG. Carbohydrate antigen 19-9 – tumor marker: Past, present, and future. World J Gastrointest Surg. 2020;12(12):468-490. Berger AC, Garcia M, Hoffman JP, et al. Postresection CA 19-9 Predicts Overall Survival in Patients With Pancreatic Cancer Treated With Adjuvant Chemoradiation: A Prospective Validation by RTOG 9704. Journal of Clinical Oncology. 2008;26(36):5918-5922. doi:10.1200/JCO.2008.18.6288 Huerga I. Q&A: De-risking clinical trials with real-world data. Tempus.com. Ivanisevic T, Sewduth RN. Multi-Omics Integration for the Design of Novel Therapies and the Identification of Novel Biomarkers. Proteomes. 2023;11(34). Boer AC, Burgers LE, Mangnus L, et al. Using a reference when defining an abnormal MRI reduces false-positive MRI results—a longitudinal study in two cohorts at risk for rheumatoid arthritis. Rheumatology. 2017;56(10):1700-1706. doi:10.1093/rheumatology/kex235 Wilkinson MD, Dumontier M, Aalbersberg IjJ, et al. The FAIR Guiding Principles for scientific data management and stewardship. Sci Data. 2016;3(1):160018. doi:10.1038/sdata.2016.18 Lin Y, Caldwell GW, Li Y, Lang W, Masucci J. Inter-laboratory reproducibility of an untargeted metabolomics GC–MS assay for analysis of human plasma. Sci Rep. 2020;10(1):10918. doi:10.1038/s41598-020-67939-x González-Domínguez R, González-Domínguez Á, Sayago A, Fernández-Recamales Á. Recommendations and Best Practices for Standardizing the Pre-Analytical Processing of Blood and Urine Samples in Metabolomics. Metabolites. 2020;10(6). Sumner LW, Amberg A, Barrett D, et al. Proposed minimum reporting standards for chemical analysis. Metabolomics. 2007;3(3):211-221. doi:10.1007/s11306-007-0082-2 Ross JL. The Dark Matter of Biology. Biophysical Perspective. 2016;111(5):909-916. Krassowski M, Das V, Sahu SK, Misra BB. State of the Field in Multi-Omics Research: From Computational Needs to Data Mining and Sharing. Front Genet. 2020;11. Karaman ED, Işık Z. Multi-Omics Data Analysis Identifies Prognostic Biomarkers across Cancers. Medical Sciences. 2023;11(3). Wörheide MA, Krumsiek J, Kastenmüller G, Arnold M. Multi-omics integration in biomedical research – A metabolomics-centric review. Anal Chim Acta. 2021;1141:144-162. doi:https://doi.org/10.1016/j.aca.2020.10.038 Wang T, Shao W, Huang Z, et al. MOGONET integrates multi-omics data using graph convolutional networks allowing patient classification and biomarker identification. Nat Commun. 2021;12(1):3445. doi:10.1038/s41467-021-23774-w Rattray NJW, Deziel NC, Wallach JD, et al. Beyond genomics: understanding exposotypes through metabolomics. Hum Genomics. 2018;12(1):4. doi:10.1186/s40246-018-0134-x Bauermeister A, Mannochio-Russo H, Costa-Lotufo L V, Jarmusch AK, Dorrestein PC. Mass spectrometry-based metabolomics in microbiome investigations. Nat Rev Microbiol. 2022;20(3):143-160. doi:10.1038/s41579-021-00621-9 Fenton SE, Ducatman A, Boobis A, et al. Per- and Polyfluoroalkyl Substance Toxicity and Human Health Review: Current State of Knowledge and Strategies for Informing Future Research. Environ Toxicol Chem. 2021;40(3):606-630. doi:https://doi.org/10.1002/etc.4890 Adjoian TK, Firestone MJ, Eisenhower D, Yi SS. Validation of self-rated overall diet quality by Healthy Eating Index-2010 score among New York City adults, 2013. Prev Med Rep. 2016;3:127-131. doi:https://doi.org/10.1016/j.pmedr.2016.01.001