Charting a New Atlas of “Life’s Dark Matter”: CryoSeek Database Launches, Pioneering a Structure-Led Paradigm for Biological Discovery
2025-11-21 87

Traditional structural biology has long adhered to the classic "function-to-structure" paradigm: researchers typically begin by identifying target molecules through genetic, cell biological, and biochemical approaches, and then resolve its three-dimensional structure to elucidate its biological function. Although artificial-intelligence technologies – exemplified by AlphaFold – advanced the ability of "sequence-to-structure prediction followed by reverse inference of function," existing approaches still remain inadequate for deciphering the extensive "dark matter of life" that is not directly encoded by genes or has yet to be sequenced, such as complex glycans and lipids.

Recent rapid advances in AI are shifting scientific inquiry from traditional reductionist thinking toward an "omics-centered" systematic research approach. In structural biology, with the advent of disruptive technologies such as AlphaFold, emerging strategies, ranging from "structural docking-omics" and "structure–ligand-ome" analyses to high-throughput protein design and screening, are profoundly reshaping the field. Yet these efforts remain heavily dependent on macromolecular structural data accumulated over the past several decades. Critically, the experimental technologies capable of high-throughput determination of new structures, the very source needed to fill this data gap, are still largely absent.

To break through this conceptual boundary, Nieng Yan's team proposed CryoSeek, a strategy that transforms cryo-EM from a traditional structure-determination tool into a “discovery engine” for unknown biomacromolecules. Without relying on prior sequence or functional information, CryoSeek enables direct high-resolution structural determination of unknown biomolecules from natural waters, environmental samples, and even clinical specimens. Since its introduction in 2024, the team has used CryoSeek to uncover and resolve multiple novel biological filaments, including TLP-1 and TLP-4, through samples from natural water bodies such as the lotus pond at Tsinghua University. Notably, most of these structures are glycoprotein filaments, or even pure glycan assemblies, that were previously difficult to study systematically. For example, TLP-4b, informally dubbed the “8-nm luosifen rice noodle,” contains up to 95% carbohydrate, challenging established assumptions about macromolecular assembly.

In their recent work, Professor Nieng Yan's group collaborated with Professor Mingxu Hu's team to develop a high-throughput CryoSeek structural determination system, significantly improving the efficiency of structural determination and providing initial validation of the strategy's technical feasibility for large-scale application. As the CryoSeek strategy is being increasingly applied to diverse environmental samples, both the volume and complexity of newly resolved structures are growing rapidly, overwhelming traditional data management and analysis methods. To promote data sharing and scientific collaboration, strengthen China's leadership in research data governance, and turn CryoSeek into a global public tool in service of life-science research, Yan's and Hu's teams jointly established the CryoSeek Database at SMART.  

Figure 1. Homepage of the CryoSeek Data Platform

The CryoSeek Data Platform comprises three interconnected sub-databases: (1) the Structure Database, (2) the Cryo-EM Density Map Database, and (3) the Raw Micrograph Database, with clear relational links established between them. 

It provides all functions expected of a traditional scientific database, including data browsing, searching, uploading, and downloading. From the homepage, users can access all currently released datasets through each sub-database (Figures 2, 3, and 4).

Figure 2: CryoSeek Structure Database

Figure 3: CryoSeek Electron Microscopy Density Map Database

Figure 4: CryoSeek's Raw Electron Microscopy Image Database

For each data entry, users can access the corresponding information with a single click. Unlike traditional databases such as the Protein Data Bank (PDB) and the Electron Microscopy Data Bank (EMDB), the CryoSeek platform internally integrates relationships among its three data types, allowing users to trace data provenance more conveniently (Figures 5, 6, 7). 

Building on this foundation, CryoSeek also provides a cross-database, full-spectrum search function that enables seamless querying across all data categories (Figure 8).

Figure 5: Structure Data Entry Interface

Figure 6: Cryo-EM Density Map Entry Interface

Figure 7: Raw Cryo-EM Image Entry Interface

Figure 8: Search Function in the CryoSeek Database

The CryoSeek platform also enables users to upload and share data (Figure 9). Its submission workflow balances scientific rigor with collaboration-friendly design: users upload metadata files sequentially, and the system automatically identifies and validates potential internal relationships within the dataset. All entries undergo manual review and annotation by the management team before being returned to the user for final confirmation and subsequent public release.

Figure 9: Data Release Workflow of CryoSeek

Beyond these core database functions, the CryoSeek platform also integrates a variety of features, including a global community forum, practical tools, and educational resources. A notable example is "Ahaha" (Figure 10) – a tool co-developed by the research teams of Professor Nieng Yan and Dr. Mingxu Hu for determining the absolute handedness of helical structures.

Figure 10: Ahaha, A Tool for Determine the Absolute Handedness of Helical Structures Using Single-Tilt Images

Unlike the Protein Data Bank (PDB) and the Electron Microscopy Data Bank (EMDB), the CryoSeek data platform specializes in non-protein biomolecules, offering professional data-sharing and analytical services dedicated to glycans and other "biological dark matter". The current version focuses on enabling data deposition and sharing, with future development centered on integrating interactive analysis tools and incorporating complementary technical data from mass spectrometry, metagenomic sequencing, and other techniques. This will allow for systematic exploration of the origins and functions of diverse glycan fiber structures and advance research on species diversity and biological entities across sampling sites. As a key vehicle for CryoSeek's strategic evolution – from methodological innovation toward systematic, scalable, and community-driven science – the CryoSeek data platform will provide sustained momentum for frontier discoveries in structural glycobiology and related fields. Through this open platform, the team aims to join scientists worldwide in charting a navigational map for the new world of "biological dark matter".