In the traditional paradigm of structural biology, research typically focuses on biological macromolecules with known coding genes, elucidating their molecular functions and underlying mechanisms by determining three-dimensional structures. However, there are huge numbers of unknown macromolecules in nature that may not be directly genetically encoded or remain unsequenced. Nieng Yan's team has pioneered an alternative approach: CryoSeek strategy, which employs cryogenic electron microscopy as a discovery tool. By combining cryogenic electron microscopy with AI-facilitated autobuilding tools, bioinformatic analyses, and other multidisciplinary techniques, CryoSeek enables the direct determination of high-resolution structures of macromolecules from natural freshwaters, environmental samples, or clinical specimens, even without prior knowledge, enabling further exploration and elucidation of their biological functions.
In prior work, Nieng Yan's team established CryoSeek as a novel structural biology paradigm. They first reported a class of protein fibrils, TLP-1a and TLP-1b, identified from environmental samples collected at Tsinghua lotus pond, and speculated their origins and potential functions. On this basis, the team discovered a novel glycofibril, TLP-4, whose core comprises a linear chain of tetrapeptide repeats and is coated by a dense layer of glycans. Subsequent data processing resolved five additional glycofibrils, revealing the diversity of glycofibrils in natural environments and highlighting the critical role of glycan chains in structural assembly. In addition, to address the technical challenge of determining the absolute hand of such glycofibrils, Nieng Yan's team collaborated with Mingxu Hu's team to develop a new method named Ahaha. These findings not only reveal the crucial role of glycans in the structural assembly of biological macromolecules but also provide novel insights for the discovery and structural characterization of natural complex glycans [1, 2, 3, 4].
Despite the notable progress of the CryoSeek strategy, its overall throughput remains limited by two main bottlenecks: image processing and glycan modeling. First, natural samples typically contain tens to hundreds of distinct fibrils, many of which are present in extremely low abundance. Conventional cryogenic electron microscopy image processing methods struggle to handle this heterogeneity, resulting in data analysis that lags far behind the pace of data collection. Second, even when a high-resolution density map is obtained, building an accurate atomic model remains challenging. Unlike protein modeling, which can use predictive algorithms and autobuilding tools, the modeling of glycans still relies heavily on manual, experience-driven identification. This highly manual process substantially impairs the efficiency of glycan structure determination from density maps, thereby constituting a second critical bottleneck in high-throughput structural glycobiology.
On November 21, 2025 (Beijing Standard Time), the teams led by Nieng Yan and Mingxu Hu, together with their collaborators, posted a preprint on LangTaoSha, entitled "High-throughput cryo-EM characterization and automated model building of glycofibrils via CryoSeek".

Fig. 1 LangTaoSha preprint
To overcome the two bottlenecks, the team developed two novel methods. First, they introduced a recursive bisection clustering (RBC) strategy (Fig. 2), enabling efficient, parallel processing of highly heterogeneous cryogenic electron microscopy data and aligning image analysis throughput with the pace of data collection. Second, they created EModelG (Fig. 3), an AI-driven automated modeling framework that accurately identifies glycan residue sites and automatically assembles glycan chains, thereby liberating glycan modeling from burdensome manual determination. By integrating the two methods, the team established a high-throughput workflow for glycofibril structure determination. Applying this workflow to samples from natural freshwaters, they determined multiple glycofibril structures. To consolidate and share these structural resources, the team has launched the CryoSeek database, providing valuable data basis and methodology support for future research in structural glycobiology.
To address the data processing bottleneck in the CryoSeek strategy caused by high sample heterogeneity and low target abundance, the team developed recursive bisection clustering (RBC) strategy, an efficient image processing method. RBC employs a top-down, divisive clustering algorithm: starting with all particles identified by helix recognition, and the algorithm recursively splits one cluster into two child clusters via hierarchical two-dimensional classification, thereby constructing a binary tree. This design offers two key advantages. First, the number of computational tasks that can be processed in parallel increase exponentially with recursion depth, dramatically improving computational efficiency. Second, a theoretical detection limit of approximately 0.24‰ abundance can be achieved at a recursion depth of 12. The team applied RBC to approximately 12.2 million particles from a water sample collected in a karst cave in Guilin. Within 1–2 days, the method successfully isolated 18 distinct fibril structures with abundances ranging from 0.14‰ to 4.9‰, robustly validating its efficiency and sensitivity for large-scale, highly heterogeneous cryogenic electron microscopy data.

Fig. 2 Recursive bisection clustering enables efficient isolation of 18 fibrils from 12.2 million particles
The team further developed EModelG, the first AI framework capable of fully automated modeling of glycoproteins and glycan chains directly from cryogenic electron microscopy density maps. The framework performs neural-network-based density interpretation to assign voxel-vise probabilities to protein and glycan regions. Protein regions are subsequently modeled, with glycosylation sites serving as structural anchors. On this basis, EModelG conducts SE(3) rotational sampling of monosaccharide templates, gradient-based density fitting, and automated glycosidic linkage assembly to extend glycan chains from protein surface. As demonstrated in glycoprotein GF-1L2R5 (2.53 Å), EModelG accurately reconstructed the entire glycan branches. Automatically built monosaccharide rings match the experimental density and exhibit strong agreement with the manual model. This work establishes a key technical foundation for high-throughput, standardized modeling of glycan structures within the CryoSeek workflow.

Fig. 3 EModelG AI framework for automated glycan modeling
By integrating the RBC strategy with EModelG, the team successfully matched the data processing throughput of CryoSeek with its pace of data collection. From 13 distinct water samples, they determined 126 three-dimensional helical structures. Of these, 50 representative fiber structures were identified. To describe these structures systematically, a naming system based on molecular composition and structural characteristics were established. Fibrils are first categorized by composition as protein fibrils (PF) or glycofibrils (GF). They are then further classified by suffixes denoting specific features such as protein domain, linear peptide repeat, or no protein (NP). This naming system is designed to be both inclusive for newly discovered structures and systematic for existing ones, thereby laying a standardized foundation for future large-scale studies of glycofibril structures.

Fig. 4 50 representative three-dimensional helical structures
To consolidate and share these structural resources, the team has established the CryoSeek database. This is a comprehensive platform that integrates glycofibril structures, cryogenic electron microscopy density maps, and raw micrographs while preserving their interrelationships. Beyond conventional data retrieval and uploading functions, it provides a suite of specialized tools and services, including Ahaha (an absolute hand determination tool for glycofibrils), a global collaborative forum, and a dedicated repository for teaching and training, thereby establishing an integrated infrastructure for data sharing, methodology support, and scientific collaboration in structural glycobiology.

Fig. 5 CryoSeek database (https://cryoseek.org.cn)
This study was led by co-corresponding authors Nieng Yan (Founding Dean, Shenzhen Medical Academy of Research and Translation; Director, Shenzhen Bay Laboratory), Mingxu Hu (Junior Principal Investigator, Shenzhen Medical Academy of Research and Translation), and Zhangqiang Li (PhD, School of Life Sciences, Tsinghua University). Mingxu Hu (Junior Principal Investigator, Shenzhen Medical Academy of Research and Translation), Sheng Chen (PhD, School of Life Sciences, Tsinghua University), Tongtong Wang (PhD, School of Life Sciences, Tsinghua University), Lanju Qin (Shenzhen Medical Academy of Research and Translation), and Qi Zhang (PhD, Shenzhen Medical Academy of Research and Translation) are the co-first authors. Other contributors include Yilin Zhang (Shenzhen Medical Academy of Research and Translation) and Qijun Ge (Shenzhen Medical Academy of Research and Translation). The studies were funded by the Shenzhen Medical Academy of Research and Translation, National Natural Science Foundation of China, and Beijing Frontier Research Center for Biological Structure.
References:
[1] Wang, T., Li, Z., Xu, K., Huang, W., Huang, G., Zhang, Q. C., & Yan, N. (2024). CryoSeek: A strategy for bioentity discovery using cryoelectron microscopy. Proceedings of the National Academy of Sciences, 121(42), e2417046121.
[2] Wang, T., Huang, W., Xu, K., Sun, Y., Zhang, Q.C., Yan, C., Li, Z., & Yan, N. (2025). CryoSeek II: Cryo-EM analysis of glycofibrils from freshwater reveals well-structured glycans coating linear tetrapeptide repeats, Proceedings of the National Academy of Sciences, 122(1), e2423943122.
[3] Li, Z., Wang, T., Sun, Y., Xu, K., Huang, W., Zhang, Q.C., Yan, C., & Yan, N. (2025). CryoSeek identification of glycofibrils with diverse compositions and structural assemblies, bioRxiv, 2025.09.30.679562.
[4] Zhang, Q., Qin, L., Wang, T., Li, Z., Zhang, Y., Chen, S., Yan, N., Wang, J., & Hu, M. (2025). Absolute hand determination of glycofibrils from natural sources in cryo-EM, bioRxiv, 2025.09.30.679555.