scpFormer is a transformer-based foundation model created to unify and interpret fragmented single-cell proteomics data across various technologies and experimental designs. To overcome the lack of universal reference panels, the model employs a continuous semantic tokenization strategy that maps proteins to a shared space using their amino acid sequences. This innovative architecture allows the system to handle an open vocabulary of proteins, enabling the integration of disparate datasets and the prediction of unmeasured markers through in silico imputation. By pre-training on over 390 million cells, the model learns complex protein co-expression patterns that remain robust even in the presence of technical noise and batch effects. Benchmarking demonstrates that scpFormer excels at identifying rare cell types, harmonizing multi-center studies, and enhancing the accuracy of cancer drug response predictions. Ultimately, this framework provides a scalable, panel-agnostic tool for advancing precision oncology and biomarker discovery.
References:
Zhou Q, Yu L, Guo Y, et al. scpFormer: A Foundation Model for Unified Representation and Integration of the Single-Cell Proteomics[J]. arXiv preprint arXiv:2604.20003, 2026.

