S2GS: Streaming Semantic Gaussian Splatting for Online Scene Understanding and Reconstruction

Abstract

Existing offline feed-forward methods for joint scene understanding and reconstruction on long image streams often repeatedly perform global computation over an ever-growing set of past observations, causing runtime and GPU memory to increase rapidly with sequence length and limiting scalability. We propose Streaming Semantic Gaussian Splatting (S2GS), a strictly causal, incremental 3D Gaussian semantic field framework: it does not leverage future frames and continuously updates scene geometry, appearance, and instance-level semantics without reprocessing historical frames, enabling scalable online joint reconstruction and understanding. S2GS adopts a geometry-semantic decoupled dual-backbone design: the geometry branch performs causal modeling to drive incremental Gaussian updates, while the semantic branch leverages a 2D foundation vision model and a query-driven decoder to predict segmentation masks and identity embeddings, further stabilized by query-level contrastive alignment and lightweight online association with an instance memory. Experiments show that S2GS matches or outperforms strong offline baselines on joint reconstruction-and-understanding benchmarks, while significantly improving long-horizon scalability: it processes 1,000+ frames with much slower growth in runtime and GPU memory, whereas offline global-processing baselines typically run out of memory at around 80 frames under the same setting.

Motivation

Most existing approaches remain offline-global in the sense that, as new frames arrive, they repeatedly recompute cross-frame interactions over the growing history. While effective for short sequences, this paradigm scales poorly: both runtime and memory typically grow rapidly with the number of views, hindering long-horizon online scenarios. As shown in Figure 1, even on an H200 GPU equipped with 140 GB of VRAM, SIU3R (Xu et al., 2025) still encounters an out-ofmemory (OOM) after processing approximately 80 frames, exposing a fundamental limitation of current joint modeling paradigms under long input streams. This phenomenon indicates that, for long-running online systems, there is an urgent need for an incremental modeling approach that does not require repeatedly reprocessing historical frames.

Method

Overview of S2GS. S2GS processes an uncalibrated and unposed RGB image stream in a strictly causal manner. A causal Transformer encoder, guided by geometric priors from a 3D foundation model, predicts camera parameters, depth, and Gaussian attributes to incrementally construct 3D Gaussian representation. A decoupled semantic stream leverages a 2D foundation model and a query-driven decoder to produce per-view semantic and instance predictions. Query-level contrastive learning and an online instance memory bank (MB) stabilize instance identities over time. Semantic confidence is lifted to the 3D Gaussian field and decoded via splatting, enabling unified novel view synthesis, semantic segmentation, instance segmentation, and panoptic segmentation without revisiting past frames.

Qualitative Results

Novel View Synthesis / Semantic Segmentation / Instance Tracking.

BibTeX

@misc{zhang2026s2gsstreamingsemanticgaussian,
      title={S2GS: Streaming Semantic Gaussian Splatting for Online Scene Understanding and Reconstruction}, 
      author={Renhe Zhang and Yuyang Tan and Jingyu Gong and Zhizhong Zhang and Lizhuang Ma and Yuan Xie and Xin Tan},
      year={2026},
      eprint={2603.14232},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2603.14232}, 
}