segger.data¶
The segger.data
module in Segger is a comprehensive data processing and management system designed specifically for spatial transcriptomics datasets. It provides a unified, scalable interface for handling large-scale spatial transcriptomics data from technologies such as Xenium and Merscope, with a focus on preparing data for graph-based deep learning models.
- Sample: Core sample handling, tiling, and data management classes
- PyG Dataset: PyTorch Geometric dataset integration and utilities
- Utils: Core utility functions for data processing, filtering, and analysis
- Transcript Embedding: Transcript feature encoding and embedding utilities
- NDTree: Spatial partitioning and load balancing utilities
Module Overview¶
The segger.data
package is organized into several key modules, each serving a specific purpose in the spatial transcriptomics data processing pipeline.
Sample Module¶
The main data processing module containing the core classes for handling spatial transcriptomics data.
Main Functions:
- Data loading and validation
- Spatial tiling and partitioning
- Graph construction and feature engineering
- Parallel processing coordination
Key Classes:
STSampleParquet
: Main orchestrator for data loading and processingSTInMemoryDataset
: In-memory dataset with spatial indexingSTTile
: Individual spatial tile processing
Utils Module¶
Core utility functions for data processing, filtering, and analysis.
Key Functions:
get_xy_extents()
: Extract spatial extents from parquet filesread_parquet_region()
: Read specific spatial regions from parquet filesfilter_transcripts()
: Filter transcripts by quality and gene typefilter_boundaries()
: Filter boundaries by spatial criteriaload_settings()
: Load technology-specific configurationsfind_markers()
: Identify marker genes for cell types
PyG Dataset Module¶
PyTorch Geometric dataset integration for machine learning workflows.
- Automatic tile discovery and loading
- PyTorch Lightning integration
- Built-in data validation
- Efficient memory management
Key Classes:
STPyGDataset
: PyTorch Geometric dataset wrapper
Transcript Embedding Module¶
Utilities for encoding transcript features into numerical representations.
NDTree Module¶
Spatial partitioning and load balancing utilities.
Key Classes:
NDTree
: N-dimensional spatial partitioninginnernode
: Internal tree node structure
Features:
- Efficient spatial data partitioning
- Load balancing for parallel processing
- Memory-optimized data structures
- Configurable region sizing
Configuration and Settings¶
Settings Directory¶
Technology-specific configuration files for different spatial transcriptomics platforms.
Available Platforms:
- Xenium
- Merscope
- CosMx
- Xenium v2 (segmentation kit)
Configuration Options:
- Column mappings
- Quality thresholds
- Spatial parameters
- Platform-specific defaults
Data Flow Architecture¶
Raw Data (Parquet) → STSampleParquet → Spatial Tiling → STInMemoryDataset → STTile → PyG Graph
↓ ↓ ↓ ↓ ↓
Metadata Extraction Region Division Data Filtering Tile Creation Graph Construction
↓ ↓ ↓ ↓ ↓
Settings Loading Load Balancing Spatial Indexing Feature Comp. Edge Generation
Class Hierarchy¶
STSampleParquet (Main Orchestrator)
├── STInMemoryDataset (Region Processing)
│ └── STTile (Individual Tile Processing)
├── TranscriptEmbedding (Feature Encoding)
└── NDTree (Spatial Partitioning)
STPyGDataset (ML Integration)
└── PyTorch Geometric Integration
BackendHandler (Experimental)
└── Multi-backend Support
Usage¶
from segger.data.sample import STSampleParquet
# Load and process data
sample = STSampleParquet(base_dir="path/to/data", n_workers=4)
sample.save(data_dir="./processed", tile_size=1000)