spamosaic.preprocessing
Preprocessing utilities for SpaMosaic.
Implements TF-IDF/LSI pipelines, CLR normalization, Harmony batch correction, and modality-specific preprocessing for RNA/ADT/epigenome.
- spamosaic.preprocessing.ADT_preprocess(adt_ads, batch_corr=False, favor='clr', lognorm=True, scale=False, n_comps=50, batch_key='src', key='dimred_bc')[source]
Preprocessing pipeline for ADT (protein) modality.
- Parameters:
adt_ads (list of AnnData) – ADT modality per batch.
batch_corr (bool) – Whether to perform batch correction.
favor ({'clr', 'lognorm'}) – Whether to use CLR or log-normalization.
lognorm (bool) – Apply log-normalization (if
favor='lognorm').scale (bool) – Whether to scale features.
n_comps (int) – Number of components for PCA.
batch_key (str) – Key for batch annotation.
key (str) – Key to store reduced dimension result.
- Return type:
None
- spamosaic.preprocessing.Epigenome_preprocess(epi_ads, batch_corr=False, n_peak=100000, n_comps=50, batch_key='src', key='dimred_bc', return_hvf=False)[source]
Preprocessing pipeline for epigenomic modality (e.g., ATAC-seq).
- Parameters:
epi_ads (list of AnnData) – Epigenomic modality per batch.
batch_corr (bool) – Whether to apply Harmony batch correction.
n_peak (int) – Number of variable peaks to keep.
n_comps (int) – Number of LSI components.
batch_key (str) – Batch identifier key.
key (str) – Output key in
.obsm.return_hvf (bool) – Whether to return selected peak indices.
- Returns:
If
return_hvfisTrue, returns (peak_names, indices); otherwiseNone.- Return type:
Optional[Tuple[np.ndarray, np.ndarray]]
- spamosaic.preprocessing.RNA_preprocess(rna_ads, batch_corr=False, favor='adapted', n_hvg=5000, lognorm=True, scale=False, n_comps=50, batch_key='src', key='dimred_bc', return_hvf=False)[source]
Preprocessing pipeline for RNA modality.
- Parameters:
rna_ads (list of AnnData) – RNA modality per batch.
batch_corr (bool) – Whether to perform batch correction.
favor ({'adapted', 'scanpy'}) – Which pipeline to use.
n_hvg (int) – Number of highly variable genes.
lognorm (bool) – Whether to apply log-normalization.
scale (bool) – Whether to scale features.
n_comps (int) – Number of output components.
batch_key (str) – Key in
.obsindicating batch identity.key (str) – Key to store result in
.obsm.return_hvf (bool) – If
True, return indices of selected HVGs.
- Returns:
If
return_hvfisTrue, returns (gene_names, indices); otherwiseNone.- Return type:
Optional[Tuple[np.ndarray, np.ndarray]]
- spamosaic.preprocessing.clr_normalize(adata)[source]
Perform centered log-ratio (CLR) normalization on count data.
- Parameters:
adata (AnnData) – Input data with count matrix in
.X.- Returns:
Normalized AnnData object.
- Return type:
AnnData
- spamosaic.preprocessing.harmony(latent, batch_labels, use_gpu=True)[source]
Batch correction using Harmony.
- Parameters:
latent (np.ndarray) – Low-dimensional representation (e.g., PCA).
batch_labels (list or array) – Corresponding batch annotations.
use_gpu (bool, default=True) – Whether to use GPU acceleration.
- Returns:
Batch-corrected latent representation.
- Return type:
np.ndarray
- class spamosaic.preprocessing.lsiTransformer(n_components: int = 20, drop_first=True, use_highly_variable=None, log=True, norm=True, z_score=True, tfidf=True, svd=True, use_counts=False, pcaAlgo='arpack')[source]
Bases:
objectLatent Semantic Indexing (LSI) pipeline for dimensionality reduction.
- Parameters:
n_components (int) – Number of SVD components.
drop_first (bool) – Whether to drop the first principal component.
use_highly_variable (bool or None) – Whether to subset to highly variable features.
log (bool) – Whether to apply log1p transformation.
norm (bool) – Whether to normalize features.
z_score (bool) – Whether to z-score features.
tfidf (bool) – Whether to apply TF-IDF normalization.
svd (bool) – Whether to apply SVD transformation.
use_counts (bool) – Use
.layers['counts']instead of.Xfor data.pcaAlgo (str) – SVD backend (e.g.,
'arpack').
- fit(adata: AnnData)[source]
Fit the transformer on an AnnData object.
- Parameters:
adata (AnnData) – Input data.
- Return type:
None
- spamosaic.preprocessing.sparse_log1p_scale(X, scale=10000.0)[source]
Apply log1p transformation to sparse or dense matrix, scaled by a factor.
- Parameters:
X (Union[scipy.sparse.spmatrix, np.ndarray]) – Input expression matrix.
scale (float, default=1e4) – Scaling factor applied before log1p.
- Returns:
Transformed matrix (same type as input).
- Return type:
Union[scipy.sparse.spmatrix, np.ndarray]
- class spamosaic.preprocessing.tfidfTransformer[source]
Bases:
objectTF-IDF Transformer for sparse count data.
- fit(X)[source]
Compute IDF vector from input data.
- Parameters:
X (array-like or sparse matrix) – Count matrix.
- Return type:
None
Functions
Preprocessing pipeline for ADT (protein) modality. |
|
Preprocessing pipeline for epigenomic modality (e.g., ATAC-seq). |
|
Preprocessing pipeline for RNA modality. |
|
Perform centered log-ratio (CLR) normalization on count data. |
|
Batch correction using Harmony. |
|
Apply log1p transformation to sparse or dense matrix, scaled by a factor. |
Classes
Latent Semantic Indexing (LSI) pipeline for dimensionality reduction. |
|
TF-IDF Transformer for sparse count data. |