spamosaic.preprocessing

Preprocessing utilities for SpaMosaic.

Implements TF-IDF/LSI pipelines, CLR normalization, Harmony batch correction, and modality-specific preprocessing for RNA/ADT/epigenome.

spamosaic.preprocessing.ADT_preprocess(adt_ads, batch_corr=False, favor='clr', lognorm=True, scale=False, n_comps=50, batch_key='src', key='dimred_bc')[source]

Preprocessing pipeline for ADT (protein) modality.

Parameters:
  • adt_ads (list of AnnData) – ADT modality per batch.

  • batch_corr (bool) – Whether to perform batch correction.

  • favor ({'clr', 'lognorm'}) – Whether to use CLR or log-normalization.

  • lognorm (bool) – Apply log-normalization (if favor='lognorm').

  • scale (bool) – Whether to scale features.

  • n_comps (int) – Number of components for PCA.

  • batch_key (str) – Key for batch annotation.

  • key (str) – Key to store reduced dimension result.

Return type:

None

spamosaic.preprocessing.Epigenome_preprocess(epi_ads, batch_corr=False, n_peak=100000, n_comps=50, batch_key='src', key='dimred_bc', return_hvf=False)[source]

Preprocessing pipeline for epigenomic modality (e.g., ATAC-seq).

Parameters:
  • epi_ads (list of AnnData) – Epigenomic modality per batch.

  • batch_corr (bool) – Whether to apply Harmony batch correction.

  • n_peak (int) – Number of variable peaks to keep.

  • n_comps (int) – Number of LSI components.

  • batch_key (str) – Batch identifier key.

  • key (str) – Output key in .obsm.

  • return_hvf (bool) – Whether to return selected peak indices.

Returns:

If return_hvf is True, returns (peak_names, indices); otherwise None.

Return type:

Optional[Tuple[np.ndarray, np.ndarray]]

spamosaic.preprocessing.RNA_preprocess(rna_ads, batch_corr=False, favor='adapted', n_hvg=5000, lognorm=True, scale=False, n_comps=50, batch_key='src', key='dimred_bc', return_hvf=False)[source]

Preprocessing pipeline for RNA modality.

Parameters:
  • rna_ads (list of AnnData) – RNA modality per batch.

  • batch_corr (bool) – Whether to perform batch correction.

  • favor ({'adapted', 'scanpy'}) – Which pipeline to use.

  • n_hvg (int) – Number of highly variable genes.

  • lognorm (bool) – Whether to apply log-normalization.

  • scale (bool) – Whether to scale features.

  • n_comps (int) – Number of output components.

  • batch_key (str) – Key in .obs indicating batch identity.

  • key (str) – Key to store result in .obsm.

  • return_hvf (bool) – If True, return indices of selected HVGs.

Returns:

If return_hvf is True, returns (gene_names, indices); otherwise None.

Return type:

Optional[Tuple[np.ndarray, np.ndarray]]

spamosaic.preprocessing.clr_normalize(adata)[source]

Perform centered log-ratio (CLR) normalization on count data.

Parameters:

adata (AnnData) – Input data with count matrix in .X.

Returns:

Normalized AnnData object.

Return type:

AnnData

spamosaic.preprocessing.harmony(latent, batch_labels, use_gpu=True)[source]

Batch correction using Harmony.

Parameters:
  • latent (np.ndarray) – Low-dimensional representation (e.g., PCA).

  • batch_labels (list or array) – Corresponding batch annotations.

  • use_gpu (bool, default=True) – Whether to use GPU acceleration.

Returns:

Batch-corrected latent representation.

Return type:

np.ndarray

class spamosaic.preprocessing.lsiTransformer(n_components: int = 20, drop_first=True, use_highly_variable=None, log=True, norm=True, z_score=True, tfidf=True, svd=True, use_counts=False, pcaAlgo='arpack')[source]

Bases: object

Latent Semantic Indexing (LSI) pipeline for dimensionality reduction.

Parameters:
  • n_components (int) – Number of SVD components.

  • drop_first (bool) – Whether to drop the first principal component.

  • use_highly_variable (bool or None) – Whether to subset to highly variable features.

  • log (bool) – Whether to apply log1p transformation.

  • norm (bool) – Whether to normalize features.

  • z_score (bool) – Whether to z-score features.

  • tfidf (bool) – Whether to apply TF-IDF normalization.

  • svd (bool) – Whether to apply SVD transformation.

  • use_counts (bool) – Use .layers['counts'] instead of .X for data.

  • pcaAlgo (str) – SVD backend (e.g., 'arpack').

fit(adata: AnnData)[source]

Fit the transformer on an AnnData object.

Parameters:

adata (AnnData) – Input data.

Return type:

None

fit_transform(adata)[source]

Fit and transform an AnnData object.

Parameters:

adata (AnnData) – Input data.

Returns:

Low-dimensional representation.

Return type:

pandas.DataFrame

transform(adata)[source]

Transform AnnData using the fitted transformer.

Parameters:

adata (AnnData) – Data to transform.

Returns:

Low-dimensional representation with index aligned to adata.obs_names.

Return type:

pandas.DataFrame

spamosaic.preprocessing.sparse_log1p_scale(X, scale=10000.0)[source]

Apply log1p transformation to sparse or dense matrix, scaled by a factor.

Parameters:
  • X (Union[scipy.sparse.spmatrix, np.ndarray]) – Input expression matrix.

  • scale (float, default=1e4) – Scaling factor applied before log1p.

Returns:

Transformed matrix (same type as input).

Return type:

Union[scipy.sparse.spmatrix, np.ndarray]

class spamosaic.preprocessing.tfidfTransformer[source]

Bases: object

TF-IDF Transformer for sparse count data.

fit(X)[source]

Compute IDF vector from input data.

Parameters:

X (array-like or sparse matrix) – Count matrix.

Return type:

None

fit_transform(X)[source]

Fit to data, then transform it.

Parameters:

X (array-like or sparse matrix) – Input count matrix.

Returns:

TF-IDF transformed matrix.

Return type:

array-like or sparse matrix

transform(X)[source]

Apply TF-IDF transformation using precomputed IDF.

Parameters:

X (array-like or sparse matrix) – Count matrix to transform.

Returns:

TF-IDF transformed matrix.

Return type:

array-like or sparse matrix

Functions

spamosaic.preprocessing.ADT_preprocess

Preprocessing pipeline for ADT (protein) modality.

spamosaic.preprocessing.Epigenome_preprocess

Preprocessing pipeline for epigenomic modality (e.g., ATAC-seq).

spamosaic.preprocessing.RNA_preprocess

Preprocessing pipeline for RNA modality.

spamosaic.preprocessing.clr_normalize

Perform centered log-ratio (CLR) normalization on count data.

spamosaic.preprocessing.harmony

Batch correction using Harmony.

spamosaic.preprocessing.sparse_log1p_scale

Apply log1p transformation to sparse or dense matrix, scaled by a factor.

Classes

spamosaic.preprocessing.lsiTransformer

Latent Semantic Indexing (LSI) pipeline for dimensionality reduction.

spamosaic.preprocessing.tfidfTransformer

TF-IDF Transformer for sparse count data.