spamosaic.preprocessing

Preprocessing utilities for SpaMosaic.

Implements TF-IDF/LSI pipelines, CLR normalization, Harmony batch correction, and modality-specific preprocessing for RNA/ADT/epigenome.

spamosaic.preprocessing.ADT_preprocess(adt_ads, batch_corr=False, favor='clr', lognorm=True, scale=False, n_comps=50, batch_key='src', key='dimred_bc')[source]

Preprocessing pipeline for ADT (protein) modality.

Parameters:

adt_ads (list of AnnData) – ADT modality per batch.
batch_corr (bool) – Whether to perform batch correction.
favor ({'clr', 'lognorm'}) – Whether to use CLR or log-normalization.
lognorm (bool) – Apply log-normalization (if favor='lognorm').
scale (bool) – Whether to scale features.
n_comps (int) – Number of components for PCA.
batch_key (str) – Key for batch annotation.
key (str) – Key to store reduced dimension result.

Return type:

None

spamosaic.preprocessing.Epigenome_preprocess(epi_ads, batch_corr=False, n_peak=100000, n_comps=50, batch_key='src', key='dimred_bc', return_hvf=False)[source]

Preprocessing pipeline for epigenomic modality (e.g., ATAC-seq).

Parameters:

epi_ads (list of AnnData) – Epigenomic modality per batch.
batch_corr (bool) – Whether to apply Harmony batch correction.
n_peak (int) – Number of variable peaks to keep.
n_comps (int) – Number of LSI components.
batch_key (str) – Batch identifier key.
key (str) – Output key in .obsm.
return_hvf (bool) – Whether to return selected peak indices.

Returns:

If return_hvf is True, returns (peak_names, indices); otherwise None.

Return type:

Optional[Tuple[np.ndarray, np.ndarray]]

spamosaic.preprocessing.RNA_preprocess(rna_ads, batch_corr=False, favor='adapted', n_hvg=5000, lognorm=True, scale=False, n_comps=50, batch_key='src', key='dimred_bc', return_hvf=False)[source]

Preprocessing pipeline for RNA modality.

Parameters:

rna_ads (list of AnnData) – RNA modality per batch.
batch_corr (bool) – Whether to perform batch correction.
favor ({'adapted', 'scanpy'}) – Which pipeline to use.
n_hvg (int) – Number of highly variable genes.
lognorm (bool) – Whether to apply log-normalization.
scale (bool) – Whether to scale features.
n_comps (int) – Number of output components.
batch_key (str) – Key in .obs indicating batch identity.
key (str) – Key to store result in .obsm.
return_hvf (bool) – If True, return indices of selected HVGs.

Returns:

If return_hvf is True, returns (gene_names, indices); otherwise None.

Return type:

Optional[Tuple[np.ndarray, np.ndarray]]

spamosaic.preprocessing.clr_normalize(adata)[source]

Perform centered log-ratio (CLR) normalization on count data.

Parameters:: adata (AnnData) – Input data with count matrix in .X.
Returns:: Normalized AnnData object.
Return type:: AnnData

spamosaic.preprocessing.harmony(latent, batch_labels, use_gpu=True)[source]

Batch correction using Harmony.

Parameters:

latent (np.ndarray) – Low-dimensional representation (e.g., PCA).
batch_labels (list or array) – Corresponding batch annotations.
use_gpu (bool, default=True) – Whether to use GPU acceleration.

Returns:

Batch-corrected latent representation.

Return type:

np.ndarray

class spamosaic.preprocessing.lsiTransformer(n_components: int = 20, drop_first=True, use_highly_variable=None, log=True, norm=True, z_score=True, tfidf=True, svd=True, use_counts=False, pcaAlgo='arpack')[source]

Bases: object

Latent Semantic Indexing (LSI) pipeline for dimensionality reduction.

Parameters:

n_components (int) – Number of SVD components.
drop_first (bool) – Whether to drop the first principal component.
use_highly_variable (bool or None) – Whether to subset to highly variable features.
log (bool) – Whether to apply log1p transformation.
norm (bool) – Whether to normalize features.
z_score (bool) – Whether to z-score features.
tfidf (bool) – Whether to apply TF-IDF normalization.
svd (bool) – Whether to apply SVD transformation.
use_counts (bool) – Use .layers['counts'] instead of .X for data.
pcaAlgo (str) – SVD backend (e.g., 'arpack').

fit(adata: AnnData)[source]

Fit the transformer on an AnnData object.

Parameters:: adata (AnnData) – Input data.
Return type:: None

fit_transform(adata)[source]

Fit and transform an AnnData object.

Parameters:: adata (AnnData) – Input data.
Returns:: Low-dimensional representation.
Return type:: pandas.DataFrame

transform(adata)[source]

Transform AnnData using the fitted transformer.

Parameters:: adata (AnnData) – Data to transform.
Returns:: Low-dimensional representation with index aligned to adata.obs_names.
Return type:: pandas.DataFrame

spamosaic.preprocessing.sparse_log1p_scale(X, scale=10000.0)[source]

Apply log1p transformation to sparse or dense matrix, scaled by a factor.

Parameters:

X (Union[scipy.sparse.spmatrix, np.ndarray]) – Input expression matrix.
scale (float, default=1e4) – Scaling factor applied before log1p.

Returns:

Transformed matrix (same type as input).

Return type:

Union[scipy.sparse.spmatrix, np.ndarray]

class spamosaic.preprocessing.tfidfTransformer[source]

Bases: object

TF-IDF Transformer for sparse count data.

fit(X)[source]

Compute IDF vector from input data.

Parameters:: X (array-like or sparse matrix) – Count matrix.
Return type:: None

fit_transform(X)[source]

Fit to data, then transform it.

Parameters:: X (array-like or sparse matrix) – Input count matrix.
Returns:: TF-IDF transformed matrix.
Return type:: array-like or sparse matrix

transform(X)[source]

Apply TF-IDF transformation using precomputed IDF.

Parameters:: X (array-like or sparse matrix) – Count matrix to transform.
Returns:: TF-IDF transformed matrix.
Return type:: array-like or sparse matrix

Functions

`spamosaic.preprocessing.ADT_preprocess`	Preprocessing pipeline for ADT (protein) modality.
`spamosaic.preprocessing.Epigenome_preprocess`	Preprocessing pipeline for epigenomic modality (e.g., ATAC-seq).
`spamosaic.preprocessing.RNA_preprocess`	Preprocessing pipeline for RNA modality.
`spamosaic.preprocessing.clr_normalize`	Perform centered log-ratio (CLR) normalization on count data.
`spamosaic.preprocessing.harmony`	Batch correction using Harmony.
`spamosaic.preprocessing.sparse_log1p_scale`	Apply log1p transformation to sparse or dense matrix, scaled by a factor.

Classes

`spamosaic.preprocessing.lsiTransformer`	Latent Semantic Indexing (LSI) pipeline for dimensionality reduction.
`spamosaic.preprocessing.tfidfTransformer`	TF-IDF Transformer for sparse count data.