spamosaic.MNN

Nearest-neighbor utilities and MNN matching for SpaMosaic.

Implements exact/approximate kNN (HNSW, Annoy) and mutual nearest neighbors (MNN) pairing.

spamosaic.MNN.mnn(ds1, ds2, names1, names2, knn1=20, knn2=20, approx=True, metric='euclidean', way='hnsw', norm=False)[source]

Compute mutual nearest neighbors (MNN) between two datasets.

Parameters:

ds1 (np.ndarray) – First dataset (queries), shape (N1, D).
ds2 (np.ndarray) – Second dataset (references), shape (N2, D).
names1 (list of str) – Identifiers for rows in ds1.
names2 (list of str) – Identifiers for rows in ds2.
knn1 (int) – Number of neighbors for ds1 → ds2.
knn2 (int) – Number of neighbors for ds2 → ds1.
approx (bool, default=True) – If True, use approximate search (HNSW/Annoy); otherwise exact kNN.
metric (str, default='euclidean') – Distance metric used when way='annoy'.
way (str, default='hnsw') – Approximation backend: 'hnsw' or 'annoy'.
norm (bool, default=False) – Whether to normalize inputs before Annoy search (ignored for HNSW/exact).

Returns:

Set of mutual nearest-neighbor pairs.

Return type:

set[tuple[str, str]]

spamosaic.MNN.nn(ds1, ds2, names1, names2, knn=50, metric_p=2)[source]

Exact nearest-neighbor search using scikit-learn.

Parameters:

ds1 (np.ndarray) – Query dataset of shape (N1, D).
ds2 (np.ndarray) – Reference dataset of shape (N2, D).
names1 (list of str) – Identifiers for rows in ds1.
names2 (list of str) – Identifiers for rows in ds2.
knn (int) – Number of nearest neighbors to retrieve for each query.
metric_p (int) – Minkowski distance parameter (e.g., 2 for Euclidean).

Returns:

Set of matched nearest-neighbor pairs.

Return type:

set[tuple[str, str]]

spamosaic.MNN.nn_annoy(ds1, ds2, names1, names2, norm=True, knn=20, metric='euclidean', n_trees=10, save_on_disk=False)[source]

Approximate nearest-neighbor search using Annoy index.

Parameters:

ds1 (np.ndarray) – Query dataset of shape (N1, D).
ds2 (np.ndarray) – Reference dataset of shape (N2, D).
names1 (list of str) – Identifiers for rows in ds1.
names2 (list of str) – Identifiers for rows in ds2.
norm (bool, default=True) – Whether to L2-normalize datasets before indexing/search.
knn (int) – Number of nearest neighbors to retrieve.
metric (str, default='euclidean') – Distance metric (e.g., 'euclidean', 'manhattan').
n_trees (int, default=10) – Number of trees to build in the Annoy index.
save_on_disk (bool, default=False) – If True, write the index to disk.

Returns:

Set of nearest-neighbor pairs.

Return type:

set[tuple[str, str]]

spamosaic.MNN.nn_approx(ds1, ds2, names1, names2, knn=50)[source]

Approximate nearest-neighbor search using HNSW (hnswlib).

Parameters:

ds1 (np.ndarray) – Query dataset of shape (N1, D).
ds2 (np.ndarray) – Reference dataset of shape (N2, D).
names1 (list of str) – Identifiers for rows in ds1.
names2 (list of str) – Identifiers for rows in ds2.
knn (int, default=50) – Number of nearest neighbors to find for each query.

Returns:

Set of matched (query_name, reference_name) pairs.

Return type:

set[tuple[str, str]]

Functions

`spamosaic.MNN.mnn`	Compute mutual nearest neighbors (MNN) between two datasets.
`spamosaic.MNN.nn`	Exact nearest-neighbor search using scikit-learn.
`spamosaic.MNN.nn_annoy`	Approximate nearest-neighbor search using Annoy index.
`spamosaic.MNN.nn_approx`	Approximate nearest-neighbor search using HNSW (hnswlib).