nispace.datasets.fetch_collection

nispace.datasets.fetch_collection(collection, dataset=None, maps=None, set_size_range=None, weight_range=None, weight_quantile=None, set_top_n=None, set_specificity=None, return_maps=False, nispace_data_dir=None, overwrite=False, check_file_hash=True, verbose=True)[source]

Fetch a collection that defines a subset (and optional grouping) of maps.

A collection is a mapping from map IDs to optional set labels and weights. The result is a DataFrame with columns ["map"], ["set", "map"], or ["set", "map", "weight"] depending on the collection content.

Three .collect file formats are supported:

Simple list — plain text, one map ID per line, map header.
JSON set dict — {"set_name": ["map1", "map2", ...], ...}.
CSV set table — columns: map, set, map, or set, map, weight.

Parameters:

collection (str, Path, ndarray, DataFrame, Series, or list) – When dataset is given: the name of an integrated collection (e.g. "All", "BrainSpanWeights"). When dataset is None: a path to a .collect file, or an in-memory array-like / DataFrame that is used directly.
dataset (str, optional) – Name of an integrated NiSpace reference dataset (e.g. "mrna", "pet"). If provided, collection must be the name of one of that dataset’s registered collections.
maps (list, optional) – Restrict to this subset of map IDs after loading.
set_size_range (tuple (int, int), optional) – Keep only sets whose membership count falls within [min, max] (inclusive).
weight_range (tuple (float, float), optional) – Keep only entries whose weight is within [min, max] (inclusive). Ignored when the collection has no weights.
weight_quantile (float, optional) – Within each set, keep only entries with weight ≥ this quantile. Ignored when the collection has no weights.
set_top_n (int, optional) – Within each set, keep only the set_top_n entries with the highest weight. Ignored when the collection has no weights.
set_specificity (float in (0, 1], optional) – Keep only maps that appear in ≤ set_specificity fraction of all sets, i.e. discard ubiquitous maps.
return_maps (bool, default False) – If True, return a tuple (collection_df, maps_avail) where maps_avail is the deduplicated list of map IDs after all filters.
nispace_data_dir (str or Path, optional) – Override the NiSpace data directory (default: $NISPACE_DATA_DIR).
overwrite (bool, default False) – Re-download the collection file even if it is already cached.
check_file_hash (bool, default True) – Verify the SHA-256 hash of the downloaded file.
verbose (bool, default True) – Print progress messages.

Returns:

collection_df (DataFrame) – Columns: ["map"] for unstructured collections; ["set", "map"] for grouped collections; ["set", "map", "weight"] for weighted collections.
maps_avail (list of str) – Only returned when return_maps=True. Deduplicated map IDs present in collection_df after filtering.