
CLiMB Separates the Known from the New to Discover in Scientific Data
In data science, a constant challenge is to classify familiar patterns and detect unexpected anomalies simultaneously. 🧩 Current semi-supervised clustering methods often fail at this dual objective, as they start from the premise that guide signals represent the entire reality. This leads to imposing strict boundaries that can hide surprising findings or to needing to predefined how many groups exist, which restricts the possibility of finding genuine novelty.
A Framework that Decouples Exploration
To close this gap, CLiMB (CLustering in Multiphase Boundaries) is presented, a framework that explicitly separates exploiting prior knowledge from exploring the unknown. Its architecture avoids the rigid assumptions that characterize other approaches.
CLiMB's Two-Stage Operation:- Anchoring Phase: Establishes and fixes the already known groups using a constrained partition, allowing maximum use of available prior information.
- Exploration Phase: Applies a density-based clustering technique to the unclassified residual data, enabling the revelation of arbitrary and unknown topologies without forcing a predefined structure.
- Fundamental Separation: This sequential division between exploiting and exploring is the basis of its operation and distinguishes it from other methods.
If your clustering algorithm forces data to fit into known boxes, it might be suppressing the next big revelation.
Validation with Cosmic Information
The effectiveness of this framework is demonstrated with real data of RR Lyrae stars obtained from the Gaia Data Release 3 publication. CLiMB achieves an Adjusted Rand Index of 0.829 with 90% coverage in recovering documented substructures of the Milky Way. 🪐
Key Results from Validation:- Clear Superiority: Its performance clearly outperforms heuristic methods and those that only use constraints, which stagnate with values below 0.20.
- Proven Efficiency: A sensitivity analysis confirms that its performance improves monotonically as initial available knowledge increases.
- Validated Discovery: The framework successfully isolates three dynamic features (Shiva, Shakti, and the Galactic Disk) within the unlabeled data field, proving its potential for scientific discoveries.
Implications for Data Analysis
CLiMB offers a practical solution to the dual problem of classifying and discovering. By decoupling the exploitation and exploration phases, it avoids suppressing unexpected patterns and allows genuine novelty to emerge in the margins of the data. Its validation with real astronomical information underscores its utility for complex scientific scenarios where not everything is predefined. 🔭