Such feature selection tasks are very demanding. The dependencies in data can be of any type (e.g., linear, non-linear, multivariate), while the data can be of any type as well (i.e., mixtures of discrete and continuous attributes). We have to accommodate all types in our analysis, because otherwise we can miss important interactions (e.g., linear methods cannot find non-linear relationships). For interpretability, the degree of dependence should be meaningfully summarized (e.g., “target Y depends 50% on X”, or “the variables in set X exhibit 80% dependence”). From a statistical perspective, the degree of dependence should be robustly measured from data samples of any size to minimize spurious discoveries. Lastly, the combinatorial optimization problem to discover the top dependencies out of all possible dependencies is NP-hard, and hence, we need efficient exhaustive algorithms. For situations where this is prohibitive (i.e., large dimensionalities), we need approximate algorithms with guarantees such that we can interpret the quality of the solution.
In this dissertation, we solve the aforementioned challenges and propose effective algorithms that discover dependencies in both supervised and unsupervised scenarios. In particular, we employ notions from Information Theory to meaningfully quantify all types of dependencies (e.g., Mutual Information and Total Correlation), which we then couple with notions from statistical learning theory to robustly estimate them from data. Lastly, we derive tight and efficient bounding functions that can be used by branch-and-bound, enabling exact solutions in large search spaces. Case studies on Materials Science data confirm that we can indeed discover high-quality dependencies from complex data.
-------------------------------------------------------------------------------------------------------------------------
The zoom link:
https://cispa-de.zoom.us/j/94010943330?pwd=VkcvMS93Y0V0eGxHb2oxMTNzZjI5Zz09
Meeting ID: 940 1094 3330
Passcode: Dr-1ng