Achinthya Sreedhar is studying for a Bachelor of Computer Science and Engineering at the RV College of Engineering, Bengaluru, India. Achinthya, joined a team of three students working alongside the leader of our Machine Learning Library project, Roger Dev (Senior Architect, LexisNexis Risk Solutions Group). The main focus of the Causality Project 2021 is to research and implement some features that will extend our ML Library in this field. Since this is a relatively new area, Achinthya's project involved carrying out a lot of research and some challenging mathematics, specifically in the area of probabilities and conditional probabilities. Achinthya's work will be made available as an academic paper in the near future. As well as the resources included here, read Achinthya's intern blog journal which includes a more in depth look of his work. |
Poster Abstract
Conditional Probability is a key enabling technology for Causal Inference. For real valued variables, calculating conditional probabilities is particularly challenging because they can take on an infinite set of values. With the increase in conditional dimensions, the data appears sparser and sparser making it difficult to derive accurate results. After looking at various ways of modelling conditional probabilities, we found that using RKHS kernel methods, it was possible to estimate the density and cumulative density of conditional probabilities with a single conditioning variable. We found that this allowed us more accuracy when the data was sparse as compared to the regular discretization method, called D-Prob. We also explored accelerating these methods further using Random Fourier Features (RFF).
Unfortunately, we could not find any RKHS algorithms to condition on multiple variables. So, working with Roger, we developed a new approach using a multidimensional RKHS with a Multivariate Gaussian Kernel to model the full joint probability space. This new method (called J-Prob or Joint Probability) seems to perform as well as D-Prob (the traditional method) when data is plentiful, and much better than D-Prob when the data is sparse. Data can be sparse either because there is little data, or the conditional dimensionality is high. The traditional method works by filtering the dataset based on the conditionals so we wondered if combining filtering with RKHS could get better runtime performance without losing the accuracy gains. We originally called this approach F-Prob for Filtered Probability, and found that it had the best of both worlds. I then extended F-prob to build an algorithm that encompasses all three (D-Prob, J-Prob, F-Prob) methods. This approach called U-Prob (or Universal Probability) can map to a different method depending on the parameter K, which is simply put, the percentage of conditional variables we want to filter (before performing J-Prob on the remaining variables). So U-Prob(K=0) is equivalent to J-Prob and U-Prob(K=100) approximates D-Prob. Everything in between is a variation of the F-Prob family of methods. We then tailored U-Prob to adaptively choose a near optimal value of K for a given scenario. We found this method to produce great results, surpassing J-Prob and D-Prob in different areas.
I will be providing a detailed comparison between all of the methods based on a variety of tests under different conditions. These methods will serve as a solid foundation for the HPCC Systems Causality Toolkit, which will further enhance their performance using the HPCC Systems Platform's robust parallelization capabilities.
Presentation
In this Video Recording, Achinthya provides a tour and explanation of his poster content.
Improving conditional probability calculations using kernel methods in Reproducing Kernel Hilbert Space (RKHS)
Click on the poster for a larger image.