Mayank Agarwal is studying for a Bachelor of Computer Science and Engineering at the RV College of Engineering, Bengaluru, India. Mayank joined a team of three students working alongside the leader of our Machine Learning Library project, Roger Dev (Senior Architect, LexisNexis Risk Solutions Group). The main focus of this project in 2021 is to research and implement some features that will extend our ML Library in the field of Causality. Mayank's intern project focused on Independence, Conditional Independence and Directionality, which involved becoming familiar with Reproducing Kernel Hilbert Space and experimenting with various kernels. Since this is a new and groundbreaking area, Mayank had to do a lot of research by reading a number of papers written in the field as well as interpreting the experiments referred to and learning how to apply them. Mayank's project contributes greatly to our Machine Learning Library, helping to accelerate progress on the Causality project. |
Poster Abstract
The new science of Causality promises to open new frontiers in Data Science and Machine Learning, but requires an accurate model of the causal relationships between variables.
This causal model takes the form of a Directed Acyclic Graph (DAG). Nature provides a few subtle cues to the structure of the causal model, the most important of which is the independencies or conditional independencies between variables. These independencies allow us to test a causal model to determine if it is consistent with the observed data, and in some cases to discover the causal model from data alone. Doing so, however, is very challenging because of the subtleties of the signals, the presence of noise, and the often high conditional dimensions involved. Runtime performance is also very critical, since it impacts the depth of testing that can realistically be achieved. While studying the various algorithms, I found one with the highest accuracy and efficiency called RCoT.
RCoT is a new measure of conditional dependence of random variables, based on normalized cross-covariance operators on Reproducing Kernel Hilbert Spaces (RKHS). Unlike previous kernel dependency measures, RCoT does not depend on the choice of kernel in the limit of infinite data, for a wide class of kernels. RCoT is a very recently developed method, and has only been implemented in the R language. For compatibility with our Causality Framework, I redeveloped the algorithm in Python. It has been tested on various different models both synthetic and Real- world and has shown satisfactory results. The first phase of RCoT is the detection of dependence between data elements as described above. Then it must determine if the detected dependence is statistically significant, resulting in a “test statistic”. In the RCoT, the null hypothesis used is (X _||_ Y | Z), and the opposite is the alternate hypothesis which defines the dependence between 2 random variables conditioning on a set of random variables. The output of RCoT is the probability of rejecting the null hypothesis called, p-val. The null hypothesis is rejected when the value of p is less than 5%.
One of the improved approximation techniques is the Linday Pilla Basak (LPB) method , an approximation technique for a weighted sum of chi-squared random variables. Like LPB, there are various other approximation methods, one of which I used is the Hall–Buckley–Eagleson (HBE) method. The third major feature of RCoT is its excellent runtime performance and scalability relative to other methods. RCoT achieves this by using Random Fourier Features to approximate the working of the previous model, KCIT (Kernel-based Conditional Independence Testing) to a great degree without taking as much computation time. Random Fourier features is a widely used, simple, and effective technique for scaling up kernel methods. This allows approximation with arbitrary precision using a lower dimensional model, giving RCoT unprecedented runtime performance and scalability, even for high dimensional conditioning. I evaluate the runtime and accuracy performance of the resulting RCoT muldule, and compare it to our previous method.
The RCoT algorithm for Independence testing will result in the formation of a concrete foundation for the HPCC Systems Causality Toolkit, which will further enhance their performance using the HPCC Systems Platform's robust parallelization capabilities.
Presentation
In this Video Recording, Mayank provides a tour and explanation of his poster content.
Independence Testing with RCoT : Causal Validation and Discovery for HPCC System Causal Toolkit
Click on the poster for a larger image.