Logan Patterson - 2023 Poster Contest Resources

Logan is a second year Data Science M.S. student at The New College of Florida. He loves working with big data, machine learning, and has a B.S. in Neuroscience. Other than that, you can usually find him outside hiking or exercising whenever he has the time. 

Poster Abstract

Many current statistical methods revolve around the concept of correlation: if one event occurs in a consistent manner and changes in tandem with another event, we have tests that can tell us there is a likely relationship between the two. In contrast, causality is a rapidly evolving field that involves not only discovering relationships between variables, but determining the directionality and degree to which these relationships occur. One of the biggest goals in causal research is developing algorithms that can accurately find casual relationships within datasets. These methods of causal discovery usually involve various levels of Markovian probability, do-calculus, and information theory. However, many studies have a singular focus when approaching discovery. While this often proves to be inherently successful when dealing with data in a closed system, it can limit generalizability. Usage of different datasets, scoring metrics, and even computing systems makes large-scale comparison between discovery algorithms a challenge. This project aims to develop a more generalized series of testing algorithms within the HPCC Systems Causality Framework. The Python package “Because” offers a large-scale, comprehensive collection of methods for exploration of causal relationships. We intend to add a more streamlined version of testing algorithms that will allow for easy input of pre-designed and novel causal discovery models, ubiquitous application of novel scoring mechanisms, and customization of causal datasets. This endeavor comes with three main difficulties: model applicability to specific types of data, effective scoring metrics, and validation of models applied to datasets with unknown casual relationships. Many causal models are designed with two types of data in mind: continuous and discreet. Some may be able to handle both types within the same datasets, but it is important to identify each algorithm’s limitations for users. If a model is applied inappropriately, the resulting scores could be misleading. The scoring metrics themselves need to be comparable across all model types, informative about the causal nature of each discovered relationship, and ideally easy to understand. This translates to a battery of standardized metrics with simple outputs including Structural Hamming Distance (SHD), Structural Interventional Distance (SID), Precision/Recall curves, and implementation of Because’s own probability metrics. Lastly, the biggest hurdle for this project is determining if a model accurately represents the relationships contained within data if there is no ground truth present for comparison. This occurs mainly with natural datasets where relationships are typically more complicated and nuanced.
The discovery model testing algorithm was used on four different algorithms, one of which was already implemented within Because: PC (Peter-Clark), GES (Greedy Equivalence Search), IGCI (Information Geometric Causal Inference), and RCC (Randomized Causation Coefficient. Each of the models were compared to one another based on performances with various datasets to determine viability of both the testing algorithm and the models themselves. This algorithm hopefully paves the way for easier integration and implementation of causal discovery algorithms for future developments within the HPCC Causality Framework.

Presentation

In this Video Recording, Logan provides a tour and explanation of his poster content.

Designing Test Algorithms for Causal Model Discovery Within the HPCC Systems Causality Framework

Click on the poster for a larger image.

All pages in this wiki are subject to our site usage guidelines.