2021 Poster Contest (Abstracts Only)
Browse: Abstracts, Winners and runners up, Awards Ceremony (Watch Recording / View Slides), Posters by HPCC Systems Interns, Posters by Academic Partners, Poster Judges, Virtual Judging, Home
Click on the Title to View the Poster Resources
Achinthya Sreedhar
RVCE, Bengaluru, India
Conditional Probability is a key enabling technology for Causal Inference. For real valued variables, calculating conditional probabilities is particularly challenging because they can take on an infinite set of values. With the increase in conditional dimensions, the data appears sparser and sparser making it difficult to derive accurate results. After looking at various ways of modelling conditional probabilities, we found that using RKHS kernel methods, it was possible to estimate the density and cumulative density of conditional probabilities with a single conditioning variable. We found that this allowed us more accuracy when the data was sparse as compared to the regular discretization method, called D-Prob. We also explored accelerating these methods further using Random Fourier Features (RFF).
Unfortunately, we could not find any RKHS algorithms to condition on multiple variables. So, working with Roger, we developed a new approach using a multidimensional RKHS with a Multivariate Gaussian Kernel to model the full joint probability space. This new method (called J-Prob or Joint Probability) seems to perform as well as D-Prob (the traditional method) when data is plentiful, and much better than D-Prob when the data is sparse. Data can be sparse either because there is little data, or the conditional dimensionality is high. The traditional method works by filtering the dataset based on the conditionals so we wondered if combining filtering with RKHS could get better runtime performance without losing the accuracy gains. We originally called this approach F-Prob for Filtered Probability, and found that it had the best of both worlds. I then extended F-prob to build an algorithm that encompasses all three (D-Prob, J-Prob, F-Prob) methods. This approach called U-Prob (or Universal Probability) can map to a different method depending on the parameter K, which is simply put, the percentage of conditional variables we want to filter (before performing J-Prob on the remaining variables). So U-Prob(K=0) is equivalent to J-Prob and U-Prob(K=100) approximates D-Prob. Everything in between is a variation of the F-Prob family of methods. We then tailored U-Prob to adaptively choose a near optimal value of K for a given scenario. We found this method to produce great results, surpassing J-Prob and D-Prob in different areas.
I will be providing a detailed comparison between all of the methods based on a variety of tests under different conditions. These methods will serve as a solid foundation for the HPCC Systems Causality Toolkit, which will further enhance their performance using the HPCC Systems Platform's robust parallelization capabilities.
Amy Ma
Marjory Stoneman Douglas High School, FL, USA
An Ingress is an object that allows access to Kubernetes services from outside the Kubernetes cluster. Ingress is made up of an Ingress object and the Ingress Controller. An Ingress Controller is the implementation of the Ingress. In this project, two Ingress implementations, HAProxy and Nginx were examined on Azure environment. These two Ingress controllers both use the in-cluster Ingress solutions, where load balancing is performed by pods within the cluster. My works explore the different setup used to configure Ingress features through annotations and Kubernetes ingress specifications.
The basic functions exercised in this project include routing, authentication, and access control features such as whitelist, rate limit, buffer size. Various TLS configurations were investigated, including using self-generated certificates with open SSL, using HPCC TLS implementation with Cert-manager, and configuring TLS with a dynamic, externally reachable IP address. Also explored and tested are deployment patterns blue-green and canary deployment, implemented by HAProxy and Nginx controllers, respectively. These configurations are very useful in real Cloud application development and maintenance. For example, utilizing canary to gradually adapt new application features such as with ECLwatch in HPCC cluster. User end-to-end request to response time was collected from nine ECL samples, with the scenarios being the usage of HPCC with and without the Nginx Ingress controller. The initial latency tests indicate that Ingress does not add much performance delay.
Some benefits of using Ingress for cloud are that it comes with a wide range of features, such as access control, basic authentication, providing a singular access point for external traffic, and advanced routing.
André Fontanez Bravo Best Poster - Research
University of Sao Paulo, Brazil
Big Data and Logistic Regression applied to Analysis of Loan Requests
Big Data and its applications are becoming more and more important across many different fields. In this context, techniques and tools that are able to process the immense flow of information to create value can be powerful instruments. This study focuses on the application of data analysis to financial investments at LendingClub’s platform. LendingClub is an American peer-to-peer lending company. The company’s platform allows users to file loan requests and others to finance them, becoming investors. Each loan is broken up into Notes that represent a fraction of said loan. These Notes can also be traded among investors, similarly to what is done in the stock market. The investors can choose the loans they wish to finance based on a plethora of information about the loan and the borrower, such as the loan’s interest rate and the borrower’s purpose and credit score. Even though LendingClub assesses all loan requests before making them available on its platform, the company’s public historic database shows that around 12.5% of loans were charged off. As investing in loans that end up not being paid evidently incurs in financial losses, it would be useful to have a way to identify loan requests that have a higher probability of being paid on time.
With that in mind, the goal of this study is to develop a logistic regression model capable of identifying the best options for investment among the loan requests at LendingClub’s platform using the information available to investors. This binary model should calculate the likelihood of a loan being paid and then classify it as “good loan” or not.
Given the size of the company’s dataset (over two million records, with dozens of columns), this project was developed on the HPCC Systems platform, which is able to handle large volumes of data and also has a logistic regression module. The modelling process involved four main stages:
Data extraction
Data cleaning and preparation
Model training and evaluation
Optimization.
By the end of the study two final models were obtained through two different methods of optimization. The first one is better to filter loan requests to obtain a higher proportion of good loans, while the second one is better to filter loan requests discarding as few good loans as possible.
Atreya Bain 2021 Community Choice Award
RVCE, Bengaluru, India
Improvements on HSQL: A SQL-like language for HPCC Systems
Big Data has become an important field, and there is a steep learning curve to getting used to handling Big Data, especially in distributed systems. HSQL for HPCC Systems is a solution that is developed for allowing users to get used to its architecture and the ECL (Enterprise Control Language) language with which it primarily operates. HSQL aims to provide a seamless interface for data science developers to use, for working with data. It is designed to work in conjunction with ECL, the primary programming language for HPCC Systems, and should prove to be easy to work with and robust for general purpose analysis.
HSQL is made to provide a compact and easy to comprehend SQL-like syntax for performing visualizations, general exploratory data analysis, training of Machine Learning models while also allowing a modular structure to such programs. Functions can also be written to allow for code reuse. It can also integrate with VSCode IDE and provide Syntax Highlighting and Code Completion features.
In previous work on HSQL, the primary foundations were set and in this work, various improvements were made to make it more usable and correct as a compiler. The architecture of the compiler has received changes that allow it to translate more effectively and the newer version of the compiler brings in support for functions, for code reusability, and modules that help structure code. Additionally, a lot of the existing statements have received new features that make them easier and better to use.
Bruno Carneiro Camara
University of Sao Paulo, Brazil
Preventing Fraud by Registration Inconsistencies
Tons of money are lost because of fraud committed by companies. There are already laws to punish company partners for these abusive acts for their own benefit, however, how can the authorities locate and take the necessary actions? This is where my work comes in. Identifying registration inconsistencies, suspicious behaviors or unusual situations may prevent or locate frauds. Using three different public databases as the starting point, I was able to link companies and partners to suspicious behaviors, such as receipt of undue government benefit by company partners and reports of work analogous to slavery in companies.
The three public databases used were:
Brazilian Companies - Divided into 3 categories of Companies, Partners and Establishments. All of them have specific information, such as company status, partner position, age group, localization, date and others.
Government Subsidy - The people who received government aid. I have chosen 2 famous Brazilian subsidies: Bolsa Família and Auxílio Emergencial. The two subsidies are aimed at people with low income.
Work Analogous to Slavery - Information about companies or employers that practice slavery-like work.
All these datasets are publicly available on the Brazilian government websites.
With the three databases properly treated and cleaned I was able to reach my goal of identify suspicious behaviors, unusual situations and registration inconsistencies, obtaining three main datasets: partners that received benefits and their respective companies, establishments that had some type of complaint about work analogous to slavery and partners with reports of labor practices analogous to slavery. Those 3 resulting datasets were descriptively analyzed in order to highlight group singularities and trends. The most relevant keys used to analyze groups were: partner position, partners’s age group and area of activity of the company. One of the trends highlighted was: 49% of the partners that received benefits were managing partners which is a high position to receive assistance,
In short, this project shows that by combining Big Data and Analytics with HPCC Systems could be a promising alternative to prevent and locate frauds, using registration inconsistencies as a tool to raise potential fraudulent companies and company partners.
Carina Wang Best Poster - Data Analytics
American Heritage School, FL, USA
Processing Student Image Data with Kubernetes and HPCC Systems GNN on Azure
In order to foster a safe learning environment, measures to bolster campus security have emerged as a top priority around the world. The developments from my internship will be applied to a tangible security system at American Heritage High School (AHS). Processing student images on the HPCC Systems Cloud Native Platform and evaluating the HPCC Systems Generalized Neural Network (GNN) bundle on cloud ultimately facilitated a model’s classification of an individual as “AHS student” or “Not an AHS student”. While running the trained model, this robot will help security personnel identify visitors on campus, serve as an access point to viewing various locations, and give students permission to navigate school information. The long-term goal is to process mass amounts of student/staff/visitor images with HPCC. To bring HPCC Systems one step closer to that stage, this project displayed results at a faster pace and increased overall accuracy rates. HPCC Systems is transitioning from Bare Metal to the Cloud Native Platform. To facilitate this transition, this project leverages HPCC Systems by improving HPCC GNN models and HPCC GNN Thor clusters in the cloud environment to train a dataset with 4,839 images.
The prevailing obstacles faced in Machine Learning is insufficient real-world data and developing CNN models from scratch. To combat these challenges, this project took an alternative approach to data collection, and evaluated multiple pre-trained models to identify the model with peak accuracy levels and time efficiency. Instead of artificially augmenting photos of each student (e.g. fake background colors and manually adjusting angles), I obtained 4,000+ images by splitting a video into frames. This magnified the scope of the project by expanding the number of real images from the robot with consistent backgrounds and angles.
As image classification has matured over the years, more pretrained models are now available.This project evaluated 5 TensorFlow pre-trained CNN models (to compare processing speed and accuracy) and an HPCC Systems GNN model. Through the latter, this work helped test the HPCC thor functionality by varying parameters on the GNN model. The application cluster with Docker Images of HPCC Systems Core and TensorFlow libraries was deployed on Azure. By evaluating industry-standard models, this work helps users easily train a dataset with drastically better results. The MobileNet V2 model was the fastest and achieved 100% accuracy. Results show that pre-trained models with modifications can achieve optimal results instead of developing from scratch.
Finally, I developed a standard procedure for collecting images and training a model with the HPCC Systems Platform on Cloud. This will allow for processing larger datasets (e.g. photos from the entire school instead of a sample). The image classification model will be compatible and work in conjunction with devices mounted on our security robot for user convenience. The cloud-based student recognition model that has been developed in this project will allow a person to receive confirmation from the robot that they are in the student database and retrieve information as part of a larger, interactive security feature.
Chirag Bapat
RVCE, Bengaluru, India
Comparative study of HPCC Systems and Hadoop
In order to constantly evolve and generate better results from any system, we require constant studies to be conducted to assess and compare the performance of new and upcoming systems with the current industry standards. Through our project, we intend to perform a similar comprehensive comparative study between the current standard in Big Data Analytics systems - Hadoop, and that provided by HPCC Systems. This will allow us to assess both the similarities and differences between the two setups, which in turn will assist the end user or the client to make a better and more informed choice about the kind of system to be set up for their specific requirements.
Through the "Comparative study of HPCC Systems and Hadoop", we plan to prepare each solution from scratch, and analyse various parameters not just limited to technical performance, but overall user experience as well. This would include the ease and time to set up the environments and other similar factors. Through our presentation, we aim to compare the following parameters:
Ease of access of material regarding the concerned software
Time required to set up clusters
Ease of programming in the respective languages for each system using
their programming languages
Running various machine learning algorithms on each system with different
sized datasets, and measuring their accuracies and execution times and contrasting the LoCs required to implement the same
During the implementation of the machine learning algorithms, we shall be working on 2 different sized datasets: the smaller USA Cars dataset provided by HPCC Systems and a large UK Housing dataset sourced from Kaggle. This would compare the load performance of each system.
Chris Connelly
North Carolina State University
Ingestion and Analysis of Collegiate Women's Baskteball GPS Data in HPCC Systems and RealBI
In the past NC State Strength and Conditioning has worked with HPCC Systems to create solutions for taking different data streams and bringing them together for a comprehensive analysis to improve athlete wellbeing and performance. Here you will see some solutions using HPCC Systems and RealBI to provide insight from data collected with the NC State Women's basketball team. You will see some differences from working with a Bare Metal environment to a Kubernetes environment. From uploading data into a cloud based environment, to visualizing the data in a streamlined and interactive dashboard, see how these solutions can help our understanding of this data to provide better service to these student athletes.
Deeksha Shravani
RVCE, Bengaluru, India
All pages in this wiki are subject to our site usage guidelines.