2021 Poster Contest (Abstracts Only)
Browse: Abstracts, Winners and runners up, Awards Ceremony (Watch Recording / View Slides), Posters by HPCC Systems Interns, Posters by Academic Partners, Poster Judges, Virtual Judging, Home
Click on the Title to View the Poster Resources
Achinthya Sreedhar
RVCE, Bengaluru, India
Conditional Probability is a key enabling technology for Causal Inference. For real valued variables, calculating conditional probabilities is particularly challenging because they can take on an infinite set of values. With the increase in conditional dimensions, the data appears sparser and sparser making it difficult to derive accurate results. After looking at various ways of modelling conditional probabilities, we found that using RKHS kernel methods, it was possible to estimate the density and cumulative density of conditional probabilities with a single conditioning variable. We found that this allowed us more accuracy when the data was sparse as compared to the regular discretization method, called D-Prob. We also explored accelerating these methods further using Random Fourier Features (RFF).
Unfortunately, we could not find any RKHS algorithms to condition on multiple variables. So, working with Roger, we developed a new approach using a multidimensional RKHS with a Multivariate Gaussian Kernel to model the full joint probability space. This new method (called J-Prob or Joint Probability) seems to perform as well as D-Prob (the traditional method) when data is plentiful, and much better than D-Prob when the data is sparse. Data can be sparse either because there is little data, or the conditional dimensionality is high. The traditional method works by filtering the dataset based on the conditionals so we wondered if combining filtering with RKHS could get better runtime performance without losing the accuracy gains. We originally called this approach F-Prob for Filtered Probability, and found that it had the best of both worlds. I then extended F-prob to build an algorithm that encompasses all three (D-Prob, J-Prob, F-Prob) methods. This approach called U-Prob (or Universal Probability) can map to a different method depending on the parameter K, which is simply put, the percentage of conditional variables we want to filter (before performing J-Prob on the remaining variables). So U-Prob(K=0) is equivalent to J-Prob and U-Prob(K=100) approximates D-Prob. Everything in between is a variation of the F-Prob family of methods. We then tailored U-Prob to adaptively choose a near optimal value of K for a given scenario. We found this method to produce great results, surpassing J-Prob and D-Prob in different areas.
I will be providing a detailed comparison between all of the methods based on a variety of tests under different conditions. These methods will serve as a solid foundation for the HPCC Systems Causality Toolkit, which will further enhance their performance using the HPCC Systems Platform's robust parallelization capabilities.
Amy Ma
Marjory Stoneman Douglas High School, FL, USA
An Ingress is an object that allows access to Kubernetes services from outside the Kubernetes cluster. Ingress is made up of an Ingress object and the Ingress Controller. An Ingress Controller is the implementation of the Ingress. In this project, two Ingress implementations, HAProxy and Nginx were examined on Azure environment. These two Ingress controllers both use the in-cluster Ingress solutions, where load balancing is performed by pods within the cluster. My works explore the different setup used to configure Ingress features through annotations and Kubernetes ingress specifications.
The basic functions exercised in this project include routing, authentication, and access control features such as whitelist, rate limit, buffer size. Various TLS configurations were investigated, including using self-generated certificates with open SSL, using HPCC TLS implementation with Cert-manager, and configuring TLS with a dynamic, externally reachable IP address. Also explored and tested are deployment patterns blue-green and canary deployment, implemented by HAProxy and Nginx controllers, respectively. These configurations are very useful in real Cloud application development and maintenance. For example, utilizing canary to gradually adapt new application features such as with ECLwatch in HPCC cluster. User end-to-end request to response time was collected from nine ECL samples, with the scenarios being the usage of HPCC with and without the Nginx Ingress controller. The initial latency tests indicate that Ingress does not add much performance delay.
Some benefits of using Ingress for cloud are that it comes with a wide range of features, such as access control, basic authentication, providing a singular access point for external traffic, and advanced routing.
André Fontanez Bravo Best Poster - Research
University of Sao Paulo, Brazil
Big Data and Logistic Regression applied to Analysis of Loan Requests
Big Data and its applications are becoming more and more important across many different fields. In this context, techniques and tools that are able to process the immense flow of information to create value can be powerful instruments. This study focuses on the application of data analysis to financial investments at LendingClub’s platform. LendingClub is an American peer-to-peer lending company. The company’s platform allows users to file loan requests and others to finance them, becoming investors. Each loan is broken up into Notes that represent a fraction of said loan. These Notes can also be traded among investors, similarly to what is done in the stock market. The investors can choose the loans they wish to finance based on a plethora of information about the loan and the borrower, such as the loan’s interest rate and the borrower’s purpose and credit score. Even though LendingClub assesses all loan requests before making them available on its platform, the company’s public historic database shows that around 12.5% of loans were charged off. As investing in loans that end up not being paid evidently incurs in financial losses, it would be useful to have a way to identify loan requests that have a higher probability of being paid on time.
With that in mind, the goal of this study is to develop a logistic regression model capable of identifying the best options for investment among the loan requests at LendingClub’s platform using the information available to investors. This binary model should calculate the likelihood of a loan being paid and then classify it as “good loan” or not.
Given the size of the company’s dataset (over two million records, with dozens of columns), this project was developed on the HPCC Systems platform, which is able to handle large volumes of data and also has a logistic regression module. The modelling process involved four main stages:
- Data extraction
- Data cleaning and preparation
- Model training and evaluation
- Optimization.
By the end of the study two final models were obtained through two different methods of optimization. The first one is better to filter loan requests to obtain a higher proportion of good loans, while the second one is better to filter loan requests discarding as few good loans as possible.
Atreya Bain 2021 Community Choice Award
RVCE, Bengaluru, India
Improvements on HSQL: A SQL-like language for HPCC Systems
Big Data has become an important field, and there is a steep learning curve to getting used to handling Big Data, especially in distributed systems. HSQL for HPCC Systems is a solution that is developed for allowing users to get used to its architecture and the ECL (Enterprise Control Language) language with which it primarily operates. HSQL aims to provide a seamless interface for data science developers to use, for working with data. It is designed to work in conjunction with ECL, the primary programming language for HPCC Systems, and should prove to be easy to work with and robust for general purpose analysis.
HSQL is made to provide a compact and easy to comprehend SQL-like syntax for performing visualizations, general exploratory data analysis, training of Machine Learning models while also allowing a modular structure to such programs. Functions can also be written to allow for code reuse. It can also integrate with VSCode IDE and provide Syntax Highlighting and Code Completion features.
In previous work on HSQL, the primary foundations were set and in this work, various improvements were made to make it more usable and correct as a compiler. The architecture of the compiler has received changes that allow it to translate more effectively and the newer version of the compiler brings in support for functions, for code reusability, and modules that help structure code. Additionally, a lot of the existing statements have received new features that make them easier and better to use.
Bruno Carneiro Camara
University of Sao Paulo, Brazil
Preventing Fraud by Registration Inconsistencies
Tons of money are lost because of fraud committed by companies. There are already laws to punish company partners for these abusive acts for their own benefit, however, how can the authorities locate and take the necessary actions? This is where my work comes in. Identifying registration inconsistencies, suspicious behaviors or unusual situations may prevent or locate frauds. Using three different public databases as the starting point, I was able to link companies and partners to suspicious behaviors, such as receipt of undue government benefit by company partners and reports of work analogous to slavery in companies.
The three public databases used were:
- Brazilian Companies - Divided into 3 categories of Companies, Partners and Establishments. All of them have specific information, such as company status, partner position, age group, localization, date and others.
- Government Subsidy - The people who received government aid. I have chosen 2 famous Brazilian subsidies: Bolsa Família and Auxílio Emergencial. The two subsidies are aimed at people with low income.
- Work Analogous to Slavery - Information about companies or employers that practice slavery-like work.
All these datasets are publicly available on the Brazilian government websites.
With the three databases properly treated and cleaned I was able to reach my goal of identify suspicious behaviors, unusual situations and registration inconsistencies, obtaining three main datasets: partners that received benefits and their respective companies, establishments that had some type of complaint about work analogous to slavery and partners with reports of labor practices analogous to slavery. Those 3 resulting datasets were descriptively analyzed in order to highlight group singularities and trends. The most relevant keys used to analyze groups were: partner position, partners’s age group and area of activity of the company. One of the trends highlighted was: 49% of the partners that received benefits were managing partners which is a high position to receive assistance,
In short, this project shows that by combining Big Data and Analytics with HPCC Systems could be a promising alternative to prevent and locate frauds, using registration inconsistencies as a tool to raise potential fraudulent companies and company partners.
Carina Wang Best Poster - Data Analytics
American Heritage School, FL, USA
Processing Student Image Data with Kubernetes and HPCC Systems GNN on Azure
In order to foster a safe learning environment, measures to bolster campus security have emerged as a top priority around the world. The developments from my internship will be applied to a tangible security system at American Heritage High School (AHS). Processing student images on the HPCC Systems Cloud Native Platform and evaluating the HPCC Systems Generalized Neural Network (GNN) bundle on cloud ultimately facilitated a model’s classification of an individual as “AHS student” or “Not an AHS student”. While running the trained model, this robot will help security personnel identify visitors on campus, serve as an access point to viewing various locations, and give students permission to navigate school information. The long-term goal is to process mass amounts of student/staff/visitor images with HPCC. To bring HPCC Systems one step closer to that stage, this project displayed results at a faster pace and increased overall accuracy rates. HPCC Systems is transitioning from Bare Metal to the Cloud Native Platform. To facilitate this transition, this project leverages HPCC Systems by improving HPCC GNN models and HPCC GNN Thor clusters in the cloud environment to train a dataset with 4,839 images.
The prevailing obstacles faced in Machine Learning is insufficient real-world data and developing CNN models from scratch. To combat these challenges, this project took an alternative approach to data collection, and evaluated multiple pre-trained models to identify the model with peak accuracy levels and time efficiency. Instead of artificially augmenting photos of each student (e.g. fake background colors and manually adjusting angles), I obtained 4,000+ images by splitting a video into frames. This magnified the scope of the project by expanding the number of real images from the robot with consistent backgrounds and angles.
As image classification has matured over the years, more pretrained models are now available.This project evaluated 5 TensorFlow pre-trained CNN models (to compare processing speed and accuracy) and an HPCC Systems GNN model. Through the latter, this work helped test the HPCC thor functionality by varying parameters on the GNN model. The application cluster with Docker Images of HPCC Systems Core and TensorFlow libraries was deployed on Azure. By evaluating industry-standard models, this work helps users easily train a dataset with drastically better results. The MobileNet V2 model was the fastest and achieved 100% accuracy. Results show that pre-trained models with modifications can achieve optimal results instead of developing from scratch.
Finally, I developed a standard procedure for collecting images and training a model with the HPCC Systems Platform on Cloud. This will allow for processing larger datasets (e.g. photos from the entire school instead of a sample). The image classification model will be compatible and work in conjunction with devices mounted on our security robot for user convenience. The cloud-based student recognition model that has been developed in this project will allow a person to receive confirmation from the robot that they are in the student database and retrieve information as part of a larger, interactive security feature.
Chirag Bapat
RVCE, Bengaluru, India
Comparative study of HPCC Systems and Hadoop
In order to constantly evolve and generate better results from any system, we require constant studies to be conducted to assess and compare the performance of new and upcoming systems with the current industry standards. Through our project, we intend to perform a similar comprehensive comparative study between the current standard in Big Data Analytics systems - Hadoop, and that provided by HPCC Systems. This will allow us to assess both the similarities and differences between the two setups, which in turn will assist the end user or the client to make a better and more informed choice about the kind of system to be set up for their specific requirements.
Through the "Comparative study of HPCC Systems and Hadoop", we plan to prepare each solution from scratch, and analyse various parameters not just limited to technical performance, but overall user experience as well. This would include the ease and time to set up the environments and other similar factors. Through our presentation, we aim to compare the following parameters:
Ease of access of material regarding the concerned software
Time required to set up clusters
Ease of programming in the respective languages for each system using
their programming languages
Running various machine learning algorithms on each system with different
sized datasets, and measuring their accuracies and execution times and contrasting the LoCs required to implement the same
During the implementation of the machine learning algorithms, we shall be working on 2 different sized datasets: the smaller USA Cars dataset provided by HPCC Systems and a large UK Housing dataset sourced from Kaggle. This would compare the load performance of each system.
Chris Connelly
North Carolina State University
Ingestion and Analysis of Collegiate Women's Baskteball GPS Data in HPCC Systems and RealBI
In the past NC State Strength and Conditioning has worked with HPCC Systems to create solutions for taking different data streams and bringing them together for a comprehensive analysis to improve athlete wellbeing and performance. Here you will see some solutions using HPCC Systems and RealBI to provide insight from data collected with the NC State Women's basketball team. You will see some differences from working with a Bare Metal environment to a Kubernetes environment. From uploading data into a cloud based environment, to visualizing the data in a streamlined and interactive dashboard, see how these solutions can help our understanding of this data to provide better service to these student athletes.
Deeksha Shravani
RVCE, Bengaluru, India
Developing a Recommendation System for a Virtual Reality based Supermarket using Big Data Platforms
This talk introduces a Virtual Reality (VR) based online shopping platform and its integration with a recommendation system with the demonstration of the virtual environment. With the advent of the pandemic, the ability of virtual reality platforms to provide a realistic shopping experience puts it in a unique position that assures safety and isolation while also offering the benefits of online shopping platforms to both customers and retailers.
To foster user adoption and improve the experience of the user beyond the confines of traditional shopping experiences, a recommendation system is necessary in such a platform. For a recommendation system in a retail context, the amount of training data present is very vast and warrants the evaluation of the various platforms and Big Data Analytics frameworks which facilitate training of large-scale data.
For a dataset with 9M+ records, the comparison of the training on a Graphics Processing Unit (GPU), HPCC Systems and Spark on Hadoop is performed and various metrics are evaluated. The metrics of evaluation are not only with respect to the performance of the system but include metrics such as time for complete training, time for initial epoch completion etc. which are required for the operational sustainability of the platform. Moreover, deriving conclusions in real time is essential for any recommendation system. In addition to the training time, the time taken for generating inferences i.e., the time required to provide recommendations after training, also is examined. The results indicate the advantages of big data platforms over GPU training. Though Spark on Hadoop is faster in training, the results feature HPCC Systems as the better platform for real time inferencing.
Francisco Ciol Rodrigues Aveiro
Insper, Sao Paulo, Brazil
HPCC Systems Ingress Configuration with AWS ALB
During this current era of information, the use of cloud computing became a necessity due to the amount of computational power needed. The access to storage and processing power at low cost allied with ease of access are some of the advantages of using such service, which is available as platform as a service (PaaS), software as a service (SaaS), infrastructure as a service (IaaS), and hardware as a service (HaaS). In the IaaS model payment is normally under the Pay-as-you-go politics, where you pay for what you’re using. Though pricing may be cheap, the misusage of resources and unnecessary uptime can bring up the cost. In order to minimize those costs, autoscaling features and containerized applications can be used to control the misusage of resource.
An example of containerized systems with autoscaling capability is Kubernetes, as it can be configured to scale the cluster to what is needed and them return to a minimal state when demand goes down, all in a relatively simple manner. This prevents unnecessary resource usage while scaling to properly attend to any task needed. Another cost associated with a cloud cluster is the amount of external IPs needed. Since you need to pay for each external IP exposed to the web, a full cluster with unique IPs for every service can be expensive. To reduce the number of externals IPs used, a single entry-point can be configured, to then redirect to each service based on a subdomain or path.
Such implementation also centralizes access management to the cluster, improving security and governance over the cluster. The objective of this study is to achieve a single entry point to access an AWS Kubernetes cluster, configuring Ingress and AWS ALB to manage and redirect user access to the correct service in a cluster. Then create a helm chart to replicate this structure in others HPCC Systems clusters. With this implementation, an HPCC Systems cluster would only use one external IP for its multiple services thus reducing cost and improving security.
Guilherme Santos da Silva
Universidade Tecnológica Federal do Paraná, Brazil
HPCC Systems File Usage Monitor
A cluster is a connection between two or more computers with the purpose of improving the performance of systems in performing different tasks. In the cluster, each computer is called “node” and there is no limit to how many nodes can be interconnected. Then, computers start to act within a single system, working together in processing, analyzing and interpreting data, information and/or performing simultaneous tasks. It is interesting to know information about a cluster, such as its capacity and availability. This monitoring can help in the maintenance and management of this type of system, avoiding the upload of very large files that will occupy a large part of its capacity or informing the user of the ideal size that can be used. Currently, there is no such control and monitoring in clusters, and the system administrator only finds out that the cluster is out of memory when teams notify it.
Therefore, this project proposes a cluster monitoring system. This is a system that maintains a list of all files in each of the Thor clusters and presents a dashboard that can enable a user to drill into disk utilization on any of them. It is the most efficient way to track multiple data sources as it provides, in real time and in a single location, all the information needed to track a company's performance. As data is displayed in real time, long hours are not required to interpret all indicators and the time to communicate results is shorter and more efficient.
The project consists of a structure formed by .ecl files, that includes a section to run the build hourly in the cron, a section to collect all the information from each server of a cluster, collate it, and create a key to be used by the service, and the service itself; and an HTML file which is a dashboard that shows the file system usage across all clusters. The user will be able to select the desired cluster and thus have access to data and usage graphs over time (virtual size and real occupied area of the disk). Each dot on these charts, when clicked, will call a roxie service showing the five main scopes and their relative usage in a pie chart, all using the Visualizer Bundle. With the tool we will provide an alert that is triggered when the cluster is running out of memory. The ultimate goal of this project is to provide the tool for people to use in any cluster.
Jeff Mao Best Poster - Use Case
Lambert High School, GA. FL
Not only was the creation of the internet the largest technological breakthrough of the 20th century, it also happened to become a hidden double-edged sword. The internet has allowed us to access information and communicate at unprecedented levels, across the globe. Yet, this comes at an enormous cost. The human cost. Hidden behind computer screens, we enjoy a security blanket of anonymity, which emboldens some to say and do things that are labeled as disturbing in a public setting.
Throughout the lifespan of the internet, these people - dubbed “trolls” - have evolved from provocative users in online chatrooms to bullies. Now, trolls infest every layer of the internet and show themselves in online activities from videogames to YouTube comment sections. This trend of online harrassment has become so prevalant that it has been classified as “cyberbullying” and is a known cause for depression, self-harm, and suicides. As documented by Stop Bullying, most instances of cyberbullying takes place through text messaging, online messaging, direct messaging, and online chatting.
By creating a Toxicity Detection Platform, I aim to curb this harassment and provide a healthier web environment for everyone.
Luiz Fernando Cavalcante Silva
University of Sao Paulo, Brazil
The amount of open data made available by government agencies is getting bigger over time. This results in a large number of datasets with different layouts, formats and frequency updates, that can fall under the domains of Big Data. Despite being difficult to analyze, these datasets have a large amount of rich information that could be useful for applications involving public policies.
One such example is the dataset covering the São Paulo real estate registry which is made publicly available by the São Paulo city government and contains a variety of information about each property in the São Paulo city, such as its address, land square footage, building area, terrain and construction values, number of floors, and the type of the property. Since São Paulo is a large city with more than 3 million registered real estate properties, this dataset has a large amount of data that can be analyzed for different purposes. For instance, since property taxes are calculated based on the property location and its physical characteristics, this dataset can be utilized to group properties sharing similar characteristics and compare whether their tax values are similar. Outliers in these groups could be potentially considered candidates for tax evasion or fraud. Based on the list of outliers, the city council can assign tax inspectors to physically visit these properties and inspect its characteristics, something that could help the city hall on tax evasion problems.
To assist with challenges like this there is a need to combine different Big Data technologies, such as a powerful platform for data extraction, transformation and loading (ETL), plus machine learning algorithms. One example of such an end-to-end Big Data management platform is HPCC Systems.
The objective of this project is to develop a machine learning pipeline using HPCC Systems that can be ultimately used to identify outliers in the São Paulo city government´s real state registry extract. To achieve this aim, a complete ETL pipeline needed to be structured to treat the data before unsupervised machine learning algorithms could be utilized to cluster the data.
This approach required extensive use of the HPCC Systems functionalities on various data extraction and transformation instances. In the extraction phase, the dataset was defined to be utilized by the platform with the adequate data types. It was also needed to decrease the amount of data in every record because of the challenge to manipulate and interpret a large number of fields. For this, ECL code was developed to calculate the correlation between fields and the correlation values were used to combine the fields into factors. ECL code was also developed to normalize the resulting field values using the Machine Learning Core bundle. Lastly, the resulting dataset was submitted to the K-Means clustering algorithm using the HPCC systems K-means bundle to group all the remaining data into clusters. Besides giving information about the organization of properties in the São Paulo city, the clusters also allowed the identification of outliers in relation to each cluster, by using the box-plot method.
Mayank Agarwal
RVCE, Bengaluru, India
Independence Testing with RCoT : Causal Validation and Discovery for HPCC System Causal Toolkit
The new science of Causality promises to open new frontiers in Data Science and Machine Learning, but requires an accurate model of the causal relationships between variables.
This causal model takes the form of a Directed Acyclic Graph (DAG). Nature provides a few subtle cues to the structure of the causal model, the most important of which is the independencies or conditional independencies between variables. These independencies allow us to test a causal model to determine if it is consistent with the observed data, and in some cases to discover the causal model from data alone. Doing so, however, is very challenging because of the subtleties of the signals, the presence of noise, and the often high conditional dimensions involved. Runtime performance is also very critical, since it impacts the depth of testing that can realistically be achieved. While studying the various algorithms, I found one with the highest accuracy and efficiency called RCoT.
RCoT is a new measure of conditional dependence of random variables, based on normalized cross-covariance operators on Reproducing Kernel Hilbert Spaces (RKHS). Unlike previous kernel dependency measures, RCoT does not depend on the choice of kernel in the limit of infinite data, for a wide class of kernels. RCoT is a very recently developed method, and has only been implemented in the R language. For compatibility with our Causality Framework, I redeveloped the algorithm in Python. It has been tested on various different models both synthetic and Real- world and has shown satisfactory results. The first phase of RCoT is the detection of dependence between data elements as described above. Then it must determine if the detected dependence is statistically significant, resulting in a “test statistic”. In the RCoT, the null hypothesis used is (X _||_ Y | Z), and the opposite is the alternate hypothesis which defines the dependence between 2 random variables conditioning on a set of random variables. The output of RCoT is the probability of rejecting the null hypothesis called, p-val. The null hypothesis is rejected when the value of p is less than 5%.
One of the improved approximation techniques is the Linday Pilla Basak (LPB) method , an approximation technique for a weighted sum of chi-squared random variables. Like LPB, there are various other approximation methods, one of which I used is the Hall–Buckley–Eagleson (HBE) method. The third major feature of RCoT is its excellent runtime performance and scalability relative to other methods. RCoT achieves this by using Random Fourier Features to approximate the working of the previous model, KCIT (Kernel-based Conditional Independence Testing) to a great degree without taking as much computation time. Random Fourier features is a widely used, simple, and effective technique for scaling up kernel methods. This allows approximation with arbitrary precision using a lower dimensional model, giving RCoT unprecedented runtime performance and scalability, even for high dimensional conditioning. I evaluate the runtime and accuracy performance of the resulting RCoT muldule, and compare it to our previous method.
The RCoT algorithm for Independence testing will result in the formation of a concrete foundation for the HPCC Systems Causality Toolkit, which will further enhance their performance using the HPCC Systems Platform's robust parallelization capabilities.
Murtadha D. Hssayeni
Florida Atlantic University
The Forecast of COVID-19 Spread Risk at The County Level
The early detection of the coronavirus disease 2019 (COVID-19) outbreak is important to save people's lives and restart the economy quickly and safely. People's social behavior, reflected in their mobility data, plays a major role in spreading the disease. Therefore, we used the daily mobility data aggregated at the county level beside COVID-19 statistics and demographic information for short-term forecasting of COVID-19 outbreaks in the United States. The daily data are fed to a deep learning model based on Long Short-Term Memory (LSTM) to predict the accumulated number of COVID-19 cases in the next two weeks. A significant average correlation was achieved (r=0.83 (p=0.005)) between the model predicted and actual accumulated cases in the interval from August 1, 2020 until January 22, 2021. The model predictions had r > 0.7 for 87% of the counties across the United States. A lower correlation was reported for the counties with total cases of <1,000 during the test interval. The average mean absolute error (MAE) was 605.4 and decreased with a decrease in the total number of cases during the testing interval. The model was able to capture the effect of government responses on COVID-19 cases. Also, it was able to capture the effect of age demographics on the COVID-19 spread. It showed that the average daily cases decreased with a decrease in the retiree percentage and increased with an increase in the young percentage.
Lessons learned from this study not only can help with managing the COVID-19 pandemic but also help with early and effective management of possible future pandemics. The project used the HPCC Systems platform for collecting, hosting, and analyzing the data. For more details, please visit https://covid19.hpccsystems.com.
Nikita Jha Best Poster - Platform Enhancement
Northview High School, GA. USA
Apply Docker Image Build and Kubernetes Security Principles
With cybersecurity attacks becoming more prevalent in the United States every year,organizations are constantly looking for ways to improve the security outlook of their platforms. HPCC Systems is an open-sourced, big data analytics platform that provides high-performance data processing for other companies in the form of parallel batch data processing and online query applications. Recently, the company has begun transitioning to a cloud-native platform in which they use Docker containers managed by Kubernetes to store and manage data. With this new change, it is of utmost importance that HPCC Systems has a secure cloud environment since they are using it to manage secure data from other companies. The two fundamental components of their cloud-native platform that need to be secured are Docker and Kubernetes.
The first component of the implementable Docker security features is an option for developers to enable or disable image build caching during runtime based on the specifications of their application. For example, if their container has applications that were recently updated, they can choose to disable caching for those builds while keeping it enabled for the rest to make sure the Docker build catches all changes including any security updates. In order to test whether this update actually improves the security of the platform, vulnerability scanners like Trivy, Grype, and Docker Scan were used to detect differences between the “before” and “after” security threats. Results showed the disabled image build caching had two fewer vulnerabilities than enabled caching. While it is not always ideal to have the build cache turned off, the results proved that disabled caching definitely has a better security outlook for the company.
Another critical aspect of Docker components that should be accounted for is the misunderstanding of the "Latest" tag. The ‘Latest Tag’ refers to the last build that ran without a specific tag verified. That being said, due to caching issues, the Latest tag often fails, meaning it does not actually store the newest code with the latest tag. Therefore, it is safer to version the tags every time so developers know exactly which update they are working with.
For the Kubernetes section of the project, the first component that was implemented in the HPCC Systems Platform was pod security policies. These configurations that define, which security-related conditions a Kubernetes pod has to meet in order to be accepted into a cluster, have numerous best practices like disabling privileged containers, requiring read-only file systems, and preventing privileged escalation. The last implementable best practice for Kubernetes is certificate management with HashiCorp Vault. Certificate management is important because it enables the setup of Transport Layer Security, also known as TLS. This technology encrypts HPCC data sent over the internet so hackers cannot get access to it. In order to use TLS, certificates must be generated that tell us important information about the server and public keys. While certificates can be generated manually, this process is not scalable, especially for cloud applications. Therefore, HashiCorp Vault can be used to generate the certificates instead.
Roshan Bhandari
Clemson University
Use Azure Spot Instance with HPCC Systems for Cost Optimization
Minimizing the cost of setting up cloud infrastructure is very important for all companies. Azure spot instances can provide great cost savings for cloud infrastructure setup. Azure Spot Instances are unused computing resources (virtual machines) azure has. Azure gives it for a lower price compared to normal virtual machines. It is found that Azure gives these instances at a rate that can be as low as 90% below the normal instance. The price can vary based on region and size. These spot instances do not have any service level agreement and Azure does not provide any high availability guarantee for these instances. Azure also takes the machines whenever they need it with or without notice.
In this project, we try to analyze different aspects related to the use of Azure Spot Instance with HPCC Systems.
Shivani C H
RVCE, Bengaluru, India
COVID-19 Cases and Vaccination Data Tracker in India
With the global outbreak of COVID-19 pandemic, it has become crucial to track the active cases and vaccination data in order to analyse the current situation and trends. Hence, a systematic way of collecting, processing, enhancing, analysing and visualising the data and trends for better understanding has been very much needed. Through this project, we aim to provide the users with the required information about the covid cases since it’s outburst and the vaccination data in different states of India and country as a whole.
We plan to analyse the existing covid data collected using basic ECL language on the HPCC Systems platform to give the requested consolidated information to the user. This process includes data preprocessing, data enhancement, analysis and visualizations to understand the retrieved data in a better way. We aim to achieve the following objectives displayed in the poster:
- Displaying various covid vaccination details until current day, various covid vaccination details on the given day, details about covid cases and tests done until the current day.
- Visualizations are created for every analysis done and conclusions derived from data helps to understand the trends and learn better.
- State-wise vaccination trends for cumulative data till now.
- State-wise vaccination details for any given date.
- Cumulative state-wise data till any given date.
- State-wise covid cases registered till current day.
All pages in this wiki are subject to our site usage guidelines.