Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Find out more about the HPCC Systems Summer Intern Program including how to apply and read this blog introducing the students and their projects.

13 students joined our intern program in 2024. Our students presented about their projects to the team during the year and 12 of them entered our 2023 2024 Poster Contest held hosted at the virtual HPCC Systems Community Day Summit in October 2024.

Meet the Class of 2024

Resume analyzer in NLP++

Name

Project Title

Description

Mentor(s)

Resources

Aryaman GautamCharan Nagaraj     

Bachelor of Tech Data Science
Mukesh Patel School of Technology, Management and Computer Science RV College of Engineering, IndiaHPCC Systems local deployment on K3D cluster

The goal of this project was to establish an initial setup for a local deployment of HPCC Systems on K3D. K3D is a lightweight wrapper to run K3S (Rancher Lab's minimal Kubernetes distribution) in docker which makes it very easy to create single and multi-node K3S clusters in docker.

Xiaoming Wang

Godji Fortil

Chinmay Desai

Sidharth Ganesan 

View Poster 

2023 Community Day recording

Boqiang Li

Ph.D. in Computer Science, Clemson University, USA

Convert Generalized Neural Network bundle (GNN) to native Tensorflow 2.0 

Neural Networks have emerged as a powerful tool for analyzing complex datasets like images, video, and time-series data, surpassing classical methods in their effectiveness. To leverage this potential, HPCC Systems offers the Generalized Neural Network Bundle (GNN), which combines the parallel processing capabilities of HPCC Systems with the robust Neural Network functionalities of Keras and TensorFlow. This project upgraded the GNN bundle to utilize the native Tensorflow 2 interface. The upgraded GNN with Tensorflow 2 demonstrated several significant advantages over its previous version.

Lili Xu

Roger Dev

View Poster

View Blog

Carlos Caceres 

High School Student American Heritage School Delray, FL, USA

Practical Application of Generative AI Technology 

During this project a generalized interface was created for HPCC Systems to access GPT and ChatGPT. From there the steps were taken to use HPCC Systems to train a neural network model capable of classifying faces into different emotions. These emotions would then be processed by the interface to create a call to OpenAI’s API from which an appropriate response would be generated.

Lili Xu

Roger Dev

View Poster

View Blog

2023 Community Day recording

Charvi Dave

Bachelor of Tech Data Science
Mukesh Patel School of Technology, Management and Engineering, India

A Resume Analyzer is the implementation of an approach to apply various techniques for analyzing the resumes a company receives and retrieving the main sections. This project has leveraged the NLP++ plugin to process resumes and extract the main headers and sections of the resume, such as skills, work experience, email, and education. 

David de Hilster

Umesh Mahind

Nandhini Velu

View Poster

2023 Community Day recordingMigrate and Improve Regression Testing in GitHub actions

At HPCC Systems, we use two main test systems: Overnight Build and Test (OBT) and Smoketest. Regression testing of ECL bundles, initially handled by OBT, is now integrated into Continuous Integration (CI) using GitHub Actions, automatically testing bundles when a pull request (PR) is raised. Additionally, I implemented automated testing of hyperlinks in our documentation files, also using GitHub Actions. This ensures that broken links are detected early, keeping the documentation accurate without requiring manual verification.

Attila Vamos

View Poster

View Blog

Eatesam Khan            

Masters in Computer Science California State University, USA

Create a New HPCC Command Line Tool

As part of my internship, I developed a command-line tool that simplifies interaction with HPCC Systems ESDL services, offering powerful features for describing and testing services. The describe command provides detailed information about available services, methods, and request-response structures, while the test command allows users to send test requests, supporting various formats like XML and JSON. Key options include setting authentication credentials and server details. A standout feature is dynamic tab auto-completion, which helps users input commands accurately and efficiently.

Terrence Asselin

Tim Klemm

View Poster

View Blog

El Arbi Belfarsi                 

PhD in Computer Science Kennesaw State University, USA

Update and Improve the Generation of Platform Artifacts for HPCC Systems Builds

This project focuses on transitioning HPCC Systems CI/CD workflow from Jenkins to GitHub Actions, automating platform artifact generation using Python. A Python script replaces an existing web service, handling tasks like fetching assets, extracting metadata, and saving data as JSON. The workflow automates setup of AWS credentials, Docker image management, and uploads to GitHub and AWS S3, with security provided by GitHub secrets. This project streamlines the build process, reduces manual effort, and improves automation, benefiting the HPCC Systems platform and the open-source community.

Michael Gardner

Ming Wang

View Poster

View Blog

Elizabeth Lorti       

Bachelor of International Development,
King's College, UK

HPCC Systems Technology Marketing and BrandingAs a returning HPCC Systems intern and one that has worked year-round on maintaining social media, this year, I completed a review of my own social media contributions and strategy to see what could be done to improve, as well as will conducted interviews among stakeholders and recorded minutes to best understand and communicate the needs of the Technology Summit and Community Day stakeholders.  

For this year's Tech Summit, I coordinated communication with stakeholders, collected speaker bios and abstracts for uploads, and worked closely with the project management team. I also managed all social media channels and key event aspects. Leveraging two years of prior experience, including last year's Summit, I efficiently referenced past spreadsheets to streamline bio and content management.

Jessica Lorti

View Poster

View Blog

Hiroki Sato Gagana Premnath            

Masters in Computer Science Syracuse University of Indiana, USA

Automation Integration of HPCC Systems Cloud Native Deployment to AWS with TerraformTerraform CI with GitHub Actions

This project leveraged Terraform to explore integrates HPCC Systems Terraform-based infrastructure management with GitHub Actions to streamline the deployment of the HPCC Systems containerized application onto AWS Elastic Kubernetes Service cluster (EKS). During the internship, we developed a hpcc-aws-terraform module. This consisted of building a necessary AWS infrastructure such as virtual private cloud (VPC), subnets, necessary security group, EKS cluster and node group. 

Wayne Carty

Godson Fortil

View Poster

View Blog

Jessie Mao                 

High School Student Lambert High School Suwanee, GA, USA

HPCC Systems Deployment with Various Helm Chart Configurations

This project provided two solutions for HPCC Systems deployments. The overrides solution utilizes the default values.yaml file while using other files to modify it. Overrides can be used to make small changes to the values.yaml, and mainly concentrates on Roxie and Thor. The HPCC-lite, on the other hand, does not require a custom values.yaml file, so can be used with other files to create more scenarios. 

Xiaoming Wang

Godson Fortil

View Poster

View Blog

Johnny Huang clusters. Terraform modules - vnet, storage, aks, and HPCC Systems - are deployed sequentially using GitHub Actions workflows. Key steps include configuring Terraform, managing Azure authentication, handling data persistence, and securing sensitive information with GitHub Secrets. By automating deployments through GitHub Actions, the project ensures consistency, reduces manual intervention, and improves deployment efficiency, while fostering collaborative development and maintaining reliable, version-controlled infrastructure across environments.

Godji Fortil

Ming Wang

View Poster

View Blog

Girikratna Premnath            

Bachelor of Computer Tech Data Science
University of Toronto, Canada

Improve Error Handling and Reporting for Automated Test Systems

This project concentrated primarily on refining the GitHub Actions scripts, a vital tool for automated testing within the HPCC Systems environment. These scripts analyze the logs generated from tests, providing a granular breakdown of the executed tests. I also introduced enhancements to the scripts to improve the fault tolerance of our testing systems. These included adding logic to retry failed actions, increasing the resilience of the system to transient issues, reducing test failures, and decreasing the need for manual interventions.

Attila Vamos

View Poster

View Blog

K Dheemonth                    

Bachelor of Computer Science and Engineering 
RVCE, India

Sentiment Analysis in English

During my internship we created a number of parsers and an analyzer using NLP++(Visual Text). To do this, we defined the different rules that map to a very generic manner of supplying the sentiments rather than having for specific ones. NLP++ assisted in constructing the parsers for assigning different sentiments depending on user, cricket terms, player and team interests and team supports. The second phase centered on the sentiments that were given to emojis. Emojis in the dictionary, a capability offered by NLP++, were used to assign sentiments to the cricket tweets. 

David de Hilster

View Poster

View Blog

2023 Community Day recording

Kruthika Pinnada Mukesh Patel School of Technology, Management and Engineering, India

Integration of PowerBI with HPCC Systems platform

My project established a connection between Power BI and HPCC Systems using WsSQL for SQL-based data retrieval. I automated SOAP requests from Power BI to HPCC Systems, enhancing data analytics and visualization workflows. Using a Bare Metal System on WSL, I handled the Power BI integration with M code/Power Query and successfully tested it on various data sample sizes, ensuring smooth functionality.

Srinivasan Kothandam

Aryaman Gautam

View Poster

Harsh Raj          

Bachelor of Tech Data Science
Mukesh Patel School of Technology, Management and Engineering, India

Vehicle Build Contributory System

The goal of this project was to develop an end-to-end pipeline that automates data extraction using Python libraries such as Beautiful Soup and Selenium. Data transformation and cleaning were performed using HPCC Systems platform capabilities, and insights were visualized through tools like Power BI, creating a streamlined process from extraction to visualization.

Srinivasan Kothandam

Aryaman Gautam

View Poster

Ilhan Gelle            

Bachelor of Computer Science and Engineering 
RVCE, India

Resume Analyzer

The project "Resume Analyzer" leverages the power of NLP++ programming language to build a digital human reader that parses the resume text in the same we humans do. The system has made use of the “zoning” of a resume (done by a previous intern) and aims at doing an in-depth analysis of text and extracting valuable information in the way a human does. 

David de Hilster

View Poster

View Blog

2023 Community Day recording

Logan Patterson          

Masters in Data Science
New College of Florida, USA

Designing Test Algorithms for Causal Model Discovery Within the HPCC Systems Causality Framework

The discovery model testing algorithm was used on four different algorithms, one of which was already implemented within Because: PC (Peter-Clark), GES (Greedy Equivalence Search), IGCI (Information Geometric Causal Inference), and RCC (Randomized Causation Coefficient. Each of the models were compared to one another based on performances with various datasets to determine viability of both the testing algorithm and the models themselves. This algorithm hopefully paves the way for easier integration and implementation of causal discovery algorithms for future developments within the HPCC Causality Framework

Roger Dev

Lili XuUniversity of Texas, USA

Test Suite for the HPCC Systems Parquet Plugin

This project developed a comprehensive test suite for the HPCC Systems Parquet Plugin, crucial for ensuring performance, functionality, and reliability in big data workflows. The test suite validates data integrity across ECL and Arrow data types, evaluates compression algorithms and file sizes, and simulates real-world scenarios like large datasets and schema evolution. It addresses edge cases to maintain stability, enabling HPCC Systems to leverage the Parquet format’s columnar storage for faster queries and better compression compared to CSV and XML, ensuring efficient data processing and transfer.

Jack del Vecchio

View Poster

View Blog

Narayan Kandel Nisha Bagdwal              

Ph.D. in Computer Science, Clemson Masters in Information Technology Kennesaw State University, USA

Enhancing Performance of Distributed Neural Network with GNN Bundle

Our work addresses the challenges of parallelizing neural network training, recognizing that, in certain scenarios, superior results can be achieved by training on a single high-powered node (e.g., with GPU) or a limited number of nodes. To enhance network performance and accuracy, we pursued two distinct approaches. Firstly, we optimized neural network training by strategically setting a limit on the number of nodes used for training, thereby reducing communication overhead. 
Secondly, we investigated an alternative multi-node approach to network training, varying the starting points across multiple nodes and averaging the results. This technique shows promises yielding improved predictions compared to a single-network setup. 

Lili Xu

Roger DevDevelop an Automated ECL Watch Test Suite

This project aims to develop a comprehensive automated test suite for the ECL Watch UI, a key component of the HPCC Systems platform for high-speed data engineering. The suite will validate functionality, usability, performance, and error handling, ensuring a seamless user experience. By simulating human interactions, the tests will verify navigation, interactive features, and data presentation. Developed using Java, Selenium, and Unix, the project will include robust documentation for future maintenance. This initiative enhances ECL Watch's reliability and contributes to overall system efficiency.

Attila Vamos

Chris Lo

View Poster

View Blog

Nivedha Sivakumar Rohith Surya Podugu                

Bachelor of Masters in Computer Science
Georgia California State University, USA

Test Suite for a Roxie Cluster on Kubernetes

My project focuses on creating a test suite for Roxie designed to provide more in-depth understanding of how different query, cluster and infrastructure configurations can affect functionality and performance of Roxie in the cloud. Unlike the bare metal, the cloud environment provides more options and flexibility to build and customize your cluster infrastructure. The primary goal for the test suite is to give indications or guidelines to what configuration will be suitable for each use case of Roxie in the future. 

Krishna Turlaphati

Attila Vamos

View Poster

View Blog

Noah Seligson Refactoring and Releasing PyHPCC

PyHPCC is a Python package and wrapper for HPCC Systems web services, initially developed as an internal tool at LexisNexis Risk Solutions to automate tasks on HPCC Systems. Since its introduction in 2022, interest in PyHPCC has grown across the organization and the broader community. In response to user feedback, we have enhanced its usability, maintenance, and documentation. We are now excited to announce that PyHPCC will be open-sourced, fostering collaboration within the HPCC Systems community.

Amila de Silva

View Poster

View Blog

Sabrina Harris                             

Bachelor of Computer Science
University of Central Florida, USA

Convert Automated Test Systems from Python2 to Python3

The main objective for this project is to convert the Python files from the Smoketest and OBT from Python 2 to Python 3. There are tools such as 2to3 that automate the conversion process by changing segments of codes based on pre-existing conditions in its algorithm. This tool is not enough to ensure a healthy conversion, which is why manual review and testing are a mandatory part of this project as well. In addition, another principal goal is a clean-up of both the Python and Bash files for the testing systems. This involves removing commented code that does not serve a purpose to the file anymore, unused variables and uncalled functions.

Attila Vamos

View Poster

View Blog

Ryan Rao                                

High School Student American Heritage School Delray, FL, USA

HPCC Systems Storage Support With Container Storage Interface (CSI)

The goal of my project is to create the 3rd storage lifecycle for the EFS implementation to provide users with a more permanent storage and to complete as much as I can of the FSx implementation. This will include configuring all the necessary storage components: PVs, PVCs, storage classes, EFS access points, the CSI driver, etc. In addition, I will build the necessary helm charts and improve any existing code and documentation.

Xiaoming Wang

Godson Fortil

View Poster

View Blog

Sarah Nash                      

Masters in Data Science
New College of Florida, USA

Causal Discovery and Validation with Categorical Data

The HPCC Systems Causality Framework “Because” is a toolkit for multiple areas of causal analysis, including discovery and validation. The discovery algorithms previously implemented in the toolkit are mainly compatible with two data types: continuous numeric and discrete numeric. This project’s focus is to expand the discovery portion of the toolkit to additionally handle the remaining data type: categorical data. In all, we were able to determine strengths and weaknesses of this particular model through various tests, as well as areas for improvement within the Causality toolkit.

Roger Dev

Lili Xu

View Poster

View Blog

Shyamaa Karthik         

High School Student Saint Andrew's School Boca Raton, FL, USA    

Processing the Tamil Wiktionary Pages into a NLP++ Dictionary

As a summer intern for HPCC Systems, I worked on creating the world’s first and most advanced Tamil dictionary with parts of speech for NLP++. My goal was to use Tamil wiktionary pages and leverage the past English Wiktionary parser to create my own parser for Tamil. My end result was the most thorough Tamil dictionary for NLP++ to date, but my hope is that more people will come along and build on it and expand it to make it more complete, and the same is carried across more languages.

David de Hilster

View Poster

View Blog

2023 Community Day recordingMasters in Applied Data Science New College of Florida, FL, USA

HPCC Systems Machine Learning Tutorials

This project explores machine learning bundles in HPCC Systems, focusing on Gaussian Process Regression (GPR), Support Vector Machines (SVM), and the General Linear Model (GLM). It highlights the development of tutorials to help users apply these algorithms, including dataset selection, preprocessing, and coding. The project also identified and resolved a critical error in the SVM implementation, enhancing the robustness of these tools and supporting HPCC Systems educational and open-source goals.

Bob Foreman

View Poster

View Blog

Scarlett Huang 

High School Student at A. W. Dreyfoos School of the Arts West Palm Beach, FL, USA 

Investigate Third-Party Environments (Google Big Query)

This project integrates HPCC Systems with Google Cloud's BigQuery, utilizing two data transfer methods to streamline migration and analysis. The first method involves migrating large datasets from HPCC Systems to BigQuery via Google Cloud Storage, ensuring secure transfer and automated loading using the BigQuery Data Transfer Service. The second method leverages Google Cloud Pub/Sub for real-time data streaming in JSON format, facilitating continuous data flow for immediate processing. Both methods enhance HPCC Systems capabilities in managing big data efficiently and open opportunities for further integration.

Ming Wang

Terrence Asselin

View Poster

View Blog

Shounak Joshi 

Bachelor of Computer Science University of Florida, USA    

Investigate Third-Party Environments (Azure Synapse Analytics)

This project explores integrating Azure Synapse Analytics with HPCC Systems platform endpoints. Azure Synapse, a limitless analytics service, complements HPCC Systems functionality by offering improved visualization and diverse data analysis. The "Linked Service" feature facilitates connections to various data sources, enabling efficient data ingestion into the HPCC Systems Landing Zone. Users can then query data within Synapse SQL Pools, leveraging its powerful analytics capabilities to gain valuable insights. This project demonstrates the potential of third-party environments to enhance HPCC’s capabilities.

Ming Wang

Michael Gardner

View Poster

View Blog