Watch recordings of all presentations made during our Community Day Summit held during October 2023. Find out more about this event and read our blog review of the event.
In addition to these presentations, we also held a Technical Poster Competition for students to present work they have completed on projects leveraging HPCC Systems. See our poster contest wiki to find out more about our 2023 Poster Contest participants and winners.
Resources
Presentation Tracks and Content
Plenary Sessions
Join us as Gavin Halliday, SVP and Head of Platform Engineering, LexisNexis® Risk Solutions, kicks off the 10th annual HPCC Systems Community Virtual Summit. We welcome our community keynote speakers Bill Franks, Director of the Center for Data Science and Analytics, Kennesaw State University, followed by Bahar Fardanian, Manager, Solutions Engineering, LexisNexis Risk Solutions, with an introduction to Gus Cawley, Chief Executive Officer, ZipApply, to discuss how his company uses HPCC Systems in industry. Gavin will conclude with a keynote on the latest HPCC Systems platform progress and developments before sharing key highlights of the day ahead.
Welcome & Plenary Keynotes
Gavin Halliday, SVP and Head of Platform Engineering, LexisNexis Risk Solutions, Bill Franks, Director of the Center for Data Science and Analytics, Kennesaw State University, Bahar Fardanian, Manager, Solutions Engineering, LexisNexis Risk Solutions, with an introduction to Gus Cawley, Chief Executive Officer, ZipApply
Bill Franks will provide a keynote address on Generative AI: Moving Beyond The Hype & Hysteria. Rarely has a topic rocketed from obscurity to everyday discussion point as fast as generative AI has. Only going mainstream in late 2022, it has captured the attention of not just the data science and data engineering communities, but the world at large. With that attention has come a lot of hysterical discussions about how it will ruin society, equally unrealistic discussions of how it will transform everything, and a mix of both facts and misinformation all around. This talk will provide some sober and realistic discussions about what generative AI is, how it is being misunderstood, the strengths and pitfalls that it has, and how it might realistically change our world both in the near future and the long term. After the talk, attendees should walk away with a better feel for what’s real and what’s hype, as well as tangible ideas for how generative AI can play a role in both their personal and professional lives.
Bahar Fardanian will be joined by Gus Cawley from ZipApply for a discussion on How HPCC Systems is being used in Industry and how this dynamic startup is utilizing HPCC Systems and big data analytics to hep job seekers get hired faster.
...
Watch recordings of all presentations made during our Community Day Summit held during October 2024. Find out more about this event and read our blog review of the event.
In addition to these presentations, we also held a Technical Poster Competition for students to present work they have completed on projects leveraging HPCC Systems. See our poster contest wiki to find out more about our 2024 Poster Contest participants and winners.
Resources
Presentation Tracks and Content
Plenary Sessions
Join us for the opening and closing sessions from the 11th annual HPCC Systems Community Virtual Summit.
Welcome & Plenary Keynotes
Gavin Halliday, SVP and Head of Platform Engineering, LexisNexis® Risk Solutions/ Michael Stefanick, Managing Director, EY
Watch as Gavin Halliday kicked off the 11th annual HPCC Systems Community Virtual Summit, as well as hear from our community keynote speaker Michael Stefanick, who presented, AI + Human: How AI will Disrupt Work and Allow Workers to Thrive.
Awards Ceremony & Plenary
Trish McCall, Sr Director Program Management, Hugo Watanuki, Manager Community Tech Programs, George S Foreman, Prod Dev Technical Writer Business Analyst II, Bob Foreman, Software Engineering Lead LexisNexis Risk Solutions
Join us for the long-awaited announcement of the 2023 2024 Community Awards and Poster Competition winners, followed by our afternoon keynote honoring our academic community and engagement.
Cloud Strategies with HPCC Systems
The latest techniques in optimizing the Cloud Native HPCC Systems platform
HPCC Systems Deployment using K3D instance
Aryaman Gautam, Sidharth Ganesan, Srinivasan Kothandam, LexisNexis Risk Solutions ITTC
This integration package enables HPCC Systems trainers and trainees to install the HPCC Systems engine in any standalone system for training and Practice. This will enable the user to work on HPCC Systems components including Thor, Roxie, Dali Storage, and ECL at the localized environment.A Better Understanding of Thor and Roxie Configurations for the Containerized HPCC Systems Platform
Godji Fortil & Krishna Turlapathi, LexisNexis Risk Solutions
In the HPCC Systems Cloud Native Platform, configurations for the Thor and Roxie engines are coded in YAML, stored in a file and passed to the helm deployment. For this presentation, our goal is to go over those settings, recommendations and best practices, then deploy the HPCC Systems Platform using Terraform.How to Enable Azure Log Analytics for the Containerized HPCC Systems Platform Using Terraform
Godji Fortil, LexisNexis Risk Solutions
For log solutions, the HPCC Systems Cloud Native Platform provides a standard log interface to process HPCC Systems logs. During this presentation, we will demonstrate how to enable Azure Log Analytics using our newly developed Terraform module for HPCC Systems logs.Optimizing Business Solutions with Azure Fullstack Products
Mauricio Nunes and Matheus Machado, LexisNexis Risk Solutions
Learn how our LexisNexis Risk Solutions business in Brazil launched two data products using an HPCC Systems implementation on Azure. The speakers will provide an overview of components including a Web portal, Batch and ESP API as delivery channels served from a Roxie and MySQL backend, all on commercial Azure. They will close with a product demo of identity verification and fraud solutions using fictitious data while diving into the architecture, best practices, challenges, and technical experiences shared by the team.
Productivity with HPCC Systems
Leverage the improvements and innovative features in the latest HPCC Systems 9.x release
Roxie Performance: A Deeper Technical Dive into 9.x
Gavin Halliday, LexisNexis Risk Solutions
In this presentation, Gavin will discuss some of the different factors that affect Roxie performance, as well as cover details including some of the recent platform changes which aim to make this job easier. Anyone deploying production Roxie queries or interested in the technical details of how Roxie indexes are implemented can benefit from attending this session.HPCC Systems Landing Zone Security
Russ Whitehead, LexisNexis Risk Solutions
Data is imported to and exported from an HPCC Systems Logical File from a file system location known as a Landing Zone. Landing zone access was previously authorized via the LDAP Security Manager feature permissions, but once a user had access to it they could access the entire Landing Zone directory. Beginning with HPCC Systems Version 9.0, admins can specify Landing Zone scopes, using the same ECL Watch interface as file scopes and workunit scopes. This presentation describes the motivation for the additional security and walks the attendee through the creation of these scopes.ECL Watch: Redefining Progress
Kunal Aswani, LexisNexis Risk Solutions
Delve into ECL Watch and check out the snapshot queries in Roxie, logging, hot spot identification, cost estimation, and metrics for efficient data retrieval.A Threat Detection and Mitigation Framework: HPCC Systems and Azure at the core
Sowmya Myneni, LexisNexis Risk Solutions
While it is fantastic to advertise that we are confirming various security controls (CIS, FedRAMP), how do we actually prove it? This talk is going to explore in-depth the concepts around control effectiveness by showing the audience some example security controls we are implementing at LexisNexis Risk Solutions and how the HPCC Systems data analytics platform is used to collect, explore and analyze the data.
Programming with HPCC Systems
Extend your ECL with powerful language tools
Building a Scalable and Efficient Bitcoin Blockchain Parser on the HPCC Systems Platform / Optimization of Learning Trees using ECL on HPCC Systems
Jyothi Shetty, RVCE / Manonmani S & Sudershan K S, RVCE
This joint session includes two presentations from RV College of Engineering featuring the work from academic collaboration as part of the HPCC Systems Centre of Excellence (CoE) in Cognitive Intelligent Systems for Sustainable Solutions (CISSS).
Building a Scalable and Efficient Bitcoin Blockchain Parser on the HPCC Systems Platform, Jyothi Shetty:
Many commercial applications are interested in analyzing Bitcoin blockchain data, but due to the large volume of transactions the processing of transaction data becomes a challenge. A distributed and parallelized architecture such as HPCC Systems can be a solution to efficiently analyse such massive amounts of data. The work proposes to build a C++-based Bitcoin parsing algorithm embedded within ECL to create a robust and efficient blockchain parser. The proposed approach takes raw blockchain data in the form of blk.dat files as input and processes the headers and transaction data within each block. Furthermore, it maps input addresses to output addresses of previous transactions, constructing a connected network of blocks that accurately resembles the blockchain. By iteratively traversing this chain of blocks, the parsed contents are then written to a CSV file. The suggested implementation offers flexibility as it does not rely on external library dependencies and can be executed as a single ECL file. This is a work in progress study. An initial functional implementation of the parser code has been deployed successfully on hThor and the next step is to expand its capability to Thor so the parser can fully take advantage of HPCC Systems distributed and parallel processing capabilities. As the size of the blockchain data continues to grow, having a scalable and reliable ECL Bitcoin parser that enables retrieval of blockchain data with minimal overhead is quite beneficial.
Optimization of Learning Trees using ECL on HPCC Systems, Manonmani S & Sudershan KS
Some core ECL functions on HPCC Systems, like the ones leveraged in the LearningTrees Machine Learning Bundle are recursive in nature and hence tend to demand higher computational effort under certain conditions. This talk presents an alternative solution for optimized execution of the LearningTrees algorithms in ECL via embedding python libraries. The proposed approach was tested on standard datasets and a significant decrease in execution time was observed with no impact on the accuracy of the models: while the embedding technique provided approximately the same accuracy levels as the current existing ML library functions, the execution time was reduced by 4-5 times making it an extremely fast and efficient technique. The target audience who will benefit from this talk includes industry professionals who wish to explore the resources for solving big data problems, technology enthusiasts willing to learn latest trends and technologies, and researchers working in big data and high-performance computing domains.
Building Trustworthy and Auditable Digital Human Readers Superior to Humans with NLP++
David de Hilster & Ashton Williamson, LexisNexis Risk Solutions & Clemson University
Machine Learning, Neural Networks and Large Language Models are statistical and opaque, human memory is fallible, and NLP++ is now stepping in to fix these problems and build trustworthy systems in areas such as law enforcement, healthcare, and sentiment analysis. Not only can NLP++ perform better than humans, but it can build systems that eventually can be better than human experts. Unlike the statistical methods of ML, NN, and LLM which are opaque and not auditable, NLP++ is a verifiable and auditable technology which is a game-changer in today’s critical systems. NLP++ is a glass box computer programming language and framework that seamlessly integrates with HPCC systems to provide sophisticated text processing that until now, have been only achievable by humans. We will look at work being done at Clemson to help solve the medical coding task that is currently done by human readers and has a 50% failure rate.Digital Human Readers Making History with NLP++ and HPCC Systems
Charvi Dave, Kruthika Pinnada, Pedro Rodrigues, Dheemonth Kodali, Shyamaa Karthik, David de Hilster LexisNexis Risk Solutions, RVCE, USP Brazil, St Andrews School
From resume processing, to sentiment analysis, to digital dictionaries, NLP++ and HPCC Systems are making history. Searching for resumes is precarious when using simple keyword search and new work using NLP++ is making resume searching an exact science. Sentiment analysis till now has been done with statistical and keyword methods and have been too generic to used as real-world systems. Two sentiment analyzers break that mold, one involving soccer teams in Brazil, and the other cricket teams in India. NLP digital dictionaries have either been stuck in time for decades or are non-existent. Two Wiktionary projects use NLP++ to parse linguistic information from Wiktionary pages, processing hundreds of thousands and millions of pages using HPCC Systems, and creating dictionaries that can be used in future NLP systems. The result will be the most comprehensive English dictionary ever created for NLP and the first digital dictionary with linguistic information ever created for the Tamil language.Parquet Support for ECL
Jack Del Vecchio, LexisNexis Risk Solutions
Introducing the Parquet Plugin, an interface between ECL and Apache Arrow that gives the ECL programmer the capability to interact with the parquet file format. This talk will demonstrate how ECL programmers can efficiently read and write parquet files with ease. With this interface, ECL programmers can partition datasets, read any partitioned or non-partitioned dataset, and write to a parquet file. In the demo all the functions of the plugin will be shown. One of the key highlights of the plugin is its capability to handle datasets larger than memory. By leveraging streaming techniques, we ensure efficient processing of large-scale datasets without sacrificing performance or data integrity. Attendees will gain insights into the integration of parquet and how the Apache Arrow library was leveraged to give the ECL programmer efficient access to the parquet file format. A variety of demos to include usage examples will be shown as well as opportunities to ask questions and learn more about the plugin.
HPCC Systems in Action
Real World Use Cases with HPCC Systems
An Understanding of HPCC Systems Platform Metrics
Ken Rowland, LexisNexis Risk Solutions
HPCC Systems provides a metric framework for collecting low level platform specific metrics. This presentation covers the framework, available metrics, component instrumentation, and metric collection. It also covers the new ESP method execution profiling. All of it is tied together with examples from a running cluster.Robots, Drones and Generative AI: Exploring cutting edge technologies with HPCC Systems in a school campus environment
Tai Donovan, Carlos Caceres, Nick Schwartz, American Heritage School
School safety security, and wellbeing have been a driving force for our most recent academic partnership with HPCC Systems. The first phase of our security project began in 2020 with an autonomous mobile sentry designed, built and programmed with facial recognition software developed by our students. This project was very effective, and we wanted to enhance the capabilities by increasing response times for any security concern with the use of drones. This latest version uses autonomous drone technology to enhance the security features of our school by detecting unauthorized personnel on campus, assisting security in response to active shooter situations, gathering information during lockdowns and to track the student/staff evacuation process if initiated. In parallel to these latest efforts, a recent HPCC Systems summer internship project investigated the possibility on whether generative AI could be used to generate human-like responses in our autonomous security robot robots. During the internship, an interface through HPCC Systems was constructed to interface interact with GPT and ChatGPT. Therefore, we expect to be able to use a standard neural network to identify different emotions given video of a person and these these emotions would then be processed by the interface to create a call to OpenAI’s API from which an appropriate response would be generated. This response would be designed to change based on the emotions classified by the neural network and most importantly due to the use of generative AI would be different even with the same emotional classification. Even though this is all still in the prototype phase, our projects have proven to be a great proof of concept developed by high school students to deal with the reality of our times.Tombolo - Opensource Tools for Interacting with HPCC Systems Clusters
Matt Fancher & Yadhap Dahal, LexisNexis Risk Solutions
Introducing Tombolo version 2.0, an innovative open-source project initiated by LexisNexis Risk Solutions, aimed at providing data catalog capabilities for HPCC Systems clusters. A web application that caters to both technical and non-technical users, facilitating seamless interaction with HPCC Systems clusters. Join us in this enlightening session as we demonstrate the impressive capabilities of Tombolo. Discover how users can effortlessly create workflows by leveraging assets from HPCC Systems clusters, enhanced with our newly introduced versioning tool. This tool empowers users with increased confidence when editing existing workflows. Witness the application's exceptional ability to monitor assets and proactively send timely notifications via different medias, ensuring crucial data management tasks are completed promptly. Moreover, we'll showcase Tombolo's intuitive dashboard and its extensive range of API endpoints, offering a comprehensive overview of asset performance. During the presentation, our team will also share insights into the future development goals of Tombolo, providing a glimpse into the exciting features currently in development and planned for future additions.Vulnerable Victim Monitoring Connecting the Dots to Find Missing Children Quickly
Pasqua Ruggiero, LexisNexis Risk Solutions
According to Child Find of America, an estimated 2,300 children go missing each day. That equates to over half a million missing children per year. Most of these children are runaways, a population that is quite vulnerable to trafficking. Trafficked victims can be exploited for their PII in relation to account take over, stolen or synthetic identity fraud. LexisNexis Risk Solutions Inquiries are comprised of searches for fraud and credit seeking consumers which are stored in HPCC Systems. These hundreds of billions of inquiry data can be searched retroactively and monitored proactively for matches on missing children within a reasonable geographic area. Additional topics of discussion will include data visualization capabilities in Python with data obtained from both the National Center for Missing & Exploited Children’s public RSS feed (current missing cases) as well as all missing cases from the past 20 years (obtained from the ADAM Program). We will also discuss proposals for additional future studies utilizing the power of HPCC Systems to extract other types of big data and provide further insights and information to provide law enforcement and social services to better aid them in the search and prevention of human trafficking. *We would like to issue a content warning for this presentation for topics like human (and especially child) trafficking. We understand the difficulty of presenting information on this sensitive topic and appreciate your patience and understanding as we present our research. All data will be anonymized to not reveal any personal information.
ECL Training Workshops
The HPCC Systems Training and Support team has been very busy this year visiting universities and trade show events to promote HPCC Systems and ECL through Code Days, Hackathons, and Workshops. This year’s Community Summit workshop presents an in-depth look at these events in three 1-hour sessions:
Part 1: The “Music is Life!” Workshop
Bob Foreman, Software Engineering Lead, LexisNexis Risk Solutions
Who doesn’t love music? Take a break from your daily routine datasets and join us in this first hour. We break down a popular open-source music dataset, explore normalizing the dataset and its effects, and look at a variety of data evaluation and query techniques.Part 2: The “Find Your Paradise!” Hackathon
Bob Foreman, Software Engineering Lead, LexisNexis Risk Solutions
Have you ever thought about building an application that can help people find places to live that maximize their quality of life and happiness? The goal of this challenge is to analyze different datasets across different categories and correlate them using the HPCC Systems platform. After analyzing, the participants will be asked to design an interface to query this data and assign it a scoring system, then deliver it to the user via ROXIE and show the user where they should most likely want to live. Users should be given choices in an easy-to-use form that when submitted will generate a unique set of scores based on locations.Part 3: Better Customer Insights Through Relational Dataset Queries
Bob Foreman, Software Engineering Lead, LexisNexis Risk Solutions
In this final hour, we look at the power of the open data model, transforming normalized data into a denormalized relational dataset, and use the implicit relationality power to analyze relationships to provide better customer insights.our keynote honoring our academic community and engagement.
Platform Evolution
Take a look at the latest improvements and innovative features in the platform.
The Latest Advancements in HPCC Systems Observability
Rodrigo Pastrana & Mark Kelly, LexisNexis Risk Solutions
Learn about the latest advancements in HPCC Systems observability, focused on the Open telemetry-based instrumentation framework and how this feature can help you streamline your ECL query workload and ensure your HPCC Systems deployments are functioning at full strength!How to Secure Your Containerized HPCC Systems Platform Using Terraform
Godji Fortil, LexisNexis Risk Solutions
In our presentations over the last couple of years, we showed you how to deploy the containerized HPCC Systems Platform. As a following step, knowing how to secure your cluster over the internet is very important. This presentation demonstrated how that can be done in the best ways possible using Terraform, htpasswd, or Azure Active Directory.Parquet Plugin Usage / Test Suite for the Parquet Plugin
Jack del Vecchio & Ilhan Gelle, LexisNexis Risk Solutions
The HPCC Systems Parquet Plugin has received many performance improvements and bug fixes over the past year. In this two-part session, Jack highlighted the improvements compared to native file formats and went into detail about the usage of the Plugin and how to leverage the Parquet file format. Ilhan then provided an overview of the robust test suite for the HPCC Systems Parquet plugin and how it helps enhance the reliability and efficiency of Parquet integration within HPCC Systems.What’s New – ECL Watch and IDEs
Kunal Aswani, LexisNexis Risk Solutions
Catch up on the latest changes to the interface and appearance in the world of ECL development.
Data DNA
See how these projects enhanced the integrating processes and capabilities of the platform to streamline, enrich and visualize data.
Streamlining Business Tasks with HPCC Systems Clusters and Tombolo
Yadhap Dahal & Matthew Fancher, LexisNexis Risk Solutions
Introducing Tombolo, an innovative open-source project designed to enhance the already robust features, speed, and capabilities of HPCC Systems clusters. Our aim is to introduce a user-friendly web application that caters to both technical and non-technical users, enabling seamless interaction with HPCC Systems clusters. During our presentation, we’ll explore the capabilities of Tombolo, including creating workflows and monitoring assets. Additionally in this session, we’ll highlight the application’s ability to proactively send timely notifications. Moreover, we’ll provide a comprehensive overview of Tombolo’s intuitive dashboard.Power BI Integration with HPCC Systems
Harsh Raj & Srinivasan Kothandam, LexisNexis Risk Solutions
Learn more about this integration package that enables analysts and business users to read HPCC Systems native files directly from Power BI.Using NLP++ to Build a Brazilian Address Cleaner in HPCC Systems
Guilherme da Silva, LexisNexis Risk Solutions
NLP++ is a new programming language specially designed to build deep text parsers. The main objective of this approach is to build a Brazilian address analyzer and cleaner that is capable of improving the current cleaning process, with the advantage of being a transparent process with easy problem identification and correction, demonstrating great potential for future use in production.Enhancing Legal Assistance Through Data Enrichment with HPCC Systems
Nihar Mandahas, Skanda P R, Manvith L B, Pratheek Rao MP, Arya Hariharan, & Dr. Jyoti Shetty,RVCE
The proposed application enhances legal research efficiency and accuracy by using NLP for keyword extraction and leveraging HPCC Systems for rapid data retrieval, ensuring quick and relevant reference searches. Among those who will benefit from this application are lawyers who wish to simplify the task of finding relevant legal references, academics and law students who usually conduct extensive research, and legal organizations overall.Integrating Microsoft Fabric and HPCC Systems for Security Analytics
Sowmya Myneni & Kushi Kiran, LexisNexis Risk Solutions
Integrating HPCC Systems with Microsoft Fabric and Power BI offers a seamless workflow for transforming, linking, and visualizing data. HPCC Systems, with its user-friendly ECL language, efficiently handles complex data transformations, which can then be imported into Microsoft Fabric. From there, Power BI can be used to create interactive and insightful visualizations and reports. This integration simplifies the process of turning raw data into actionable insights, making it an effective solution for data analysis and presentation.
Productivity Tools
These projects highlighted tools and best practices for maximizing efficiency.
We’ve Come a Long Way From README.txt – Improvements in the Platform Documentation
Jim DeFabia, LexisNexis Risk Solutions
Keeping documentation clear, concise, and user-friendly is crucial for a smooth user experience. We are always working to improve our HPCC Systems® Platform documentation and have recently implemented a few new features to increase use and improve the user experience.LLM for ECL Code Generation Using Llama3
Connor Davis, DataSeers
Learning and exploring a new programming language can be very challenging, especially when your resources are limited. To help overcome this challenge and improve developers’ productivity, this project proposes the creation of an AI assistant that can help with ECL learning and code development tasks. To achieve this aim, the LLM Llama3 and prompt engineering techniques were leveraged to create a chat bot that can be run locally in VSCode to assist with the generation of ECL code. This is a work in progress development, but the preliminary results are promising and there is a large potential for further expansion and improvements.Implementing Conditional Cleanup after Regression Testing in HPCC Systems
Goutami Sooda, Arya Vinod, Ahana Patil, & Chandana S, RVCE
The cleanup module designed and developed during this project allows users to choose the cleanup mode and when enabled can effectively delete thousands of workunits generated by regression tests across different clusters, including Thor, Roxie, and hThor. This feature has yielded tangible benefits, including reduced operational costs, improved resource utilization, and enhanced overall efficiency of the regression test engine. By preventing overload on both the cluster and the Dali component, we have significantly enhanced cost-effectiveness and streamlined resource management within the HPCC Systems environment. Overall, this project focuses on contributing to the extensive codebase of the regression test engine.Data 360° View Using HPCC Systems
S Dhanush & Shreyas Shankar, RVCE
In today's data-driven landscape, organizations face significant challenges in managing and leveraging large volumes of data across diverse platforms. This session presents a cohesive solution developed by RVCE team for ProfitOps Inc., a startup company based out of Cumming, GA, USA, using HPCC Systems to achieve comprehensive data integration and transformation, ensuring seamless connectivity across MySQL, MongoDB, AWS, FTP, and other platforms. HPCC Systems serves as the core technology for this solution, facilitating streamlined data ingestion, synchronization, transformation, and versioning processes. Whether ingesting data from the HPCC Systems landing zone to MySQL or vice versa, the solution supports bidirectional data flows with robust mechanisms for incremental updates, deletions, and version control across various data formats.Testing Best Practices of the HPCC Systems Platform (Now and in the Future)
Christopher Lo, LexisNexis Risk Solutions
Over the past two years we have been developing a new build system that is more open and reviewable. In this lecture we will discuss; changes to our build machine image generation, leveraging vcpkg for library dependencies, our builds on Github Actions and how to utilize our build workflow code to generate your own custom builds.Navigating the Platform Build System
Michael Gardner, LexisNexis Risk Solutions
Join us to hear about our current and future testing philosophies and how they are used to validate and test our platform. We'll talk about our current set of testing procedures on various environments and source code repository. We'll also touch on how we perform our testing and how you can replicate these tests on your own environments.From In-House to Open Source: The Journey of PyHPCC
Amila de Silva & Rohith Podugu, LexisNexis Risk Solutions
PyHPCC is a Python package and wrapper built around the HPCC Systems web services that facilitates communication between Python and HPCC Systems. It was originally developed as a LexisNexis Risk Solutions internal-only tool to automate repetitive work done on HPCC Systems. Since the evangelization of PyHPCC began in 2022, there has been growing interest across the organization and the broader HPCC Systems community to adopt PyHPCC.
Fueling Success
Here are the real world use cases and success stories leveraging HPCC Systems.
Building an NLP Pipeline for Electronic Health Records and Brain MRI Classification
Vishalakshi Prabhu, Eshaan Mathur, Nikhil Vasu, & Prashant Ronad, RVCE
A medical record can includes a variety of types of "notes" entered over time by healthcare professionals, such as observations and administration of drugs/therapies, test results, X-rays, reports, etc. Accordingly, one of the biggest challenges in healthcare is the unavailability of data standardization models. At the same time, most doctors rely on their own knowledge and limited patient data when making decisions. Therefore, accessing the knowledge of many medical professionals would potentially benefit patient care.This work aims to develop an effective disease identification/classification system using NLP for Electronic Health Records (EHR). Further, it builds a knowledge base for future reference, allowing for querying patient/disease details and for pattern finding. BioBERT is used for text embedding in this project. The selected approach has leveraged pre-processing in such a way that the symptom, duration, gender and affected organ of the patient are labeled/displayed when the whole text is given as input.
Learned Cache Size Setting for Roxie Clusters
Yifan Wang, University of Hawaii
Internal node cache of the index has a non-trivial impact on the Roxie system component. Proper setting of node cache size will result in a significant speedup on data access. However, optimal internal cache sizes depend on various factors, including the access patterns within the index, the compression ratio of data, the disk IO speed, and the time needed to decompress the data.In this talk, I present our research on setting best cache size using machine learning methods. I first introduce the background and latest works of learning-based knob tuning which aims at predicting the best configurations for data systems. I then present our research in two parts: (1) simulation of Roxie system and (2) learning method over the simulation results.
I hope this talk can provide a basic overview about the knob tuning in AI for database area, as well as help inspire other researchers and developers to improve HPCC Systems using the emerging AI techniques.This talk is targeting researchers and developers of HPCC Systems to provide insight on facilitating the system using AI techniques and inspire new direction to further improve the platform.
Machine Learning and Cybersecurity Analytics Using the NSL-KDD Dataset
Zularbine Kamal, Kennesaw State University
HPCC Systems currently supports several machine learning algorithms, both supervised and unsupervised, and makes them available via machine learning bundles. In this project, we have leveraged all the algorithms that have data classification capability to detect network intrusion with the dual goal of getting the highest accuracy possible for the trained model while ensuring efficiency during its training. We will also demonstrate the use of the “Myriad Interface”, which can perform multiple independent machine learning tasks within a single interface invocation. Invoking the activities in parallel allows them to be distributed across the nodes in the cluster, thereby maximizing the performance while minimizing the run time. Lastly, we will also cover how the ML_Core.Preprocessing bundle can be used for data preprocessing including label encoding, scaling, and one-hot-encoding.The content of this presentation is aimed at individuals working with machine learning, cybersecurity specialists, and HPCC Systems users and technology enthusiasts overall.
Model Inversion Attacks with the HPCC Systems Platform
Andrew Polisetty, Kennesaw State University
In this world of machine learning, feeding the model with inputs and training the model is one thing, and securing that model is another. Since many companies leverage machine learning models for decision-making using sensitive data, attackers can target these data-sensitive models, and one of the biggest threats to these models is MIA (Model Inversion Attack). Mainly, MIA is a technique that can be leveraged to reconstruct sensitive information such as financial data. Attackers can access these models and gather predicted data, which can be used as input for training a new model similar to the original one. The attackers can then infer the sensitive data from the original model to reconstruct the training data or build a comparable model.In this project, we have leveraged a public loan dataset and utilized the HPCC Systems platform to perform black-box attacks and to design solutions to prevent them. We first developed the machine learning model and then utilized it to build the original, threat and defender models. For the original model, we chose a credit risk assessment scenario consisting of a person’s loan and personal details. We developed the original model by using the learning trees algorithm from the HPCC Systems machine learning bundle. In this scenario, the attackers would access the inputs and outputs of the model through querying, where they can analyze the data points to perform a black-box attack. In the attack model, we simulate an attack by querying the original model and we train the attack model using its output. For the prevention model, we are currently exploring different approaches, such as adding noise to the output of the original model to manipulate the attacker. So far, the raw accuracy obtained from the learning trees algorithm is still relatively low, so we are also training models using logistic regression and continuing to explore different defenders for the prevention model.
School Safety and Security Using RFID and Drones
Taiowa Donovan & Nick Schwartz, American Heritage School
The continuation of our school security project has grown to incorporate drones with thermal imagery along with RFID (Radio Frequency Identification) The new drone platform allows us to collect more detailed real time data, higher resolution real time images for the facial recognition software, double our flight time with extra battery capabilities, a camera with 56x zoom, a wide-angle camera for high-precision campus mapping, and a thermal camera for heat source inspection. This new drone has increased rapid response time to investigate security threats detected by our current patrol drones and security staff. Thermal cameras allow us to track students during code red events and have assisted in the training of security personnel.Exploring the Capabilities of HPCC Systems in Facilitating Inter Fog Communication
Henrique Antonio Buzin Vargas, Federal University of Santa Catarina (UFSC)
The increasing proliferation of devices connected to the Internet of Things (IoT) has generated significant challenges in terms of integration and interoperability due to the diversity of communication protocols. Fog computing has emerged as an effective solution to reduce latency and improve system efficiency by bringing data processing closer to the source. However, communication between different fogs and clouds still faces considerable obstacles due to the absence of a protocol conversion layer. This session will explore how HPCC Systems can be used to facilitate efficient communication between different fogs. The presentation will address a modular architecture that supports protocol conversion and communication between fogs and clouds, providing a flexible and scalable solution for IoT environments. Key topics to be discussed include an overview of HPCC Systems and its large-scale data processing capabilities, the structure and functionalities of the modular layered architecture developed to support interoperability between different devices and systems, the methodology adopted for the implementation and testing of the architecture, including the evaluation of its efficiency, latency, security, and scalability, and preliminary results and performance analysis of the architecture in practical scenarios. Lastly, insights will be shared about the challenges faced during the development of the architecture and proposed guidelines for future implementations. This session is aimed at researchers, developers, and technology professionals seeking solutions to improve the integration and efficiency of fog computing systems. Discover how HPCC Systems can transform communication and data processing in distributed environments and drive the evolution of IoT.Internship to Impact: Real Life Success in the HPCC Systems Community
George S Foreman, Christopher Connelly, Jack del Vecchio, Yash Mishra, Nathalia Ribas, & Fulvio Favilla Filho, LexisNexis Risk Solutions
Join us for a moderated panel style interview with former interns who have emerged from the HPCC Systems Summer Internship Program. In this session, the next generation of technologists will share their experiences and successes working with the HPCC Systems community and learn from their experiences during their internship and how they transitioned to full time RELX employees.
ECL Training Workshop
The HPCC Systems Training and Support team has been very busy this year visiting universities and trade show events to promote HPCC Systems and ECL through Code Days, Hackathons, and Workshops. Here is this year’s Community Summit workshop:
Sleep Well with ECL: Job Automation and Scheduling
Bob Foreman, Software Engineering Lead, LexisNexis Risk Solutions
This workshop showcased the latest work regarding ECL Scheduling and Process Automation. In addition, many ECL best practices are demonstrated along the way.
Topics included:Creating and using FUNCTIONMACROs
Creating and using MACROs
The ECL Template Language
Automating ECL
ECL Scheduling