Watch recordings of all presentations made during our Community Day Summit held during October 2021. Find out more about this event and read our blog review of the event.
In addition to these presentations, we also held a Technical Poster Competition for students to present work they have completed on projects leveraging HPCC Systems. See our poster contest wiki to find out more about our 2021 Poster Contest participants and winners.
Resources
- See the full Agenda
- Go to the Video Library and listen to sessions
- Watch videos of students talking about their posters
- 2021 Poster Contest - Meet the students, read abstracts and view posters
Presentation Tracks and Content
Plenary Sessions
The 8th annual HPCC Systems Community Virtual Summit began with keynotes from top industry leaders and technologists including Microsoft, BitPay and DataSeers:
- Welcome & HPCC Systems Platform Vision
Flavio Villanustre, VP Technology & CISO and Richard Chapman, VP & Head of R&D, LexisNexis Risk Solutions
Flavio kicks off the 8th annual HPCC Systems Community Summit and reflects on 10 years of our open-source journey. Richard shares his vision and future direction of the platform and our ongoing progress towards making HPCC Systems cloud deployments seamless for our community users. - HPCC Systems on Azure: Present and Future Possibilities
Shrikrishna Khose, Senior Cloud Solution Architect, and Steve Griffith, Principal Technical Specialist, Cloud Native Global Black Belt Team, Microsoft
The Microsoft team will discuss the strong collaboration between Microsoft and LexisNexis Risk Solutions. See how HPCC Systems can be the foundation for a Data Lake and explore Azure services integration possibilities. Bonus: high level overview on Azure certification paths. - A Crypto Customer Journey with HPCC Systems
Stephen Pair, CEO, BitPay and Gracie Ortiz, COO, DataSeers
Hear from C-level executives from the FinTech industry speak about market pressures in Cryptocurrency and how the DataSeers appliance based on HPCC Systems is going to solve some of the problems. Community Recognition & Poster Awards Ceremony
Trish McCall, Director Program Management & Lorraine Chapman, Consulting Business Analyst and HPCC Systems Intern Program Manager, LexisNexis Risk Solutions Group
Join us as we announce the recipients of the 2021 HPCC Systems Community Recognition and David Kan Ambassador Awards. Winners of the 2021 Poster Competition will also be unveiled.Academia and Industry – BFFs (Best Friends Forever)
Moderator: Bahar Fardanian, Technology Evangelist, LexisNexis Risk Solutions Group
Guest Panelists:
Dawn Tatum, Director of CCSE Partnerships and Engagements, Kennesaw State University
Burcin Bozkaya, Director, Graduate Program in Data Science, New College of Florida
Geoffrey Machin, Metadata and Information Architect, Cirium
Jesse Shaw, Principal Data Scientist II, LexisNexis Risk Solutions
Industry and Academia have a long and prosperous partnership history, providing mutual benefit through project collaboration, real-world opportunities for students, and preparation for the next generation of professionals into the workforce. This partnership needs work from both sides to thrive, needing mutual respect, equal contribution, and alignment to goals and outcomes. This discussion will be focused on the opportunities and challenges of this partnership. Our guest panelists will feature industry and academic experts talking through their experience in establishing value, best practices, pitfalls, and methods to ensure a successful long-lived relationship. This panel discussion will be moderated by Bahar Fardanian, Technology Evangelist, LexisNexis Risk Solutions Group, who works closely with our community partners.Wrap-up & Adjourn, Flavio Villanustre
LexisNexis Risk Solutions Group
Flavio closes the exciting day with a wrap-up and thank you to our Community.
Platform Features
Learn about the new features and enhancements in the latest HPCC Systems platform, including cloud native topics:
- What’s New in HPCC Systems and the Cloud Native Roadmap
Gavin Halliday, LexisNexis Risk Solutions Group
Gavin shares an update on the new features in the latest release including data handling in the cloud native version. - What’s New in ECL Watch, IDEs and Visualization Framework
Gordon Smith, LexisNexis Risk Solutions Group
Gordon discusses the latest features in ECL related development tools including "Modern" ECL Watch, Visualization updates, and the VS Code ECL Extension. - Securing Your Cloud Native HPCC Systems with Service Mesh
Manish Kumar Jaychand, Infosys
With the advent of Cloud and Kubernetes over the years, machines are no longer considered as attached to a data center. Machines are more ephemeral than ever before. The traditional architecture of the HPCC Systems environment harnessed the physical storage of each node and that in turn gave certain performance benefits. But with cloud, it is no longer necessary to have a fixed machine for a process. The true power of cloud can be harnessed only when we treat them as ephemeral. Therefore, a Thor worker node which is always on in a traditional HPCC Systems environment is spun up only when it is required in a cloud environment. With the latest cloud native version of HPCC Systems, we now have the flexibility to spin up the clusters only when required. In this session, we will cover how the latest cloud native platform is different from the bare metal version, explain service mesh and how it fits into the HPCC Systems scheme of things, and a comparison of service mesh Istio and Linkerd. - Cloud Security & Authentication in HPCC Systems
Russ Whitehead, Tony Fishbeck & Mark Kelly, LexisNexis Risk Solutions Group
This talk covers a discussion of some current and future technology enhancements around HPCC Systems platform security, with a primary focus on cloud deployments. This session will include a look at support for mutual transport layer security between internal components within an HPCC Systems environment, external facing TLS for securing access into the HPCC Systems environment, using cert-manager to generate TLS certificates for HPCC Systems services and components, installing externally created TLS certificates, secrets management, and a look at our plans for future support of OAUTH2 based authentication and authorization, with an initial focus on support for OAUTH2 integration with Azure Active Directory services. ROXIE Troubleshooting
Mark Kelly, LexisNexis Risk Solutions Group
ROXIE services on cloud/Troubleshooting: What changes will need to occur in the ROXIE code to run on the cloud native platform?
Machine Learning
Hear from our ML experts on the latest machine learning libraries and algorithms available in HPCC Systems:
- Contributions to HPCC Systems - From Virtual Collaboration to Virtual Reality
Dr G Shobha, RV College of Engineering
This talk focuses on the virtual collaborative work done between RV College of Engineering and LexisNexis Risk Solutions on recent contributions to the HPCC Systems Platform. These include plugins and extending Machine Learning bundles for HPCC Systems, followed by analysing the impact of skewed data distributions on most commonly used ECL operations. The talk concludes with case studies executed on HPCC Systems, including the implementation of a virtual reality application. - HSQL: An SQL-like Language for HPCC Systems
Atreya Bain, RV College of Engineering & HPCC Systems Intern 2021 and Mahdi Kashani, LexisNexis Risk Solutions Group
There is a steep learning curve to getting used to handling Big Data, especially in distributed systems, where the task of data processing is split amongst various nodes in clusters.
HSQL is the new big-data query language of HPCC Systems and is an innovative and open- source solution to let users process their data at any scale. It is designed to work in conjunction with ECL which is the primary programming language for HPCC Systems, and it should prove itself to be easy to work with and robust for general purpose analysis. Made to provide a compact and easy to comprehend SQL-like syntax for performing visualizations, general data analysis, training of Machine Learning models, HSQL allows a modular structure to such programs and can easily integrate with VS Code IDE. In this presentation, learn why HSQL is important and how it adds more value to HPCC Systems users, its syntax, and see a couple of examples on different datasets and its installation and setup instructions. New Advancements to Logistic Regression and the ML Library
Lili Xu, LexisNexis Risk Solutions Group
Logistic Regression is one of the most important analytic tools in the social and natural sciences such as natural language processing and image recognition. One of our Machine Learning advancements is to renovate the current HPCC Systems Logistic Regression bundle and add the ability to handle both binary and multi-classes predictions tasks. Another advancement is to improve the performance and remove the bottlenecks of the Preprocessing bundle. The improved version is more scalable and more efficient for Big Data preprocessing tasks.The Causality Analytics Toolkit for HPCC Systems
Roger Dev, LexisNexis Risk Solutions Group
Causal Reasoning is at the heart of most human thought and action, yet has only recently been formalized as a mathematical and scientific field of study. It is hard to conceive of achieving a true AI without such a capability. Although the science of Causality has not advanced to the threshold of AI, it can unlock capabilities that are beyond the realm of statistical observation. Current Machine Learning methods assess observational patterns, and learn to replicate the results of patterns previously detected. They make no effort to disentangle true causal effects from observed correlation. They lack the ability to respond to changes in the scenarios that generated the data, or to predict the effect of new actions on the outcome. Causal Science provides a path toward a deeper understanding of our data. It defines mechanisms that can separate causal influences from spurious correlation and infer causal effects from observational data. As these techniques evolve, they stand to revolutionize our understanding and uses of data. Causality 2021 is an HPCC Systems research and development program. The goal is to increase our understanding of the latest causal algorithms, assess and challenge the current state-of-the art, and develop a Causality Toolkit for HPCC Systems Platform. This project encompasses all three levels of the "Ladder of Causality": “Seeing”, “Doing”, and “Imagining”, as well as Causal Model Validation, and Causal Discovery. This project includes work from three interns who joined the HPCC Systems Intern Program in 2021.The Forecast of COVID-19 Spread Risk at The County Level
Murtadha Hssayeni, Florida Atlantic University
The early detection of the coronavirus disease 2019 (COVID-19) outbreak is important to save people's lives and restart the economy quickly and safely. People's social behavior, reflected in their mobility data, plays a major role in spreading the disease. Therefore, we used the daily mobility data aggregated at the county level beside COVID-19 statistics and demographic information for short-term forecasting of COVID-19 outbreaks in the United States. The daily data are fed to a deep learning model based on Long Short-Term Memory (LSTM) to predict the accumulated number of COVID-19 cases in the next two weeks. A significant average correlation was achieved (r=0.83 (p=0.005)) between the model predicted and actual accumulated cases in the interval from August 1, 2020 until January 22, 2021. The model predictions had r > 0.7 for 87% of the counties across the United States. A lower correlation was reported for the counties with total cases of <1,000 during the test interval. The average mean absolute error (MAE) was 605.4 and decreased with a decrease in the total number of cases during the testing interval. The model was able to capture the effect of government responses on COVID-19 cases. Also, it was able to capture the effect of age demographics on the COVID-19 spread. It showed that the average daily cases decreased with a decrease in the retiree percentage and increased with an increase in the young percentage. Lessons learned from this study not only can help with managing the COVID-19 pandemic but also help with early and effective management of possible future pandemics. The project used the HPCC Systems platform for collecting, hosting, and analyzing the data.
Data Lake
Learn about efficient and secure ways for handling your data, analytics as well as cool tools and extensions:
- Data Visualization with RealBI
Dan Camper & Mahdi Kashani, LexisNexis Risk Solutions Group
RealBI is a new HPCC Systems business intelligence tool, used to empower HPCC Systems developers to shape and visualize their data in real time, regardless of the size of that data. RealBI saves users time and cost by communicating directly with HPCC Systems clusters. This eliminates the need to further secure or transport the data since it remains entirely within the cluster. RealBI gives users direct access to logical files and ROXIE queries. It also enables users to write and execute custom ECL scripts from within the application if that is desired. Users don’t need programming skill to use RealBI. All charts, filters, sorting, and many more options, are all available with a click of the mouse. - Data Cataloging with Tombolo
Roger Dev & Jerry Jacob, LexisNexis Risk Solutions Group
It is easy for a Data Lake to grow out of control if appropriate measures are not put in place. When this happens, Data Engineer’s productivity can suffer, resulting in delays in customer commitments. A Data Lake can become a Data Swamp suddenly and without warning. The critical threshold is reached when the complexity of the Data Lake exceeds the capability of key personnel to hold the pattern of the Data Lake in their head. The goal of Tombolo, a Data Lake Curation tool, is to prevent such an event and allow the data lake to continue evolving rapidly as its complexity increases and as more personnel begin to participate. Tombolo provides the central operating environment for a Data Lake. The Tombolo Data Lake Curation System 1.0 is the first open-source Data Lake Curation system for the HPCC Systems Platform. It allows creation of documentation along with the data and analyses that provides a roadmap into all aspects (assets) of the Data Lake: Data Files, Data Providers and Consumers, Data Ingestion and Analytics, and User Queries. Its global find facility allows users to rapidly locate any asset or browse hierarchically to get the lay-of-the-land. Design Considerations for Migrating Your HPCC Systems Data Lake to the Cloud
Krishna Turlapathi & Michael Gardner, LexisNexis Risk Solutions Group
During this session, we share lessons learned and design best practices through our own cloud migration experience. The beginning of our presentation is a simple installation of our cluster on Azure using the community helm charts. During this demo we hit topics such as how the HPCC Systems platform differs between the Kubernetes cluster that we are deploying and the bare metal installations that community members are familiar with. Dive into helm for HPCC Systems, the value of .yaml files and a few different ways that the cluster can be configured and explain storage in the cloud compared to bare metal. Then learn about ROXIE and Thor usage in the cloud. Krishna covers some details about getting query lists, suspended queries, and doing package file deployments. Michael expands on basic security features that end users will want to enable in the cloud, including encryption in transit and at rest in a cloud environment such as Azure.Terasort with HPCC Systems on Azure Kubernetes Service and High Performance Storage
Shrikrishna Khose & Steve Griffith, Microsoft
The speakers discuss challenges, AKS considerations and storage options, including a demo covering the setup and configuration of HPCC Systems on AKS with Blob NFS 3.0 and performing a Terasort.Taming the Data Demon with the DataSeers HPCC Systems Appliance
Gurjot Bandasha & Adwait Joshi, DataSeers
The core of any data solution lies in data management. What is needed is a solution that will integrate and coordinate compliance, reconciliation, fraud monitoring, and visualization. Hear from the DataSeers experts how they are helping companies in the FinTech and Banking industry to manage money, fight fraud and maintain compliance using a solution built from the ground up leveraging HPCC Systems.
Proven Use Cases
Hear success stories on how HPCC Systems is being used in the industry and academia in proven solutions:
- Deploying Digital Human Readers Leveraging HPCC Systems
David de Hilster, LexisNexis Risk Solutions Group
With the newly launched NLP-Plugin for HPCC Systems and VSCode NLP Language Extension, the community now has the ability to incorporate human-like “digital readers” into HPCC Systems to mine information from free text that has up until now, been impossible to extract. Future projects will be discussed including reading radiology reports, business reports, and real estate documents the latter of which could open new markets across the industry. It is important for everyone to understand this new technology in order to spot potential applications for extracting unmined data that until now, was impossible to obtain. Sharing our own use case, the end goal is to create a NLP Center of Excellence that will serve the entire company with digital readers first in English, then, other languages to open new streams of revenue. - HPCC Systems Thor Monitor
Using Workunit Services and Power BI to Monitor Thor Activity, Jessica Skaggs, LexisNexis Risk Solutions Group
The ECL Workunit Services standard library functions can be used to capture details about workunits running on Thor including processing time, errors, current state, and more. Capturing these details allows for monitoring, trending, error analysis, degradation, and other data points that can help improve the efficiency of your Thor environments. We will look at how to use this information to monitor the system with visualizations in Power BI. Cooperative actions between University of São Paulo and LexisNexis Risk Solutions
Renato de Oliveira Moraes, University of São Paulo
Prof. Renato discusses the successful conjoint initiatives being held between University of São Paulo (USP) and LexisNexis Risk Solutions in Brazil for leveraging HPCC Systems for teaching & learning, research and extensions activities in academia, including recent machine learning projects.Processing Student Image Data with Kubernetes and HPCC Systems GNN on the Cloud
Carina Wang, American Heritage School and HPCC Systems Intern 2021
In order to foster a safe learning environment, measures to bolster campus security have emerged as a top priority around the world. In this session, I will share how HPCC Systems was leveraged to process student images with Kubernetes running on the Cloud Native Platform while utilizing the Generalized Neural Network (GNN) bundle for image classification. The result is a trained model which can be implemented on the autonomous security robot we built to help campus security personnel identify visitors, students, and staff.Athlete 360: Leveraging HPCC Systems and RealBI for Athlete Wellness and Performance
Christopher Connelly, North Carolina State University and HPCC Systems Intern 2021
There is a lot that plays into an athlete being able to perform at their best when it matters most. Not only are there physical demands, but factors that come from outside of their sport that affect their wellbeing and readiness to perform. In team sports, there are many external variables that cannot be controlled, which makes the process of gauging performance of individual athletes difficult. The better the understanding of what an athlete does and how their body responds, the better we can support them to be at their best. Within collegiate athletics, and sports in general, there is a struggle to be able to interpret data from different streams together in a single report. Furthermore, streamlined data collection, can further aid our understanding of what an athlete does and how their body responds. This involves data from all aspects of an athlete’s day including wellness questionnaires, practice training loads, weight room training loads, and weight room assessments of strength, power, and fatigue. In the past we have shown the impact of using HPCC Systems with the NC State Men’s soccer team. Here you will see some solutions using HPCC Systems and RealBI to provide insight from data collected with the NC State Women's basketball team as well as how this system can serve not only the Strength and Conditioning department, but the athletics department as a whole.
Instructional Demos
Our technical engineers explain how to complete specific tasks for configuring and using your HPCC Systems platform:
...
A Simple HPCC Systems Cloud Deployment for Open-Source Users
Xiaoming Wang & Godson Fortil, LexisNexis Risk Solutions Group
This talk will cover how to setup and implement a basic HPCC Systems cluster in the cloud using Azure Kubernetes. We will walk through the deployment configuration leveraging Terraform, GitOps/Flux2 and storage settings. Note: This talk is intended for the wider open source HPCC Systems community. It is advised to check with your organization for any specific security protocols.
...
All About the HPCC Systems Metrics Framework
Ken Rowland, LexisNexis Risk Solutions Group
This presentation is for anyone interested in HPCC Systems metrics. It covers a description of the metrics framework and how its components operate, a brief explanation of how HPCC Systems components are instrumented for metric collection, configuration using helm charts, and a discussion of how HPCC Systems is planning on using metrics in areas of cluster health and scaling.
...
HPCC Systems Logging in the Cloud and an Elastic Stack Solution
Greg Panagiotatos & Rodrigo Pastrana, LexisNexis Risk Solutions Group
As HPCC Systems continues its journey to the cloud, one major challenge faced is the ephemeral nature of log data and the accessibility of distributed application-level logs. This presentation discusses these challenges, the HPCC Systems logging architecture, and a simple Elastic Stack-based solution to the challenge. We'll demonstrate in detail the end-to- end solution, which includes Helm-based deployment, Kibana configuration, HPCC Systems log exploration, querying, and filtering. We'll also discuss an advanced topic that improves log data query performance by utilizing Elastic Search Ingest Pipelines. Finally, we'll touch on other possible solutions such as Azure Log Analytics.
...
Watch recordings of all presentations made during our Community Day Summit held during October 2023. Find out more about this event and read our blog review of the event.
In addition to these presentations, we also held a Technical Poster Competition for students to present work they have completed on projects leveraging HPCC Systems. See our poster contest wiki to find out more about our 2023 Poster Contest participants and winners.
Resources
- See the full Agenda
- Go to the Video Library and listen to sessions
- Watch videos of students talking about their posters
- 2023 Poster Contest - Meet the students, read abstracts and view posters
Presentation Tracks and Content
Plenary Sessions
Join us as Gavin Halliday, SVP and Head of Platform Engineering, LexisNexis® Risk Solutions, kicks off the 10th annual HPCC Systems Community Virtual Summit. We welcome our community keynote speakers Bill Franks, Director of the Center for Data Science and Analytics, Kennesaw State University, followed by Bahar Fardanian, Manager, Solutions Engineering, LexisNexis Risk Solutions, with an introduction to Gus Cawley, Chief Executive Officer, ZipApply, to discuss how his company uses HPCC Systems in industry. Gavin will conclude with a keynote on the latest HPCC Systems platform progress and developments before sharing key highlights of the day ahead.
- Welcome & Plenary Keynotes
Gavin Halliday, SVP and Head of Platform Engineering, LexisNexis Risk Solutions, Bill Franks, Director of the Center for Data Science and Analytics, Kennesaw State University, Bahar Fardanian, Manager, Solutions Engineering, LexisNexis Risk Solutions, with an introduction to Gus Cawley, Chief Executive Officer, ZipApply
Bill Franks will provide a keynote address on Generative AI: Moving Beyond The Hype & Hysteria. Rarely has a topic rocketed from obscurity to everyday discussion point as fast as generative AI has. Only going mainstream in late 2022, it has captured the attention of not just the data science and data engineering communities, but the world at large. With that attention has come a lot of hysterical discussions about how it will ruin society, equally unrealistic discussions of how it will transform everything, and a mix of both facts and misinformation all around. This talk will provide some sober and realistic discussions about what generative AI is, how it is being misunderstood, the strengths and pitfalls that it has, and how it might realistically change our world both in the near future and the long term. After the talk, attendees should walk away with a better feel for what’s real and what’s hype, as well as tangible ideas for how generative AI can play a role in both their personal and professional lives.
Bahar Fardanian will be joined by Gus Cawley from ZipApply for a discussion on How HPCC Systems is being used in Industry and how this dynamic startup is utilizing HPCC Systems and big data analytics to hep job seekers get hired faster.
Gavin Halliday will close with the latest HPCC Systems Platform Update including the 9.x roadmap and sharing with attendees a high level overview of the day.
- Awards Ceremony & Plenary
Trish McCall, Sr Director Program Management, Hugo Watanuki, Manager Community Tech Programs, George S Foreman, Prod Dev Technical Writer II, Bob Foreman, Software Engineering Lead LexisNexis Risk Solutions
Join us for the long-awaited announcement of the 2023 Community Awards and Poster Competition winners, followed by our afternoon keynote honoring our academic community and engagement.
Cloud Strategies with HPCC Systems
The latest techniques in optimizing the Cloud Native HPCC Systems platform
- HPCC Systems Deployment using K3D instance
Aryaman Gautam, Sidharth Ganesan, Srinivasan Kothandam, LexisNexis Risk Solutions ITTC
This integration package enables HPCC Systems trainers and trainees to install the HPCC Systems engine in any standalone system for training and Practice. This will enable the user to work on HPCC Systems components including Thor, Roxie, Dali Storage, and ECL at the localized environment. - A Better Understanding of Thor and Roxie Configurations for the Containerized HPCC Systems Platform
Godji Fortil & Krishna Turlapathi, LexisNexis Risk Solutions
In the HPCC Systems Cloud Native Platform, configurations for the Thor and Roxie engines are coded in YAML, stored in a file and passed to the helm deployment. For this presentation, our goal is to go over those settings, recommendations and best practices, then deploy the HPCC Systems Platform using Terraform. - How to Enable Azure Log Analytics for the Containerized HPCC Systems Platform Using Terraform
Godji Fortil, LexisNexis Risk Solutions
For log solutions, the HPCC Systems Cloud Native Platform provides a standard log interface to process HPCC Systems logs. During this presentation, we will demonstrate how to enable Azure Log Analytics using our newly developed Terraform module for HPCC Systems logs. Optimizing Business Solutions with Azure Fullstack Products
Mauricio Nunes and Matheus Machado, LexisNexis Risk Solutions
Learn how our LexisNexis Risk Solutions business in Brazil launched two data products using an HPCC Systems implementation on Azure. The speakers will provide an overview of components including a Web portal, Batch and ESP API as delivery channels served from a Roxie and MySQL backend, all on commercial Azure. They will close with a product demo of identity verification and fraud solutions using fictitious data while diving into the architecture, best practices, challenges, and technical experiences shared by the team.
Productivity with HPCC Systems
Leverage the improvements and innovative features in the latest HPCC Systems 9.x release
- Roxie Performance: A Deeper Technical Dive into 9.x
Gavin Halliday, LexisNexis Risk Solutions
In this presentation, Gavin will discuss some of the different factors that affect Roxie performance, as well as cover details including some of the recent platform changes which aim to make this job easier. Anyone deploying production Roxie queries or interested in the technical details of how Roxie indexes are implemented can benefit from attending this session. - HPCC Systems Landing Zone Security
Russ Whitehead, LexisNexis Risk Solutions
Data is imported to and exported from an HPCC Systems Logical File from a file system location known as a Landing Zone. Landing zone access was previously authorized via the LDAP Security Manager feature permissions, but once a user had access to it they could access the entire Landing Zone directory. Beginning with HPCC Systems Version 9.0, admins can specify Landing Zone scopes, using the same ECL Watch interface as file scopes and workunit scopes. This presentation describes the motivation for the additional security and walks the attendee through the creation of these scopes. - ECL Watch: Redefining Progress
Kunal Aswani, LexisNexis Risk Solutions
Delve into ECL Watch and check out the snapshot queries in Roxie, logging, hot spot identification, cost estimation, and metrics for efficient data retrieval. A Threat Detection and Mitigation Framework: HPCC Systems and Azure at the core
Sowmya Myneni, LexisNexis Risk Solutions
While it is fantastic to advertise that we are confirming various security controls (CIS, FedRAMP), how do we actually prove it? This talk is going to explore in-depth the concepts around control effectiveness by showing the audience some example security controls we are implementing at LexisNexis Risk Solutions and how the HPCC Systems data analytics platform is used to collect, explore and analyze the data.
Programming with HPCC Systems
Extend your ECL with powerful language tools
- Building a Scalable and Efficient Bitcoin Blockchain Parser on the HPCC Systems Platform / Optimization of Learning Trees using ECL on HPCC Systems
Jyothi Shetty, RVCE / Manonmani S & Sudershan K S, RVCE
This joint session includes two presentations from RV College of Engineering featuring the work from academic collaboration as part of the HPCC Systems Centre of Excellence (CoE) in Cognitive Intelligent Systems for Sustainable Solutions (CISSS).
Building a Scalable and Efficient Bitcoin Blockchain Parser on the HPCC Systems Platform, Jyothi Shetty:
Many commercial applications are interested in analyzing Bitcoin blockchain data, but due to the large volume of transactions the processing of transaction data becomes a challenge. A distributed and parallelized architecture such as HPCC Systems can be a solution to efficiently analyse such massive amounts of data. The work proposes to build a C++-based Bitcoin parsing algorithm embedded within ECL to create a robust and efficient blockchain parser. The proposed approach takes raw blockchain data in the form of blk.dat files as input and processes the headers and transaction data within each block. Furthermore, it maps input addresses to output addresses of previous transactions, constructing a connected network of blocks that accurately resembles the blockchain. By iteratively traversing this chain of blocks, the parsed contents are then written to a CSV file. The suggested implementation offers flexibility as it does not rely on external library dependencies and can be executed as a single ECL file. This is a work in progress study. An initial functional implementation of the parser code has been deployed successfully on hThor and the next step is to expand its capability to Thor so the parser can fully take advantage of HPCC Systems distributed and parallel processing capabilities. As the size of the blockchain data continues to grow, having a scalable and reliable ECL Bitcoin parser that enables retrieval of blockchain data with minimal overhead is quite beneficial.
Optimization of Learning Trees using ECL on HPCC Systems, Manonmani S & Sudershan KS
Some core ECL functions on HPCC Systems, like the ones leveraged in the LearningTrees Machine Learning Bundle are recursive in nature and hence tend to demand higher computational effort under certain conditions. This talk presents an alternative solution for optimized execution of the LearningTrees algorithms in ECL via embedding python libraries. The proposed approach was tested on standard datasets and a significant decrease in execution time was observed with no impact on the accuracy of the models: while the embedding technique provided approximately the same accuracy levels as the current existing ML library functions, the execution time was reduced by 4-5 times making it an extremely fast and efficient technique. The target audience who will benefit from this talk includes industry professionals who wish to explore the resources for solving big data problems, technology enthusiasts willing to learn latest trends and technologies, and researchers working in big data and high-performance computing domains.
- Building Trustworthy and Auditable Digital Human Readers Superior to Humans with NLP++
David de Hilster & Ashton Williamson, LexisNexis Risk Solutions & Clemson University
Machine Learning, Neural Networks and Large Language Models are statistical and opaque, human memory is fallible, and NLP++ is now stepping in to fix these problems and build trustworthy systems in areas such as law enforcement, healthcare, and sentiment analysis. Not only can NLP++ perform better than humans, but it can build systems that eventually can be better than human experts. Unlike the statistical methods of ML, NN, and LLM which are opaque and not auditable, NLP++ is a verifiable and auditable technology which is a game-changer in today’s critical systems. NLP++ is a glass box computer programming language and framework that seamlessly integrates with HPCC systems to provide sophisticated text processing that until now, have been only achievable by humans. We will look at work being done at Clemson to help solve the medical coding task that is currently done by human readers and has a 50% failure rate. - Digital Human Readers Making History with NLP++ and HPCC Systems
Charvi Dave, Kruthika Pinnada, Pedro Rodrigues, Dheemonth Kodali, Shyamaa Karthik, David de Hilster LexisNexis Risk Solutions, RVCE, USP Brazil, St Andrews School
From resume processing, to sentiment analysis, to digital dictionaries, NLP++ and HPCC Systems are making history. Searching for resumes is precarious when using simple keyword search and new work using NLP++ is making resume searching an exact science. Sentiment analysis till now has been done with statistical and keyword methods and have been too generic to used as real-world systems. Two sentiment analyzers break that mold, one involving soccer teams in Brazil, and the other cricket teams in India. NLP digital dictionaries have either been stuck in time for decades or are non-existent. Two Wiktionary projects use NLP++ to parse linguistic information from Wiktionary pages, processing hundreds of thousands and millions of pages using HPCC Systems, and creating dictionaries that can be used in future NLP systems. The result will be the most comprehensive English dictionary ever created for NLP and the first digital dictionary with linguistic information ever created for the Tamil language. Parquet Support for ECL
Jack Del Vecchio, LexisNexis Risk Solutions
Introducing the Parquet Plugin, an interface between ECL and Apache Arrow that gives the ECL programmer the capability to interact with the parquet file format. This talk will demonstrate how ECL programmers can efficiently read and write parquet files with ease. With this interface, ECL programmers can partition datasets, read any partitioned or non-partitioned dataset, and write to a parquet file. In the demo all the functions of the plugin will be shown. One of the key highlights of the plugin is its capability to handle datasets larger than memory. By leveraging streaming techniques, we ensure efficient processing of large-scale datasets without sacrificing performance or data integrity. Attendees will gain insights into the integration of parquet and how the Apache Arrow library was leveraged to give the ECL programmer efficient access to the parquet file format. A variety of demos to include usage examples will be shown as well as opportunities to ask questions and learn more about the plugin.
HPCC Systems in Action
Real World Use Cases with HPCC Systems
- An Understanding of HPCC Systems Platform Metrics
Ken Rowland, LexisNexis Risk Solutions
HPCC Systems provides a metric framework for collecting low level platform specific metrics. This presentation covers the framework, available metrics, component instrumentation, and metric collection. It also covers the new ESP method execution profiling. All of it is tied together with examples from a running cluster. - Robots, Drones and Generative AI: Exploring cutting edge technologies with HPCC Systems in a school campus environment
Tai Donovan, Carlos Caceres, Nick Schwartz, American Heritage School
School safety security, and wellbeing have been a driving force for our most recent academic partnership with HPCC Systems. The first phase of our security project began in 2020 with an autonomous mobile sentry designed, built and programmed with facial recognition software developed by our students. This project was very effective, and we wanted to enhance the capabilities by increasing response times for any security concern with the use of drones. This latest version uses autonomous drone technology to enhance the security features of our school by detecting unauthorized personnel on campus, assisting security in response to active shooter situations, gathering information during lockdowns and to track the student/staff evacuation process if initiated. In parallel to these latest efforts, a recent HPCC Systems summer internship project investigated the possibility on whether generative AI could be used to generate human-like responses in our autonomous security robot robots. During the internship, an interface through HPCC Systems was constructed to interface interact with GPT and ChatGPT. Therefore, we expect to be able to use a standard neural network to identify different emotions given video of a person and these these emotions would then be processed by the interface to create a call to OpenAI’s API from which an appropriate response would be generated. This response would be designed to change based on the emotions classified by the neural network and most importantly due to the use of generative AI would be different even with the same emotional classification. Even though this is all still in the prototype phase, our projects have proven to be a great proof of concept developed by high school students to deal with the reality of our times. - Tombolo - Opensource Tools for Interacting with HPCC Systems Clusters
Matt Fancher & Yadhap Dahal, LexisNexis Risk Solutions
Introducing Tombolo version 2.0, an innovative open-source project initiated by LexisNexis Risk Solutions, aimed at providing data catalog capabilities for HPCC Systems clusters. A web application that caters to both technical and non-technical users, facilitating seamless interaction with HPCC Systems clusters. Join us in this enlightening session as we demonstrate the impressive capabilities of Tombolo. Discover how users can effortlessly create workflows by leveraging assets from HPCC Systems clusters, enhanced with our newly introduced versioning tool. This tool empowers users with increased confidence when editing existing workflows. Witness the application's exceptional ability to monitor assets and proactively send timely notifications via different medias, ensuring crucial data management tasks are completed promptly. Moreover, we'll showcase Tombolo's intuitive dashboard and its extensive range of API endpoints, offering a comprehensive overview of asset performance. During the presentation, our team will also share insights into the future development goals of Tombolo, providing a glimpse into the exciting features currently in development and planned for future additions. Vulnerable Victim Monitoring Connecting the Dots to Find Missing Children Quickly
Pasqua Ruggiero, LexisNexis Risk Solutions
According to Child Find of America, an estimated 2,300 children go missing each day. That equates to over half a million missing children per year. Most of these children are runaways, a population that is quite vulnerable to trafficking. Trafficked victims can be exploited for their PII in relation to account take over, stolen or synthetic identity fraud. LexisNexis Risk Solutions Inquiries are comprised of searches for fraud and credit seeking consumers which are stored in HPCC Systems. These hundreds of billions of inquiry data can be searched retroactively and monitored proactively for matches on missing children within a reasonable geographic area. Additional topics of discussion will include data visualization capabilities in Python with data obtained from both the National Center for Missing & Exploited Children’s public RSS feed (current missing cases) as well as all missing cases from the past 20 years (obtained from the ADAM Program). We will also discuss proposals for additional future studies utilizing the power of HPCC Systems to extract other types of big data and provide further insights and information to provide law enforcement and social services to better aid them in the search and prevention of human trafficking. *We would like to issue a content warning for this presentation for topics like human (and especially child) trafficking. We understand the difficulty of presenting information on this sensitive topic and appreciate your patience and understanding as we present our research. All data will be anonymized to not reveal any personal information.
ECL Training Workshops
The HPCC Systems Training and Support team has been very busy this year visiting universities and trade show events to promote HPCC Systems and ECL through Code Days, Hackathons, and Workshops. This year’s Community Summit workshop presents an in-depth look at these events in three 1-hour sessions:
- Part 1: The “Music is Life!” Workshop
Bob Foreman, Software Engineering Lead, LexisNexis Risk Solutions
Who doesn’t love music? Take a break from your daily routine datasets and join us in this first hour. We break down a popular open-source music dataset, explore normalizing the dataset and its effects, and look at a variety of data evaluation and query techniques. - Part 2: The “Find Your Paradise!” Hackathon
Bob Foreman, Software Engineering Lead, LexisNexis Risk Solutions
Have you ever thought about building an application that can help people find places to live that maximize their quality of life and happiness? The goal of this challenge is to analyze different datasets across different categories and correlate them using the HPCC Systems platform. After analyzing, the participants will be asked to design an interface to query this data and assign it a scoring system, then deliver it to the user via ROXIE and show the user where they should most likely want to live. Users should be given choices in an easy-to-use form that when submitted will generate a unique set of scores based on locations. - Part 3: Better Customer Insights Through Relational Dataset Queries
Bob Foreman, Software Engineering Lead, LexisNexis Risk Solutions
In this final hour, we look at the power of the open data model, transforming normalized data into a denormalized relational dataset, and use the implicit relationality power to analyze relationships to provide better customer insights.