This project is already taken and is no longer available for the 2023 HPCC Systems Intern Program
This project is available as a student work experience opportunity with HPCC Systems. Curious about other projects we are offering? Take a look at our Ideas List.
Student work experience opportunities also exist for students who want to suggest their own project idea. Project suggestions must be relevant to HPCC Systems and of benefit to our open source community.
Find out about the HPCC Systems Summer Internship Program.
Project Description
The HPCC Systems platform includes system configurations to support two main clusters: a Thor cluster for parallel batch data processing and a Roxie cluster to support high-performance data delivery applications using indexed data files. Due to their specific nature, each cluster environment has its own specificities in terms of functional and performance testing. This has always been the case in the bare metal or on premises world and it is now also prevalent in the containerized world.
This project focuses exclusively on the Roxie cluster environment and requires at least some basic knowledge of the HPCC Systems platform and test methodology. Currently, HPCC Systems has a regression test suite https://github.com/hpcc-systems/HPCC-Platform/tree/master/testing/regress and a performance test suite https://github.com/hpcc-systems/PerformanceTesting that were originally developed on a bare-metal paradigm.
This goal of this project is to adapt these tests suites to a containerized paradigm mainly focusing on Roxie jobs in various cloud setup configurations, such as different storage types, cluster sizes, Kubernetes Node sizes, etc.
The code can be developed and tested in local Kubernetes and real measurement will be conducted primarily on Azure.
Here are some dimensions for the test:
- Storage types: diskfile, blob
- Encryption (storage/volume): on/off
- Datasets: various datasets will be provided for the project
- Queries: various datasets will be provided for the projec
- Cluster size: various Roxie sizes can be used for benchmark.
- Network: private/public endpoints
- Caching: TBD
Additional prerequisites and considerations about this project
- A good grasp of HPCC Systems and how it differs from a relational database (or from something that processes streamed data) is required
- Extend the performance test to include other activities, and aspects of the system (e.g. lots of tiny subgraphs, lots of small graphs, workflow dependencies)
- Monitor and analyse whether test results variations come from cloud noise or HPCC Systems Platform code changes. In this case, the new test should be informative to be representative of the work that is done on the platform
- We want to reduce costs which for the cloud are a combination of time taken and machine type. It would be interesting to know how performance of different activities changes with different constraints e.g. number of cpus, memory, network bandwidth, and even better if it highlighted areas in the platform that would give a significant reduction in cost for little work.
- Focus on improving the performance suite, and gathering and analyzing the stats already generated by the platform (rather than using other benchmarking tools). Work closely with our performance test team to produce graphs and look at trends for the performance suite.
- Should not try to directly and exactly compare some RDBMS benchmark to Roxie, but as a follow-on to performance suite work - come up with a few types of queries (or select some from the existing performance suite) and then run those at various loads and AKS cluster sizes to show performance and also perhaps scaling.
A github project should be created to host all files and documentation.
Student will work closely with our build and test team.
Completion of this project involves:
- Measurement on Azure AKS.
- A complete github project with Documentation
By the mid term review we would expect you to have:
- A github project with design and initial code implementation.
- Basic setup and measurement on Azure.
Mentor | Krishna Turlapathi Backup Mentor: Attila Vamos |
Skills needed |
|
Deliverables | Midterm
End of project Complete github project with documentation. Finish measurements for Azure. |
Other resources |
|