Test suite for the HPCC Systems Parquet plugin

This project is available as a student work experience opportunity with HPCC Systems. Curious about other projects we are offering? Take a look at our Ideas List.

Student work experience opportunities also exist for students who want to suggest their own project idea. Project suggestions must be relevant to HPCC Systems and of benefit to our open source community. 

Find out about the HPCC Systems Summer Internship Program.

Project Description

Parquet is an open source, column-oriented data file format designed for efficient data storage and retrieval. It provides efficient data compression and encoding schemes with enhanced performance to handle complex data in bulk. 

Due to its increased importance for storing big data, starting with v9.x HPCC Systems also supports reading and writing Parquet files directly from ECL code via a plugin.

The goal of this project is to develop a test suite for the HPCC Systems Parquet plugin. To accomplish this, the following items are expected from a successful intern: 

  • Understanding of HPCC Systems and the Parquet file format.

  • Create test cases with different datasets to include every ECL type and Parquet Type.

  • Monitor and analyze test results under multiple conditions even trying different hardware configurations e.g. number of cpus, memory.

  • Identify bottlenecks and performance improvements in the plugin code.

Completion of this project involves:

  • Creating benchmarks and performing analysis of the Parquet embedded Plugin.

  • Comparing and analyzing the performance compared to native file formats in the platform.

  •  complete GitHub project with code and documentation.

  • A blog, a recorded presentation, and a poster artifact about your project (see examples from previous years here).

By the mid term review we would expect you to have:

  • Basic setup with test environment and running the Parquet Plugin.

  • Initial test code for benchmarks gathering data.

Mentor

Jack Del Vecchio 
Jack.DelVecchio@lexisnexisrisk.com

Backup Mentor: TBD


Skills needed
  • Ability to code in C++.

  • Ability to build and test the HPCC system (guidance will be provided).

  • Familiarity with vector databases.

  • Ability to write test code.

Deliverables

Midterm

  • Basic setup with test environment and running the Parquet Plugin.

  • Initial test code for benchmarks gathering data.

End of project

  • Monitor and analyze test results under multiple conditions even trying different hardware configurations.

  • Comparing and analyzing the performance compared to native file formats in the platform.

  • Documenting the project and results.

Important resources

All pages in this wiki are subject to our site usage guidelines.