This project is available as a student work experience opportunity with HPCC Systems. Curious about other projects we are offering? Take a look at our Ideas List.
Student work experience opportunities also exist for students who want to suggest their own project idea. Project suggestions must be relevant to HPCC Systems and of benefit to our open source community.
Find out about the HPCC Systems Summer Internship Program.
Project Description
Parquet is an open source, column-oriented data file format designed for efficient data storage and retrieval. It provides efficient data compression and encoding schemes with enhanced performance to handle complex data in bulk.
Due to its increased importance for storing big data, starting with v9.x HPCC Systems also supports reading and writing Parquet files directly from ECL code via a plugin.
The goal of this project is to develop a test suite for the HPCC Systems Parquet plugin. To accomplish this, the following items are expected from a successful intern:
- Understanding of HPCC Systems and the Parquet file format.
- Create test cases with different datasets to include every ECL type and Parquet Type.
- Monitor and analyze test results under multiple conditions even trying different hardware configurations e.g. number of cpus, memory.
- Identify bottlenecks and performance improvements in the plugin code.
Completion of this project involves:
- Creating benchmarks and performing analysis of the Parquet embedded Plugin.
- Comparing and analyzing the performance compared to native file formats in the platform.
- Documenting the project and results.
By the mid term review we would expect you to have:
- Basic setup with test environment and running the Parquet Plugin.
- Initial test code for benchmarks gathering data.
Mentor | Jack Del Vecchio Backup Mentor: TBD |
Skills needed |
|
Deliverables | Midterm
End of project
|
Other resources |