Interfacing a Vector Database with ECL

This project is available as a student work experience opportunity with HPCC Systems. Curious about other projects we are offering? Take a look at our Ideas List.

Student work experience opportunities also exist for students who want to suggest their own project idea. Project suggestions must be relevant to HPCC Systems and of benefit to our open source community. 

Find out about the HPCC Systems Summer Internship Program.

Project Description

Efficient data processing has become more crucial than ever for applications that involve large language models, generative AI, and semantic search. All of these new applications rely on vector embeddings, a type of vector data representation that carries within it semantic information that is critical for AI to gain understanding and maintain a long-term memory they can draw upon when executing complex tasks. A vector database is a type of database that indexes and stores vector embeddings for fast retrieval and similarity search, with capabilities like CRUD operations, metadata filtering, and horizontal scaling.

The goal of this project is to support a Vector DB by allowing the embedding of Milvus database queries within ECL code running on HPCC Systems.

Completion of this project involves:

  • Investigating the API for calling Milvus from C++ and learning the ECL embed API.

  • Creating a simple wrapper that passes lists between the ECL embedded API and the Milvus API using the MongoDB plugin as an example. 

  • Extending the simple wrapper to handle structured data.

  • Develop test cases for the plugin that tests all functionality of Milvus and ensures all data types are passed in and returned properly.

  • Develop test cases for the plugin ensuring multi-threaded access from the ECL side. This includes performance and throughput of the system for some examples that approximate to real-world usage.

  • A complete GitHub project with code and documentation.

  • A blog, a recorded presentation, and a poster artifact about your project (see examples from previous years here).

By the mid term review we would expect you to have:

  • Understand the ECL embed API and implement a simple example that makes a connection to a Milvus database.

Mentor

Jack Del Vecchio 
Jack.DelVecchio@lexisnexisrisk.com

Backup Mentor: TBD


Skills needed
  • Ability to code in C++.

  • Ability to build and test the HPCC system (guidance will be provided).

  • Familiarity with vector databases.

  • Ability to write test code. Knowledge of ECL is not a requirement since it should be possible to re-use existing code with minimal changes for this purpose. Links are provided below to our ECL training documentation and online courses should you wish to become familiar with the ECL  language.

Deliverables

Midterm

  • Understand the ECL embed API and implement a simple example that makes a connection to a Milvus database.

End of project

  • A plugin that can handle multiple connections to Milvus and implements the ECL embed API and the full Milvus C++ API.

  • Test cases demonstrating the behavior and performance of the plugin.

  • Documentation of the usage and functions implemented by the Milvus Plugin.

  • A complete GitHub project with code and documentation.

  • A blog, a recorded presentation, and a poster artifact about your project (see examples from previous years here).

Other resources

All pages in this wiki are subject to our site usage guidelines.