Scarlett Huang - 2024 Poster Contest Resources
Scarlett Huang is a rising junior at A.W. Dreyfoos School of the Arts (high school). She is a varsity tennis player and a member of the Math Honor Society at her school. Scarlett has been involved in competitive math, tennis, and piano from a young age, though she finds less time for all these activities since starting high school. In college, she aims to study mathematics or computer science and hopes to pursue a career in tech. In her free time, she enjoys baking, drawing, and hiking, especially when traveling to new places whether it’s California or even Scotland. |
Poster Abstract
HPCC Systems is a robust, open-source platform designed to run in diverse environments to address big data challenges. This project focuses on integrating HPCC Systems with Google Cloud's BigQuery, utilizing multiple data transfer methods to streamline data migration and analysis. This showcases the flexibility and capability of HPCC Systems in leveraging modern cloud services, thus optimizing big data workflows.
The first data transfer method involves steps that ensure the efficient migration of large datasets from HPCC Systems to BigQuery. Initially, a file is desprayed to the Landing Zone of HPCC Systems using ECL script. Once the data is prepared, it is uploaded from the Landing Zone of HPCC Systems into Google Cloud Storage (GCS) using a Java program. This ensures the data is securely transferred and stored in a cloud environment, ready for further processing. Finally, the BigQuery Data Transfer Service is utilized to transfer the data from GCS into BigQuery. This is a scheduled service that automates the data loading process, ensuring the data is accurately and efficiently imported into BigQuery for analysis. This method highlights a structured approach to data transfer, leveraging Google Cloud's storage and transfer services to enhance data migration from HPCC Systems.
The second data transfer method employs Google Cloud Pub/Sub, a more dynamic messaging service. Pub/Sub allows for real-time data streaming by transferring data in JSON-formatted messages. This method is particularly useful for scenarios that require immediate data processing and handling. By using Pub/Sub, data can be continuously streamed from HPCC Systems (Roxie Server) to BigQuery, ensuring that the latest data is always available for analysis.
Both methods illustrate the seamless integration between HPCC Systems and Google Cloud services, enhancing the capability of HPCC Systems to handle and process big data efficiently. By leveraging the cloud infrastructure, the project demonstrates how data workflows can be optimized to provide scalable and efficient data management strategies. These integrations are pivotal for organizations looking to enhance their data processing abilities and make better use of big data resources.
As of right now, I have successfully completed the two methods of data transfer mentioned above. However, looking ahead, there is significant potential to expand the data integration and transformation even further using Google Cloud's Pub/Sub service. One promising area of development involves leveraging Pub/Sub to join different data files before transferring them to BigQuery for analysis. Another potential path I can work on in the future is Kafka. Kafka is widely regarded as a leading messaging service because of its compatibility with existing systems, advanced stream processing capabilities, and detailed configuration control. However, setting up Kafka requires an initial setup and configuration process. It can be a crucial part of the data pipeline, whether integrated with HPCC Systems or Google Cloud. Since the user has chosen HPCC Systems for permanent storage, adding a feature to upload processed data back to HPCC Systems would be beneficial.
Presentation
In this Video Recording, Scarlett provides a tour and explanation of her poster content.
BigQuery Integration with HPCC Systems:
Click on the poster for a larger image.
All pages in this wiki are subject to our site usage guidelines.