Poster presentation abstracts 2016
Browse: About the 2016 Poster Presentation Competition, Posters by HPCC Systems interns, Posters by students working with our Academic Partners, Full list of entrants
Deep Kernel: Learning kernel function from data using deep neural networks on HPCC Systems
Linh Le and Ying Xie - Kennesaw State University
1st Place Winning Entry
Kernel methods are a family of machine learning algorithms specialized in pattern analysis that use a kernel function to implicitly map the data into a high dimensional feature space where further learning or analytics is conducted. Although kernel methods are among the most elegant part of machine learning, it remains challenging for users to define or select a proper kernel function with optimized parameter settings for their data.
Deep Learning is a different branch of machine learning that makes use of deep neural networks to model high level representations of the data. We propose a novel method called Deep Kernel that can automatically learn a kernel function from data using deep learning method. Specifically, a deep neural network can be automatically created for the given data, and an optimized kernel function can be learned from the data by using the deep neural network. Accordingly, the burden of the users to define, select, and/or configure a proper kernel function when applying kernel methods is relived.
We implement the proposed Deep Kernel using ECL on HPCC Systems® to efficiently process training data with large number of pairs. With our implementation, the users of HPCC Systems® just need to provide the training data, the non-optimized kernel function will be automatically generated. Experimental results show that the proposed deep kernel method outperforms the optimized Gaussian Kernels.
Empowering the ECL-ML Library
Vivek Nair - North Carolina State University
2nd Place Winning Entry
ECL machine learning library hosts multiple parallel machine learning (ML) algorithms. However, it is not widely used by the community since there is the need to learn the ECL language as well as inherent assumptions of the algorithms. The objective of this work is to 'empower' the ML library to be more useable. Two parallel approaches are adopted solve the usability issues of the library: (i) to build a testing suite, which would compare the performance score (e.g. accuracy) with other implementations such as scikit-learn and (ii) to leverage the DSP (a web application) to use the ML algorithms. The approach was demonstrated to the members of the analytics team during a ‘show and tell’. Their feedback includes low learning curve (user doesn’t require to know ECL), time saved by not downloading data hence avoiding compliance issues. This tool enable efficient machine learning development in practice. We plan to extend this plugin suite to support more state-of-the-art algorithms, include support for hyper-parameter optimization (e.g. Grid Search).
Understanding high dimensional networks for continuous variables - The CSCS Algorithm
Syed Rahman - University of Florida
3rd Place Winning Entry
The availability of high dimensional data (or “big data”) has touched almost every field of science and industry. Such data, where the number of variables (features) is often much higher than the number of samples, is now more pervasive than it has ever been. Discovering meaningful relationships between the variables in such data is one of the major challenges with which modern day data scientists have to contend. The covariance matrix of the variables is the most fundamental quantity that can help us understand the complex multivariate relationships in the data. In addition to estimating the inverse covariance matrix, CSCS can be used to detect the edges in a directed acyclic graph, as opposed to the edges an undirected graph, for which CONCORD (presented at the 2015 summit) was used. Similar to the CONCORD algorithm, the CSCS algorithm works by minimizing a convex objective function through a cyclic coordinate minimization approach. In addition, it is theoretically guaranteed to converge to a global minimum of the objective function. One of the main advantage of CSCS is that each row can be calculated independently of the other rows, and thus we are able to harness the power of distributed computing.
Covariance estimation for high-dimensional datasets is a fundamental problem in modern day statistics with numerous applications. In these high dimensional datasets, the number of variables p is typically larger than the sample size n. A popular way of tackling this challenge is to induce sparsity in the covariance matrix, its inverse or a relevant transformation. In particular, methods inducing sparsity in the Cholesky parameter of the inverse covariance matrix can be useful as they are guaranteed to give a positive definite estimate of the covariance matrix. Also, the estimated sparsity pattern corresponds to a Directed Acyclic Graph (DAG) model for Gaussian data. In recent years, two useful penalized likelihood methods for sparse estimation of this Cholesky parameter (with no restrictions on the sparsity pattern) have been developed. However, these methods either consider a non-convex optimization problem which leads to convergence issues when p > n, or achieve a convex formulation by placing a strict constraint on the conditional variance parameters. In this paper, we propose a new penalized likelihood method for sparse estimation of the inverse covariance Cholesky parameter that aims to overcome some of the shortcomings of current methods, but retains their respective strengths. We obtain a jointly convex formulation for our objective function, which leads to convergence guarantees, even when p > n. The approach always leads to a positive definite and symmetric estimator of the covariance matrix. We establish high-dimensional estimation and sparsity selection consistency, and also demonstrate finite sample performance on simulated/real data.
Analyzing Clustered Latent Dirichlet Allocation (CLDA)
Christopher Gropp - Clemson University
Latent Dirichlet Allocation (LDA) forms the basis of many topic modeling approaches, able to process a large corpus of documents into a small set of topics. Dynamic Topic Models (DTM) are a means to extend this to see change over time, but at prohibitive performance cost. We develop and analyze Clustered Latent Dirichlet Allocation (CLDA), which utilizes existing highly parallel components to gain these insights at scale on large datasets.
The CLDA algorithm segments a large dataset by some marker, such as time or geography. Each segment is independently input to a parallel implementation of LDA, producing a set of topics for each segment. These topics are merged into a single collection, and then clustered using a parallel implementation of k-means. The final output contains clusters of topics and their representative centroids, along with information on the relative representation of these topics and topic clusters in the dataset.
Our implementation of CLDA uses HPCC and R for preprocessing, and Python for intermediate data manipulation. We use PLDA+ as a strong parallel implementation of LDA, and a parallel k-means developed by Northwestern University. This method achieves two orders of magnitude faster performance over DTM, superior perplexity, and still generates similar topics.
HPCC is used for preliminary data cleaning, but an ECL implementation of LDA is in development. Once this is complete, the CLDA pipeline can be transitioned fully into HPCC, allowing for easy integration into a data exploration workflow. Our future work also includes the development of better metrics with which to analyze topic model output, as our analysis revealed problems with the perplexity metric.
Column Level Security on HPCC Systems
Suk Hwang Hong - Georgia Tech
Currently, HPCC platform does not provide a native way to control access to individual columns in logical files. A workaround such as embedding security logic within a query is not flexible and puts extra burden on ECL programmers. Column-level security on HPCC Systems is certainly a desirable goal, and I have worked on this topic during my summer internship with LexisNexis Risk Solutions Group.
In this poster presentation, I will present and describe two possible ideas for implementing column-level security on HPCC Systems: High-level CLS and Low-level CLS.
High-level CLS focuses on individual columns in delivered data and targets end-users who
merely run deployed queries. Four access control modes (GRANT/DENY/MASK/HIDE) and “view” concept are introduced to facilitate it. I’ve also implemented a simple proof-of-concept for High-Level CLS to demonstrate its possibility. I will present it and explain its implementation details and challenges I’ve faced.
Low-level CLS focuses on individual columns in (huge) flat files in THOR and targets ECL programmers who frequently deploy queries to THOR. Because it uses the concept of “SuperColumns” which are similar to SuperFiles, it can leverage existing file scopes to enforce low-level column-level security.
Both types of column-level security can co-exist on HPCC systems. I believe that if implemented correctly, they will together provide a very strong and flexible security on HPCC Systems.
Elastic Computing Support for HPCC
Chin-Jung Hsu - North Carolina State University
This research project aims to add elastic computing support to Roxie in HPCC. Because workload demands can grow or shrink, with the support of elastic computing, a Roxie cluster is properly sized and therefore, the cluster is not over provisioned or under utilized. This project brings Roxie the elasticity mechanism and its performance optimization. Our prototype tracks the hot spots and enables optimal data placement for elastic operations.More specifically, our proposed distribution-aware data placement helps achieve optimal system performance for the expanded and reduced Roxie cluster.
In this project, our solution includes three main components: 1) dynamic cluster membership management which enables Roxie to grow and shrink on demand, 2) a workload monitor that tracks the hot spots of partition and node access, and 3) distribution-aware data placement for optimizing Roxie performance in support of dynamic changes in cluster sizes and diverse workloads of access patterns.
To verify our design, we have designed and implemented a distributed benchmark suite that evaluates Roxie performance in a configurable and systematic way. This benchmark suite can evaluate Roxie performance at very large scale (thousands of queries per minutes). Our preliminary results show that the proposed distribution-aware data placement can improve Roxie performance by 2.3 times in the best case scenario.
Unsupervised Learning and Image Classification in a High Performance Computing Cluster
Itauma Itauma - Wayne State University
Feature learning and object classification in machine learning are ongoing research areas. Identifying good features has various benefits for object classification with respect to decreasing computational cost and increasing classification accuracy. Many research studies have focused on improving optimization methods and the use of Graphics Processing Units (GPUs) to improve the training time for machine learning algorithms.
This study explores feature learning and object classification ideas in HPCC Systems platform. HPCC Systems is a Big Data processing and massively parallel processing (MPP) computing platform used for solving Big Data problems. Algorithms are implemented in HPCC Systems with a language called Enterprise Control Language (ECL) which is a declarative, data-centric programming language. It is a powerful, high-level, parallel programming language ideal for Big Data intensive applications.
This study utilized a novel multimodal learning and object identification framework in High Performance Computing Cluster (HPCC Systems), to speed up optimization stages and to handle data of any dimension. This framework first learns representative bases (or centroids) over unlabeled data for each model through the K-means unsupervised learning method. Then, to extract the desired features from the labeled data, the correlation between the labeled data and representative bases is calculated. These labeled features are fused to represent the identity and then fed to the classifiers to make the final recognition. This novel framework was then evaluated on several databases such as the CALTECH-101, AR databases, and a subset of wild PubFig83 data to which multimedia content was added. Classification accuracy was found to be improved.
Yinyang K-Means Clustering Algorithm in HPCC Systems
Lily Xu - Clemson University
Yinyang K-means is a recent attempt to improve the scalability and velocity of the classic K-means algorithm. The algorithm involves two steps: the assignment step to assign each point to its closest center and the update step to re-locate the K centers. Yinyang Kmeans improves assignment step by applying a group filter and a local filter to avoid unnecessary calculations. The results from the optimized assignment step reduce the computation in the update step. Speedup over the standard K-means between two times to an order of magnitude can be achieved.
HPCC Systems (High-Performance Computing Cluster) is an open source, data-intensive computing system platform developed by LexisNexis Risk Solutions Group. The HPCC Systems platform includes system configurations to support both parallel batch data processing (Thor) and high-performance online query applications using indexed data files (Roxie). The HPCC Systems platform also includes a data-centric declarative programming language for parallel data processing called ECL. [1]
The present work is implementing Yinyang K-means clustering algorithm in HPCC. The current stage of the project implemented group filter and local filter without grouping concept of centroids. The results have tested in HPCC with two datasets: DP100 and IRIS. Results show that the Yinyang K-means is consistent with the result of K-means. At this stage of the project, the performance of Yinyang K-means is expected to be slower than K-means because of the lack of grouping concept and the smaller testing set compared to the dataset applied in the original paper.
The future work of this project is to include grouping into the current implementation of Yinyang K-means algorithm in HPCC Systems. The implementation will be tested with larger dataset to verify the advantages of Yinyang K-means over standard K-means and what role of HPCC plays in improving the performance of the clustering algorithm.
All pages in this wiki are subject to our site usage guidelines.