Timothy C. Havens, Assistant Professor, Electrical and Computer Engineering and Computer Science and William and Gloria Jackson Assistant Professor, Michigan Technological University
Abstract: Since the early 1990's, the ubiquity of personal computing technology has produced an abundance of staggeringly large data sets-it is estimated that Facebook alone logs over 25 terabytes of data per day and large bioinformatics data sets that integrate microarrays, sequences, and ontology annotations continue to grow. To compound this fact, these data sets are populated from disparate, often unknown, sources and are in a wide-range of formats. There is a great need for systems by which one can elucidate the similarity among and between groups in these data sets and produce easy-to-understand visualizations of the results. In this talk, I will discuss methods for efficiently and accurately approximating the solution of the kernel c-means clustering algorithm, focusing on both crisp and fuzzy variants. Kernel clustering has been shown to be effective for data sets where the groups are not linearly separable in the input space or are high-dimensional. However, kernel c-means (or k-means) algorithms present computation and storage requirement challenges: clustering 500,000 objects requires 1 terabyte of main memory. I will show that on medium scale data (~50,000 objects) the proposed approximate and streaming kernel k-means algorithms give up to three orders of magnitude speed-up and a constant factor reduction in memory footprint with little-to-no degradation in performance, as compared to literal kernel k-means. I also demonstrate that the algorithms perform well on large-scale data (>500,000 objects), including magnetic resonance imaging volumes. Last, I will apply my methods to bioinformatics data composed of genes described by Gene Ontology annotations to show how they can be used for comparative genomics.
Speaker Bio: Tim Havens is an Assistant Professor in Electrical and Computer Engineering and Computer Science and the William and Gloria Jackson Assistant Professor at Michigan Technological University (MTU). He received a Ph.D. in electrical and computer engineering from the University of Missouri in 2010, and a B.S. and M.S. in electrical engineering from Michigan Tech in 1999 and 2000, respectively. Prior to joining MTU, he was an NSF / CRA Computing Innovation Fellow at Michigan State University, where he developed machine learning methods for heterogeneous and big data. From 2000-2006, Dr. Havens was an associate technical staff at MIT Lincoln Laboratory, where he analyzed GPS and directed energy systems in support of the U.S. Air Force. He is a senior member of the IEEE and an associate editor of the IEEE Transactions on Fuzzy Systems. He was awarded the best paper award at FUZZ-IEEE 2012, the IEEE Franklin V. Taylor Memorial Award for best paper at IEEE SMC 2011, and a best journal paper award from the Midwest Nursing Research Society in 2009; he has published over 50 technical articles. He was on the program committee of IEEE CEC 2009 and IAPR CICB/PRIB 2013, has been chair of several special sessions at IEEE WCCI and FUZZ-IEEE, and is an external reviewer for the Hong Kong Research Grants Council and several technical journals. Dr. Havens's research has been supported by the National Science Foundation, Leonard Wood Institute, and RAND/ John A. Hartford Foundation.