Skip to main content Skip to navigation

EECS Colloquium — CodeNet: large-scale AI by David Kung

Online
Zoom link

About the event

CodeNet: A large-scale AI for code dataset for learning a diversity of coding tasks

by David Kung, IBM Distinguished Research Staff Member

Abstract

As software development becomes ubiquitous across all industries and code infrastructure of enterprise legacy applications ages, it is more critical than ever to increase software development productivity and modernize legacy applications. Based on the phenomenal success of applying AI to natural language processing (NLP), researchers are keen in applying AI to facilitate code development as well. In this talk, I will present “CodeNet”, a first-of-its-kind, very large scale, diverse, and high-quality dataset to accelerate the algorithmic advancements in AI for code. It consists of 14M code samples and about 500M lines of code in 55 different programming languages. I will discuss how CodeNet differentiates from other datasets and its potential uses cases. A wide varieties of code classification and code similarity experiments have been performed on CodeNet using techniques ranging from bag of tokens to graph neural network, which  will be described in detail. I will conclude by offering some thoughts on how AI for code will evolve in the future.

Bio

David S Kung received a BA (Physics) from U. C. Berkeley, an MA (Physics) from Harvard University and a PhD (Physics) from Stanford University. He is currently a Distinguished Research Staff Member and Manager of AI for Code in the Hybrid Cloud Platform department. Previously, he was the Senior Manager of Design Automation, responsible for the Research Design Automation strategy for IBM. He contributed to IBM’s Logic Synthesis System and led the development of IBM’s Physical Synthesis System, which are deployed for every IBM’s mainframe and Power server microprocessors for the past decades. He then led the effort to accelerate Deep Learning applications, especially through massively parallel distributed computing over GPUs. His current research activity is AI for Code.

David has received a Corporate Award, and four Outstanding Technical Achievement Awards from IBM. He served as Chair of the Design Automation Technical Committee and on the executive committee of the International Conference on Computer Aided Design. He has about 30 US patents and over 50 publications.

Contact