Type: Research Highlight
Title: Declarative Knowledge Base Construction
Christopher De Sa, Alex Ratner, Christopher Ré, Jaeho Shin, Feiran Wang, Sen Wu, Ce Zhang
Available in: PDF
The dark data extraction or knowledge base construction (KBC) problem is to populate a SQL database with information from unstructured data sources including emails, webpages, and pdf reports. KBC is a long-standing problem in industry and research that encompasses problems of data extraction, cleaning, and integration. We describe DeepDive, a system that combines database and machine learning ideas to help develop KBC systems. The key idea in DeepDive is that statistical inference and machine learning are key tools to attack classical data problems in extraction, cleaning, and integration in a unified and more effective manner. DeepDive programs are declarative in that one cannot write probabilistic inference algorithms; instead, one interacts by defining features or rules about the domain. A key reason for this design choice is to enable domain experts to build their own KBC systems. We present the applications, abstractions, and techniques of DeepDive employed to accelerate construction of KBC systems.
Christopher De Sa is a PhD candidate at Stanford University advised by Chris Ré and Kunle Olukotun. He primarily studies fast stochastic algorithms, such as Gibbs sampling and stochastic gradient descent. He takes particular interest in heuristics that improve performance on modern heterogeneous hardware and can be guaranteed to not affect statistical efficiency.<
Alex Ratner is a PhD candidate advised by Chris Ré. His work focuses on using new algorithmic and systems paradigms to solve core challenges around the creation and maintenance of supervision data for machine learning systems. His research interests also include applying these methods to problems in bioinformatics.
Christopher (Chris) Ré is an assistant professor in the Department of Computer Science at Stanford University. His work’s goal is to enable users and developers to build applications that more deeply understand and exploit data. Chris received his PhD from the University of Washington in Seattle under the supervision of Dan Suciu. For his PhD work in probabilistic data management, Chris received the SIGMOD 2010 Jim Gray Dissertation Award. He then spent four wonderful years on the faculty of the University of Wisconsin, Madison, before moving to Stanford in 2013. He helped discover the first join algorithm with worst-case optimal running time, which won the best paper at PODS 2012. He also helped develop a framework for feature engineering that won the best paper at SIGMOD 2014. In addition, work from his group has been incorporated into scientific efforts including the IceCube neutrino detector and PaleoDeepDive, and into Cloudera’s Impala and products from Oracle, Pivotal, and Microsoft’s Adam. He received an NSF CAREER Award in 2011, an Alfred P. Sloan Fellowship in 2013, a Moore Data Driven Investigator Award in 2014, the VLDB early Career Award in 2015, and the MacArthur Foundation Fellowship in 2015.
Jaeho Shin is a PhD candidate at Stanford University advised by Christopher Ré. His work focuses on solving data management challenges in machine learning systems and accelerating humans in the development loop of such systems through creation of languages, abstractions, and tools.
Feiran Wang is a PhD candidate in the Department of Electrical Engineering at Stanford University. His main research interests are methods for utilizing multimodal data in machine learning tasks, as well as their applications.
Sen Wu is a PhD candidate in the Department of Computer Science at Stanford University. His work focuses on building general frameworks for tera-scale knowledge base construction that can achieve high quality while requiring minimal human efforts.
Ce Zhang is a postdoctoral researcher in Computer Science at Stanford University. He is working with Christopher Ré on data management and database systems. With the indispensable help of many collaborators, his PhD work produced the system DeepDive, a trained data system for automatic knowledge-base construction. As part of his PhD thesis, he led the research efforts that won the 2014 SIGMOD Best Paper Award and was invited to the “Best of VLDB 2015” special issue; PaleoDeepDive, a machine-reading system for paleontologists, was featured in Nature magazine, and he also led the Stanford team that produced the top-performing machine-reading system for TAC-KBP 2014 slot-filling evaluations using DeepDive. Ce obtained his PhD from the University of Wisconsin-Madison, advised by Christopher Ré, and his Bachelor of Science degree from Peking University, advised by Bin Cui.