Quarry: "Assumption-Free" Data Management
Demonstration servers:
Medication Nomenclature (password required)
(817 Signatures, 744526 Resources, 10401670 Descriptors)
(currently 24 Signatures, 66404 Resources, 522280 Descriptors; in a previous experiment, 86 Signatures, ~1M Resources, 7.5M Descriptors)
(7 Signatures, 69 Resources, 875 Descriptors)
Through a collaboration with Portland State University researchers, we are addressing the problem of "bootstrapping" a data management application: How does one proceed from heterogeneous, unfamiliar data sources to useful knowledge, in hours rather than weeks or months?.
Conventional data management solutions (cf. RDBMS), are characterized by top-down, rigid designs, requiring significant up-front investment: schema design, formal requirements gathering, feature triage. The concept of a dataspace encourages a different approach: begin by immediately providing simpler, baseline services over any and all data sources, rather than restricting the design to advanced services implemented over only convenient and familiar data sources.
Quarry performance over a competitive RDF management system
The Quarry system address the bootstrapping problem of dataspaces: given a set of datasources that one knows nothing about, what is the shortest path to gaining useful knowledge? Quarry helps manage those data over which very few assumptions hold: there is no schema available, the data need not be relational, there are no obvious constraints or patterns to exploit, and there may be millions of items with no obvious way to "start small."
To use Quarry, data sources are decomposed into an "assumption-free" data model -- a set of (resource, property, value) triples. These triples are then processed and indexed -- automatically -- to provide efficient query and browse services through a simple API. The Quarry platform also provides an interactive web application for profiling your data -- testing assumptions, assessing quality and "cleanliness", and, more generally, improving one's understanding.
Publications
Quarrying Dataspaces: Schemaless Profiling of Unfamiliar Information Sources, Bill Howe, David Maier, Nicolas Rayner, James Rucker, Workshop on Information Integration Methods, Architectures, and Systems (IIMAS 2008)
Smoothing the ROI Curve for Scientific Data Management Applications, Bill Howe, David Maier, Laura Bright, Third Biennial Conference on Innovative Data Systems Research (CIDR 2007)
Emergent Semantics: Towards Self-Organizing Scientific Metadata, Bill Howe, Kuldeep Tanna, Paul Turner, David Maier
International Conference on Semantics for a Networked World (SFNW 2004), co-located with SIGMOD 2004.
| Attachment | Size |
|---|---|
| howe_maier_rayner_rucker_quarry.pdf | 353.48 KB |

