Date: June 11 - 13, 2001 (three full days)
Venue: National Center for Ecological Analysis and Synthesis (NCEAS) in Santa Barbara, California, USA.
Attendants (all CODATA Working Group members): Walter Berendsohn, Lois Blaine, Kurt Bollacker, Stan Blum, Alex Chapman, Jim Croft, George Garrity, Anton Güntsch, Charles Hussey, Raśl Jimenez Rosenberg, Rudi May, Derek Munro, Satoru Miyazaki, Sabine Roscher, Paula Ross Huddleston, Hideaki Sugawara, Neil Thomson, Dave Vieglais, Greg Whitbread, John Wieczorek.
Meeting Report
The purpose of this meeting was two-fold:
In keeping with current trends in Internet computing, the data specification will be provided in an XML-based methodology (i.e., as either a Document Type Definition [DTD] or an XML-Schema), and the software architecture will be based on SOAP (Simple Object Access Protocol).
Group members include representatives from four existing distributed query systems:
Each of these systems provides a single portal to multiple collection data providers. They enable a user to formulate a query against a simple generic concept of a collection unit, to broadcast the query to data providers and return to the user a single structured data set, which can be viewed on line in a variety of ways, or downloaded to the user's computer for subsequent processing.
At the meeting, participants separated into two break-out groups addressing the data specification and the software architecture.
Data Specification
All of the world's biological collections contain a number of data items including specimen specific (e.g. taxon, date, altitude, sex) and collection specific (e.g. holding institution) elements. The set of elements used varies from collection to collection, and there are no widely adopted standards for common sets of elements. The workshop was aimed at creating a reconciled set of element names for scientists and curators to use.
The data group made significant progress toward the specification using a combination of top-down conceptualisation (and organization) and bottom-up use of existing relevant specifications (e.g., the BioCISE information model, and the TDWG endorsed standard, HISPID). While it was not expected (or even possible) for any collection to use more than a small of fraction of the elements defined in the standard, the hope was that no elements beyond the standard would be necessary. A design goal of the data specification was to be both comprehensive and general, to include a broad array of concepts that might be available in a collection database, but to mandate only the bare minimum of elements required to make the specification functional. This track ended with the creation of a rough, but mostly complete, hierarchical structure of data elements.
In a plenary session near the end of the meeting, the working group determined that the data specification should be cast as an XML schema. It was estimated that several (3-8) months of design/testing and around $75K of skilled labor would be required to create a final, usable definition as an XML Schema. The first publicly released draft will be available for review and comment before the annual TDWG meeting in November (Sydney, Australia).
Software architecture
The world's collection databases represent a myriad of database and access technologies. Many of these are primitive, hard to use, platform specific, and scale poorly. Furthermore, almost all of the existing systems are incompatible with each other. Rather than encourage database providers to take on the burden of redesigning/rebuilding existing databases, the software architecture track defined a system of "gateways" to wrapper around existing databases and "portals", which would access the gateways.
The software architecture group determined to cast its specification as a "search" protocol http://www.gils.net/search.html, which combines elements of SOAP, ANSI Z39.50, and UDDI. Portals will broadcast queries to providers as XML documents, in which the elements of the query have been "marked up" as XML elements. Providers will convert the query into SQL or other native query language, as required, and pass it to the local database. The provider will then take the result set returned from the database, and return it to the portal as an XML document. The portal will then merge the documents returned from multiple providers into views or data sets of interest to the user.
Two important results might now emerge from adoption of this standard. The first is the decoupling of portals and providers. With standard data provider software and capabilities, any organization with special skills in data integration (perhaps beyond biological collection data, such as GIS data) and designing easy-to-use interfaces should be capable of establishing a portal to collection data. The second result is that both portal and provider software can be built in modules, using an open-source model. This should enable programmers now working on isolated (stove-piped) projects to share code with each other.
Individual networks may decide to make use of different software architectures and use query protocols considered to be more appropriate for their specific tasks. However, compliance with the data- related part of the schema will ensure compatibility and gateways to the common protocol can be installed for nodes and/or portals in the respective network.
Acknowledgements
The organizers gratefully acknowledge support from the following organizations, without which this work would not have been possible:
The meeting was graciously hosted by the National Center for Ecological Analysis and Synthesis (NCEAS) - http://www.nceas.ucsb.edu/ and jointly organized by Stan Blum, Walter Berendsohn, and Lois Blain.
Next meeting: Sydney, Nov. 2001.
Working Group Homepage | TDWG Accessions Subgroup Homepage | CODATA | TDWG
Page hosted by the Department
of Biodiversity Informatics and Laboratories of the Botanic
Garden and Botanical Museum Berlin-Dahlem. DISCLAIMER
Page editor: Walter Berendsohn (w.berendsohn [at] bgbm.org).
This page last edited on 06.03.2005