CODATA Working Group on Biological Collection Databases

ABCD Schema - Task Group on
Access to Biological Collection Data

A joint CODATA and TDWG initiative supported by GBIF

Informal meeting at the BioForum II workshop, Sydney, March 16, 2002

Venue: Royal Botanic Garden, Sydney

Participants: Group members: S. Blum, D. Vieglais, J. Croft, G. Whitbread, W. Berendsohn;
Guests: Anna Weitzman, Elizabeth Kolster, Jerry Cooper

1. CODATA application

Berendsohn informed about the application for extension of the CODATA mandate and funding for the working group (text has been circulated to group members). Given that the group has been very active, there might be a chance to attain Task Group status.

2. Communications

The group wants the process to be as transparent as possible. The website holding documentation of the DiGIR project (protocol subgroup) will be thoroughly updated to (i) invite developers to directly participate in the process, and (ii) inform a broader community about the ideas the protocol subgroup and DiGIR developed (Stan Blum responsible for website). DiGIR uses the sourceforge.net website, interested developers should get a login (contact any of the listed editors at digir.sourceforge.net). The website of the content definition subgroup (www.bgbm.org/tdwg/codata/) should be more broadly advertised. This will be done once the public request for comment on the schema is broadcasted (see below; BioCASE team currently responsible for website, in close collaboration with BioCASE at NHM). The group would wish for closer collaboration with the HISCOM committee’s Australian Virtual Herbarium project; the aims laid out at the website big.netforge.net are very similar to ours. AVH now publishes its code, which is not state-of-the-art but working. In general, sites like www.sourceforge.net can be used to ensure open-source character of code produced for or in projects.

3. Protocol subgroup (Blum & Vieglais)

Since November, 2001, Dave Vieglais (KUNHM) has been working on a PHP implementation of provider software, and P.J. Schwartz (CAS) continued working on a Java implementation of a portal.

P.J.'s work on DiGIR was financed with funds from a related NSF project; but this has come to an end. We expect John Wieczorek (Museum of Vertebrate Zoology, UC Berkeley) to continue working on the portal software under the MANIS project, which has 2 years to go. P.J. posted her Java code, Java Server Pages and XML schemas to a CVS repository on the DiGIR site (http://sourceforge.net/projects/digir) and worked with John to transfer the code to his "care".

Several aspects of the protocol still need to be worked out, including:

Query Schema vs. Federation Schema
One of our initial design goals was to make the protocol and software that implements it as generic as possible, and independent of the particular federation schema used by a community of providers. This has proven difficult because the validity and content of a query message (an XML instance) are determined in part by the federation schema. The query schema needs to be derived from a sort of merging of the federation schema and the generic query syntax. In XML this is a challenge and we would like to consult with XML experts to ensure our current approach is appropriate. (Stan and Dave to contact XML experts named).

Metadata schema for provider databases
The metadata schema is still largely unspecified. In addition to basic information about contacts, etc., the metadata should indicate:

the taxa covered in the collection (resource)

what kind of resource it is (e.g., preserved collection, living collection, observation database, etc.),

the types of queries a given provider supports for a given concept,

limits the provider places on query execution time, the number of records returned, etc.

If this information can be cached at portals, users can get feedback about their queries before they are sent, such as "providers A, B, and C do not support substring queries on field X".

It is also probable that both the portal and the end-user will want to impose limits, such as "drop connections that have not responded in 5 minutes", or "limit responses to the first 200 records". In other cases, the user or portal may want to inform providers that they are particularly tolerant (e.g., willing to wait for 48 hours to get all the data.)

Some of the problems are a consequence of the comprehensive approach taken by DiGIR, as compared to existing solutions such as AVH and ENHSIN.

There was considerable discussion of the potential scale of the enterprise. The system should scale up to be able harvest information from 5,000 or more collections simultaneously, roughly two orders of magnitude more extensive than current designs envision. In some instances query times of several days might be acceptable and in other more reasonable timeouts will have to be imposed.

4. Content definition subgroup

The BioCASE schema definition group (NHM and BGBM) is to provide a collection-level schema (currently BioCASE only) and a Unit level schema (CODATA/TDWG and BioCASE), so this group is currently able to dedicate personnel resources to the schema definition process. The scope of the schema was discussed. Currently, its structure is focused on biological natural history collections, but the data element definitions and types apply to biological collections in general (i.e. including living collections and observation records). The priority at present is to develop a consensus about element and type descriptions. The annotation tag has been structured to hold metadata on the element. A schema-viewer was developed in Berlin to allow XML non-specialists to browse the schema and view the annotations in a structured way (see http://www.bgbm.org/scripts/ASP/TDWG/Frame.asp) General principles governing the structure in the schema were discussed: a new hierarchical level is only needed where

repeated elements occur
for subtypes (excludable parts / domain specific), and
for complex types

Apart from this the AltText simple type is defined separate from normal text because of its implicit representation of a concatenation of the atomized elements it is grouped with (however, this cannot be enforced).

A next step will be to single out priority elements, especially those used in existing access systems like Species Analyst, ENHSIN, and REMIB, as well as those used in existing TDWG standards (Botanical Names, HISPID; Jim Croft will undertake to copy in semantics if provided with HISPID subset of schema).

Agreement was reached that the controversy “broad vs. minimal schema” is a non-issue. The broad general schema developed by the contents definition subgroup represents a semantic and formal definition of elements in datasets returned by data providers. The minimal schema pursued by the protocol subgroup is a subset of the general schema, which includes those elements which are to form access points for queries and thus conform to a defined structure. This query contents schema may vary according to special interests of the portal in question. So recommended access points should

represent common denominators
should be queryable
should contain a concept represented in structured way in the result set.

5. RfC process

In general, the process should be open and visible. An archived uncensored bulletin board should be established for that purpose. A voting procedure (yes or no) apart from allowing for comments may help to achieve more response.

The entire process could probably run in “Simplify” or an open source software (ask Bob Morris) which provides a simple, uncluttered interface.

After broadcasting the RfC, a 30 days waiting period had been agreed in Sydney. The results are put out in an edited version and at least one more RfC follows.

The content definition subgroup wants to refer the review and the versioning process directly to elements (or complex types).

For the Protocol subgroup, the first step is better documentation on the Web, followed by an open mailing list or bulletin board on SourceForge. Comments will requested from XML experts and software developers to solve problems indicated above.

6. Role of GBIF

TDWG has become a associate member of GBIF. The role of GBIF for the group is seen as providing buy-in across the continents and perhaps obtaining funding to get people together. The role of TDWG within GBIF will be to stress the importance of standards as the foundation to structure, content, communication and interoperability of biodiversity databases, and to provided a link with the international information standards community.

Working Group Homepage | TDWG Accessions Subgroup Homepage | CODATA | TDWG

Page hosted by the Department of Biodiversity Informatics and Laboratories of the Botanic Garden and Botanical Museum Berlin-Dahlem. DISCLAIMER
Page editor: Walter Berendsohn (w.berendsohn [at] bgbm.org).

This page last updated: 06.03.2005