Participants: Group
members: S. Blum, D. Vieglais, J. Croft, G. Whitbread, W. Berendsohn;
Guests:
Anna Weitzman, Elizabeth Kolster, Jerry Cooper
Since November, 2001, Dave Vieglais (KUNHM) has been working on a PHP implementation of provider software, and P.J. Schwartz (CAS) continued working on a Java implementation of a portal.
P.J.'s work on DiGIR was financed with funds from a related NSF project; but this has come to an end. We expect John Wieczorek (Museum of Vertebrate Zoology, UC Berkeley) to continue working on the portal software under the MANIS project, which has 2 years to go. P.J. posted her Java code, Java Server Pages and XML schemas to a CVS repository on the DiGIR site (http://sourceforge.net/projects/digir) and worked with John to transfer the code to his "care".
Several aspects of the protocol still need to be worked out, including:
Query Schema vs. Federation Schema
One of our initial design goals was to make the protocol and software that implements it as generic as possible, and independent of the particular federation schema used by a community of providers. This has proven difficult because the validity and content of a query message (an XML instance) are determined in part by the federation schema. The query schema needs to be derived from a sort of merging of the federation schema and the generic query syntax. In XML this is a challenge and we would like to consult with XML experts to ensure our current approach is appropriate. (Stan and Dave to contact XML experts named).Metadata schema for provider databases
The metadata schema is still largely unspecified. In addition to basic information about contacts, etc., the metadata should indicate:
- the taxa covered in the collection (resource)
- what kind of resource it is (e.g., preserved collection, living collection, observation database, etc.),
- the types of queries a given provider supports for a given concept,
- limits the provider places on query execution time, the number of records returned, etc.
If this information can be cached at portals, users can get feedback about their queries before they are sent, such as "providers A, B, and C do not support substring queries on field X".
It is also probable that both the portal and the end-user will want to impose limits, such as "drop connections that have not responded in 5 minutes", or "limit responses to the first 200 records". In other cases, the user or portal may want to inform providers that they are particularly tolerant (e.g., willing to wait for 48 hours to get all the data.)
Some of the problems are a consequence of the comprehensive approach taken by DiGIR, as compared to existing solutions such as AVH and ENHSIN.
There was considerable discussion of the potential scale of the enterprise. The system should scale up to be able harvest information from 5,000 or more collections simultaneously, roughly two orders of magnitude more extensive than current designs envision. In some instances query times of several days might be acceptable and in other more reasonable timeouts will have to be imposed.
A next step will be to single out priority elements, especially those used in existing access systems like Species Analyst, ENHSIN, and REMIB, as well as those used in existing TDWG standards (Botanical Names, HISPID; Jim Croft will undertake to copy in semantics if provided with HISPID subset of schema).
Agreement was reached that the controversy “broad vs. minimal schema” is a non-issue. The broad general schema developed by the contents definition subgroup represents a semantic and formal definition of elements in datasets returned by data providers. The minimal schema pursued by the protocol subgroup is a subset of the general schema, which includes those elements which are to form access points for queries and thus conform to a defined structure. This query contents schema may vary according to special interests of the portal in question. So recommended access points should
In general, the process should be open and visible. An archived uncensored bulletin board should be established for that purpose. A voting procedure (yes or no) apart from allowing for comments may help to achieve more response.
The entire process could probably run in “Simplify” or an open source software (ask Bob Morris) which provides a simple, uncluttered interface.
After broadcasting the RfC, a 30 days waiting period had been agreed in Sydney. The results are put out in an edited version and at least one more RfC follows.
The content definition subgroup wants to refer the review and the versioning process directly to elements (or complex types).
For the Protocol subgroup, the first step is better documentation on the Web, followed by an open mailing list or bulletin board on SourceForge. Comments will requested from XML experts and software developers to solve problems indicated above.
Working Group Homepage | TDWG Accessions Subgroup Homepage | CODATA | TDWG
Page hosted by the Department
of Biodiversity Informatics and Laboratories of the Botanic
Garden and Botanical Museum Berlin-Dahlem. DISCLAIMER
Page editor: Walter Berendsohn (w.berendsohn [at] bgbm.org).
This page last updated: 06.03.2005