TDWG

Subgroup on Accession Data: Preparing for a Collection Data Standard


Convener's Report 2


Contents

Introduction

1. Objectives of the subgroup

2. Form of standard

3. Terminology

4. Scope of "Accession Data"

5. Data areas

6. Other comments received since July 1996


Introduction

This report summarizes the discussion which took place over the past 3 months. Headings 1 to 5 form the proposed agenda for the Accession Subgroup session during the forthcoming TDWG meeting in Toronto, where the proposed action will be discussed and decided.

Liaison with groups which are or have been working on projects involving collection data has been moved to a separate file (http://www.bgbm.org/TDWG/acc/liaison.htm)

1. Objectives of the TDWG Subgroup on Accession Data

The following is a draft which is to be discussed in Toronto. It incorporates several comments and questions posed by subgroup members regarding the aims and the circumscription of the subgroup's activities.

General Objective

Specific Objectives

  1. Disseminate information about standards, information models, and projected or existing systems
  2. Maintain compatibility with existing TDWG standards and point out areas in need of revision or extension (e.g. to cover biological as opposed to botanical data)
  3. Develop a glossary of standard terms
  4. Provide or endorse standard field definitions (elements and compound elements) for biological collection databases which can be used in data exchange and database design
  5. Provide or endorse standards for the development of common World Wide Web interfaces for "federated" collection databases; such as field definitions (usually compound elements) and standards for searching and reporting
  6. Establish or endorse data quality standards for collection databases
  7. Contribute to the development of a comprehensive content standard for biological collection metadata
  8. Provide or endorse datamodels for defined data areas within the scope of collection information

Proposed action: Acceptance of Objectives after discussion


2. Form of standard

2.1. Information model.

No controversy was raised as to the view that data modeling is indispensable for the analysis of the data we deal with. However, because of the complexity of the task (see 6.2 below) and the possible conflicts with regard to modeling techniques and project independence, this subgroup should not try to develop or endorse a detailed information model to be accepted as a TDWG standard. Instead, "we should carefully identify the most significant and constant part of the model (core model)" [Savov, July 29 1996], because this is decisive especially if we envision standards which are compatible over a wide range of biological collections (including microbial cultures, zoological and paleontological collections, gene banks, etc.).

The NFS project "An Interdisciplinary Information Model for Biological Collections", headed by A. Allison & S. Blum and sponsored/sanctioned by ASC has now been approved and the first workshop will take place at the time of the TDWG meetings. For project details please refer to *http://www.bishop.hawaii.org/asc-cnc/asc-prop.htm and see below under 6.1. Same as the first ASC model and the CDEFD model, the project stands out because this is not directly application driven but a research project.

Proposed action: Blum and Berendsohn to maintain close liaison with the ASC project and inform group members about progress made.

2.2. Content metadata standard

The development of a detailed content standard for collection metadata following the examples set by the FGDC would require a funded project. However, building on the results of past and present modeling efforts, the results of projects undertaken by the libraries and general museums community, and existing metastandards, it is not a task completely out of reach of the biological community. For the time being, discussion should center on the possibilities for acceptance of parts of existing metadata standards.

Proposed action: Call for the identification of parts of defined metadata standards suitable for acceptance by the accession subgroup and TDWG.

2.3. Data dictionary, exchange standards and HISPID 2

Some members of the group strongly criticized my decision to make acceptance of HISPID an option. Although on the one hand I agree in principle with much what has been said about flat file data dictionaries (see convener's report 1), the following should be considered:

Barry Conn, editor of the HISPID 3 document, pointed out that "Although HISPID has tried to maintain consistency, its real success has come from the ready acceptance, by those who use it, for the standard to change". HISPID is fully compatible with accepted TDWG standards, as well as with the forthcoming update of the ITF standard. Barry Conn also posited that we could use the published HISPID3 as a discussion document and, for the reasons stated, I agree with that point of view.

HISPID was also criticized for "trying to do semantics along with exchange protocol syntax". I agree that these two things should be kept separate in a TDWG standard, which should establish a "pure" data definition catalogue, e.g. following the format of ITF 1 (or ITF-2 in the August 31, 1995 version).

Several members of the group have agreed to revise and circulate HISPID for comment. The time period of availability for discussion has apparently been to short for detailed consideration. I think an achievable short-term goal would be to obtain a consensus on the name, semantics, domain(/range/values), and description of a number of fields. I do not think that in general "they are valuable in a context of data structure only" (K. Savov, see also the comments from S. Blum). Starting with the definitions in HISPID (excluding the taxon name fields, see 4 below), I estimate that nearly half of the fields can be directly used, either because they are not controversial or because they are part of already accepted standards. Several more may be acceptable after re-wording the definitions to fit for general collections. The other fields may prove to be more controversial, requiring a modeling or at least a hierarchical approach to represent different levels of recommended decomposition. Additional fields must be included, for example for

To that end, the attribute lists of past and present data modeling projects should be evaluated.

It is conspicuous, however, that the herbarium exchange standard includes no fields which are specific to herbarium collections, except perhaps some descriptive fields specific to botany included in the "additional data" group. This confirms the point of view that definition of attributes should start with the common data elements, rather than with the distinguishing features of collections.

Proposed action: Over the next 6 months, the subgroup evaluates the individual fields (excluding the taxon name fields) in HISPID one-by-one to answer the following questions:

A list of fields is maintained on the WWW. Fields for which all answers are positive are to be recommended for acceptance as part of the Accessions Standard proposal.


3. Terminology

Up to now, no comments were received on the definition of the following terms:

Biological object: Anything which could be the object of a biological observation or study; i.e. a unit, a site, or a concept like a taxon or syntaxon.

Biological collection data: All information related to units which have been incorporated into a collection or a system of observations such as those used in floristic mapping projects or birdwatching records.

Collection: An artificial assemblage of units.

Gathering: The act of collecting physical objects and/or information.

Site: A defined point or area in the biosphere.

Unit: A physical object which contains organisms, represents an organism, or is/was part of an organism.

Proposed action: Discussion of terms, especially of term "Unit" vs. "Specimen" vs. "Sample" vs. "Sheet"


4. Scope of "Accession Data"

4.1. Data areas to be excluded

Nobody objected to excluding data areas which have been covered by existing TDWG standards and/or subgroups. However, the need for a comprehensive name standard, which also covers groups other than extant plants was noted. A subgroup treating that subject will be proposed in Toronto.

Proposed action: Acceptance of data areas to be excluded. Proposal for a new or renewed Names subgroup.

4.2. Data areas to be covered completely

See convener's report 1, 4.2.

Proposed action: Acceptance of data areas to be covered.


5. Data areas

David Lazarus, June 25 1996: I very much support the suggestion of Charles Hussey for doing this modularly. One immediate extension to his suggestion - does Paleo qualify as a module of it's own, or do we have modules only for (for example) Zoology and Botany, each with some sort of Paleo definition? Most of what makes paleo Paleo is found in extensions (and deletions) of data types attached to the Collecting Site/Gathering.

Walter Berendsohn, July 1 1996: I would prefer to start with the core and work outwards - in CDEFD we found that most seemingly special attributes sooner our later fall within a general category (naming is difficult, however).

Proposed action: Discussion of "modular approach".


6. Selected comments received since July 1996

(excerpts, please protest if I left out something essential!)

6.1 On the scope of the new ASC model project:

Stan Blum, June 24 1996: I agree, generally, that duplication should be avoided. But in efforts like these, the process can be as important as the result. That fact that different sets of people will be participating in the different efforts almost guarantees, by definition, that the efforts won't be duplicative. The thing I fear more is that the different efforts result in very different looking recommendations or "standards". Our challenge is to make the results as comparable as possible, and/or to explain the differences rationally. ..... The ASC effort is definitely focussing on collections databases.

Stan Blum, October 4 1996: I had not intended to make living collections a high priority in this iteration, simply because we have a lot on our plate already. When I get everyone together at the November workshop, I could easily be "over-ruled" on this decision. It will depend on the participants. If we do try to cover the intricacies of living collections, I think we'll probably borrow heavily from existing treatments (with acknowledgement). Perhaps the more challenging related issue for us will be dealing with observations (information with time, place, etc. but without a specimen). I haven't yet given the "scope issue" a lot of thought, yet. My intention will be to keep dead collections as the primary focus, and to expand only as we have time -- and I don't think there will be enough time, really.

6.2 On models and modeling:

David Lazarus, June 25 1996: And a question for those who know to answer: are our data models really independent of technology to the extent that, if the technology standard changes from the current dominance of relational systems to (who knows? - highly distributed WWW databases+search agents, like the EMBO system?) we would still be able to migrate our data fairly painlessly? Maybe this is something data modeling books cover - forgive my ignorance in this case.

Walter Berendsohn, July 1 1996: That is where the Meta-Standard question comes in. As long as our "model" or "standard" does not imply a system, but just defines the data, we are fine. However, if you keep things to abstract, you are ending up with something people do not use (nor understand). As said in the introduction to the CDEFD model, some kind of compromise has to be taken. In any case, I am sure that a logical ER model (i.e. one which does not refer to a specific database system or implementation) is useful for any system.

Konstantin Savov, July 29 1996: While choosing methodology and techniques for the Standard, we should avoid any assumptions on implementation tools and techniques, because of variability of DBMSs and software development tools. We should also base the Standard on information structure, which supposes a wide range of patterns to be identified. Both hierarchical and relational models apply significant restrictions (corresponding to implementation techniques and tools) to real data model. We should also deal with data structures, which don't depend on functional model of a particular application. At the same time, behavior of collection objects is often very important (e.g., while discussing transactions of collection units).
From this point of view, object-oriented modelling has many important advantages, while comparing with structured analysis and design methods. In general, it corresponds with requirements mentioned above. I skip here a detailed comparison of object-oriented and structured techniques. I'd write more on this subject, if needed. Object-oriented methods usually require more skill from analyst or designer, but good logical model described in terms of object-oriented model isn't too difficult to understand for inexperienced readers. Object design usually produces much more complicated models, but system and object design are out of the Standard scope.
I'd propose Rumbaugh's Object Modelling Technique (OMT) as a basic method. A perfect book on OMT is available (James Rumbaugh et al. 1991. Object-oriented Modelling and Design. Prentice-Hall Inc. ISBN 0-13-629841-9). It is one of the best books on analysis and design (both structured or object-oriented) I've read. OMT is the most widespread object-oriented analysis and design method. After uniting OMT with Booch's method (draft on Unified Method is already available) it will become a standard in OO development, by the matter of facts. OMT proposes good set of rules and activities for all stages from analysis to implementation. It is flexible enough, at the same time. OMT also has strong support by CASE tools.
[This is cited from an extensive comment upon the aims, strategy and methods of a comprehensive approach to standardizing accession data. Although I agree with many of the points made there, I fear it is far exceeds the scope of a group like TDWG. See http://www.bgbm.org/tdwg/acc/savov.htm for the complete text].

6.3 Suggestions for data elements

Dough Yanega in reply to Richard Pankhurst's TDWG Descriptors Subgroup, August 31, 1996: I don't really have the time or inclination to get deeply involved with this, but I do have one sincere suggestion, after having worked for YEARS now both designing and implementing biological databases. I find it indispensable to build into ANY such database a quasi-redundancy of fields to store *original data* associated with a specimen as opposed to *actual data*. What I mean by this is that very often the original date/locality information may be abbreviated, illegible, or genuinely incorrect. I have seen many databases designed where there is no option to distinguish between original data and that entered by someone refining or correcting this original data. It is my opinion that ANY general recommendation for database design *must* be structured so as to permit one to make this distinction. My preferred design has one large field for "label data", and then separate fields to store each of the parameters independently; if the label is perfect and thorough, then it will all be redundant, but if there are any faults or omissions, they can be corrected without having to lose the original information (e.g., a label reading only "Hmps., Brounton, 4-96 Miller" would be expanded upon to "England / Hampshire / Braunton / April / 1896 / Joseph P. Miller / xlatN-ylongW / 350 m alt.", but the original label data is kept intact in the database). Good luck coming up with a uniform policy, and I look forward to the day when people *will* adhere to rigorous standards.


To index page. Contact: Walter G. Berendsohn, subgroup convener, wgb@zedat.fu-berlin.de. This page last updated October 6, 1996.