TDWG Subgroup on Accession Data

Konstantin Savov on the Accessions Standard (in reply to convenor's report 1)

General discussion

While discussing a form of a standard, it seems reasonable to take its purposes as a starting point.

In order to facilitate data import/export, we need common import/export format (formats) for collection data. Field content definitions seem to be the most appropriate form. But, they are valuable in a context of data structure only. So, the standard should specify a set of collection objects, with a set of attributes (fields) for each object. Logical data type (character, numeric, binary, etc.) and maximal size of a value of every attribute are important. We have to take into consideration limited field length in the most DBMSs, while discussing import/export format.

Data quality depends on both correct structure (namely, precise identification of objects and their relations) and a set of attributes implemented. Some implementation techniques (e.g., normalization, in the case of relational database) may determine data quality significantly. So, for that purpose a standard should identify objects to be supported (at least, the most principal ones), together with their relations and attributes. Some implementation rules should be also considered.

There are two things needed for software integration. The first is a logical interface for the main collection objects, based on the core information model (logical formats of queries, transactions, determination, etc). The second is a set of data formats and APIs for queries, requests, reports, spatial data, etc. In some cases, implementation details should be considered, for example, while discussing Web interface for collection data (CGI). Maybe, it'll be better to make a set of extensions for the Standard for applications of that kind. Data protection would be also discussed here.

A framework for software development needs a core information model, the most principal design patterns and some implementation rules (generally, concerning user interface) to be proposed.

I agree with you, that simple data dictionary (a list of fields with descriptions) is not suitable for the Standard. Any purpose we are discussing needs at least basic data structures to be described. For import/export, complex data dictionary is suitable enough. For data quality, software integration and development, some elements of information model are extremely important.

I don't think it's a good idea to make a standard as a detailed information model, for the latter depends on the purpose of the particular application, implementation techniques, skill of developers and so on. However, we should carefully identify the most significant and constant part of the model (core model).

Possible structure of the Standard

By all the reasons discussed, I'd like to propose the following structure of the Standard:

Core data model showing general data structures (not database structure!), events and life-cycles (when needed), as well as data dictionary = for the model. The model should be free from implementation details and restrictions. So, careful identification of design patterns and all data types is not badly needed. It's very important to make this part of the model clear enough for inexperienced (in software development) readers.
Import/export format, in the form of list of files to be exported/imported, with detailed description of file structures. Actually, the format originates from data dictionary and comprises all data types and sizes, as well as some minor details skipped in the core data model.
Standard extensions of different kinds, in particular, logical interfaces for main collection objects, some recommended design patterns, interchange formats, software interfaces for some specialized applications and so on. A set of extensions would be identified during the work with the previous parts of the standard, or may be expanded later. The core model will provide a basis for extensions of the Standard.
Certification procedures for data sets and information systems dealing with collection data. It's very important part of a standard.

The Standard would be also partitioned, if needed, corresponding to data areas covered (e.g., gathering and identification, collection management, descriptions and experimental data, etc.).

In some cases, it would be difficult to provide a strong standard. We'd not be able to standardize all possible kinds of applications for accession data. So, it would be useful to establish minimal obligatory requirements for any data set and/or software package, and, for some cases, propose an optional part of the standard. For example, some objects and attributes would be marked as required, while some other - as optional.

Standard development techniques

Anyway, I'm sure that we'll have to use modelling techniques while working with the Standard. We'll not be able to avoid some modelling activities even if we're going to provide import/export formats, because the latter should be structured. The final result (document to publish) is another story. However, diagrams are often more easy to understand than lists of fields or dictionaries like Content Standard for Geographical Metadata.

While choosing methodology and techniques for the Standard, we should avoid any assumptions on implementation tools and techniques, because of variability of DBMSs and software development tools. We should also base the Standard on information structure, which supposes a wide range of patterns to be identified. Both hierarchical and relational models apply significant restrictions (corresponding to implementation techniques and tools) to real data model. We should also deal with data structures, which don't depend on functional model of a particular application. At the same time, behavior of collection objects is often very important (e.g., while discussing transactions of collection units).

From this point of view, object-oriented modelling has many important advantages, while comparing with structured analysis and design methods. In general, it corresponds with requirements mentioned above. I skip here a detailed comparison of object-oriented and structured techniques. I'd write more on this subject, if needed. Object-oriented methods usually require more skill from analyst or designer, but good logical model described in terms of object-oriented model isn't too difficult to understand for inexperienced readers. Object design usually produces much more complicated models, but system and object design are out of the Standard scope.

I'd to propose Rumbaugh's Object Modelling Technique (OMT) as a basic method. A perfect book on OMT is available (James Rumbaugh et al. 1991. Object-oriented Modeling and Design. Prentice-Hall Inc. ISBN 0-13-629841-9). It is one of the best books on analysis and design (both structured or object-oriented) I've read. OMT is the most widespread object-oriented analysis and design method. After uniting OMT with Booch's method (draft on Unified Method is already available) it will become a standard in OO development, by the matter of facts. OMT proposes good set of rules and activities for all stages from analysis to implementation. It is flexible enough, at the same time. OMT also has strong support by CASE tools.

In OMT, Class Association diagrams (CAD) represent classes with attributes and operations (with detailed description of their properties in Class Description Matrices - CDM), as well as associations between classes, generalizations, etc. With this tool, we can represent almost all information we need. In addition, Event Trace (ETD) and State Transition (STD) diagrams would be useful in describing general behavior of collection objects and their interface (operations to be supported or a kind of APIs, etc.). Class Communication and Message Generalization diagrams are generally needed at design stage, which we needn't to pursue. We'll hardly need Data Flow diagrams, but they are also available.

Standard development procedure

We'd perform the steps listed below, while constructing core information model for the Standard (rather simple general scheme, which can be refined later):

Identify and describe data areas to be covered, as well as compile a glossary of terms. The latter will be a basis for data dictionary. Description of some parts of related data areas is also important (e.g., taxonomy).
Identify objects and classes, using descriptions and glossaries prepared at the previous step.
Review data models and software implemented concerning object classes supported. We'll need data dictionaries of as many models as possible for that work.
Prepare data dictionary for core information model.
Identify associations between object classes included in core model, taking into consideration other data models available. The first version of CADs should be compiled at this step.
Identify attributes of objects and links.
Refine CADs using generalization.
Add the most principal (for the standard) operations to the model. Dynamic modelling would be useful at this step to model behavior of objects and events.
Iterate and refine model, if needed (usually, it's needed!).

Partitioning of the model in correspondence with data subareas would be done at the 3d step, or as a last step of the first iteration. The procedure described above corresponds usual data modelling practice. It guarantees the best possible quality of the model and prevents from making many mistakes.

With core model completed, we'll be able to construct import/export formats and Standard extensions with minimal efforts.

Need for Liaison

The Standard will succeed only, if many development teams support it and are interested in certifying data sets and software after completing the Standard. At least, exchange formats should be supported as widely as possible. The Subgroup also should try to receive data models and data dictionaries of as many databases and information systems as possible.

So, participation of developers of internal information systems of institutions (museums, herbaria, botanical gardens, universities, etc.) is very important. They usually deal with computerized systems comprising huge amounts of valuable data. Information systems or software packages from independent developers can be widely distributed and can hold a lot of data, as well.

To index page. Contact: Walter G. Berendsohn, subgroup convener, wgb@zedat.fu-berlin.de. This page last updated July 29, 1996