While discussing a form of a standard, it seems reasonable to take its purposes as a starting point.
In order to facilitate data import/export, we need common import/export format (formats) for collection data. Field content definitions seem to be the most appropriate form. But, they are valuable in a context of data structure only. So, the standard should specify a set of collection objects, with a set of attributes (fields) for each object. Logical data type (character, numeric, binary, etc.) and maximal size of a value of every attribute are important. We have to take into consideration limited field length in the most DBMSs, while discussing import/export format.
Data quality depends on both correct structure (namely, precise identification of objects and their relations) and a set of attributes implemented. Some implementation techniques (e.g., normalization, in the case of relational database) may determine data quality significantly. So, for that purpose a standard should identify objects to be supported (at least, the most principal ones), together with their relations and attributes. Some implementation rules should be also considered.
There are two things needed for software integration. The first is a logical interface for the main collection objects, based on the core information model (logical formats of queries, transactions, determination, etc). The second is a set of data formats and APIs for queries, requests, reports, spatial data, etc. In some cases, implementation details should be considered, for example, while discussing Web interface for collection data (CGI). Maybe, it'll be better to make a set of extensions for the Standard for applications of that kind. Data protection would be also discussed here.
A framework for software development needs a core information model, the most principal design patterns and some implementation rules (generally, concerning user interface) to be proposed.
I agree with you, that simple data dictionary (a list of fields with descriptions) is not suitable for the Standard. Any purpose we are discussing needs at least basic data structures to be described. For import/export, complex data dictionary is suitable enough. For data quality, software integration and development, some elements of information model are extremely important.
I don't think it's a good idea to make a standard as a detailed information model, for the latter depends on the purpose of the particular application, implementation techniques, skill of developers and so on. However, we should carefully identify the most significant and constant part of the model (core model).
By all the reasons discussed, I'd like to propose the following structure of the Standard:
The Standard would be also partitioned, if needed, corresponding to data areas covered (e.g., gathering and identification, collection management, descriptions and experimental data, etc.).
In some cases, it would be difficult to provide a strong standard. We'd not be able to standardize all possible kinds of applications for accession data. So, it would be useful to establish minimal obligatory requirements for any data set and/or software package, and, for some cases, propose an optional part of the standard. For example, some objects and attributes would be marked as required, while some other - as optional.
Anyway, I'm sure that we'll have to use modelling techniques while working with the Standard. We'll not be able to avoid some modelling activities even if we're going to provide import/export formats, because the latter should be structured. The final result (document to publish) is another story. However, diagrams are often more easy to understand than lists of fields or dictionaries like Content Standard for Geographical Metadata.
While choosing methodology and techniques for the Standard, we should avoid any assumptions on implementation tools and techniques, because of variability of DBMSs and software development tools. We should also base the Standard on information structure, which supposes a wide range of patterns to be identified. Both hierarchical and relational models apply significant restrictions (corresponding to implementation techniques and tools) to real data model. We should also deal with data structures, which don't depend on functional model of a particular application. At the same time, behavior of collection objects is often very important (e.g., while discussing transactions of collection units).
From this point of view, object-oriented modelling has many important advantages, while comparing with structured analysis and design methods. In general, it corresponds with requirements mentioned above. I skip here a detailed comparison of object-oriented and structured techniques. I'd write more on this subject, if needed. Object-oriented methods usually require more skill from analyst or designer, but good logical model described in terms of object-oriented model isn't too difficult to understand for inexperienced readers. Object design usually produces much more complicated models, but system and object design are out of the Standard scope.
I'd to propose Rumbaugh's Object Modelling Technique (OMT) as a basic method. A perfect book on OMT is available (James Rumbaugh et al. 1991. Object-oriented Modeling and Design. Prentice-Hall Inc. ISBN 0-13-629841-9). It is one of the best books on analysis and design (both structured or object-oriented) I've read. OMT is the most widespread object-oriented analysis and design method. After uniting OMT with Booch's method (draft on Unified Method is already available) it will become a standard in OO development, by the matter of facts. OMT proposes good set of rules and activities for all stages from analysis to implementation. It is flexible enough, at the same time. OMT also has strong support by CASE tools.
In OMT, Class Association diagrams (CAD) represent classes with attributes and operations (with detailed description of their properties in Class Description Matrices - CDM), as well as associations between classes, generalizations, etc. With this tool, we can represent almost all information we need. In addition, Event Trace (ETD) and State Transition (STD) diagrams would be useful in describing general behavior of collection objects and their interface (operations to be supported or a kind of APIs, etc.). Class Communication and Message Generalization diagrams are generally needed at design stage, which we needn't to pursue. We'll hardly need Data Flow diagrams, but they are also available.
We'd perform the steps listed below, while constructing core information model for the Standard (rather simple general scheme, which can be refined later):
Partitioning of the model in correspondence with data subareas would be done at the 3d step, or as a last step of the first iteration. The procedure described above corresponds usual data modelling practice. It guarantees the best possible quality of the model and prevents from making many mistakes.
With core model completed, we'll be able to construct import/export formats and Standard extensions with minimal efforts.
The Standard will succeed only, if many development teams support it and are interested in certifying data sets and software after completing the Standard. At least, exchange formats should be supported as widely as possible. The Subgroup also should try to receive data models and data dictionaries of as many databases and information systems as possible.
So, participation of developers of internal information systems of institutions (museums, herbaria, botanical gardens, universities, etc.) is very important. They usually deal with computerized systems comprising huge amounts of valuable data. Information systems or software packages from independent developers can be widely distributed and can hold a lot of data, as well.
To index page. Contact: Walter G. Berendsohn, subgroup convener, wgb@zedat.fu-berlin.de. This page last updated July 29, 1996