[OAI-implementers] Qualified Dublin Core

Fri Aug 13 17:44:57 EDT 2004

> -----Original Message-----
> From: oai-implementers-bounces at openarchives.org 
> [mailto:oai-implementers-bounces at openarchives.org] On Behalf 
> Of Stephen Crawley
> Sent: Thursday, August 12, 2004 9:33 PM
> Subject: Re: [OAI-implementers] Qualified Dublin Core 
> 
> Hi Tim,
> ...<snip>
> > ... That's why XML
> > namespaces are so handy. The data provider can explicitly and 
> > unambiguously tie an element in his or her record to one specific, 
> > community-standard metadata semantic set.
> 
> I disagree.  The XML namespaces (i.e. OAI record formats) are 
> actually a LOSSY way of expressing semantics.  Or at least 
> that's what happens in practice ... when people try to 
> shoe-horn metadata into some existing OAI record schema that 
> isn't quite right.

XML Namespaces pre-date OAI. XML Namespace is a recommendation of the W3C
(see http://www.w3c.org/TR/xml-names11/). OAI-PMH exploits XML Namespaces,
but so do many (if not most) other large-scale XML applications today. XML
Namespaces are not peculiar to OAI-PMH and are in no way synonymous with OAI
record formats.

> 
> My point is that a real metadata schema includes something that says
> what the elements, refinements, encodings, etc all mean.   Currently,
> that something is usually English text, but in the future it 
> might be augmented with machine readable cross-references to 
> standard thesauri, ontologies, etcetera.
>

Yes, real metadata SCHEMES should be thoroughly defined and it would be even
nicer if they were all registered. I was not suggesting that the W3C XML
Schema Language be used as a primary way to define (in human terms) the
semantics of any metadata scheme. More appropriate technologies and methods
for defining / describing semantics exist and/or are in development (e.g.,
RDF Schema and OWL -- and what you're doing, I gather). At one time DCMI
used attributes suggested by the ISO/IEC 11179 Specification and
Standardization of Data Elements as a way to define DC semantics in human
understandable terms. (The multi-part 11179 standard is available free from
www.jtc1.org -- part 3 updated in 2003, about metadata registries, is
interesting relative to this discussion albeit longwinded and a bit dull).

> Current day OAI-style XML schemas are not metadata schemas.  
> Rather they are formats for transporting metadata records 
> that may (or may not)
> fully conform to some real metadata schema.   Other formats 
> include RDF,
> HTML meta tags, domain specific formats as in MARC and 
> ANZLIC, and even clunky ad-hoc mappings to spread-sheets.
> ...<snip>

Again, the W3C XML Schema Language predates OAI-PMH and is used for many
purposes other than OAI-PMH. The XML Schema Language conformant XSDs
required by OAI-PMH are intended to provide a means by which metadata
instances can be validated (in an XML sense). XML is the format, XML Schema
Language provides a means to validate that a set of XML metadata instances
(i.e., harvested XML metadata records) correctly use (in an XML sense) a
pre-defined arrangement of element names and attributes. XML Schema Language
is not an especially good method for validating correctness of the real use
of semantics in XML metadata instances, although it does have some crude
capabilities in that regard (e.g., you can't introduce a tag name not
mentioned in the XSD, other than from another namespace, and then only if
allowed by the XSD).

So I think we're arguing apples and oranges here. 

I think the confusion stems from fact we're each looking at different
aspects of the problem. My concern is in being able to recognize that most
or all of the metadata elements in a harvested metadata record come from a
metadata scheme (a set of metadata semantics) with which I'm already
familiar (e.g., qualified Dublin Core). How I learned about that metadata
scheme is a separate issue and not of interest to me at the moment. The
question is how do I look at a metadata record and know which of the
elements it contains are qualified DC used in the way I would expect to
describe a particular information resource. I'm not interested in elements
that aren't qualified DC, or can't at least be immediately, simply,
automatically, and safely mapped to qualified DC.

If a harvested metadata record explicitly references the official, canonical
XML Namespace and XSD for qualified Dublin Core, and if the metadata record
validates against that canonical XSD, I can be pretty certain.
Unfortunately, no such canonical XML Namespace or XSD for qualified Dublin
Core currently exists. Instead, DCMI has posted 3 separate XSDs, each
associated with its own XML Namespace, which together name all the elements
and attributes currently included in qualified DC. 

My contention is that if I run across a metadata record which makes use of
the 3 canonical component XML Namespaces for qualified DC, then I should
have good confidence that I will be able to extract out from that metadata
record qualified DC elements that I know and understand -- even if that
record references some other XML Namespace for its top-level element and
claims conformance to some other XSD (previously unknown to me). The
elements of interest to me will be labeled (in the standard XML fashion)
with prefixes tied to the 3 canonical XML Namespaces associated with
qualified DC, so I'll still be able to identify which elements come from
qualified DC. 

The question is how reasonable and safe an assumption is that? Obviously if
the previously unknown XSD imports only the 3 DCMI XML Namespaces and
associated XSDs and adds no semantics of its own other than a top-level
container element, I can be highly confident. Such an XSD would allow no
foreign semantics -- essentially all metadata records conforming to such an
XSD could contain would be qualified DC. It wouldn't matter that I didn't
previously know of the container element XML Namespace or its XSD.

On the other hand, if the previously unknown XSD imports 25 other XML
Namespaces and associated XSDs, and limits the use of qualified DC elements
to some low-level in the record's XML hierarchy, then my confidence that
I'll be able to extract useful qualified DC elements from the record is
rather low.

Most cases of course fall somewhere in between. The OLAC top-level XSD
imports the appropriate DCMI Namespaces and XSDs, but it also imports 4 OLAC
specific XSDs that define a handful of additional non-DC elements and
attributes. Same with the NSDL and CDP top-level XSDs. But still, 90% of the
content of the metadata records harvested from data providers using these
XSDs for their homegrown metadataFormats is recognizable as qualified DC. As
a harvester, therefore, I'm comfortable for purposes of aggregation to take
the elements associated with the qualified DC namespaces and ignore the
rest.

Based on your response, I think what I'm worried about as further elaborated
here, is different from what you're focusing on.

Tim Cole
University of Illinois at UC