[UPS] Problems/Comments with Santa Fe Metadata Set

Mark Doyle doyle@aps.org
Tue, 16 Nov 1999 13:13:14 -0500


Greetings Carl,

> From: Carl Lagoze <lagoze@cs.cornell.edu>
> Date: 1999-11-15 06:41:17 -0500

Sorry, I was unable to answer this sooner... Since I was the one who  
initiated the addition of this element, I feel I should address it. I  
understand your point of view, but I think that we live in an imperfect world  
and one needs to have pragmatic solutions to otherwise vexing problems. The  
whole point of these repositories and overlaid services is to make material  
available to researchers in a variety of formats, some of which may be much  
richer than others (the variation is both within a repository and across  
repositories). Formats may be added and removed as the underlying technology  
changes. Any service which chooses to just display a subset of a repositories  
formats (say, just PostScript or PDF) is likely to short change users. For  
instance, xxx offers many flavors of PostScript, some of which require the  
user to understand additional issues (e.g., font installation). So the simple  
goal (again in the context of doing things on the six month scale) is to  
give users a path to the definitive interface of a repository, preferably  
anchored around the target that the user is actually interested in at the  
moment. I feel it is much more useful to a user of the services to wind up a  
"wrapper" page than  just the home page of the arXiv. Furthermore, the URL's  
in the display ID  are to be persistent and freely accessible (some  
repositories may have to limit who can access certain components and a  
mechanism for authentication has to be made available).

> Our view throughout the design of Dienst (and digital object repositories in 
> general) is that a repository is not in the business of human presentation.
> It simply provides sufficient information through a protocol so that other 
> services can use its contents.  From the perspective human interaction, it 
> provides protocol requests that can be used by any user interface to
> construct "display pages" are pages that access specific disseminations or 
> parts of disseminations.  Thus,  there may be many user interfaces and many 
> "display Ids" for a particular digital object. Furthermore, a repository
> does not have any record of what these display Ids are (i.e., does the
> publisher of a book know every house, library, bookstore that their book
> sits in).

This is all well and good in theory, but where the rubber hits the road, I  
think it fails. Not all repositories are the same. The selection of  
repository services that an overlay service makes visible to the reader is  
not likely to be the complete set of services. This is a disadvantage to the  
users who may not even be aware that they have other choices for retrieval of  
information.

> The display ID metadata element presumes that not only does the repository 
> or digital object know about these URLs but endows one with the property of 
> being the "correct" one (a rather wrong concept since the display ID for an 
> Italian audience should be different than for a US audience).

I strongly disagree. The fact is that most repositories have a definitive  
wrapper page that provides links to all available repository services  
relevant to a particular item in the repository (and a dynamic set of  
services at that). To use phrasing from physics, this is a "natural" URL -  
"naturalness" is not a statement about correctness (as you imply), but  
rather, it allows for a choice of a distinguished member of a class (here,  
the class of display URL's).  There may be other ways to make the choice  
(just as natural), but each arXiv has a very good sense of which URL is  
potentially the most useful to end users. The mere fact that these stable,  
persistent URLs exist and are made available by the repository distinguishes  
them from the rest of the URLs.

Should all overlays be required to track all of the services and mirrors of  
its underlying repositories? I think that is what your point of view requires  
(and from below, you seem to acknowledge this). You seem to want to keep  
users confined to a specific box without even giving them a chance to see  
that there may be more in the world than the box you give them.

> Furthermore
> it imprints it as part of the metadata for the digital object, which
> philosophically is a rather persistent entity - yes, objects should be
> persistent but the user interfaces that present them should be malleable.  

The URLs were meant to be as persitent as the object itself. The  
malleability is in what the URL points to, not what the URL is. URLs are not  
inherently non-persistent.

> For a little idea of how this works in the Dienst software take a look at
> the following example:
>
> A document with the URN ncstrl.cornell/TR94-1418
>
> Its display page from the Cornell ncstrl user interface is:
> http://cs-tr.cs.cornell.edu:80/Dienst/UI/1.0/Display/ncstrl.cornell/TR94-1418 

> This information is put together from three protocol requests to the object 
> in the cornell repository:

[A rich set of wonderful examples deleted]

> This uses the same raw repository requests to construct its information.
>
> In fact, this is exactly the way that NCSTRL and XXX/CoRR interact.  Take a 
> look at the URL:
> http://cs-tr.cs.cornell.edu:80/Dienst/UI/1.0/Display/xxx.cs.DL/9812020
> <http://cs-tr.cs.cornell.edu:80/Dienst/UI/1.0/Display/xxx.cs.DL/9812020>
>
> and you will see a document in XXX presented through the NCSTRL user
> interface.  You could go to http://xxx.lanl.gov/archive/cs/intro.html
> and get the same document  through the XXX User interface.

This is the main counter example right here (thanks for providing it!). I do  
not object to your presentation of the information through the NCSTRL  
interface (having the uniform interface is quite nice), but I do not  
understand why you don't give the user the natural URL
http://xxx.lanl.gov/abs/cs/9812020. Why force the user to navigate from  
http://xxx.lanl.gov/archive/cs/intro.html?  Actually, this example isn't  
really the best because your article is only available in a single format.  
Instead, I give http://xxx.lanl.gov/abs/cs.CL/9911006  
(http://cs-tr.cs.cornell.edu:80/Dienst/UI/1.0/Display/xxx.cs.CL/9911006)
with source, pdf, and other formats (dvi and about 8 flavors of PS).   You  
suppress source and dvi, and you chose a single resolution of bitmapped fonts  
for the PS (I prefer resolution independent type I fonts). Where would  
NCSTRL give me an opportunity for discovering that I can choose a mirror,  
that I can choose a default download format (or even that other choices  
exist), that author names are conveniently linked for searching, or, in the  
case of some physics archives, that xxx provides "cited by" and "refers to"  
links?

> Sorry to assault you with all this detail but we at Cornell have been
> somewhat in the business of trying to get DL protocols correct and this
> "display URL" violates some of our thinking on separation of concerns.  I
> don't have a real good answer here, since the "correct" answer (from the
> Dienst perspective) involves some more burden on the external services
> (understanding more protocol requests).

Exactly. My point is that there exist natural URLs which may give enhanced  
services to users. It may be that some repositories will just give the Dienst  
display URL and be done with it. But I submit that the majority of  
repositories will function not just as faceless warehouses, but will also  
present their own particular view of the world, will have a persistent URL  
mechanism for accessing that view, and some set of users will find benefit in  
the repository's view. I think you need to change your vocabulary a bit. Try  
"natural" or "canonical" rather than "correct."

All that said, I might be persuaded that the display ID doesn't have to be  
mandatory, but I think the act of a repository commiting to persistent  
nautral URLs  (i.e., the notion of making them readable as well as writable)  
is one of the foundational principles for making them function as true  
repositories. Thus, I don't think any repository should choose to omit it.  
Nor do I think any overlay should throw away this item of information if it  
is provided.

Cheers,
Mark