[UPS] RE: UPS digest, Vol 1 #24 - 5 msgs

Carl Lagoze lagoze@cs.cornell.edu
Thu, 18 Nov 1999 08:55:01 -0500


Mark,

Thanks for your long, detailed, and provocative response.  I think what this
all comes down to, and many of us have been around and around this many
times, is what is the exact definition of a repository. 

 I have a very minimalist interpretation of a repository.  From my point of
view it is simply something that, via protocol or API or service requests
(they all mean the same to me), provides access to its digital objects.
Those objects then have a very generic API themselves which provides some
top level metadata about the object (bibliographic, structural, access
rights).  From my point of view API's or protocols are not instruments for
humans to read but for machines to understand and then turn into human
presentations.  Thus, my "pure" separation of concerns between the job of a
user interface and a repository.  From my point of view the notion of a
Display URL being in the metadata for an object blurs that line.

You have a more expansive view of the job of a repository.  From your view
it not only offers an API but also provides a "definitive wrapper page" for
the objects.  No doubt this is what many repositories do now.  Without
arguing whether this is right or wrong (reality is fruitless to argue with),
I will argue that the fact that most repositories do NOT provide an API for
lower level functionality (such as Dienst provides) IS wrong.  The result is
that it then becomes very difficult to built higher level services on top of
repositories without engaging in the unpleasant art of screen scraping the
HTML they produce through their definitive wrapper page.

In any case, as I said to Dale Flecker in a previous note, I'm not immune to
compromising my purity - I believe that's what makes me an "applied"
researcher rather than a "theoretical" researcher.

Regards,

Carl

Message: 4
To: ups@vole.lanl.gov
Subject: Re: [UPS] Problems/Comments with Santa Fe Metadata Set
Date: Tue, 16 Nov 1999 13:13:14 -0500
From: Mark Doyle <doyle@aps.org>
Reply-To: doyle@aps.org

Greetings Carl,

> From: Carl Lagoze <lagoze@cs.cornell.edu>
> Date: 1999-11-15 06:41:17 -0500

Sorry, I was unable to answer this sooner... Since I was the one who  
initiated the addition of this element, I feel I should address it. I  
understand your point of view, but I think that we live in an imperfect
world  
and one needs to have pragmatic solutions to otherwise vexing problems. The

whole point of these repositories and overlaid services is to make material

available to researchers in a variety of formats, some of which may be much

richer than others (the variation is both within a repository and across  
repositories). Formats may be added and removed as the underlying technology

changes. Any service which chooses to just display a subset of a
repositories  
formats (say, just PostScript or PDF) is likely to short change users. For  
instance, xxx offers many flavors of PostScript, some of which require the  
user to understand additional issues (e.g., font installation). So the
simple  
goal (again in the context of doing things on the six month scale) is to  
give users a path to the definitive interface of a repository, preferably  
anchored around the target that the user is actually interested in at the  
moment. I feel it is much more useful to a user of the services to wind up a

"wrapper" page than  just the home page of the arXiv. Furthermore, the URL's

in the display ID  are to be persistent and freely accessible (some  
repositories may have to limit who can access certain components and a  
mechanism for authentication has to be made available).

> Our view throughout the design of Dienst (and digital object repositories
in 
> general) is that a repository is not in the business of human
presentation.
> It simply provides sufficient information through a protocol so that other

> services can use its contents.  From the perspective human interaction, it

> provides protocol requests that can be used by any user interface to
> construct "display pages" are pages that access specific disseminations or

> parts of disseminations.  Thus,  there may be many user interfaces and
many 
> "display Ids" for a particular digital object. Furthermore, a repository
> does not have any record of what these display Ids are (i.e., does the
> publisher of a book know every house, library, bookstore that their book
> sits in).

This is all well and good in theory, but where the rubber hits the road, I  
think it fails. Not all repositories are the same. The selection of  
repository services that an overlay service makes visible to the reader is  
not likely to be the complete set of services. This is a disadvantage to the

users who may not even be aware that they have other choices for retrieval
of  
information.

> The display ID metadata element presumes that not only does the repository

> or digital object know about these URLs but endows one with the property
of 
> being the "correct" one (a rather wrong concept since the display ID for
an 
> Italian audience should be different than for a US audience).

I strongly disagree. The fact is that most repositories have a definitive  
wrapper page that provides links to all available repository services  
relevant to a particular item in the repository (and a dynamic set of  
services at that). To use phrasing from physics, this is a "natural" URL -  
"naturalness" is not a statement about correctness (as you imply), but  
rather, it allows for a choice of a distinguished member of a class (here,  
the class of display URL's).  There may be other ways to make the choice  
(just as natural), but each arXiv has a very good sense of which URL is  
potentially the most useful to end users. The mere fact that these stable,  
persistent URLs exist and are made available by the repository distinguishes

them from the rest of the URLs.

Should all overlays be required to track all of the services and mirrors of

its underlying repositories? I think that is what your point of view
requires  
(and from below, you seem to acknowledge this). You seem to want to keep  
users confined to a specific box without even giving them a chance to see  
that there may be more in the world than the box you give them.

> Furthermore
> it imprints it as part of the metadata for the digital object, which
> philosophically is a rather persistent entity - yes, objects should be
> persistent but the user interfaces that present them should be malleable.


The URLs were meant to be as persitent as the object itself. The  
malleability is in what the URL points to, not what the URL is. URLs are not

inherently non-persistent.

> For a little idea of how this works in the Dienst software take a look at
> the following example:
>
> A document with the URN ncstrl.cornell/TR94-1418
>
> Its display page from the Cornell ncstrl user interface is:
>
http://cs-tr.cs.cornell.edu:80/Dienst/UI/1.0/Display/ncstrl.cornell/TR94-141
8 

> This information is put together from three protocol requests to the
object 
> in the cornell repository:

[A rich set of wonderful examples deleted]

> This uses the same raw repository requests to construct its information.
>
> In fact, this is exactly the way that NCSTRL and XXX/CoRR interact.  Take
a 
> look at the URL:
> http://cs-tr.cs.cornell.edu:80/Dienst/UI/1.0/Display/xxx.cs.DL/9812020
> <http://cs-tr.cs.cornell.edu:80/Dienst/UI/1.0/Display/xxx.cs.DL/9812020>
>
> and you will see a document in XXX presented through the NCSTRL user
> interface.  You could go to http://xxx.lanl.gov/archive/cs/intro.html
> and get the same document  through the XXX User interface.

This is the main counter example right here (thanks for providing it!). I do

not object to your presentation of the information through the NCSTRL  
interface (having the uniform interface is quite nice), but I do not  
understand why you don't give the user the natural URL
http://xxx.lanl.gov/abs/cs/9812020. Why force the user to navigate from  
http://xxx.lanl.gov/archive/cs/intro.html?  Actually, this example isn't  
really the best because your article is only available in a single format.  
Instead, I give http://xxx.lanl.gov/abs/cs.CL/9911006  
(http://cs-tr.cs.cornell.edu:80/Dienst/UI/1.0/Display/xxx.cs.CL/9911006)
with source, pdf, and other formats (dvi and about 8 flavors of PS).   You  
suppress source and dvi, and you chose a single resolution of bitmapped
fonts  
for the PS (I prefer resolution independent type I fonts). Where would  
NCSTRL give me an opportunity for discovering that I can choose a mirror,  
that I can choose a default download format (or even that other choices  
exist), that author names are conveniently linked for searching, or, in the

case of some physics archives, that xxx provides "cited by" and "refers to"

links?

> Sorry to assault you with all this detail but we at Cornell have been
> somewhat in the business of trying to get DL protocols correct and this
> "display URL" violates some of our thinking on separation of concerns.  I
> don't have a real good answer here, since the "correct" answer (from the
> Dienst perspective) involves some more burden on the external services
> (understanding more protocol requests).

Exactly. My point is that there exist natural URLs which may give enhanced  
services to users. It may be that some repositories will just give the
Dienst  
display URL and be done with it. But I submit that the majority of  
repositories will function not just as faceless warehouses, but will also  
present their own particular view of the world, will have a persistent URL  
mechanism for accessing that view, and some set of users will find benefit
in  
the repository's view. I think you need to change your vocabulary a bit. Try

"natural" or "canonical" rather than "correct."

All that said, I might be persuaded that the display ID doesn't have to be  
mandatory, but I think the act of a repository commiting to persistent  
nautral URLs  (i.e., the notion of making them readable as well as writable)

is one of the foundational principles for making them function as true  
repositories. Thus, I don't think any repository should choose to omit it.  
Nor do I think any overlay should throw away this item of information if it

is provided.

Cheers,
Mark



End of UPS Digest