
In this paper I would like to do four things. First I will present the reader with a brief survey of what can be found on the Internet. Then I will touch on some problems these objects present to the cataloger. Next, I will discuss some developments in the engineering segment of information science. Finally I will imagine what a working partnership between engineers, librarians, and business might mean for access to networked objects.
I first became aware of networked objects by volunteering to participate in OCLC's (the Online Computer Library Center's) ground-breaking attempt to assess the fit between the Anglo-American Cataloguing Rules, 2nd ed., 1988 revision (AACR2R) and the kind of computer file one comes across while surfing the Internet. In OCLC's published report of its experiment, Assessing Information on the Internet (Dillon, p. 20) there is a list of generic terms for the types of electronic objects that can be found there. This list includes:
To the relief of the participants, the OCLC experiment focused on the approximately 10% of this universe represented by text files, although other types of files were also included. Certainly text files are the objects most nearly equivalent to the majority of objects cataloged under AACR2R. But it is equally certain that AACR2R is not well enough equipped to handle many other computer file genera.
*Page 50*
Networked sites, for example, are full of unfamiliar and evolving species, unlabeled hybrids. Directories of addresses are peppered with information on course listings at unspecified institutions. These entries then nestle up against entries for what look like title listings for actual books in actual libraries. Spelunking the Internet, one imagines that these ill-defined and poorly organized files are a somewhat strange, computerized, sedimentary paleolith, because the components do not naturally form part of one another. There are lots of fragments on the Internet. There are, for example, recipes, with a provenance as layered and ghost like to a novice Internetworker as their copyright notices are prominent and their punctuation odd.
Example 1:
.ig Path: decwrl!recipes From: liz@unirot (Mamaliz) Newsgroups: mod.recipes Subject: Recipe: Orange Pound Cake Message-ID: <4241@decwrl.DEC.COM> Date: 18 Jul 86 03:42:03 GMT Sender: recipes @ decwrl.DEC.COM Organization: The Soup Kitchen, Edison NJ Lines: 70 Approved: reid@decwrl.UUCP Copyright (C) 1986 USENET Community Trust Permission to copy without fee all or part of this material is granted provided that the copies are not made or distributed for direct commercial advantage, the USENET copyright notice and the title of the newsgroup and its date appear, and notice is given that copying is by permission of the USENET Community Trust or the original contributor.
.RH MOD.RECIPES-SOURCE POUND CAKE-1 D "19 May 86" 1986 .RZ "ORANGE POUND CAKE" "A luscious orange-flavored pound cake" Absolutely the best cake I have ever eaten! I got the recipe from
*Page 51*
the mother of a friend. I think Don's mom got the recipe off the back of a sugar box. .IH "1 cake" .IG "1 lb" "butter" "450g" .IG "1 lb" "powdered sugar" "450 g" .IG "2 Tbsp" "grated orange rind" "30 ml" .IG "6" "large eggs" .IG "3\(12 cups" "sifted all purpose flour" "350 g" .IG "\(14 tsp" "salt" "1 ml" ...
There are also instructions for mounting, running, and maintaining both services and executable programs that one does not always have nor know how to get. Another kind of material is the networked equivalent of course-packs: a professor selects portions of other texts and presents them as a discrete package. There are files whose only title is on the menu through which they're accessed and files that consist of frequently asked questions (FAQs) on a variety of themes: sports trivia, veronica services, AIDS facts, and the Cleveland FreeNet. Finding what appears to be an excerpt from a newspaper, for which there is no attribution, no date, and no raison d'etre is not uncommon. All the context this one is given is that Bert Dalmer wrote it and that it appears on "pg. 28."
Example 2:
pg. 28. Sluggers bats pound Irish as Sinak slams two homers. Sports story by Bert Dalmer, pg. 28 It seems the Illini just aren't happy with a .297 team batting average. Forry Wells and the Illinois baseball team continued to tear up their opponents with their hitting, shelling Notre Dame's top two pitchers Tuesday night in an 11-4 win in South Bend, Ind. As if Tom Sinak's two home runs were not enough, Wells, who leads the team with a .421 average, put the exclamation point on the night with a ninth-inning grand slam. The Illini finished the night with 15 hits.
... *Page 52*
Single poems can also be found. This poem lacks not only the computer-ese introduction that came with the recipe, but also the slightest attempt to situate it within print, as did the story about the Illini. On the other hand, the author's name is still included.
Example 3:
Winter is icummen in, Lhude sing Goddamm, Raineth drop and staineth slop, and how the wind doth ramm, Sing: Goddamm. Skiddeth bus and sloppeth us, An ague hath my ham. Freezeth river, turneth liver, Damn you, sing: Goddamm. Goddamm, Goddamm, 'tis why I am, Goddamm. So 'gainst the winter's balm. Sing goddamm, damm, sing Goddamm,
Sing goddamm, sing goddamm, DAMM.
-Ezra Pound
Because literature is something we catalog, librarians might take some comfort in knowing that poems can be usefully identified by their first lines. Making that extra effort might mean consulting Granger's Index to Poetry or an anthology of this poet's work to find out whether the poem has a title. This one does, after all, have one, even if it wasn't included here. Could it be possible that the record for this poem would be improved by a uniform title? This line of questioning will be developed later in the paper, not as an inquiry into poems, and not merely as an inquiry into titles, but as an inquiry into the relationship between traditional cataloging rules and the uncontrolled material one finds on the Internet. Would it be useful to index this material by its first lines? Are we going to see a return to the library of the incipits? Should we be cataloging snippets? AACR2R is basically a tool for the entirety and is not very good at snippets.
*Page 53*
The Internet is not just texts and hypertexts and text fragments, it's weather maps and other (potentially compound) multimedia objects. It is journal illustrations, and soon it will even be feature films. According to the 5/24/93 New York Times (curiously enough, this citation comes from a 5/13/93 edition of EDUPAGE) the first Internet cult movie, "Wax: Or the Discovery of Television Among the Bees" was successfully digitized into the Internet from a mid-Manhattan video recording studio on an unspecified Saturday. Although it was only transmitted at 2 frames per second, the experiment was considered a success, and more movies are in the works. "Oh good," catalogers might say. "We have chapters on maps and movies in AACR2R."
No one will be surprised to learn that cataloging Internet objects in the OCLC experiment was exceptionally difficult. A common difficulty encountered when cataloging a monograph might be, "How do I properly construct my series statement?" When cataloging an Internet object, however, the cataloger is most often challenged even before she begins a transcription of the elements of the description. The first challenge is to know and be able to name in a standardized way, "What is this?"
Example 4:
ABSTRACT 92 A1 V 8192 Trunc=8192 Size =18 Line=9 Col=1 Alt=0
|...+.....1......+......2......+......3......+......4......+..... .5......+.. ....6......+ ===***Top of File*** ===###<&///|||///&#>>>>\\\\########||||||||//////&&& ===%%%$###++++++&&//////||||||||%%%%#####||||||??**// ===***End of File***
What are the specific material designators that apply to an object like this? Is there a chapter in AACR2R that covers it?
*Page 54*
Because definitions for the generic terms used by OCLC were not published in "Assessing Information on the Internet", there was some discussion on Autocat, the cataloging and authorities listserv, in early 1993 that questioned the actual distinctions between system files and source files, between executable files and games, between data files and every other kind of file except image and audio, and so forth. Surely we do need to do a better job defining and entitling these generic forms. However, even within the genera that we are most comfortable with, the text file, the range of possibility on the Internet is enormous, as has been illustrated above.
One of the reasons that Internet material is so misshapen is that it has not been subjected to the rigors of publishing. This launches a vicious cycle of indescribability. No attempt has been made to control most of this material because the material is ephemeral, or it is too poorly put together to afford its would-be describers any handles. Catalogers continue to resist drafting viable descriptive conventions for this material because it is too slippery to be generalized about. The material continues to be released without any standard with which to compare itself. The real (non-virtual), public appearance of commercially available information has long since been shaped to meet the consumer expectations of users. These expectations generally include title pages, colophons, summaries, accompanying manuals, and even statements of responsibility, just like books. Descriptive conventions for Internet objects are not very well developed because the objects themselves are created and released in uncontrolled and unconventional ways.
Networked government documents or technical reports or serials like their print equivalents can be difficult to catalog. They are, however, something familiar, something to which entire chapters of AACR2R are devoted. They are, after all, serials; or they're functioning like monographs or like pamphlets, or like graphic materials. When they are seen as merely the electronic version of something we know about on
*Page 55*
paper, the challenges that these things pose to the cataloger do not seem to thwart the basic cataloging paradigm the way that other kinds of Internet objects do. The challenge of networked objects is in the essential mutability of virtual reality, the chameleon formatting, the effortless changes in location, the easy effacement of authorship, the transparent refreshment to accommodate newer platforms, the pointer poised from within another document that makes the original object a part of a larger whole.
When one contemplates the current state of the Net, questions are inevitable: Will the networked versions of these objects present us with chief sources we can really work from?
Example 5:
; f r o
m
uakari!indri!ames!apple!rutgers!aramis.rutgers.edu!athos.rutgers.
edu!mende
Tue Jul 18 05:54:12 PDT 1989
;Article 69 of comp.emacs:
; p a t
h
:arkl!uakari!indri!ames!apple!rutgers!aramis.rutgers.edu!athos.ru
tgers.edu!m
ende
;>From:mende@athos,rutgers,edu (Bob Mende)
;Newsgroups: gnu.emacs,comp.emacs,alt.sex
;Subject: purity.el (part 1)
;Message-;ID:
*Page 56*
;
;
This was not an easy one to work from. It is doubtful
that the "purity-el," a questionnaire about sexual experience,
exists in monograph form somewhere, but let us pretend that it
does. Its title page surely would not resemble this electronic
title screen. On the other hand, if Internet chief sources are
only the electronic equivalents of what would have appeared on a
print version, would they be enough? Probably not. One still
wants to know the formatting history and something about any
editorial changes that might have been made. The
interconnectivity status of the object should be clear. It is
possible, after all, that networked status changes the nature of
an object in ways parallel to the subtle and important ways that
the presence of an observer changes the nature of data observed,
as physicists and anthropologists have known for decades. A
networked computer file is different from a non networked
computer file is different from a print item. It is more and it
demands more description.
*Page 57*
What are the elements we want to include in the chief
source of our users' dreams? That is part of what needs to be
worked out. Appropriate labeling has attracted the attention of
some very important standards-developing bodies. The National
Information Standards Organization has worked on standards for
the construction of periodicals (Z39.1), for headers on
microfiche (Z39.32), and for manufacturer's labels on CD-ROMs
(Z39.68). At its 1993 Midwinter meeting, MARBI, a joint
committee of the American Library Association that concerns
itself with machine readable bibliographic information,
considered Proposal 93-9 about file label specifications for
machine-readable catalog (MARC) records sent according to the
File Transfer Protocol (FTP). When one sends a FAX, one fills
out an accompanying template to send along as an identifier. The
need for a chief source for Internet objects is plain.
Catalogers need a plan of action for convincing the producers of
these objects to provide a chief source for every object on the
Internet.
The Internet Engineering Task Force is a group of
engineers many of whom seem to have worked also on Z39.50
implementations. They go by the abbreviation IETF. Clifford
Lynch, in a paper he presented to the IETF last March, said that
two groups had been working on three main problems associated
with accessing networked information. These three problems are
identification, location, and description. One group, a group of
electronic engineers and developers has focused on structures
that can be used to identify and locate networked objects. The
other group, which he commends for their testing of AACR2R, is
the library community, by which he means the Library of Congress,
MARBI, and OCLC.
The IETF is trying to develop standards that distinguish
between identification and location. The URN
(Universal/Unique/Uniform Resource Number) is meant to identify
an object uniquely by its content. Unfortunately, neither the
Library community nor the IETF has, as yet, a consensus on
what
*Page 58*
constitutes unique object content. Is it one that is bit-for-bit
different from any other object, or can an object's identity
transcend delible manipulation? Does the WriteNow version of a
file differ enough from the ASCII version to require a separate
identifier?
URN's are readily compared to ISBN and ISSN. As cited in
MARBI's Discussion Paper 68 (A 007 Physical Description Fixed
Field for Computer Files) , the International Serials Data System
(ISDS) Directors feel that separate ISSN are needed for serials
published in different media. We need a level of consensus as to
the uniqueness or non uniqueness of a medium that ISDS Directors,
IETF engineers, and struggling Internetworkers can agree to. To
get there, we have to sort through much complexity and we may
have to shatter a lot of tradition. There is nothing traditional
about Internet objects. Is PKZIP, a program that compresses
files, a medium? Does tarring a file make it different from an
untarred manifestation? (Tarring is a UNIX based protocol that is
used to compress and connect groups of files simultaneously
rather than compress them as separates.) Questioning the
identity of various manifestations of computer files complements
another semantic debate about whether one can catalog a database
or only the implementation of a database. Arguments have been
made that since the database is never available except through
its implementation, (GEAC, NOTIS, etc.) that one has no choice
but to catalog the implementation. Counter-arguments have been
made that cataloging the Platonic form of a database once will
allow an infinity of other catalog records for the
implementations to be somehow associated with that record. These
questions of identity and difference need to be resolved so that
identifiers can be constructed for unique objects. The IETF sees
identifiers as permanent. They are not substitutes for
locators.
The IETF locator is often referred to as the URL,
(Universal/Unique/ Uniform Resource Locator). The URL is still
in development, but it is safe to say that it will probably not
be much different from the kind of network address one is used to
seeing. The URL, however, is not necessarily complete or
permanent regarding any particular object, like the URN.
Objects
*Page 59*
can be moved away from and into the space once occupied by
another object. An object may reside at multiple locations. The
syntax for FTP type objects is fairly straightforward: service
identifier (such as TELNET, FTP, etc.) followed by a protocol to
be used to retrieve particular objects. Some kind of registry
service is envisioned that will keep service protocols
standardized.
Lynch sees the library community's foray into locator
structures, like the USMARC 856 field and the NOTIS system's A22
field, as a transitional development that mirrors some
encoding problems encountered in the IETF's own early proposals
for locator syntax.
*Page 60*
to retrieve or operate on objects via a 'server' program]. Other
extraneous information about an object (its size, data format,
authorization details, etc.) may change with time and
shouldn't be part of the name. One might expect such information
to be part of the 'header'... and for the header to be able to be
retrieved independently of the object."(Lynch
1993,4)
Lynch encourages the library community to explore the
problems of description (he calls it "content") in a more
fundamental way. (Lynch 1993,12) As a
preliminary step, let
us focus the discussion of description on the problem of the
chief source. AACR2R's rule 9.0B1 says, "The chief source of
information for computer files is the title screen(s). If there
is no title screen, take the information from other formally
presented internal evidence (e.g. main menus, program
statements)." AACR2R recognizes that all the information may not
be "available" to the cataloger because she doesn't have
appropriate machines or software, and it makes broad provision
for this circumstance until in the end, if necessary, a cataloger
can use just about any source to catalog a computer file.
However, according to 9.1B3, catalogers are not supposed to use
the filename or data set as the title proper, unless this is the
only name given in the chief source. The OCLC guidelines
(Dillon 1993,B:3) go on to say that to use the filename title,
not only can there be no other title on the chief source, but
that the cataloger must be incapable of supplying a useful
title.
It is within this context that a cataloger without the
capability of uncompressing a hex file might use the string
"Resource Info" as the title proper because of the filename
"resource-info-09.hqx", which she can read without acquiring the
software necessary to interpret the hexadecimal characters of the
file itself. Then again, after she gets a hold of BinHex, she
probably could see that the title screen reads simply, "Source
Info." Unless catalogers find a benevolent funding source that
can supply them with all the software and computing power they
need, it may make more sense to get rid of 9.1B3 for networked
resources and canonize the filename as title proper.
Conversational names, commercial names, and other natural
language-type names could be recorded as added titles.
*Page 61*
In a brilliantly argued paper, Preston and Lynch state that
unless network information sources can be "to some extent
self-describing" it is difficult to envision that descriptive
records will ever really be provided for them. Most
organizations that will supply these resources do not have the
"expertise to prepare appropriate descriptive records in the
appropriate standard interchange formats." The Library of
Congress or the various university libraries cannot be relied
upon to supply this cataloging. One alternative is for the
"suppliers of information resources to fund the creation
of...records, or for the overall user community to fund
development of such descriptions as a community benefit".(Lynch
and Preston 1992,3) It seems unlikely that the user
community
could become organized, knowledgeable, and munificent enough to
fund this development in a timely way. Benefits may accrue,
however, to the information suppliers, if they choose to develop
and fund appropriately self-referential records. For catalogers
to dialog with information suppliers along these lines is
professionally responsible. It is professionally responsible to
start defining descriptive parameters now, so that creators of
Internet objectscan easily and consistently invoke them in the
resources they release. Ideally, the descriptive data embedded
in the records themselves would be protocol-independent. The
data should slip into an object available via FTP with no less
difficulty than they reside in an object available via Z39.50.
Engineers may not be able to do everything, but they surely can,
with cataloger support, do this.
Because the name is bound up with the identification and
the identification is bound up with the location, and these
three topics are the proper pursuit of engineers, there is some
hope that an elegant, standard solution to names and locations
and identifications will become available. We, on the other
hand, who come to the problem from AACR2R instead of from our
compilers, have deep, unmet needs for some indication of whether
a record should cite other editions or works or whether it
should be appended to something else as a version, or whether it
should be classed with something else. We need to know
whether more will come or if the item is complete. We also
need
*Page 62*
to have subject headings and authority work, but these are all
topics for another paper.
It is not only the time to list what we need in these
records. It is also time to list what we must omit. What
happens if we, who are the inheritors of the Library of Congress
Rule Interpretations, in all their Mandarin ornateness, are too
unused to an unadorned, engineered elegance to work toward an
object describing itself? Can we continue to apply rules written
for an item-in-hand situation (AACR2R Rule
0.24) to a space where
the same item is not the item when it is not remote?
MARBI's Discussion paper 69 ("Accommodating Online
Systems and Services in USMARC") says, "As further work is done
on directory services, it may be possible to establish a
mechanism for using existing directory services to keep USMARC
records up-to-date. For example the InterNIC Information
Services in San Diego provides a template for systems to fill in
and thus be registered in the directory service,". (MARBI 69, 8)
While we're working to establish all the data elements we need
for networked resources, and we're talking with the engineers who
are creating headers drawn from marked-up text data, why not
examine the template that this company and others like it have
put together? Engineers and catalogers could learn something
from the business community.
One of the problems that was encountered back when people
tried to teach machines to catalog print materials without
professional catalogers as mediators was the fact that the
machine spoke machine language and the print material was mute.
The print material didn't flag its title. The creators of its
title page were layout artists, whose goal was not
standardization. Networked objects, on the other hand, are
written in machine-language. With a few good guidelines, we
could have a title positioned in the same place or marked the
same way every time. With a good template, we could find
creators of Internet objects willing to inscribe themselves into
the header. The size, version, and up-to-dateness of an
object
*Page 63*
could be extracted from the object itself. What would the
payoffs be for indexing and abstracting businesses, or for
academia, or for the government? How much access can users
afford?
The Internet is a non static space that is host to a
variety of information objects. Cataloging rules were not
drafted with these objects in mind, and it is difficult to apply
them. There has been some work done by computer scientists to
name, locate, and describe these objects in machine-driven ways.
Librarians can advance their profession by helping to build
bridges between the technical, economic, and service issues
surrounding access to networked objects. We should actively work
to dispel the frustrating idea that human catalogers can ever
seize the time, find the funding, or create the tools to handle
the Internet all by themselves.
*Page 64*
MC JOURNAL: THE JOURNAL OF ACADEMIC MEDIA LIBRARIANSHIP
Vol. 1 #2
Fall 1993
ISSN 1069-6792
October 27, 1993
*Page 65*
Back to v1#2 Table
of Contents
Back to MC
Journal Homepage
with a real delete
;WELCOME THE ENGINEERS
THE CATALOGING/ENGINEERING HANDSHAKE
BRIGHT SPOT ON THE HORIZON
CONCLUSION
REFERENCES