CATALOGING THE INTERNET BY JUDITH M. BRUGGER INTRODUCTION In this paper I would like to do four things. First I will present the reader with a brief survey of what can be found on the Internet. Then I will touch on some problems these objects present to the cataloger. Next, I will discuss some developments in the engineering segment of information science. Finally I will imagine what a working partnership between engineers, librarians, and business might mean for access to networked objects. THE PROBLEM OF GENERA I first became aware of networked objects by volunteering to participate in OCLC's (the Online Computer Library Center's) ground-breaking attempt to assess the fit between the Anglo-American Cataloguing Rules, 2nd ed., 1988 revision (AACR2R) and the kind of computer file one comes across while surfing the Internet. In OCLC's published report of its experiment, Assessing Information on the Internet (Dillon, p. 20) there is a list of generic terms for the types of electronic objects that can be found there. This list includes: system source news text PC data images games executable files unknown To the relief of the participants, the OCLC experiment focused on the approximately 10% of this universe represented by text files, although other types of files were also included. Certainly text files are the objects most nearly equivalent to the majority of objects cataloged under AACR2R. But it is equally certain that AACR2R is not well enough equipped to handle many other computer file genera. Networked sites, for example, are full of unfamiliar and evolving species, unlabeled hybrids. Directories of addresses are peppered with information on course listings at unspecified institutions. These entries then nestle up against entries for what look like title listings for actual books in actual libraries. Spelunking the Internet, one imagines that these ill-defined and poorly organized files are a somewhat strange, computerized, sedimentary paleolith, because the components do not naturally form part of one another. There are lots of fragments on the Internet. There are, for example, recipes, with a provenance as layered and ghost like to a novice Internetworker as their copyright notices are prominent and their punctuation odd. Example 1: .ig Path: decwrl!recipes From: liz@unirot (Mamaliz) Newsgroups: mod.recipes Subject: Recipe: Orange Pound Cake Message-ID: <4241@decwrl.DEC.COM> Date: 18 Jul 86 03:42:03 GMT Sender: recipes @ decwrl.DEC.COM Organization: The Soup Kitchen, Edison NJ Lines: 70 Approved: reid@decwrl.UUCP Copyright (C) 1986 USENET Community Trust Permission to copy without fee all or part of this material is granted provided that the copies are not made or distributed for direct commercial advantage, the USENET copyright notice and the title of the newsgroup and its date appear, and notice is given that copying is by permission of the USENET Community Trust or the original contributor. .RH MOD.RECIPES-SOURCE POUND CAKE-1 D "19 May 86" 1986 .RZ "ORANGE POUND CAKE" "A luscious orange-flavored pound cake" Absolutely the best cake I have ever eaten! I got the recipe from the mother of a friend. I think Don's mom got the recipe off the back of a sugar box. .IH "1 cake" .IG "1 lb" "butter" "450g" .IG "1 lb" "powdered sugar" "450 g" .IG "2 Tbsp" "grated orange rind" "30 ml" .IG "6" "large eggs" .IG "3\(12 cups" "sifted all purpose flour" "350 g" .IG "\(14 tsp" "salt" "1 ml" ... There are also instructions for mounting, running, and maintaining both services and executable programs that one does not always have nor know how to get. Another kind of material is the networked equivalent of course-packs: a professor selects portions of other texts and presents them as a discrete package. There are files whose only title is on the menu through which they're accessed and files that consist of frequently asked questions (FAQs) on a variety of themes: sports trivia, veronica services, AIDS facts, and the Cleveland FreeNet. Finding what appears to be an excerpt from a newspaper, for which there is no attribution, no date, and no raison d'etre is not uncommon. All the context this one is given is that Bert Dalmer wrote it and that it appears on "pg. 28." Example 2: pg. 28. Sluggers bats pound Irish as Sinak slams two homers. Sports story by Bert Dalmer, pg. 28 It seems the Illini just aren't happy with a .297 team batting average. Forry Wells and the Illinois baseball team continued to tear up their opponents with their hitting, shelling Notre Dame's top two pitchers Tuesday night in an 11-4 win in South Bend, Ind. As if Tom Sinak's two home runs were not enough, Wells, who leads the team with a .421 average, put the exclamation point on the night with a ninth-inning grand slam. The Illini finished the night with 15 hits. ... Single poems can also be found. This poem lacks not only the computer-ese introduction that came with the recipe, but also the slightest attempt to situate it within print, as did the story about the Illini. On the other hand, the author's name is still included. Example 3: Winter is icummen in, Lhude sing Goddamm, Raineth drop and staineth slop, and how the wind doth ramm, Sing: Goddamm. Skiddeth bus and sloppeth us, An ague hath my ham. Freezeth river, turneth liver, Damn you, sing: Goddamm. Goddamm, Goddamm, 'tis why I am, Goddamm. So 'gainst the winter's balm. Sing goddamm, damm, sing Goddamm, Sing goddamm, sing goddamm, DAMM. -Ezra Pound Because literature is something we catalog, librarians might take some comfort in knowing that poems can be usefully identified by their first lines. Making that extra effort might mean consulting Granger's Index to Poetry or an anthology of this poet's work to find out whether the poem has a title. This one does, after all, have one, even if it wasn't included here. Could it be possible that the record for this poem would be improved by a uniform title? This line of questioning will be developed later in the paper, not as an inquiry into poems, and not merely as an inquiry into titles, but as an inquiry into the relationship between traditional cataloging rules and the uncontrolled material one finds on the Internet. Would it be useful to index this material by its first lines? Are we going to see a return to the library of the incipits? Should we be cataloging snippets? AACR2R is basically a tool for the entirety and is not very good at snippets. The Internet is not just texts and hypertexts and text fragments, it's weather maps and other (potentially compound) multimedia objects. It is journal illustrations, and soon it will even be feature films. According to the 5/24/93 New York Times (curiously enough, this citation comes from a 5/13/93 edition of EDUPAGE) the first Internet cult movie, "Wax: Or the Discovery of Television Among the Bees" was successfully digitized into the Internet from a mid-Manhattan video recording studio on an unspecified Saturday. Although it was only transmitted at 2 frames per second, the experiment was considered a success, and more movies are in the works. "Oh good," catalogers might say. "We have chapters on maps and movies in AACR2R." No one will be surprised to learn that cataloging Internet objects in the OCLC experiment was exceptionally difficult. A common difficulty encountered when cataloging a monograph might be, "How do I properly construct my series statement?" When cataloging an Internet object, however, the cataloger is most often challenged even before she begins a transcription of the elements of the description. The first challenge is to know and be able to name in a standardized way, "What is this?" Example 4: ABSTRACT 92 A1 V 8192 Trunc=8192 Size =18 Line=9 Col=1 Alt=0 |...+.....1......+......2......+......3......+......4......+..... .5......+.. ....6......+ ===***Top of File*** ===###<&///|||///&&##>>>>\\\\########||||||||//////&&& ===%%%$###++++++&&//////||||||||%%%%#####||||||??**// ===***End of File*** What are the specific material designators that apply to an object like this? Is there a chapter in AACR2R that covers it? Because definitions for the generic terms used by OCLC were not published in "Assessing Information on the Internet", there was some discussion on Autocat, the cataloging and authorities listserv, in early 1993 that questioned the actual distinctions between system files and source files, between executable files and games, between data files and every other kind of file except image and audio, and so forth. Surely we do need to do a better job defining and entitling these generic forms. However, even within the genera that we are most comfortable with, the text file, the range of possibility on the Internet is enormous, as has been illustrated above. CATALOGING'S CONCERNS One of the reasons that Internet material is so misshapen is that it has not been subjected to the rigors of publishing. This launches a vicious cycle of indescribability. No attempt has been made to control most of this material because the material is ephemeral, or it is too poorly put together to afford its would-be describers any handles. Catalogers continue to resist drafting viable descriptive conventions for this material because it is too slippery to be generalized about. The material continues to be released without any standard with which to compare itself. The real (non-virtual), public appearance of commercially available information has long since been shaped to meet the consumer expectations of users. These expectations generally include title pages, colophons, summaries, accompanying manuals, and even statements of responsibility, just like books. Descriptive conventions for Internet objects are not very well developed because the objects themselves are created and released in uncontrolled and unconventional ways. Networked government documents or technical reports or serials like their print equivalents can be difficult to catalog. They are, however, something familiar, something to which entire chapters of AACR2R are devoted. They are, after all, serials; or they're functioning like monographs or like pamphlets, or like graphic materials. When they are seen as merely the electronic version of something we know about on paper, the challenges that these things pose to the cataloger do not seem to thwart the basic cataloging paradigm the way that other kinds of Internet objects do. The challenge of networked objects is in the essential mutability of virtual reality, the chameleon formatting, the effortless changes in location, the easy effacement of authorship, the transparent refreshment to accommodate newer platforms, the pointer poised from within another document that makes the original object a part of a larger whole. When one contemplates the current state of the Net, questions are inevitable: Will the networked versions of these objects present us with chief sources we can really work from? Example 5: ; f r o m uakari!indri!ames!apple!rutgers!aramis.rutgers.edu!athos.rutgers. edu!mende Tue Jul 18 05:54:12 PDT 1989 ;Article 69 of comp.emacs: ; p a t h :arkl!uakari!indri!ames!apple!rutgers!aramis.rutgers.edu!athos.ru tgers.edu!m ende ;>From:mende@athos,rutgers,edu (Bob Mende) ;Newsgroups: gnu.emacs,comp.emacs,alt.sex ;Subject: purity.el (part 1) ;Message-;ID: ;Date: 18 jul 89 04:00:08 GMT ;Organization: Rutgers Univ., New Brunswick, N.J. ;Lines: 603 ;Xref: ark1 gnu.emacs:44 comp.emacs:69 ; ;Since I have had over 100 requests for this, I am posting it....enjoy. ;please replace the following three characters ; ; with a real delete ; with a real ctrl-c ; with a real ctrl-s ; ;; ;;Purity.el Emacs lisp program to administer the purity test. ;; Robert Mende (mende@aramis.rutgers.edu) ;; 5/5/89 ;; ;;This file is not officially part of GNU Emacs, but can be if FSF wishes ;;it to be so. Distributed under the GNU copyleft ;;GNU Emacs is distributed in the hope that it will be useful ;;but without any warranty. No author or distributor ;;accepts responsibility to anyone for the consequence of using it ;;or for whether it serves any particular purpose or works at all, ;;unless he says so in writing. Refer to the GNU Emacs General Public ;;License for full details. ... This was not an easy one to work from. It is doubtful that the "purity-el," a questionnaire about sexual experience, exists in monograph form somewhere, but let us pretend that it does. Its title page surely would not resemble this electronic title screen. On the other hand, if Internet chief sources are only the electronic equivalents of what would have appeared on a print version, would they be enough? Probably not. One still wants to know the formatting history and something about any editorial changes that might have been made. The interconnectivity status of the object should be clear. It is possible, after all, that networked status changes the nature of an object in ways parallel to the subtle and important ways that the presence of an observer changes the nature of data observed, as physicists and anthropologists have known for decades. A networked computer file is different from a non networked computer file is different from a print item. It is more and it demands more description. What are the elements we want to include in the chief source of our users' dreams? That is part of what needs to be worked out. Appropriate labeling has attracted the attention of some very important standards-developing bodies. The National Information Standards Organization has worked on standards for the construction of periodicals (Z39.1), for headers on microfiche (Z39.32), and for manufacturer's labels on CD-ROMs (Z39.68). At its 1993 Midwinter meeting, MARBI, a joint committee of the American Library Association that concerns itself with machine readable bibliographic information, considered Proposal 93-9 about file label specifications for machine-readable catalog (MARC) records sent according to the File Transfer Protocol (FTP). When one sends a FAX, one fills out an accompanying template to send along as an identifier. The need for a chief source for Internet objects is plain. Catalogers need a plan of action for convincing the producers of these objects to provide a chief source for every object on the Internet. WELCOME THE ENGINEERS The Internet Engineering Task Force is a group of engineers many of whom seem to have worked also on Z39.50 implementations. They go by the abbreviation IETF. Clifford Lynch, in a paper he presented to the IETF last March, said that two groups had been working on three main problems associated with accessing networked information. These three problems are identification, location, and description. One group, a group of electronic engineers and developers has focused on structures that can be used to identify and locate networked objects. The other group, which he commends for their testing of AACR2R, is the library community, by which he means the Library of Congress, MARBI, and OCLC. The IETF is trying to develop standards that distinguish between identification and location. The URN (Universal/Unique/Uniform Resource Number) is meant to identify an object uniquely by its content. Unfortunately, neither the Library community nor the IETF has, as yet, a consensus on what constitutes unique object content. Is it one that is bit-for-bit different from any other object, or can an object's identity transcend delible manipulation? Does the WriteNow version of a file differ enough from the ASCII version to require a separate identifier? URN's are readily compared to ISBN and ISSN. As cited in MARBI's Discussion Paper 68 (A 007 Physical Description Fixed Field for Computer Files) , the International Serials Data System (ISDS) Directors feel that separate ISSN are needed for serials published in different media. We need a level of consensus as to the uniqueness or non uniqueness of a medium that ISDS Directors, IETF engineers, and struggling Internetworkers can agree to. To get there, we have to sort through much complexity and we may have to shatter a lot of tradition. There is nothing traditional about Internet objects. Is PKZIP, a program that compresses files, a medium? Does tarring a file make it different from an untarred manifestation? (Tarring is a UNIX based protocol that is used to compress and connect groups of files simultaneously rather than compress them as separates.) Questioning the identity of various manifestations of computer files complements another semantic debate about whether one can catalog a database or only the implementation of a database. Arguments have been made that since the database is never available except through its implementation, (GEAC, NOTIS, etc.) that one has no choice but to catalog the implementation. Counter-arguments have been made that cataloging the Platonic form of a database once will allow an infinity of other catalog records for the implementations to be somehow associated with that record. These questions of identity and difference need to be resolved so that identifiers can be constructed for unique objects. The IETF sees identifiers as permanent. They are not substitutes for locators. The IETF locator is often referred to as the URL, (Universal/Unique/ Uniform Resource Locator). The URL is still in development, but it is safe to say that it will probably not be much different from the kind of network address one is used to seeing. The URL, however, is not necessarily complete or permanent regarding any particular object, like the URN. Objects can be moved away from and into the space once occupied by another object. An object may reside at multiple locations. The syntax for FTP type objects is fairly straightforward: service identifier (such as TELNET, FTP, etc.) followed by a protocol to be used to retrieve particular objects. Some kind of registry service is envisioned that will keep service protocols standardized. THE CATALOGING/ENGINEERING HANDSHAKE Lynch sees the library community's foray into locator structures, like the USMARC 856 field and the NOTIS system's A22 field, as a transitional development that mirrors some encoding problems encountered in the IETF's own early proposals for locator syntax. (Lynch 1993,9) Hard questions are now being asked about locator structures for networked objects, many of them left unanswered for the present: Is a locator structure really the place for file size or compression information? Although each file should surely have facts like this written into itself so that they can be referred to by users, this type of information is not intrinsically part of an address or a location at all. Should the locator structure be self-referential; should it be human readable? What is the level of granularity that a locator structure needs to accommodate? Title level? Article level? Paragraph level? To locate an object, minimally users will need its host, path, and name. Other things that users may want the object to inform them about before they go to the trouble of retrieving it are: information on the last update time, the number of links to that item in a gopher network, the names of veronica servers referencing the item, and so forth. These are valuable data indeed, but not as part of an artificially cluttered locator. In discussing the idea of a name, which to a cataloger reads more like the idea of a title, Tim Berners-Lee of the IETF URL Working Group, says that "The life of a name is limited by any information contained within it that may become prematurely invalid. It is therefore necessary to limit the contents of a name to the information required for [allowing a 'client' program to retrieve or operate on objects via a 'server' program]. Other extraneous information about an object (its size, data format, authorization details, etc.) may change with time and shouldn't be part of the name. One might expect such information to be part of the 'header'... and for the header to be able to be retrieved independently of the object."(Lynch 1993,4) Lynch encourages the library community to explore the problems of description (he calls it "content") in a more fundamental way. (Lynch 1993,12) As a preliminary step, let us focus the discussion of description on the problem of the chief source. AACR2R's rule 9.0B1 says, "The chief source of information for computer files is the title screen(s). If there is no title screen, take the information from other formally presented internal evidence (e.g. main menus, program statements)." AACR2R recognizes that all the information may not be "available" to the cataloger because she doesn't have appropriate machines or software, and it makes broad provision for this circumstance until in the end, if necessary, a cataloger can use just about any source to catalog a computer file. However, according to 9.1B3, catalogers are not supposed to use the filename or data set as the title proper, unless this is the only name given in the chief source. The OCLC guidelines (Dillon 1993,B:3) go on to say that to use the filename title, not only can there be no other title on the chief source, but that the cataloger must be incapable of supplying a useful title. It is within this context that a cataloger without the capability of uncompressing a hex file might use the string "Resource Info" as the title proper because of the filename "resource-info-09.hqx", which she can read without acquiring the software necessary to interpret the hexadecimal characters of the file itself. Then again, after she gets a hold of BinHex, she probably could see that the title screen reads simply, "Source Info." Unless catalogers find a benevolent funding source that can supply them with all the software and computing power they need, it may make more sense to get rid of 9.1B3 for networked resources and canonize the filename as title proper. Conversational names, commercial names, and other natural language-type names could be recorded as added titles. In a brilliantly argued paper, Preston and Lynch state that unless network information sources can be "to some extent self-describing" it is difficult to envision that descriptive records will ever really be provided for them. Most organizations that will supply these resources do not have the "expertise to prepare appropriate descriptive records in the appropriate standard interchange formats." The Library of Congress or the various university libraries cannot be relied upon to supply this cataloging. One alternative is for the "suppliers of information resources to fund the creation of...records, or for the overall user community to fund development of such descriptions as a community benefit".(Lynch and Preston 1992,3) It seems unlikely that the user community could become organized, knowledgeable, and munificent enough to fund this development in a timely way. Benefits may accrue, however, to the information suppliers, if they choose to develop and fund appropriately self-referential records. For catalogers to dialog with information suppliers along these lines is professionally responsible. It is professionally responsible to start defining descriptive parameters now, so that creators of Internet objectscan easily and consistently invoke them in the resources they release. Ideally, the descriptive data embedded in the records themselves would be protocol-independent. The data should slip into an object available via FTP with no less difficulty than they reside in an object available via Z39.50. Engineers may not be able to do everything, but they surely can, with cataloger support, do this. Because the name is bound up with the identification and the identification is bound up with the location, and these three topics are the proper pursuit of engineers, there is some hope that an elegant, standard solution to names and locations and identifications will become available. We, on the other hand, who come to the problem from AACR2R instead of from our compilers, have deep, unmet needs for some indication of whether a record should cite other editions or works or whether it should be appended to something else as a version, or whether it should be classed with something else. We need to know whether more will come or if the item is complete. We also need to have subject headings and authority work, but these are all topics for another paper. It is not only the time to list what we need in these records. It is also time to list what we must omit. What happens if we, who are the inheritors of the Library of Congress Rule Interpretations, in all their Mandarin ornateness, are too unused to an unadorned, engineered elegance to work toward an object describing itself? Can we continue to apply rules written for an item-in-hand situation (AACR2R Rule 0.24) to a space where the same item is not the item when it is not remote? BRIGHT SPOT ON THE HORIZON MARBI's Discussion paper 69 ("Accommodating Online Systems and Services in USMARC") says, "As further work is done on directory services, it may be possible to establish a mechanism for using existing directory services to keep USMARC records up-to-date. For example the InterNIC Information Services in San Diego provides a template for systems to fill in and thus be registered in the directory service,". (MARBI 69, 8) While we're working to establish all the data elements we need for networked resources, and we're talking with the engineers who are creating headers drawn from marked-up text data, why not examine the template that this company and others like it have put together? Engineers and catalogers could learn something from the business community. One of the problems that was encountered back when people tried to teach machines to catalog print materials without professional catalogers as mediators was the fact that the machine spoke machine language and the print material was mute. The print material didn't flag its title. The creators of its title page were layout artists, whose goal was not standardization. Networked objects, on the other hand, are written in machine-language. With a few good guidelines, we could have a title positioned in the same place or marked the same way every time. With a good template, we could find creators of Internet objects willing to inscribe themselves into the header. The size, version, and up-to-dateness of an object could be extracted from the object itself. What would the payoffs be for indexing and abstracting businesses, or for academia, or for the government? How much access can users afford? CONCLUSION The Internet is a non static space that is host to a variety of information objects. Cataloging rules were not drafted with these objects in mind, and it is difficult to apply them. There has been some work done by computer scientists to name, locate, and describe these objects in machine-driven ways. Librarians can advance their profession by helping to build bridges between the technical, economic, and service issues surrounding access to networked objects. We should actively work to dispel the frustrating idea that human catalogers can ever seize the time, find the funding, or create the tools to handle the Internet all by themselves. REFERENCES ALCTS/LITA/RASD. MARBI. [1992?] "Accommodating Online Systems and Services in USMARC." Discussion Paper 69. Photocopy. ALCTS/LITA/RASD. MARBI. [1993?] "A 007 Physical Description Fixed Field for Computer Files." Discussion Paper 68. Photocopy. Anglo-American Cataloguing Rules. 1988. 2nd ed. Chicago: American Library Association. Berners-Lee, Tim. 1993. "Uniform Resource Locators." Internet Draft, IETF URL Working Group. Dillon, et al. 1993. Assessing Information on the Internet: Library Services for Computer-mediated Communication. Dublin, Ohio: OCLC, Office of Research. Lynch, Clifford A. 1993. "A Framework for Identifying, Locating, and Describing Networked Information Resources." Draft for Discussion at March-April 1993 IETF Meeting. Lynch, Clifford A. and Cecilia M. Preston. 1992. "Describing and Classifying Networked Information Resources." Preprint. Electronic Networking: Research, Applications and Policy. Judith M. Brugger is Catalog Management and Authorities Librarian 107 B Olin Library, Cornell University, Ithaca, NY 14850 MC JOURNAL: THE JOURNAL OF ACADEMIC MEDIA LIBRARIANSHIP Vol. 1 #2 Fall 1993 ISSN 1069-6792 October 27, 1993