Feeds:
Posts
Comments

Repository Wishlist

Just over two weeks ago I asked on JISC’s “JISC-REPOSITORIES” list what features users of repositories felt were missing from their software solutions. I got twelve reponses, which is probably a fairly small percentage of people that “read” JISC-REPOSITORIES so you’ll have to consider for yourself if the results of the questions are relevant or not. As an aside, only one reply came from someone you wouldn’t describe as being a repository manager. This got me thinking that it would be good to ask the same questions of academic users – the scholars whose work we’re trying to promote the communication of – to see what differences there are. Trouble is, how do you reach these people?

Anyways, that is besides the point. You’re here to hear all about the replies right? It would be fair to say there was a Gold, Silver and Bronze and Gold was way ahead of the competion!

The Wishlist Winners:

  1. Statistics and reporting
  2. Better item/metadata management
  3. Automatic generation of bibliography pages/CVs

Lets expand on those:

Statistics and Reporting

Far and away the most popular request was for better support for and handling of statistics. Several respondents gave examples of the types of queries they’d like to be able to ask of the repository:

“Give me all my papers deposited in the last 6 months”

“Give me all my papers published in the last 6 months”

“How many people have followed the DOI to the publisher full-text?”

“What are the top 10 items being used in our repository?”

“Give me all the items in the repository funded under grant number x1773 last year”

“Give me a count of all the published items in the repository available as full-text”

etc.

It is clear that people are really after being able to pull out whatever they want from the system. A Library system I once worked with gave users a simplified SQL form to produce reports – a bit like Microsoft Access’ query wizard and that could be a useful feature for repository vendors to consider.

Various outputs were requested – or implied – so of course the reports need to be available as RSS, as email (and users should be able to subscribe to and automate the running of the reports they set up), and as Web pages.

The other point identified (and it isn’t a new one) was that it was difficult to compare statistics from different softwares and that some kind of standard would be useful to enable repository users to compare like with like.

One respondent mentioned ECS@Southampton’s IRStats project and that it would be nice to have it included “out of the box” for Eprints. Seems there is some potential there, but IRStats appears to be an after the event logfile processor and may or may not meet the needs outlined above – on the fly reporting by email for example – not yet anyway. It isn’t clear to me how IRStats helps to smooth differences in stats reporting across software solution either – I’d welcome comments from people with more experience on that.

Better item/metadata management

This is the ability to move items around – from School to School for example – or to change metadata fields across multiple records/items – which may or may not constitute a single “eprint”. That repository software still requires users to delete and item and then resubmit it seems antiquainted I’d imagine to most repository software users who can drag and drop files on their desktop PCs with ease. Repository and preservation people, I suspect, consider the limitation on editing an item once deposited something of a feature and you can see their point, but if repository managers find the interfaces frustrating, imagine what the academics will think! Perhaps there is scope here to separate “repository for users” and “repository for preservation” – I’ve seen architecture diagrams like that knocking around…

Automatic generation of bibliography pages/CVs

A specialised form of reporting really, but it was high in the number of requests – people requested the ability to embed the output in scholars Web pages – this could be “on the fly” or with scheduled job. Closely related to this was the ability to export references to papers in any format in order to comply with a subject’s citation style preference or to comply with an institutional reporting template/CV for the academics.

The Rest of the Wishlist:

A number of other suggestions were made and here they are in no particular order:

  • Browse Filtering – for example, display all depositors at this Institution only
  • Desktop Repository/Personal Research Manager – a desktop application to help researchers manage their work, with a “deposit” button built-in – I think this is a *great* idea!
  • Automatic Coversheet Generation – with ability to design the coversheet and include arbitrary metadata from the database
  • CRIS/IR integration and “Joined-Up” thinking
  • Linkage to subject/funder repositories – the ability to push an IR deposit to a 3rd party or make it easy to manage their harvesting
  • Ability to manage multiple affiliations for depositors – would require IDs for said authors too I guess. OpenID anyone?
  • Granular permissions – the ability to add groups and assign rights to those groups. For example, these authors may edit the following documents only
  • Related to granular permissions: Full customisation of submission forms by group, user ID, etc. – presumably this is possible with Open Source offerings provided you’re willing to get your hands dirty. The customisation needs to be manageable outside of the HTML.
  • Import and Export of “packages” – where an “eprint” consists of multiple files and needs moving around. OAI-ORE?
  • Greater flexibility in defining metadata fields, etc. (Hosted services are more restrictive here)
  • Support for the storage and delivery of (very) large collections of data and multimedia items
  • Shopping basket: the ability to collect a number of items from a repository (or, indeed, many) and then click “get them” and receive a package of all the papers requested.
  • A more sophisticated content model: currently you have “Record 1->* File”, it’d be nice to have a set bit in the middle to be able to group files… SWAP?
  • Making the links of authors live – so clicking will return all that author’s papers in the repository
  • Some way of recording where work was done – ie. did the scholar produce this paper whilst employed here?
  • Item versions – the ability to explicitly and simply link one item to another and typing the version relationship
  • Better embargo management
  • Better/easier/standardised OAI-PMH output to include all the community required fields – like grant numbers – an Application Profile maybe? What could we call it? :-)

What next?

There are some interesting feature requests here – some of which are closely related – better reporting, bibliographies and “live” author names, for example. All of them sound like useful things and on first read sound like they should be simple to implement and its easy to think “why are these not *just there* already?” or “surely they’ll just take ten minutes to fix?”. Maybe some of them are simple, but some of them are like icebergs, where the simple bit above the water hides a vast complex interaction below that needs to happen inside the software system. To use another metaphor, changing software can be a bit like trying to alter the pattern on the front of a knitted jumper. You look at the circle and say “surely this can be made into a star?” but to do so you have to unpick most of the wool and start over…

That said, I’m hopeful that RSP will be able to explore some of these issues further and produce some step-by-step guides – but I’ll need to reflect on the results more before I can say what these will be…

Thanks again to JISC-REPOSITORIES list members for their input!

The point will hopefully become clear…

–After I had tidied myself, I went down to Dr. Seward’s study. At the door I paused a moment, for I thought I heard him talking with some one. As, however, he had pressed me to be quick, I knocked at the door,and on his calling out, “Come in,” I entered.

To my intense surprise, there was no one with him. He was quite alone, and on the table opposite him was what I knew at once from the description to be a phonograph. I had never seen one, and was much interested.

“I hope I did not keep you waiting,” I said, “but I stayed at the door as I heard you talking, and thought there was someone with you.”

“Oh,” he replied with a smile, “I was only entering my diary.”

“Your diary?” I asked him in surprise.

“Yes,” he answered. “I keep it in this.” As he spoke he laid his hand on the phonograph. I felt quite excited over it, and blurted out, “Why, this beats even shorthand! May I hear it say something?”

“Certainly,” he replied with alacrity, and stood up to put it in train for speaking. Then he paused, and a troubled look overspread his face.

“The fact is,” he began awkwardly.”I only keep my diary in it, and as it is entirely, almost entirely,about my cases it may be awkward, that is, I mean . . .” He stopped, and I tried to help him out of his embarrassment.

“You helped to attend dear Lucy at the end. Let me hear how she died, for all that I know of her, I shall be very grateful. She was very, very dear to me.”

To my surprise,he answered, with a horrorstruck look in his face, “Tell you of her death? Not for the wide world!”

“Why not?” I asked, for some grave, terrible feeling was coming over me.

Again he paused, and I could see that he was trying to invent an excuse. At length, he stammered out, “You see, I do not know how to pick out any particular part of the diary.”

Even while he was speaking an idea dawned upon him, and he said with unconscious simplicity, in a different voice, and with the naivete of a child, “that’s quite true, upon my honor. Honest Indian!”

I could not but smile, at which he grimaced.”I gave myself away that time!” he said. “But do you know that, although I have kept the diary for months past, it never once struck me how I was going to find any particular part of it in case I wanted to look it up?”

By this time my mind was made up that the diary of a doctor who attended Lucy might have something to add to the sum of our knowledge of that terrible Being, and I said boldly, “Then, Dr. Seward, you had better let me copy it out for you on my typewriter.”

He grew to a positively deathly pallor as he said, “No! No! No! For all the world. I wouldn’t let you know that terrible story.!”

Then it was terrible. My intuition was right! For a moment, I thought, and as my eyes ranged the room,unconsciously looking for something or some opportunity to aid me, they lit on a great batch of typewriting on the table. His eyes caught the look in mine, and without his thinking, followed their direction. As they saw the parcel he realized my meaning.

“You do not know me,” I said. “When you have read those papers, my own diary and my husband’s also, which I have typed, you will know me better. I have not faltered in giving every thought of my own heart in this cause. But, of course, you do not know me, yet, and I must not expect you to trust me so far.”

He is certainly a man of noble nature. Poor dear Lucy was right about him. He stood up and opened a large drawer, in which were arranged in order a number of hollow cylinders of metal covered with dark wax, and said,

“You are quite right. I did not trust you because I did not know you. But I know you now, and let me say that I should have known you long ago. I know that Lucy told you of me. She told me of you too. May I make the only atonement in my power? Take the cylinders and hear them. The first half-dozen of them are personal to me, and they will not horrify you. Then you will know me better. Dinner will by then be ready. In the meantime I shall read over some of these documents, and shall be better able to understand certain things.”

He carried the phonograph himself up to my sitting room and adjusted it for me.Now I shall learn something pleasant, I am sure. For it will tell me the other side of a true love episode of which I know one side already.

From Bram Stoker’s Dracula, 1897. Funny that we’re still having the same problems… ;-)

RSP Support II

Part two of RSP support replies – the other one is just below. If you don’t want to read them all, the key point is don’t install an IR to tick a box, but rather to fulfill an identified need.

This one was from an institution struggling to sell the repository to the academics. Nothing new there, and in this case the repository has been put on hold for that reason. I don’t see any problem with that – if there is no recognised need for a repository, don’t just get one to tick the box.

“Tell an academic about Open Access and they’ll say “but I never have a problem getting hold of papers”, tell them it’ll raise their research profile and they’ll say “I think (insert prestigious publication of choice) does a better job at that”. I don’t think the problem is limited to the academics either – software developers within Information and/or Computing services seem to have similar views as do some in the Library community.

It isn’t that clear-cut but I don’t think the academics concerns are unjustified. It is hard to see the direct and immediate benefits of IRs to them personally – it is just effort and time that they don’t have. To my mind the benefits of IRs seem to largely be on the part of the Institution itself – at least at the moment – a place to manage research outputs (and teaching materials) – the digital assets of the organisation that are currently lost in a morass of disk drives and departmental Web servers. This standpoint – one of asset/data management seems to appeal to holders of purse-strings and they are issues any organisation should start grappling with – commercial, public and education sectors.

I’m interested in what you say about differentiating the “Repository” from adding records to the Library catalogue and Web/VLE, etc. While I’m not sure the Repository (the capital R is deliberate) crowd would agree, I feel that it is this need to differentiate that has caused some of the problems. If a Web site (perhaps with an associated catalogue record kept by the Library and including a persistent URI, supported by the Web site/VLE) fulfills the Institution’s requirements for research output management and dissemination (in addition to being more attractive to users) then so be it. It fulfills the role of the mythical Repository – it is a repository – in that it gets stuff “out there” but it works in the context of that Institution.

Provided the Web site is designed and implemented well (I guess that is the issue repository software is intended to help with) there is no reason why it cannot be upgraded to an IR later if people want/start to see the need for it.

One of the ways I’ve heard repositories sold to academics in teaching-led institutions is to offer them up as Flickr-like places to share resources and nothing else. You forgo complex metadata (and perhaps QA) in favour of a promiscuous deposit policy. People share their materials and use those of others. There are, of course, IPR issues here, as ever, but the benefit is that stuff that would not normally be kept can be kept safe for future staff – like the reading lists one academic had a one hard copy of and retyped and printed it each year – but did not save the Word document they printed it from – its a true story! :-)

These are just some of my thoughts. I guess the bottom line is, as ever, to really consider what the repository is for. If it is to tick some checklist that says things we should do then it is right to hold on – no technology should be forced to address a need that isn’t there. If there is an identifiable need with the institution – and that may come as more and more academics start losing their work to failing hard drives or whatever – then implement it in the way that works at the institution. IRs represent a best effort at defining a standard way of addressing asset management and scholarly communication, but they are by no means a silver bullet or one of those tiny gloves that expands to fit any hand! :-)

Finally, on a practical note, RSP should in theory be able to help with convincing your academics and if it can’t then it would learn a valuable lesson on what academics really think – which would be handy for us. So bear us in mind if the R-word appears again on the horizon…”

RSP Support

Had a couple of interesting questions from people contacting RSP these last few days. I wanted to record my replies and so this post and the next are just that. The first question was about repositories -vs- “web-based storage”:

The question was put to me as this:

“an enquiry regarding the advantages of using repository software over other web based storage solutions”

there is lots in that and in discussing it here at UKOLN I got different views on the question – one thought that the problem with with terminology – do we use the word “repository” or “web-based storage”? I’m not sure that is what you are asking, but if it is, then I think you have to call the repository whatever you and your users feel most comfortable with. “Repository” is a terrible, ugly, hugely overloaded term and not one likely to make sense to anyone! The second thought suggested you were asking about the differences between the two and a third suggested that the query was in fact “why do we not just develop our own, in house, repository software?”. All these questions are linked…

The answer, I think, is right back at the start and can be summarised as “it depends on what you want to use it for”. That is to say, what are the user requirements that the system is intended to fulfill?

In the RSP Briefing Paper “Repositories, Content Management Systems and Portals”, some of these requirements are given:

* Open Access – the ability to make available online the material in the repository and control access including time-based release, as required.

* Accessible to search engines – pretty much goes without saying these days!

* Persistent identifiers – should be self-explanatory. You could call this ongoing access – a commitment from the institution to make the items in the software system available until the end of the Internet or the world – which ever comes first. :-)

* Bibliographic metadata – a set of data associated with an item of content that records useful information such as Author, Title, etc. – in a controlled fashion – enabling search by surname, “give me a list of papers I’ve contributed to since 1986″, etc.

(I’d add here and suggest even more important and useful than bibliographic metadata:

* Management/structure metadata – this is more important than bibliographic and includes things like version – eg. is this the latest version and if not, where can I get that? Also “how many times has this paper been downloaded/cited/viewed?” and other statistical data.)

* Export/import – of content from Web pages, CVs, RAE data, research systems, etc.

* Metadata harvesting – some services are trying to make use of the stored metadata to provide more sophisticated search/browse services. At present, to achieve this they use the OAI Protocol for Metadata Harvesting (OAI-PMH). It is expected that the software system will support this protocol if it is to meet the requirements of research output management software.

This set of requirements has not been plucked from the air – it is a result of the “repository movement” that has over the last eight years investigated what might be required to support what a user needs from any research output storage, disemination and management system (“ROSDMS” is less catchy than “repository” and “repository” less expensive than “publisher” :-) ). For further details of this investigation some use cases and other materials are on the Repositories Research Team wiki:

http://www.ukoln.ac.uk/repositories/digirep/index/All_the_Scenarios_and_Use_Cases_Submitted

Having outlined those “one size fits all” requirements, it is important to say that of course every institution is different, everyone has different requirements.

While I would be cautious of starting from nothing and beginning the user requirements analysis again, it is important to allow users and developers to feel ownership of any system built to achieve repository-like functions and some kind of analysis might be a useful way to achieve this. What is the institution trying to achieve by setting up a place to store, disseminate and manage its research outputs and how does that differ or ultimately align with the received wisdom on what such a system needs to support?

Another way of specifying repository-like requirements was presented at a meeting I was at a week ago where the key features were given as:

* Storage/Versioning
* Indexing
* Retrieval
* Access Control/Rights Management
* Cool URIs

None of that screams “specialist system” to me and so you are right to ask if web-based storage can do just as well. I guess the key thing missing from that list is “management” and there is a real worry that without institutional management of research outputs there will very quickly be a mess of departmental Web sites, just like that bad old days of an institutions Web presence. The concept of a “Repository” implies, but doesn’t necessarily give, organisational control and a structured approach to looking after research outputs. That concept could be met by “web-based storage” provided it allowed for that structured approach to management of material.

So, the answer isn’t simple and at one level all a “Repository” is is a “web-based storage system”. I’ve outlined some of the specialist requirements that “web-based storage system” has been shown by the HE community to need to support. It is up to the institution to mould/map those requirements to the specifics of the institutional need to come up with its own set of requirements.

Then you consider software and you compare your requirements against them and consider the cost/benefit. Does repository-software-X meet our needs? (probably! :-) ) Does “web-based-storage-solution-X” meet our needs? What about S3, Sharepoint, Alfresco? A DIY solution?

Perhaps it helps to try not to start by thinking of the “Repository” (call it what you like – PLEASE don’t call it “Repository”! :-) ) as a piece of software. Better to think of it as a service to the institution. After that think about the technology that fulfills the needs of that service. At one level we call that technology “repository” but we could just as easily call it “web-based storage system” and whatever we call it it could still be CDSware behind the scenes.

Last week I took a flying visit to Washington DC to attend Repocamp at
the Library of Congress. The event was arranged and sponsored by CRIG
(which itself is funded by the JISC who were also
kind enough to suggest I go and fund my travel). The event was
the finale of CRIGshow.

I entered the day with some reservations. I’m not a huge fan of barcamps
- perhaps I’m a bit old fashioned, but I’d prefer an agenda, a bit of
structure maybe – at least this way you can have a hope of assessing the
usefulness of the day *before* you arrive. When the start dragged, I
started to suspect things were going to be bad. Fortunately I was wrong.

The day kicked off with David Flanders doing the intro, followed by Jim
Downing elaborating on it, mentioning that mysterious thing (in the
repository world) “the user” – suggesting that we should consider the
who in what we were trying to achieve. Needless to say, in a room of
technical developers, said user wasn’t really mentioned again and to
some extent the day was about technology, not the user.

We then moved into the “elevator pitch” part of the day whereby
people try to sell you their idea.

SWORD was first up and Jim Downing gave a good overview and talked about
SWORD2 as a revision based on feedback. Essentially SWORD is AtomPub for
repositories. After this intro, a guy from Microsoft Research got up and
outlined the other big M’s plans for global research communication
domination wielding our SWORD – that is to say they plan to use SWORD to
push documents around, from authoring tool (Word) to Conference
Management systems (MS have one called CMT -
cmt.research.microsoft.com/cmt/) to repositories (thus far no MS
product, but watch this space) and then “closing the loop” by allowing
authors to import references back to the authoring tool (did I say this
was Word?). MS seem interested in using open standards these days -
makes them a bit like a publisher, taking all that publically funded
hard work and profiting from it. Its the world we live in…

Next was an introduction to ORE which was the shortest I’ve ever seen
and highlights how ORE epitomizes taking the scenic route to stating
the obvious. ;-) (That is of course unfair to ORE, in which the
simplicity is the end result of an arduous process of wading through the
difficult – like art only less beautiful). It’d be fair to say the ORE
relies on a Resource Oriented Architecture (ROA) (which is a bit like
when you get them texts that say “you have a picture message, you can
pick it up at this URL” rather than just sending the picture) to
distinguish itself. This caused some discussion as at least one person
wanted the ability to add a content to the resource map, enabling the
embedding of documents, or whatever, as part of the map.

This would, potentially, make ORE a packaging format – something the ORE
folks are keen to avoid (probably because there are already lots of
those). In many ways the argument is academic – is there much difference
between saying “resource X (see below) is part of aggregation Y” and
“resource X (identified by URI – and maybe at URL if you’re lucky) is
part of aggregation Y”? Well, of course there is, but the distinction is
subtle, transparent (and irrelevant) to that mythical user mentioned at
the start of day.

Next a guy outlined what he saw a repository as being:

* Storage
* Indexing
* Retrieval
* Access Control/Digital Rights
* Persistent URLs

and, as a vague aside, * Preservation.

what struck me (and others) was that none of those need anything more
than “The Web” – suggesting that the Repository is a redundant idea.

This got me thinking about repository types – those for large-scale
preservation (for example of electronic legal deposit) – National
Library of Wales, Library of Congress, University of California, etc.
and then those for academics (currently IRs) and then those for the
masses (Flickr, et.al).

The big Repositories are complex and curation/cataloguing, etc. seems
pretty important, along with vast systems of hardware and software -
which is probably why people like Sun are in this game. For IRs those
services outlined would be sufficient, which leaves Repository out of
the picture – indeed, a barrier. Yet we continue in the repository
community to treat national preservation services the same as small
scale Web publishing services.

Next up Ben with his idea the groan inducing REST:ORE – someone
had to! Anyways, this was to produce a generic, Web-based file system
API for repositories to enable repository software to abstract itself
from the “storage layer” (which is not the disks in this case), meaning
you could migrate your content to another front-end system very simply.

Better, and I suspect why Ben is keen, you get ROA out of the box – the
file system would be, of course, REST based. I struggled to grasp the
“why” with this one (doesn’t HTTP on its own do the job? WebDAV?) even
discussing it with Ben after, but that isn’t to say there isn’t one.

(Later in the afternoon REST:ORE was mentioned again and Sandy Payette
confirmed my worries about types of repository by talking about
repository levels – we could have a stack like this – I think as an
attempt to rationalise similar worries to mine:

Fez/Flickr/DSpace/Eprints
Fedora/Relational DB
REST:ORE
Physical disk

each one could be and is called a repository – is it time we defined
what we meant and started to use different terminology for each? I’m all
for that! Maybe then we wouldn’t try to bundle and RSP Professional
Briefing and the SUN PASIG events into the same domain!)

Then it was the Library of Congress’ turn – Leslie Johnston (who has a
good blog at digitaleccentric.blogspot.com). LoC have a very simple
protocol called BagIt
(www.cdlib.org/inside/diglib/bagit/bagitspec.html), designed to enable
them to archive electronic stuff. It does what it says – describes a
metaphorical bag into which objects are placed and sent to LoC. The bag
can either contain the files themselves or a list of URLs to pick them
up from. It is a simple idea to meet a very specific use case and
because of this just works.

Discussing it later, Dave Tarrant pointed out that BagIt risked losing
the information stored in URIs and that it should hold on to them. In a
way I think he was inventing a use-case beyond BagIt’s (BagIt’s
“fetch.txt” file could legitimately and deliberately use temporary and
private URIs to simply move stuff around) but the discussion was
interesting all the same.

The final pitch was by Nathan Sarr who outlined his attempt to use
GoogleDocs-mashed-with-Subversion-like approach to improving repository
deposit. That is to say he was producing a collaborative authoring
environment that was (I think) linked automatically with the repository.
People created documents, checked them out, edited them, checked them
in. Then there were available for other edits, and so on. It is a nice
idea – though I did wonder if Sharepoint (or Alfresco) had all but the
final step to the repository covered. It wasn’t received very well – in
part because it was felt the authoring process wasn’t as simple as that
and that there were problems of versions (is a version of the work (in a
FRBR sense I guess), of the package of files, of a single Word doc? I
think also we were suffering a little from information overload by then.

The day then moved to the second phase of barcamping – where people
break out to discuss the ideas they’ve seen or pitch their own ideas.
After that it was hoped that developers would form little groups and do
some prototyping of the ideas – which the view to producing entries into
a competition – for which you could win $2000.

This (unstructured) part of the day worked less well (for me) though I
had a good conversation with Michelle Kimpton about “the user” and about
repositories and how data management was the important thing. I
suggested that we were going about it the wrong way – we should be
encouraging people to address the fundamental information systems
architecture and data management problems faced by their Institutions
rather than just giving them more software to install. She seemed to
think in similar ways, even to the extent that OA evangelism was seen as
a bit of a problem for repositories, which was reassuring.

I took the opportunity to partake in what David Flanders described as
“participating observer” behaviour. That is I listened to and watched
other people. I also wrote some of this trip report – subsequently
deleting all of the content because I didn’t think it was very good -
suggesting that Live Blogging is not always ideal and also sketched out
my plan for a repository that worked same as flickr but for academic
content – ie. you email the content to it, it publishes it. Job done. (I should probably say that isn’t my idea – it is Scribd, Flickr, Slideshare, Andy’s, Paul’s, etc’s idea).

I’d say I got less from the afternoon than the morning – mostly due to
lack of direct relevance to my current role(s) and that there were not
really any prototype-suitable ideas formed in the morning – these were
projects rather than ideas. I also think there was a reluctance amoung
the group to form teams aside from those that already existed or to
simply catch up with each other. I suspect people in the room got a lot
from it as a networking event. It certainly helped raise the profile of
JISC and CRIG Stateside which is no doubt a good thing.

Introduction

Between about 1pm Tuesday 17th June and 12pm Monday 30th June 2008 the JISC mailing list, JISC-REPOSITORIES[1], further discussed questions of subject classification, repositories and automation. The discussion totalled some 10,284 words (not including headers and quoted text) over 67 messages and the thread (“Subject Classification”) spawned two others: “It’s Keystrokes All the Way Down” and “Current Awareness”. During the course of this discussion someone asked that a summary be created and this document represents an attempt to do just that. It does not attempt to attribute points to individuals just as it does not take any credit for the ideas expressed within.

Background

We begin with a question: “Do Institutional Repositories that make use of Library of Congress Subject Headings (LCSH) ask depositors to select the headings, or get cataloguers to do this work? Would it be better to simply use author chosen keywords (tags) or use a classification like ISI (to support REF)?”

Importantly, and implied by reference to REF (and the subsequent discussion), is that any requirement to subject classify should be made in the context of usage. (This is a general principle in the creation of any metadata). It would be right to ask: What is the purpose of subject classification of Institutional (or other) Repository content? (Note, that is not asking what is the purpose of subject classification per se). Other questions then arise: What is the cost of subject classification and how does compare with the benefits? Is human subject classification necessary, nice if you can get it or simply a waste of time? These are reoccurring questions within the repository community, suggesting that they have not as yet been formally explored.

What is the purpose of subject classification of repository content?

In theory at least there are interesting services that can be built using the subject classified content of repositories. These include systematic searching and browsing, filtering by subject area (discussed later), support for REF, and auditing of research grants (by attaching grant codes ““ which may carry subject information – to papers). The latter two are not demanded by end users of repository content, but by administrators and funders. The former are standard methods used to discover resources and it was felt that to remove support for these types of discovery without proper investigation would be premature and unfair to those people who rely on them.

The discussion seemed to veer towards full-text indexing, coupled with sophisticated search algorithms (such as those used by Google) and boolean queries, as sufficient mechanisms for discovery of repository content. There was a strong feeling that subject descriptors attached to metadata records of papers would not enhance/aid discovery and that if subject classification was required it would be difficult to see the value added by “human classification” (at deposit) over automatic classification (at deposit or any time after).

That said, some posters advised caution, suggesting that to entrust scholarly research to the power of the search engines was not something to be taken lightly and that to dismiss subject classification, a standard discovery tool used by researchers and librarians, might carry some risks. Further, it was felt that there are limitations of full-text indexing and there was a question over whether or not a document’s content (devoid of context) was sufficient to facilitate discovery (or automatic classification) of that document. Some felt this was a minor problem that would only occur with a specialised set of documents and that this set of documents would perhaps have no place in an Institutional Repository. Others felt this might be a very real issue for the content of IRs.

The discussion seemed largely based on opinion and impressions rather than studies assessing the usefulness of full-text indexing versus enriched metadata and the question was raised to ask if there were any studies looking into this.

Subject classification to information overload

Some felt that while subject classification did not aid discovery via search engines, it was still useful to distinguish content for subject based harvesters and to filter result sets, for example current awareness alerting services. IRs are, by default, as subject agnostic as the Institution itself. How then does a subject focussed harvester determine which full-texts to retrieve and index? Some services do not place any subject metadata into their records because it is be clear from the repository in question what the subject area is. However, machine to machine interfaces do not necessarily have the luxury of knowing the subjects each repository might cover.

A further issue was raised relating to current awareness and the limitations of alerting services built on top of full-text indexes. Often such alerts (via RSS feeds) would return false positives and it was suggested that a finer grained filtering (perhaps aided by subject classification) would be of use in solving these problems.

However, there was a strong feeling that machine classification would address these issues, adding subject classification after submission (or at retrieval), but as yet no one is very sure of how successful that would be now or how much better it might get in the future.

What is the cost (to Institutional Repositories) of subject classification?

The discussion suggested that deposit into repositories is disappointing and the poor rates of deposit can be directly and solely attributed to the effort (in terms of “keystrokes”) required to submit a paper. There was a strong feeling that reducing the metadata overhead (by, for example, not asking authors for subject headings) at submission would significantly increase the chances of authors depositing their work. That is to say the cost to IRs of subject classification is high: it prevents content deposit. (There was also the question of the author’s qualifications for cataloguing a work in accordance with a subject scheme).

As aside to this discussion, the question was put to the list whether or not it really was the case that “keystrokes” were the main cause of the disappointing deposit rates. Some on the list felt that there were other, equal, if not more significant factors ““ such as copyright clearance/fears. If “keystrokes” were not the main factor, it could be argued that the cost of subject classification to deposit was less than envisaged, but there was only anecdotal evidence to support this.

That subject classification implies “keystrokes” that the authors are unwilling to make begged the question does all metadata requested/required form a barrier to deposit? If it does, should IRs be asking for any metadata at all other than that which can be gained automatically? What if all barriers were removed and the submission interface for an IR were simply a Web site to which files could be uploaded/copied? How would such a Web site differ from an IR? (A few ways were mentioned: for example that an IR allows the institution to manage the scholarly output and that OAI-PMH was a better dissemination technology than screen scraping). However, the question remains: Are IRs themselves barriers to deposit? Barriers to Open Access?

Metadata Standards

There were implications for metadata efforts within the community and application profile work was mentioned in this context. The problem is that if IRs will remain empty if there is an insistence on high levels of complex metadata, what role is there for things like SWAP? Should (could?) SWAP stipulate a subject classification scheme? How will it be possible to get authors to construct the relationships SWAP requires if they will not/are not capable of selecting a subject heading? There was a feeling that software tools currently do not support the easy creation of complex metadata coupled with a concern that they never will. “Developer bewilderment” was cited as the reason; that is to say that the software developers themselves do not understand or accept that structured metadata is a requirement for discovery and because of this will not invest the time and effort developing the tools to create it.

Where now?

A number of questions were raised on the list as part of this discussion. Among these significant ones appear to be:

What are the requirements of IRs/services that subject classification supports?

Is subject classification an aid to resource discovery – from full-text indexing to alerting?

Do we know either way or is it just a feeling?

Is the disappointing deposit rate still attributable to just “keystrokes”?

Just where we go from here is left to the reader.


During the course of the discussion it was suggested that the thread itself might be interesting to automatically classify. The following is the output from OpenCalais:

URL:


http://www.driver-community.eu


http://www.iriss.ac.uk/openlx


http://tinyurl.com/62bmvk


http://metalogger.wordpress.com


http://search1.driver.research-infrastructures.eu

http://www.digitalpreservationeurope.eu


http://www.hull.ac.uk/golddust


http://eprints.ecs.soton.ac.uk/11006


http://search.arrow.edu.au


http://cadair.aber.ac.uk


http://www.eduserv.org.uk/foundation


http://www.digitalpreservationeurope.eu


http://eprints.ecs.soton.ac.uk/12094

http://www.iriss.ac.uk/openlx


http://eprints.utas.edu.au/view/authors/Sale


http://elpub.scix.net/cgi-bin/works/Show?


http://efoundations.typepad.com


http://metalogger.wordpress.com/>


http://zoomii.com/>


http://www.libworm.com


http://www.ukoln.ac.uk


http://openaccess.eprints.org/index.php?


http://www.franklin-consulting.co.uk


http://nzresearch.org.nz/index.php/browse/browseSubject


http://www.wired.com/science/discoveries/magazine/16-07/pb_theory


http://www.amazon.com/review/product/0691020728?filterBy=addFourStar


http://eprints.ecs.soton.ac.uk/11125


http://arxiv.org/abs/cs/0312018


http://www.eprints.org/openaccess/policysignup


http://www.intrallect.com


http://tomfranklin.blogspot.com


http://www.dcc.ac.uk


http://edina.ac.uk


http://road.aber.ac.uk


http://www.icbl.hw.ac.uk/~philb

PhoneNumber:

01970 628724

02890 974824

07989 948 221

+44 (0)23 8059

0161 434 3454

0131 451 3278

+44 870 234 3933

+44 (0)23 8059

+44 (0)131 651

+44(0)141 330

MedicalCondition:

bewilderment

Paralysis

ProvinceOrState:

Tasmania

IndustryTerm:

brilliant open web interface


http://search.arrow.edu.au/

data-mining lack

heavy-duty tools

semantic-web site:latest

subject search tool

online repository

search gateways

systematic search

web interface

gogle search

repository services

subject search tool

aggregated feeds available to other services

Internet Resources Newsletter

online repository

repository technologies

sensible human boolean search

subject search

smart text-processing software

taxonomy search

well-managed general web site

software development

boolean full-text search

learning tools

Internet users

browses repository search

browses repository search interfaces

in-house tool

online research repository

boolean full-text search

repository software development

mass-market newspaper

software developers

web services suite

search engines

magic solution

search tools

semantic web

boolean search

tomnfranklin web

search engine

friendly portal

cloud computing

wildcat Web site

City:

Glasgow

Zurich

Hand

Southampton

Technology:

ASCII

repository technologies

http

AJAX

html

ascii

My algorithms

search engine

Country:

New Zealand

Australia

Scotland

United States

United States

Scotland

United Kingdom

FaxNumber:

02890 976586

+44(0)141 330

+44 (0)23 8059

0131 451 3327

+44 (0)23 8059

Person:

Stevan Harnad

Philip J Hunter

Julian Cheal

Scott Welsh

Stevan Harnad

Andy Powell

Antony Corfield

Neil Godfrey

Gwasanaethau Gwybodaeth

John Smith

Sarah Currier

Peter Cliff

Ricky Rankin

Neil Godfrey

Carr On

Tom Franklin

Peter Crowther

Mason Ingrid Mason Digital

Tîm Cynorthwywyr Pwnc

Ian Stuart

Stevan Harnad On

Joy Davidson

Steven Harnard

Arthur Sale

Ingrid Mason Ingrid Mason

Phil Barker

Steve Hitchcock

Pete Cliff

Philip Hunter

Hugh Owen

Ingrid Mason

Ingrid Ingrid Mason Digital

Alma Swan

Philip Hunter Storelink

Rosemary Russell

Neil Just

Simeon Warner

Facility:

Library of Congress

Digital Library Section Edinburgh University Library George Square

Bureau of Statistics

Kelburn Campus

Aberystwyth University Llyfrgell Hugh Owen Library

Mountbatten Building

Computer Sciences Mountbatten Building

Library of Congress

Organization:

University of Southampton

Arthur Sale University of Tasmania   From

Heriot-Watt University

University of Edinburgh

Eduserv Foundation

University of Bath

School of Electronics

Australian Government

University of Tasmania

School of Oriental and African Studies

Queen’s University

Heriot-Watt University

School of Electronics and Computer Science University of Southampton

Harvard

University of Southampton

Victoria University of Wellington

Institute of Maths & Physics

Training Coordinator Humanities Advanced Technology and Information Institute

University of Zurich

Congress

School of Electronics and Computer Science

University of Glasgow

Australian Bureau

Arthur Sale University of Tasmania From

Australian Bureau of Statistics

Information Institute

School of Mathematical

University of Edinburgh

Bureau of Statistics

Company:

IRs Export

NARCIS

Tom Franklin Franklin Consulting

Computer Sciences

Google

Yahoo

Intrallect Ltd.

Google

Yesterday I attened the Talis Xiphos Research Day at the Talis Offices
in Birmingham. I found out about it by picking up a leaflet at OR08
about it and signed up and was glad to go as it was very interesting.

I arrived slightly late (about 15 minutes) on account of the 7am train
out of Bristol being cancelled, so missed Paul Miller’s opening talk and
a smidge of Peter Murray-Rust’s, but I don’t think that mattered.

The first two talks of the day were Peter Murray-Rust (Cambridge),
giving a typically Zebedee-like brain-dump about repositories for
Science and Andy Powell (Eduserv) questioning the Repository Movement
using arguments that he has presented before but they seem more refined
this time and more compelling.

The gist of PMR’s talk seemed to be:

1) that Big Science didn’t need any help when it came to looking after
data – they do their own thing and probably don’t even need institutions.

2) That is to say that it is the smaller research Labs (Buzzword: “Long
Tail Science”) that needs support looking after their data and their
outputs.

3) Researchers need tools to help them manage datasets, research outputs
etc. and these tools did not seem to be IRs. Instead he suggested the
likes of SourceForge (he is impressed by its CVS-like versioning) and
Bioclipse, an adaptation of Eclipse (Java IDE) for researchers. That is
to say he seemed to suggest that intelligent “clients” – souped up
copies of Word for instance – that offered something useful to the
workflow of a researcher would be most helpful.

4) PDF is the source of all evil. (We’ve discussed before that this is a
misguided notion).

Following on and supporting PMR’s not dismissal so much as blanking of
IRs, Andy gave a developed version of a talk I think he probably first
gave at the Digital Deluge Conference in Manchester last year and maybe
other times since.

Andy makes a good point – the language of repositories is wrong and
counter-intuitive. If you talk to academics about “putting it in the
repository” they say “The what?” but talk to them about “putting their
work on the Web” and they at least know what you are on about. One of
the other speakers (an academic) backed this up in the questions by
saying (only a little tongue-in-cheek) “I had no idea what you were
talking about until slide 11 when you said ‘putting stuff on the Web’.
Then it became clear”.

He also talked about how IRs appear to be one way of reproducing the
print-based scholarly communication world on the Web. However, he
suggested that using the Web could require us to “re-envisage” scholarly
communication – forgetting (at least for a moment) the old ways.

Andy also suggested that “Repository” was a political word currently
being used by people with an Open Access agenda who did not what to see
that agenda sidelined by appropriating or expanding the term (to, say, content management). I feel the same way. To blindly think just having an IR will achieve perfect OA is a bit silly and naive and even OA literature says “self-archiving” is the goal rather than IRs.

So where does Andy see the future? He made two suggetions – a simple
route that involved using things like slideshare, and a complex route
where resources contained semantic markup. Neither suggestion seemed
fully developed! ;-)

Next up was Carsten Ulrich (Shanghai Jiao Tong University, China)
telling us why “Web 2.0 is Good for Learning and for Research” by
discussing Web 2.0 and its relevance and presenting his experiments in
this area. He has been using social bookmarking and semi-structured
tagging to build course resource lists (using tags like:
sjtu:type:exercise sjtu:kp:linked_list sjtu:difficulty:easy – the three
together used to identify an easy exercise on linked lists oddly
enough!) and also using Twitter to reinforce language learning.

The experiments are interesting and he has some good ideas on what can
be achieved in these lightweight ways – for example a script could, with
some knowledge of the user (eg. year of study) and the del.icio.us API
identify all the exercises a user could do right now. That said, they
still suffer in the same ways other Web 2.0 apps do – adding my own
exercises to that course would, for example, pretty easy! :-)

The Twitter one was even more interesting because, to be successful, it
required a couple of things: firstly bribing the students to use it – a
percentage of the grade was based on the number of “tweets” a student
made – and secondly that the tutor had to be *very* active – responding
to students, correcting errors, etc. What struck me about that was that
here Web 2.0 was clearly making life *busier* for the academics (for
good reason as the teaching was improved). Too often Web 2.0 things are
sold as “making life easier” and here was an example of the opposite happening – better perhaps, but busier too. So maybe we’ve been selling Web 2.0 the wrong way all this time?

Carsten also pointed out some of the problems of 3rd party services -
most notably Twitter’s flakiness and performance and how, mid-course, the interface changed without warning! Is it time to implement a Twitter@MiskatonicU or whatever?

Next was Alan Masson from Ulster talking about a “Hybrid Learning Model”
they developed. While initially the relevance of this talk didn’t seem
clear, it was certainly an interesting talk – not least because of the
examination of how to go about asking academics to model their behaviour
- here in learning and teaching. I couldn’t help but wonder if the same
approach could be applied to modeling scholarly communication to get
some insight into repository-like activities.

Alan described his experiences building and then using what was later described as “an ontology” but was also some nicely designed cards representing stages and terminology in the teaching process. These were used to illicit a detailed model of a teaching activity. I have to admit but this time my attention was slipping – I was pretty hungry by then – so I didn’t get all that many notes!

Lunch was a good opportunity to catch up with some old and new faces and
very tasty too! Chocolate cake is ALWAYS a good idea!

The afternoon was devoted to Talis staff and kicked off with Chris
Clarke presenting Project Xulu/Xiphos – I wasn’t clear which it was -
which was a research project by Talis to produce a “Social Network of
Scholarly Data” from a bag of 500 (well formed and consistent I assume)
metadata records from “a publisher friend” automatically. The work was
pretty impressive and a good show case for Talis’ Platform. From the 500
records the team were able to construct relationships between people – X
cited you, X has been cited by you 12 times, etc. and present this is a
useful and slick Web interface with a few bells and whistles. Pretty
good for a 4 week idea kick-around – something we should consider here…

While it is clear this would be a useful product to have, what wasn’t
clear was how Talis might run with it. The 500 records were from
publishers who could presumably do this kind of thing themselves -
which, of course, would not be very useful as no one would get a
complete picture and there’d be too many choices of which scholarly
network to use…

However, build that kind of service on top of open access repositories
and it starts to seem pretty exciting and a app that might get people
self-archiving. Trouble is I reckon that the OAI-PMH records would not
be nearly as nice nor as useful as metadata direct from the publisher… :-( Not yet anyways…

After this Zephyr was presented by Ian Corns (Talis). I couldn’t help
but feel this talk had an air of “sales” about it and I was a little put
off my the mixed-metaphor in the title “Letting Students Weave Their Own
Path”… The talk felt a tad incongruous given the audience seemed
unlikely to be part of any Library procurement process – but the product
was nice and another example of Talis’ embracing of interesting slick
ways of doing things built on top of some innovative stuff. Back at Bristol we talked about the need for a reading list software and maybe this’d do…

Finally Nadeem Shabir (Talis) presented a talk entitled “Open World Thinking” which was a general exploration of Openness and what that means for Talis and the sector. He had some interesting thoughts – I liked that he talked about “Designing for appropriation” with regards to data – that is putting out there and allowing people to use it in ways *they* conceive – not the data creator. He used the nice example of using a book to prop up a monitor… He also demonstrated Talis commitment to Openness by pointing out a couple of tools (ontologies) that Talis have published and spoke on how the “openness of access” – rather than Open Access – was required to allow people to build Personal Learning/Research Environments.

If I took one thing from Nadeem it was his vision of a future without “applications” as such, but simply “conceptualised perspectives” on data. Rather grand perhaps and I’m still happy to think that there are some uses for closed garden applications, but interesting none the less.

So there you have it! After we moved into discussion, which was rather short in the end and I think people were a bit sleepy by then too (I know I was!) so maybe an afternoon break and then discussion could’ve worked – or maybe a organised trip to the pub!

Still, some interesting things came out of the discussion – a set of dichotomies (which I think Paul Miller identified) that create tension in this field (eg. Lightweight -vs- Structured) and Cameron Naylon identified a problem with IRs – that in the past Institutions have not been very good at managing data (and still aren’t if RSP experience is anything to go by) and very slow moving. They are a “millstone”. There is more on his thoughts here:


http://blog.openwetware.org/scienceintheopen/

Anyways, then began the trip home – not so bad as we traveled with a Talis employee who commuted from Bristol daily and knew exactly where to stand to get the seats on the train! :-)

Hope that is an interesting, if rather long, presentation of the day! :-)

Follow

Get every new post delivered to your Inbox.