DepositAPI report

From DigiRepWiki

Contents

JISC/UKOLN Repository Deposit meeting

27th February 2006, 10.30am – 4pm

One Great George Street, Westminster, London

Attendees

  • John Bell, HarvestRoad
  • Jim Downing, University of Cambridge, Dspace / SPECTRa project
  • David Flanders, Birkbeck College
  • Richard Green, University of Hull, Fedora / RepoMMan project
  • Leah Houser, OCLC
  • Timothy Miles-Board, Southampton University, eprints.org
  • Martin Morrey, Intrallect
  • Thomas Place, Tilburg University, ARNO
  • Ben Ryan, KaiNao
  • Rachel Heery, UKOLN, JISC Digital Repositories Support team
  • Julie Allinson, UKOLN, JISC Digital Repositories Support team
  • Sarah Currier, CETIS, JISC Digital Repositories Support team
  • Phil Nicholls, Psydev, independent technical steer

Agenda

  1. Introductions
  2. Context and background (Rachel Heery)
  3. What do we want to achieve? (round table)
  4. What are your requirements for deposit? (round table)
  5. Developing scenarios
  6. Data models?
  7. Protocols and standards?
  8. Next steps

Purpose

The meeting agreed on an overall objective to specify a common interface which heterogeneous repositories can provide to enable other applications to deposit digital objects in common ways. The aim is to deliver the specification in weeks rather than months, to enable interoperability between JISC Digital Repositories Programme projects and other repositories and agents.

The meeting spent some time developing a common understanding of deposit and establishing the boundaries of that understanding.

Context

The Repository landscape is increasingly complex. Hiding this complex provision layer from the producer / consumer, whilst utilising federation, harvesting and other interoperability mechanisms, is a key consideration for repositories and the JISC Digital Repositories Programme. The developing typology and ecology of repositories, and a possible repositories reference model, will assist in the mapping and understanding of the area, and specification of a deposit API can draw on and contribute to this understanding.

Defining Deposit

"Deposit interface
Provides an interface through which content and metadata (possibly in the form 
of a ‘complex object’, i.e. a package) can be deposited and initiates ingest 
process for local storage.

Intelligence: Data format, packaging standard.

Data in: Deposit request (content, metadata, etc.)

Data out: Deposit status (success, failure, pending, etc.) and content identifier

Note that the subsequent ingest process may include both automated and manual 
procedures including format checking, editorial control, quality assurance mechanisms, etc"
Andy Powell, 'A 'service oriented' view of the JISC Information Environment', November 2005
http://www.ukoln.ac.uk/distributed-systems/jisc-ie/arch/soa/

A deposit API must allow users of the repository to be able to store data within the repository. It has become apparent that often the straight import of data is not the limiting factor – it is obtaining knowledge of the repository’s policies for importing data that is causing the most trouble for users. Further, a repository should also support bulk ingest, plus machine to machine and user-mediated ingest. At this stage, it is difficult to know at any reasonable level what the precise structure of the API would be, beyond these simple guidelines.

Ideally, the deposit API would be something that vendors can add to their products; much as in the same way that OAI-PMH can be added. Vendors would not be forced to implement the API; but there would be significant market advantage in them so doing.

Gathering requirements

Identifying business requirements, processes and workflows will have a significant impact in defining the deposit API. Scenarios developed within the JISC Digital Repositories Programme will help to identify requirements for a deposit API. Scenarios have been made available on the DRP wiki see http://www.ukoln.ac.uk/repositories/digirep/index/Scenarios_and_use_cases. Use cases from these would further assist in extracting requirements, along with associated UML diagrams. Also important are workflow mapping and other process modelling ongoing in the Digital Repositories Programme.

Approach

Jim Downing suggested a layered approach, along the lines of the following:

  • Level 0 – created for the lowest common denominator, e.g. no authorisation, internal identifiers etc.
  • Level 1 – added layer of constraint, e.g. some authorisation, recognised identifier scheme etc.
  • Level 2 – more complex authorisation process, single default identifier scheme, collections support
  • and so on …

This approach would ensure support for different levels of simplicity and complexity. An API produced only for level 0 would not scale to more complex requirements.

Technically, any API would need to be developed using an implementation neutral model. UML (for example) would be very appropriate for this. Implementers can then use an appropriate language binding to implement the API. This approach further allows open source reference models of the API to be developed quickly.

At this stage, it makes the most sense to work on simple cases. Authentication (for example) at lower levels, can be left abstract. In any case, there are already APIs which handle authentication – it should not be the role of the deposit API itself.

There are 3 elements of the proposed model, which reflect the discussions within the meeting:

  • Service Discovery API – akin to EXPLAIN – an information model (which may or may not be machine readable), which allows the repository to present information about it’s deposit policies.
  • Deposit API – a set of methods by which users can put content into the repository
  • Receipt – the return value – an indication of what has happened to the deposited data.

Some examples of what may be within these areas are described below:

Service Discovery API Deposit API Receipt
SCOPE: Contains defined statements in a controlled language. Further descriptive information might be layered underneath. A selection of possible queries are identified below: SCOPE: the range of functions offered by an API for deposit. SCOPE: Information returned to the depositor detailing the success of their submission and the actions performed on their data / metadata.
Do you support packaging standards? (default) Validation report
What metadata standards? (default) put(Content the_content) Checksum
Checksum algorithm? (default) replace(ID the_target, Content the_new_content) Locator (resolvable)
Is there modification? (y/n) put(Content[] the_content) Native identifier assigned
What modification? rawPut(Content the_content) Audit trail
Do you support collections? getSLA() //or in discovery API? Packaging standards (identified by registered URI)
What collections am I authorised to deposit into? getProperty(Property the_property) //or in discovery API? Request Status
Authentication required? (y/n) putMetaData() //keep the content in my repository; but let you be able to find it. Request success state
What authentication system? Error messages/codes
show me your SLA
What identifier schemes are supported? (native / permissible)
Do you support harvesting? (y/n)
Do you support versioning? (y/n)

There is an implicit assumption here that where ranges of values are returned, the depositor can then parameterise the property on invocation. It is suggested that in this case, a default value is also returned. For example, if a repository accepts content using a range of packaging standards, one of those standards will be the default. The repository will expect content in that format unless on invocation it is overridden by the depositor.

Topic areas for further work

The following topic areas relevant to any deposit API work were identified through the discussion. This is not an exhaustive list.

Authentication / authorisation / permissions

Identified as out of scope for a deposit API but support for access controls must be a consideration and a deposit API must support pluggable authentication. Support for embargoes, moderation, quality control and different permissions models, e.g. the author, editor, publisher model. For now, this should be abstracted.

Point-of-ingest vs pre-ingest

Need to consider what has already happened at the point-of-ingest. Metadata creation may have already happened, but metadata may be automatically created at ingest. The deposit API must be able to accept import data; the discovery service API must be able to report on any policies that will add metadata. (one presumes beyond ‘control data’ required for the operation of the repository). An example was when repository policies add to, or otherwise alter, rights metadata.

Packaging

Need to define what a package is. Does it need to be packaged to a standard (METS, DIDL, IMS CP) and is non-packaged content accepted (at level 0)? Further, consideration must be given to the import of an identifier – those cases where content in another repository is referenced by the target repository (ie – metadata only import).

Data formats

Deposit might take an agnostic approach to formats, by storing any registered mime types, or it might specify acceptable data formats, depending on the level of implementation.

Metadata standards

A deposit API cannot be limited to a single simple DC format, as this would ignore the richness of metadata currently captured by different repositories. A range of metadata types and standards may be used in creating metadata pre-ingest and post-ingest. Thought should also be given to application profiles of metadata.

Data / metadata Management

A management API is out of scope for deposit, but it should be recognised that once an object is in the repository, it can be edited and modified by the repository’s editors. Users will need to be aware of any editorial policy.

Versioning and provenance

Maintaining the history of deposited information packages should be considered. There should be support for different concepts of versions, e.g. date stamp, status or variant. Item history was also considered – so that any particular piece of content can show which repositories it has been located within.

Data / metadata integrity

There may be a need to get back the exact object that was deposited. A repository could store the exact package as deposited (cf. SIP / AIP / DIP concepts from OAIS), with the repository unpacking to its own specification. This has storage implications for large datasets. There needs to be some guarantees on the submitted information and a stated policy on what the repository will do with information packages submitted. Any changes must be tracked and documented, perhaps in the receipt and in the repository. Overwriting is a significant concern, whereas the addition of metadata can be seen as value-added.

Metadata creation

Creation of metadata pre and post ingest and the auto-generation and auto-extraction of metadata should be considered.

Identifiers

Consistent use of identifiers is essential if data is to be accessible via federation or in the case of the metadata only deposit discussed above.

Assigning / overwriting

What happens with duplicate items with the same identifier; are existing identifiers replaced / overwritten? This needs a policy.

Export

The question of export is one that needs to be considered at a policy level. There is an argument that a repository should support the facility to export the exact data that was deposited in the first place. Uses of this facility include transferring data from one repository to another, e.g. when an academic moves institution.

Harvesting

OAI is the de facto standard for harvesting. It is important to explore overlaps between deposit and harvest, e.g. the above ‘export’ functionality might be fulfilled by harvesting. OAI takes a simple, practical approach. Deposit API should do the same.

Collections

Support for collections is a consideration, although the notion of a collection is overloaded. Should a collection be an intellectual concept or merely a useful level of granularity for making policy. Examples include publication status, access rights etc. The question of collections is expansive – and often different institutions and individuals attach different meanings to the word.

Service-level agreement

This covers policy considerations, preservation of object types, depositor guarantees, trust issues and rights considerations.

Workflow

Depositing at different points in the workflow, e.g. from inception to publication in a public space is a potential new area of development for repositories. Signposting the different stages in a workflow is a consideration.

Next steps

It was agreed that lengthy standards development was not the intention of this work. Identifying existing specifications and standards was seen as a beneficial practical activity, with short-term gains to be made in further scoping the 3 areas defined through the meeting:

  • Service discovery – This means working up a good information model of the current areas of trouble when adding data to repositories. The preceding sections give a good overview of some of these areas but there are likely to be more. For now, this information model will hold pointers to policies that are human readable. An API for interrogating this model should also be considered, rather than it being bundled within deposit itself.
  • API functions (OAIS might be a guide). A set of methods that will enable content and data to be inserted into the repository. At the simplest level this might only one or two calls.
  • Receipt. A description of the return information following a call on the deposit API.

In all cases above, the use cases should be analysed to ascertain the level of fit.

To do this, the following was agreed as a way forward:

  • Establish a working group made up of those present at the day (should others be invited?).
  • Appoint a WG leader who will establish a work programme of rolling implementation with an aim to deliver in 2-3 months.
  • Create a mailing list (JISC-DAPI@jiscmail.ac.uk – DONE).

Identify other work in this area, .e.g. OKI OSIDs.

  • Liaise with international activity in this area.
  • Hold a further f2f meeting
  • Analyse candidate protocols and specifications.
  • Invest further time on the analysis of the use cases.

Technical Considerations

The first message that arose was that the definition of the API was probably not the problem! Rather, it was having access to the policies of the repository, so that clients and users would know what would happen to their data once it had been deposited.

It was recognised that supporting the policy elements of this requires semantic understanding between clients. For example, when a repository policy claims a particular metadata policy, there can be no ambiguity as to the meaning of the policy – a definition must exist that is commonly accessible. This level of infrastructure is beyond the scope of this work, but will be ‘shortcut’ in the short to medium term.

Information models and APIs must be developed in a manner neutral of binding. Web services for example, should not be an underlying assumption at this stage. Implementers should be able to implement the API in their own target language.

Collections introduce a reasonable amount of complexity to a deposit API. There are issues around whether a collection is visible at a given level of authorisation; before one considers if the collection can be written to. One view might be that a repository is an aggregation of collections; if this is so, then it perhaps makes sense for any service discovery API to be keyed to the individual collection. Repositories could then cascade permissions down, with them being overwritten at the collection level.

There is a decision that needs to be made as regards the complexity of the API. It is possible to create an API that does ‘everything’, but this API will be very complex to use, and as with any complex API, the degree of error will increase. The multi layered approach is worth further consideration; although coupling everything into a deposit API is not ideal. Better would be to present different APIs for different functions; such that a user can have gone through authentication and so forth prior to attempting deposit. This creates the notion of a session however – and the simplest possible model is stateless and synchronous.

Funding

Additional JISC funding for travel and WG leader would be necessary to take this work forward.

Benefits to the JISC Digital Repositories Programme

Establishing a short-term programme of work on this deposit API will enable DRP projects to gain practical implementations within the timescale of their current funding.

The infrastructure for carrying out this work is already in place:

  • The Deposit API process has been initiated by the Digital Repositories Programme support team. The team is in place to support any future activity of the working group.
  • The support wiki has already been used for gathering information for this initial meeting (http://www.ukoln.ac.uk/repositories/digirep/index/Deposit_API). It is ideally placed for dissemination of further outputs and collaborative editing.

Other benefits

  • E-Framework – providing definitions and specifications for a service genre (‘brick’).
  • Wider repositories community.

Terminology

The terms ‘ingest’ and ‘deposit’ are used interchangeably throughout this document. This interface might also be called ‘put’.