KE questionnaire

From DigiRepWiki

Contents

Questions: Research paper metadata

For institutional repositories as well as research databases and other e-science documentation systems, research papers (scholarly works or eprints) are one the most commonly collected types of data and are a key component of the scholarly research information cycle. The following 7 questions are intended to gain an impression of how metadata is collected and used in relation to research papers across the four Knowledge Exchange partner countries. Where possible, please try to provide a broad view of the situation in your country, in addition to personal experience.

Question 1

1) In your country, which metadata formats (e.g. Dublin Core, CERIF, MODS, 
MARC, local profile etc.) are used for this purpose and does any of these have 
status as national standard or best practice?

Liv Fugl (Denmark): A metadata format for document exchange (MXD) is used in The Danish National Research Database and thus data from the universities is exported in this format.

Further, the PURE format is used at a range of universities using the PURE system for research registration.

At the Technical University of Denmark is currently used a local format in their research registration system ORBIT. However, the format is based primarily on a previous format from the Danish National Research Database.

Adrian Price (Denmark): In the higher education field, most institutions use software PURE (http://www.atira.dk/) or Orbit. Both systems are "local" but I would imagine that both are Dublin Core compliant (difficult not to be), with DC not being sufficient for institutional repositories (IR) as such. I cannot give a complete overview of what is in use (and the situation is very much in flux at the moment).

Danish institutions of higher education (can) contribute to The Danish National Research Database (http://www.forskningsdatabasen.dk/). (see 2)

Thomas Place (Netherlands): In DARE, the national programme for institutional repositories, (simple) Dublin Core is combined with MPEG21/DIDL. The DIDL record contains three parts: 1. metadata (at least a Dublin Core record), 2. pointers to the object files and 3. pointer to jump off page (optional). The Royal Library harvests the DIDL records for digital preservation of the object files contained in the institutional repositories of all Dutch universities.

Most parts of CERIF are implemented by Metis, the Research Information System used by all Dutch Universities. CERIF is not used in the institutional repositories. In many universities, Metis and the IR are linked. Metadata records are exchanged between the two systems; in most cases the metadata records flow from Metis to the IR, in some cases Metis is also updated by records from the IR. Metis has a upload button which allows for the upload of files. These files are not stored in Metis but sent to the IR for storage.

My institution has positive experiences with MODS.

For an international collaboration in economics (Nereus), I developed a XML schema based on Qualified Dublin Core. The most important addition to DC was for author identifiers. In this collaboration, three IR's of Dutch universities are involved (Maastricht Un., Erasmus Un. Rotterdam, Tilburg Un.)

I am chairing the DARE working group for metadata. The goal is to advise DARE on a international metadata standard. At this moment, the most important candidates are MODS, the DC application profile of JISC and MARC21 (MARCXML). This format will be combined with DIDL or with whatever digital object container will come out of the ORE project. I am chairing a second DARE working group that advises on the use of DIDL and that follows developments like ORE.

Jos Odekerken (Netherlands): In our repository (ARNO software) we have an internal, extensible metadata format. It is documented in our own local repository. It is, for OAI-PMH purposes mapped to the Dublin Core and Dare-Didl formats. But research papers are also catalogued in our national catalog (oclc/Pica's ggc). This system uses the Pica metadata format.

Question 2

2) Are the metadata formats and/or the application profiles documented and 
made available on-line? if so, can you supply URLs?

Liv Fugl (Denmark): MXD and an equivalent schema is available from http://mx.forskningsdatabasen.dk/mxd/1.1.0/

Adrian Price (Denmark): With regards to harvesting by the Danish National Research Database from institutions using PURE, contributing institutions must comply with the metadata exchange format for documents (DDF-MXD).

DDF MXD Schema and documentation: http://mx.forskningsdatabasen.dk/mxd/1.1.0/DDF_MXD_v1.1.0.pdf

Thomas Place (Netherlands): DARE use of Dublin Core, version 2 (Dec. 2004): http://www.surf.nl/download/DARE%20use%20of%20DC%20v.%202.0.pdf

Nereus use of metadata (2005): http://drcwww.uvt.nl/~place/nereus/Nereus_use_of_DC.doc

Jos Odekerken (Netherlands): The dutch document of the Pica format is at http://oclcpica.org/?id=17&ln=nl&reg=kmc

Question 3

3) How is metadata exchanged with or exposed to other services or 
repositories, such as internal research information systems, national 
systems, search engines such as Google Scholar, OAIster, BASE etc. (e.g. by 
OAI-PMH using oai_dc or similar)?

Liv Fugl (Denmark): Research from the Danish universities is supplied to the Danish National Research Database via OAI-PMH using the MXD format.

The Danish National Research Database further supplies the aggregated data to Google Scholar - this is made available for harvest for Google from a weekly dump of the data to a specific location.

Adrian Price (Denmark): The Danish National Research Database will use OAI-PMH to harvest from IR using software PURE mentioned in (1).

Thomas Place (Netherlands): At the moment, the most important format is oai_dc using the guidelines of 'DARE use of Dublin Core', version 2.0 (Dec. 2004).

Exchange between Metis (Research Information System) and IR's is implemented in different ways. IR's that use Oracle synchronize sometimes the tables with the Oracle tables of Metis at the level of Oracle. Others use the XML exchange format that is based on the data structure of Metis for exchanging metadata records between Metis and IR's.

The Royal Library and other service providers harvest the DIDL records (see above).

Jos Odekerken (Netherlands): Internal research information system (METIS) ==> repository: text files (oracle to oracle) Repository ==> OAIster and other service providers: oai_dc Repository ==> dareNet (metadata only): dare_didl Repository ==> Royal library (metadata + objects): dare_didl Example dare_didl: http://arno.unimaas.nl/oai/dare.cgi?verb=GetRecord&metadataPrefix=dare_didl&identifier=oai:dare:3087

Question 4

4) In the repositories, are research papers and their metadata part 
of a wider set of research information (e.g. including metadata for persons, 
projects, organisations and perhaps primary data)? If so please describe.

Liv Fugl (Denmark): I'm not sure I understand this question completely - If you mean if data is linked, e.g. like in a relational database with primary keys etc. then no. If you mean if metadata on a document in the research database in general contains information on the authors and their organisational relation, then yes. There is however no relations to projects or other primary data.

Adrian Price (Denmark): In regards to the PURE system, the tendency is to provide a much wider set of research data, including:

  • bibliometric analysis (citations for articles over time)
  • activities connected with research (ex. participation in research committees, conferences, peer review activity etc)
  • registration of research projects
  • reporting at a level sufficiently detailed as needed by central planners (including budgetting for projects, successful/unsuccessful project applicants, degree of international cooperation etc)
  • national statistics regarding the production of research publications for universities.

Added to this is how IR (software) can be used in the dissemination of research information via for example institutional home pages. Here for example can PURE be used for researcher profiles, including areas of competence, CVs, dissemination in the popular press etc.

The days where IR are primarilly used a registration tool (the classical “catalogue function”) are over (looking at universities using PURE at any rate).

Thomas Place (Netherlands): The majority of the content is research papers (including dissertations, monographs, chapters, patents, annotations, etc.). But there are also examples of multimedia objects and data sets. As far as I am aware, there are no metadata of persons, projects and organisations in the IRs themselves.

In DARE, an author identifier is introduced, the DAI. The person records that include the DAI are maintained in (a thesaurus file of) the Dutch union catalogue hosted by OCLC PICA.

Research projects are described in the research information systems (Metis). There is a national service that makes available the information about the research projects in the Netherlands. This service is based on the information in the local research information systems. There is also a national service that combines the information about research projects with the metadata of research publications (by harvesting the IRs).

Jos Odekerken (Netherlands): The title contains information about the creator (author). In the near future, we will add a unique number for that author in our repository; the dutch "digital author identifier"

Question 5

5) Can you identify problem spaces with currently applied metadata 
formats, such as oai_dc etc. (e.g. relationships between versions, precision, 
supporting citations etc.) or any particular issues for discussion 
(e.g. interoperability, metadata quality)?

Liv Fugl (Denmark): The 'problem space' of the always ongoing development of IR  :) The current version of the MXD will soon need an extension to comply with fulltext files. In relation to this you could perhaps envision a further extension in the area of citations.

Adrian Price (Denmark): I do not have sufficiently detailed knowledge of specific formats as I work mostly with an institutional-specific format. Week points generally which come to mind:

  • Author’s/person’s names, ie. persons as employees and persons as authors are difficult to keep track of.
  • Publication types (i.e. the products of research such as books, articles etc) are finite sets defined by system/locally/nationally and largely as we did it “in the olden days”. They manifest themselves in a myriad of varieties which then have to be categorised by humans - and all too often it goes wrong! And then they are the basis of statistics which one day (soon) might be taken into account when we are allocated money. Show me a (more or less) flawless topology of how research results are documented, useable by “dummies”! We need it – urgently!
  • It would be adviseable if our IR systems were geared with metadata (at a defined level) which would support long-term preservation, either by the system itself or by adding the necessary metadata to documents which are intended to be ingested into other systems/environments. Consensus needed.

Thomas Place (Netherlands): One problem is identifying the object files. dc:identifier is used for object files but also for jump off pages and even for bibliographic citations. For digital preservation and for full text search engines it is necessary to identify unambiguously the object files. For multi-file documents it is necessary to indicate the semantic relations between the files. For this, Dare uses DIDL. Alternatives are schemas based on RDF or METS.

For relations between versions, especially related to workflow, we will follow (and participate in) ORE. In DIDL, we need to be able to express which files are the post print and which are (copies of) the publisher version. Working papers have separate DIDL records. The record of a working paper that later is published as a journal article is not used to describe the journal article too. A new record for the article is created. The latter can point to the working paper record (e.g., dc:relation).

For citations and quality publication lists, simple dc has not enough granularity. In the Nereus schema, we use OpenURL ContextObjects in dcterms:bibliographicCitation. We found that this gives more expressibility, but that it still has serious drawbacks. Moreover, this information must be expressed in XML and not as a string that has to be parsed. (Using MODS solves many of these problems.)

Metadata quality becomes especially an issue when the metadata are used for generating publication lists with a high visibility (e.g, lists that are published on the home pages of the authors themselves). Some researchers have high standards wrt their own publication lists.

Using internationally accepted formats is a sine qua non. Such a format must be extensible, because all future use is difficult to foresee.

Jos Odekerken (Netherlands): We need more precise metadata for exchange information between DAREnet and the royal library. So we are looking for a new metadata format for information exchange. In this study we will look after MODS, MARC21 etc.

Question 6

6) Are you aware of future plans for using metadata formats or developing 
application profiles, both for research papers and/or other types of data?

Liv Fugl (Denmark): In the near future a similar exchange format for projects will be developed for use in the Danish National Research Database.

Adrian Price (Denmark):' Under the auspices of the OIO (Public Information Online: http://www.oio.dk/) there is a working group working on a national metadata standard for research information. The first step in “research information” metadata concerns publications (early 2007) and later (details still to be defined) will include projects, people, organisations and events.

Thomas Place (Netherlands): I am aware of the JISC DC application profile for Eprints (I like the modelling approach, but I am not convinced that it is a practical proposal) and as chair of two DARE working groups I am supposed to be aware of the Dutch developments. As indicated ORE will be important.

In the States and in Sweden, MODS and/or MARCXML are used.

My institution has an interest in the DDI for describing social science data.

LOM IEEE for learning objects.

Question 7

7) Do you have any other comments or suggestions regarding metadata 
creation and exchange?

Liv Fugl (Denmark): at this moment no, but they will probably arise at the workshop!

Thomas Place (Netherlands): no, not at this moment.