JiscEXPO Emerging Themes

In the previous post I gave an overview of the JiscEXPO project outputs available so far, and hinted at ones coming soon. In this post I focus more on the themes and issues that are starting to appear.  It can be quite difficult to distill these out of the information available, but I have been able to see a few patterns emerging, even though it is still relatively early days.

Archives Hub Record for Sir Ernest Shackleton

Archives Hub Record for Sir Ernest Shackleton

Given that linked data is of course, about data, a number of issues have been appearing around this subject. Linked data will generally require some data modeling, and as the Locah project report, this may mean having to change your data model mindset:

“it took me quite a while to get away from the idea of modelling the EAD record, rather than the actual data.”

“I found actually getting a ’starting point’ a bit difficult. I think this is because everything can be a starting point”

There can also be inherent complexities in the existing data that can make the modeling difficult:

“perhaps one of the thorniest [questions] is that arising from one of the fundamental characteristics of the nature of archival description [which is] typically based on a “hierarchical”, “multi-level” approach”

“One consequence of the multi-level approach in archival description practice is a strong sense of the importance of “context” … the descriptions of the “lower level” units should be read and interpreted in the context of the higher levels of description”

“So, there is arguably a (perhaps unavoidable) element of tension between the strongly “contextual” emphasis of EAD and ISAD(G) and the “bounded descriptions” of “Linked Data”.”

The process of transforming and exposing linked data can also highlight ‘dirty’ data, and issues around disambiguation. The MusicNet project mentions problems arising from different naming conventions and input error when looking for records that represent the same musical composer in multiple data sets. They’ve been experimenting with a data alignment tool they developed to help solve these issues, and have put together this YouTube video demo:

Locah have also been finding numerous examples of inconsistencies, such as where the ‘creator’ is ‘Joe Bloggs and others’ rather than just a name for example.

Linkbrainz have noted some scalability challenges that can arise with some linked data:

“[the] problem becomes acute for classical composers like Bach who are credited with tens of thousands recordings … the complete RDF resource description for Bach would be immense. This would cause an unacceptable load on the database server and long wait times for dereferenced URIs.”

They suggest a solution that uses the pagination of the HTML pages for the RDF or RDFa, but note that this is not ideal from a modeling point of view. They also mention that including RDFa in the MusicBrainz HTML pages can increase the page by somewhere between 5% and 30%.

Linkbrainz have also had to contend with some licensing issues:

“… some content in the MusicBrainz database is licensed as by-nc-sa… JISC considers this license incompatible with completely open data. Therefore, this small subset of the MusicBrainz database will likely be omitted from our translation moving forward.”

British Library to share millions of catalogue recordsHowever, the JiscOpenBib project appears to have sorted out data licensing without too many problems, having recently announced that the British Library is providing bibliographic data under CC0 Public Domain Dedication Licence.

MusicNet have drawn attention to the question of how to sustain the data from the JiscEXPO projects, and the HE sector in general in the longer term, suggesting that we need provision for UK academic data to be hosted on the JANET network under a suitable .ac.uk domain. A hosted data.ac.uk is proposed, possibly JISC funded, to lower the technical and financial barrier to entry to publish RDF. One suggestion is that this could be possible via the data.gov.uk education datastore.

Locah believe there is a significant skills and training gap in the linked data area, noting a lack of domain specific examples, and a lack of helpful information about how to create a data model. They suggest that at the moment, a certain level of expertise is needed to model data and output RDF, and that efforts to address this and make it easier would help the take up of linked data. They do however note that adopting the linked data approach is already paying dividends by making development more user focused:

“the very big plus with this different kind of thinking is that by definition it puts what the user is interested in at the forefront of your thinking”

So we can see the projects are meeting a range of challenges in exposing their linked data. It’s worth noting that many of the difficulties do not uniquely arise from outputting linked data, and in fact, the projects are in many cases simply ‘exposing’ existing problems that have thus far remained hidden behind data silos. It’s good to hear about the positive effects the linked data approach can have on helping to steer development in a more user focused direction. It will be interesting to see how the projects get on and what further themes arise when the demonstration prototypes start to appear this year.

Posted in JiscEXPO Synthesis | Tagged , , , , , , , , , , , , , , | Leave a comment

JiscEXPO Quarterly Executive Newsletter

A key aim of this JISC programme newsletter is to highlight some of the outputs and emerging themes from across the ten projects that make up the ‘JiscEXPO‘ programme funded by the 2/10 grant funding call, and, if possible, use these to identify any themes that cut across other ongoing JISC programmes. This work is invaluable to JISC, as the information is fed directly back into the overall evaluation of the programme which informs the Digital Infrastructure Directors, and which in turn is synthesised for JISC’s Senior Management Team.

JiscEXPO ProgrammeThe broad aims of the ‘exposing content for education and research’ call (‘JiscEXPO’ for short) were to make a collection of resources available as structured linked data by adopting Tim Berners-Lee’s ‘four rules of linked data’, and to produce a prototype that meets a ‘compelling end user case’. Projects were also invited to report on opportunities and barriers in making the linked data available so other UK HE and FE institutions could learn from their experiences. The information in this newsletter is mainly taken from the projects’ blogs, and most of the links below point to these. A single blog post from a project can end up being discussed at the top of JISC, which can and often does affect new policy decisions. So projects, please keep your blog posts coming!

We’re about a third of the way through JiscEXPO, and we’re now starting to see some emergent themes appear.  I’ve attempted to summarise where JiscEXPO is up to so far, and give an overview of the themes arising.

A notable early JiscEXPO output was data.open.ac.uk from the LUCERO project which exposes the data available in the Open University’s various institutional repositories and makes it openly available for re-use. Already live are Open Research Online, the OU Podcasts, and some of the courses from the Study at the OU website.data.open.ac.ukLinkbrainzThe LinkBrainz project will be publishing the metadatabase from the popular MusicBrainz service as Linked Data, along with a number of tutorials for users. BBC Music is one of a number of sites that pulls in data from MusicBrainz, so we can look forward to some exciting developments following the release of their linked data. In September they announced that their RDFa test server was available, so we can now see what the data they’re embedding in MusicBrainz will look like. I had a quick look at the RDFa for John Coltrane’s ‘A Love Supreme’ using the Sindice service.

fishdelishThe fishDelish project have also just provided access to their FishBase species linked data, and the JISC OpenBibliography project expects to have data available soon as referred to in a recent progress report.

LocahThe LOCAH project will be making data from the Mimas based Archives Hub and Copac services available as Linked Data. I happen to know (as I’m managing the project) that we have linked data available via SPARQL interfaces on a number of test servers. We’re still working on refining our data models, as well as cleaning up and enhancing the data before we make these available publicly, but we have made details of our Hub and Copac modeling work available on the blog. Pete Johnston has also posted about our approach to URI patterns, and our blog post on the challenges of exposing linked data has been well received.

Jisc OpenCiteThe aim of the JISC Open Citations project is to publish life science bibliographic citation data as Open Linked Data. They have recently made available their first four ontologies of SPAR, the Semantic Publishing and Referencing Ontologies, an integrated ecosystem of generic ontologies.

So, that gives an idea of where we’re up to with outputs so far. I’ve not covered every project in this post, so as not to make it too long, but I will be sure to highlight all the significant outputs in forthcoming posts (so please don’t be offended if you’re one of the JiscEXPO projects not mentioned yet :) ). In the next post I’ll be looking at the themes that have been emerging.

Posted in JiscEXPO Synthesis | Tagged , , , , , , , , , , , , , | 1 Comment

Confidence, and the business of persistent identification

The persistent identification of resources is a foundational element of the JISC Information Environment. There are several schemes and technologies available to support this, with one of the most prominently used in the JISC IE being the Digital Object Identifier (DOI). Built on the Handle technology, the DOI, under the stewardship of the not-for-profit International DOI Foundation (IDF), adds the important element of collective commitment and management, based on straightforward business interests. DOIs are allocated and managed through Registration Agencies (RAs).

DOI has become somewhat synonymous with scholarly publishing, with most people working in the JISC IE having encountered them in citations for papers in online journals and repositories. However, while publishers continue to play an important role in minting and using DOIs, the use of DOIs to persistently identify datasets produced in research is growing in significance. Last year saw the creation of a new RA – DataCite, which deals with this relatively new and growing area.

There has been much debate over the years about the persistent identification of resources – especially at the technical level. Yet all technical solutions are bound, eventually, to come up against the issue of the persistence, or lack thereof, of organisations of people. In the JISC IE space we can see that publishers come and go, and that journal titles, for example, merge or change ownership from time to time. Universities, seen by many as very persistent organisations (a pre-conception which might, sadly, be tested in the next few years) do, nonetheless, merge and change.

The creation of a body which has as its primary goal the management of the persistence of identifiers – essentially the role of the Registration Agency in DOI – is an approach to addressing this lack of permanence. Within the ‘ecosystem’ of the RAs, each participant has a vested interest not only in maintaing their own identifiers, but in ensuring that the system as a whole continues to function well. From this point of view, it is in the interests of all participants that the commitment from others is strong which means that the addition of new RAs, such as DataCite, can only be a good thing.

Over the last year or so, IDF has been working with MovieLabs as part of a project to establish the not-for-profit Entertainment Identifier Registry (EIDR). This initiative includes the establishment of a new Registration Agency for DOIs for all digital resources created for TV and film by a consortium of many of the major producers in the entertainment industry. EIDR is actively seeking more participants, and offers a variety of types of membership.

While the engagement of this new industry may not be directly relevant to many people working in the scope of the JISC IE, the confidence and investment which this industry has placed in the DOI system is significant. This development increases the viability of DOI in general and, as such, should make it a more attractive prospect to those working in the JISC IE and in HE in the UK generally.

Essentially, confidence is an important aspect of persistence – and significant buy-in to DOI from such different sectors, commercial and public, should increase confidence in this solution.

A whitepaper about EIDR is available on request.

An introduction to DOI in a higher education context (set of presentation slides)

Posted in Infrastructure, Standards | Tagged , , , , , | Leave a comment

JiscEXPO Programme Synthesis

Earlier this year, JISC issued the 2/1o Grant Funding Call for ‘Deposit of research outputs and Exposing digital content for education and research’,  JiscDEPO and JiscEXPO for short. In addition to managing the LOCAH Project that was successful in being funded as part of JiscEXPO, I am also now undertaking the ‘Synthesis Liaison’ role for the JiscEXPO programme (tag = #jiscexpo) working with programme manager David Flanders.

I’ve essentially only just started this role, so I’m still getting a sense of the other projects. I was already familiar with the JiscEXPO programme website as one of the participating projects, but I’ll now be getting more familiar with all the other projects too. The programme synthesis activities are described in more detail on the JiscEXPO site, but it’s basically about identifying ‘emergent patterns’ from across the projects, and ‘synthesising’ these. I’ll then be writing posts on this blog to let you all (and JISC) know what I find.  I’ll also be commenting on each of the project blogs, and will be attempting to aid discussion and cross-pollination of ideas across the projects through comments and cross-links. This diagram from the JiscEXPO site aims to describe the process pictorially:

JiscEXPO Programme Synthesis Activity Overview

JiscEXPO Programme Synthesis Activity Overview

In many respects this is a new way of doing programme synthesis for JISC, so it’s going to be interesting to see how it goes.

Posted in JiscEXPO Synthesis | Tagged , , , , , , , | 1 Comment

Consuming and producing linked data in a content management system

At this summer’s Institutional Web Management Workshop in Sheffield (IWMW 2010), I demonstrated how it is becoming feasible for a content management system both to consume and to produce linked data resources. In a parallel session,  I presented an overview of the current state of play in  ’Semantic content management: consuming and producing RDF in Drupal‘. In a video-recorded plenary session (specifically in a nine-minute segment of the recording, from 34 through 42 minutes), I briefly reviewed how a modern CMS can enrich local datasets with remote linked datasets– and, by engaging with the web of data,  produce new insights. Here I explain the scope of what I demonstrated at this event, outline some practical implementation procedures, and evaluate initial results.

Scope: interim check on long sojourn towards promised land

The scope of my demonstration was limited: quickly testing current feasibility of consuming and producing linked data sets in a real-world context. Using recent developments in content management technology, work on this demo was designed to check:

  • how easily local datasets can be combined and enriched with remote datasets
  • how effectively a content management system can engage with linked data to provide new insights
  • how close are we to that promised land where linked data technology can provide real benefits to a broad range of websites

Choosing a context immediately relevant to participants in this year’s Institutional Web Management Workshop, I decided to build a ‘proof of concept’ website providing a synoptic view of institutions and speakers participating over many years in IWMW events. From IWMW organisers I understood that quite a lot of data related to these events was already available, in discrete forms.  Event information going back more than 10 years was accessible via many separate IWMW RSS feeds and web pages.  Other datasets of interest, however, languished in office spreadsheets (until now, buried within the classic ‘information silo’).  With so many sources of data available, the challenge was to find a way of presenting disparate sets of information in a unified, manageable, understandable way.

Having read with interest the explanations and arguments in “Exploiting Linked Data For Building Web Applications” by Michael Hausenblas (2009), my objective was to check how linked data technologies can have practical applications within a real-world website:

Semantic Web technologies are around now for a while, already. However, in the development of real-world Web applications these technologies have … little impact to date. With linked data this situation has changed dramatically in the past couple of months. This article shows how linked datasets can be exploited to build rich Web applications with little effort.

Hausenblas observes that “in contrast to the full-fledged Semantic Web vision … linked data is mainly about publishing structured data in RDF using URIs rather than focusing on the ontological level or inferencing”:

This simplification — just as the Web simplified the established academic approaches of Hypertext systems — lowers the entry barrier for data provider, hence fosters a wide-spread adoption.

This ‘simplicity wins’ argument rings true with regard to many technology development patterns, and with human nature.  Personally, it strongly reminds me of what I noticed during early days of the web. During the mid-1990s I could well understand how SGML adherents disliked the relatively gross simplifications of HTML and its growing preoccupation with presentation and formatting rather than semantic structure.  Nevertheless, it did seem clear then, as now, that simplification and widespread adoption were highly correlated. Is it really becoming simpler, as Hausenblas and others recently claim, so that “linked datasets can be exploited to build rich Web applications with little effort”, thanks to advances in content management systems?

This summer, as I worked on a prototype website for my IWMW presentation, I remembered how long the journey has been to the long-anticipated ‘semantic web’ promised land.  In Tim Berners-Lee’s first recorded proposal for the World Wide Web, as drafted in March 1989 and then revised in 1990, there are remarkable indications of ‘semantic web’ notation — aligned with much later development of RDF (as noted by Dan Brickley in ‘Semantic Web History: Nodes and Arcs 1989-1999′).

Tim Berners-Lee original proposal for WWW (1989/90)

Tim Berners-Lee original proposal for WWW (1989/90)

Looking back again this summer, I noticed how 20 years elapsed since this first proposal did seem  — in comparison to normally fast-paced ‘internet time’ — very much like 40 years of wandering in the desert.  I wondered if the ‘semantic web’ promised land, flowing with linked data, was at last in sight?

With its core integration of a robust RDF API and its much-heralded functionality to produce and consume linked data, forthcoming Drupal 7 promised, after two years of active planning and development, to bring linked data technologies into a widely used content management system:

While it is worthwhile to mention that the first of these [content management] systems appeared at around the same time as Semantic Web technologies emerged, with RDF being standardized in 1999, the development of CMSs and Semantic Web technologies have gone largely separate paths. Semantic Web technologies have matured to the point where they are increasingly being deployed on the Web. But the HTML Web still dwarfs this emerging Web of Data and — boosted by technologies such as CMSs — is still growing at much faster pace than the Semantic Web….
Approaching site administrators of widely used CMSs with easy-to-use tools to enhance their site with Linked Data will not only be to their benefit, but also significantly boost the Web of Data. (Corlosquet, Delbru, Clark, Polleres, Decker, 2009)

In designing a prototype website for the IWMW event, I specifically wanted to evaluate:

  • Beyond the handful of  apps and websites described by Hausenblas as exemplary integrations of linked data (Faviki, DBpedia Mobile, BBC Music, Musicbrainz),  how easy would exploiting linked data resources be for a broad range of websites managed by an open source content management system?
  • How faithfully can a modern CMS implement best-practice guidelines for exploiting linked data, such as those explained by Hausenblas?
  • Where some guidelines cannot yet be implemented, are practical benefits achievable?

Implementation procedures: from hypothetical to actual

Hausenblas describes how key linked data principles could be applied in building a hypothetical website:

imagine a historical … website http://example.org/cw/ that deals with the topic ‘Cold War’ … [and] assume the site is powered by a popular software such as WordPress or Drupal. (Hausenblas, 2009)

Whereas Hausenblas bases explanations on hypotheticals, I wanted to evaluate more closely what can actually be achieved in building a website that exploits linked data resources, using a specific, currently available CMS. Given the buzz of anticipation for the forthcoming release of version 7 with core RDF integration, I chose Drupal as best choice for a feasibility test.

Hausenblas explains, at high level, two “steps needed for exploiting linked datasets in an exemplary Web application”:

In order to exploit linked dataset[s] properly, basically two steps are required: (i) prepare your own data, and (ii) select appropriate target datasets.

Preparing local data

As explained in my post on the prototype website  entitled ‘Consuming and producing RDF: current arrangements‘, my first stage of work concentrated on local datasets:

  • extracting available data from IWMW registration details kept in office spreadsheets
  • compiling event information (session abstracts and speaker bios) from RSS feeds on IWMW website
  • cross-checking IWMW web pages for detailed information about sessions and speaker affiliations

During this first stage of local data extraction and compilation, I used perl scripts to create relevant datasets.

Overall, this first stage of work required more time and effort than the next stage. Because it needed  ad hoc data-munging scripts, this work on local data ultimately proved more tedious than the more routine retrieval of linked data resources in stage two.

Selecting linked data resources

Once these local data sets were extracted and compiled for IWMW speakers and their affiliations, it became clear how DBpedia could supply quite a lot of useful linked data.

During this second stage of work on the prototype, I used a combination of perl scripts to retrieve and process RDF triples (including textual descriptions, statistics, geolocation coordinates, etc) from DBpedia and then Drupal utility modules (‘Feeds’ and ‘Taxonomy CSV’) to batch-load this data into relevant segments of the prototype ‘IWMW synoptic’ website.

Note:  Forthcoming modules such as ‘SPARQL views’, as explained by Lin Clark in a project proposal and video, are designed to enable “average users to integrate SPARQL into their website workflow” without need for external scripts. As I worked this summer on retrieving and integrating linked data into a demo website, however, this facility was missing in both Drupal 6 or Drupal 7 alphas.

Beyond these datasets from DBpedia, a range of further resources could be integrated given more time and scope to engage with the Web of Data:

Initial results, trends, and directions of travel

Even with the limited scope and time available for working on the ‘IWMW synoptic’ demo website, it was possible to produce quite a lot of initial results. Here are some links to views of local datasets enriched with linked data:

Affiliations of IWMW speakers: example view showing local data enriched with linked data

How easy?

Short answer: CMS arrangements do make it remarkably easy to present local data enriched with linked data, accessible in both human-usable and machine-readable views.

In the currently transitional state of Drupal development (as explained in ‘Semantic content management: consuming and producing RDF in Drupal‘), however, this requires quite a bit of ad hoc preparation.

This summer, I needed to write custom scripts both for preparing local data and for retrieving linked data.  This latter process of retrieving linked data should become easier when utility modules such as ‘SPARQL views‘ and others become available following official release of Drupal 7.

Only after a full complement of RDF modules becomes available following an official release of Drupal 7 can the optimistic vision of CMS advocates be justified:

Again, the [website] operator is in a comfortable position: for his system plug-ins exist allowing to expose data with just a few configuration changes. (Hausenblas, 2009)

My experience this summer proves that it takes more than just ‘a few configuration changes’ before a CMS manager can start consuming and producing linked data robustly. Such a ‘comfortable position’ is not yet quite a reality.

How faithful?

Hausenblas discusses three best-practice guidelines for making a content management system  ”Web-of-Data compliant”:

  1. re-using relevant ontologies and vocabularies (such as FOAF)
  2. exposing linked data as RDF/XML, RDFa, or in SPARQL endpoints
  3. minting URIs along the lines used by DBpedia (where machine-readable (RDF) and human-usable (HTML) versions are distinguished within URI spaces /resource and /html paths, ideally accessible via automated content negotiation)

Guideline 1: Re-using common vocabularies

Regarding the first guideline, I found that Drupal 6 RDF modules available this summer do facilitate re-use of commonly used vocabularies such as FOAF (and many others).  In fact, just a few configuration changes were required for the demo site to output RDF such as this (abridged) excerpt:

Example of FOAF output from Drupal 6 RDF modules

Example of FOAF output from Drupal 6 modules

Guideline 2: Exposing linked data in various formats

With regard to this second guideline for exposing linked data as RDF/XML, RDFa, or as query output from a SPARQL endpoint, I found that:

  • Drupal 6 RDF modules can easily export a range of linked data in RDF/XML format.  (Upon official release of Drupal 7, there will be ‘out of the box’ support for RDFa output.)
  • It was easy to set up a SPARQL endpoint with just a few configuration changes, so that it could respond (in a very wide range of formats) to queries on triples compiled automatically (via cron runs) from website content.

As a result of the transitional state of module development pending final release of Drupal 7, however, I found that RDF/XML output included eccentric (‘site’) vocabulary tags.  In effect this produced redundant noise in RDF which, albeit distracting to the human eye, could be safely ignored by machine-read processes keyed to a standard vocabulary such as FOAF.

Guideline 3: Mint machine-readable and human-usable URIs

Regarding this third guideline, I found that current state of development in Drupal RDF modules could not support an ideal arrangement for automated content negotiation as implemented by DBpedia.

Drupal 6 RDF modules do, however, support parallel RDF and HTML output using URI schema such as:

Not ideal yet reasonably practical.

The future?

If Hausenblas, Scorloquet and others are correct about prospects for CMS developments boosting the adoption of linked data technologies, this can dramatically broaden the numbers and types of websites engaged with the Web of Data.

Probably more than  7 million websites were using Drupal in July 2010 (including many large, high-traffic and high-profile websites in commercial, governmental, and academic contexts).  As more websites transition into using new Drupal 7, this can sharply increase the numbers of websites consuming and producing linked data.

Is this the future as illustrated the DrupalCon Boston 2008 keynote presentation ‘Video from the future’? That keynote, which announced the start of work on integrating RDF into Drupal core, illustrated some interesting RDF  ’web of data’ mashups. The current focus is on increasing take-up.

As illustrated by Google Trends, levels of interest in ‘semantic web’ technologies (as reflected in search volumes) decline steadily from 2004 to 2010.

Google Trends: 'semantic web' search volumes (2004-2010)

Google Trends: 'semantic web' search volumes (2004-2010)

By contrast, Google Trends indicate that search volume levels for ‘linked data’ are gradually rising.

Google Trends: 'semantic web' and 'linked data' searches (2004-1010)

Google Trends: 'semantic web' and 'linked data' search volumes (2004-1010)

At this point, is active interest in ‘linked data’ overtaking long-established interest in the ‘semantic web’?

If Drupal’s integration of RDF into its core functionality can help dramatically expand the number of websites engaging with linked data, this is good news for tribes on a long sojourn towards a promised land.

References

Michael Hausenblas, “Exploiting Linked Data to Build Web Applications,” IEEE Internet Computing, vol. 13, no. 4, pp. 68-73, July/Aug. 2009, doi:10.1109/MIC.2009.79.

Stéphane Corlosquet, Renaud Delbru, Tim Clark, Axel Polleres, Stefan Decker, “Produce and Consume Linked Data with Drupal!”, Proceedings of the 8th International Semantic Web Conference (ISWC 2009), Springer, 2009, doi: 10.1007/978-3-642-04930-9_48.

Posted in Technical Review | Tagged , , , , | 7 Comments

Aggregation and the Resource Discovery Taskforce vision

On Tuesday of this week, UKOLN convened a group of invited experts to discuss aggregation in the context of the Resource Discovery Taskforce‘s vision. The Resource Discovery Taskforce (RDTF), a joint JISC / RLUK venture, has summed up its vision:

UK researchers and students will have easy, flexible and ongoing access to content and services through a collaborative, aggregated and integrated resource discovery and delivery framework which is comprehensive, open and sustainable

Given the limitations of time and resources, and with a firm intention to make a real contribution, the RDTF has decided to focus on aggregation of metadata as a means to progressing the vision. There was some debate at the meeting about the extent to which aggregation is something worth focussing on, and a general concern that this not become an end in itself, rather than a means to an end. We agreed to use the phrase ‘aggregation as a tactic’ as a way of characterising the proper relationship of this approach to the vision, and steered the remainder of the meeting to address aggregation from a mainly technical perspective. To get the ball rolling, I introduced a slide wherein I attempt to list possible reasons for aggregating data:

  • to address systems/network latency – a cache
  • for ‘Web Scale concentration’
    • ‘gaming’ Google – raising ‘visibility’ of content
    • network effects if user facing services also developed
  • to showcase (e.g. scale & nature of OER in UK)
  • to create middleman business opportunities
  • as infrastructure to support locally developed services
  • as an approach to preservation

This was discussed at some length, and we agreed that some other reasons could be added to this list:

  • for economic reasons – e.g. to achieve economies of scale through storing & managing metadata in one place, implying that the aggregation becomes the sole source of a given metadata record
  • to add value to the data through processes, especially around data quality, which are impractical or even impossible to contemplate when the metadata is distributed
  • to simplify licensing from the point of view of the consumer of the aggregated data

We noted that while the RDTF vision seems to concentrate on metadata describing resources and their provision, other types of metadata, such as user-generated annotations and user attention or activity data, which is also of great potential interest and value might be aggregated advantageously.

The importance of registries to help in the identification and discovery of relevant data was raised.

For the second part of the day we broke the meeting up into three smaller groups, each concentrating on an aspect of the preceding general discussions. Each of these groups, when they summarised their discussions for the whole meeting later, identified issues and made recommendations. Where these are generally applicable (which they mostly are), rather than outline them in the following descriptions of the breakout groups I have treated them together in two sections at the end of this post.

Breakout 1: APIs

This group looked at the role which Application Programming Interfaces (APIs) have to play in an environment of aggregated metadata and related services. It used a spectrum of technological interventions ranging from specific service development to meet a particular need, through to generic infrastructure provision to provide opportunities for others to develop services, and attempted to place classes of APIs on this spectrum:

service-opportunity-spectrum.gif

It was agreed that it was important to understand this distinction, and to be equipped to judge where to ‘draw the line’ between meeting specific requirements and investing in capacity for future innovation. There is clearly a tension between agility – which is a feature which becomes more desirable as one moves along the spectrum towards those servicing users’ requirements, and stability which is necessary for infrastructure to be trusted. Part of the purpose of APIs is to help to manage this tension.

APIs are for developers, and so APIs on aggregations must be highly usable from the point of view of a developer. Focussing on the need for aggregations to expose APIs so that services can build upon them, this group made some recommendations (included in the general recommendations at the end of this post) about the sorts of general features an API should exhibit. In general, it was agreed that an API on an aggregation must be more convenient, from the point of view of a developer, than going directly to the individual sources. Leaving aside simple issues of network latency, in a possible Linked Data future where data is commonly openly available, the aggregation and its API must not become a barrier to building services and adding value to data.

This group also discussed the issue of federation of aggregations – where one aggregation feeds another. There are serious engineering issues with this kind of federation which require better understanding.

Breakout 2: Aggregation as tactic

This group decided to start by looking for “prior art” – examples of successful uses of aggregation as an tactic to improving resource discovery. With this approach, it was suggested, it would be possible to identify stakeholder groups which are already ‘bought into’ the idea of using aggregation as a tactic in this way, which ought to be easier than convincing people from scratch. The trick would seem be to be to identify a shared service which could be developed upon an aggregation of metadata, and which they could recognise would be beneficial to them. Examples of successful aggregations were identified and included:

  • Copac (aggregated records from National, Academic, and Specialist Library Catalogues)
  • SUNCAT (a national serials union catalogue)
  • Worldcat (a global, aggregated library catalogue)

Echoing an earlier point, the group suggested that the value in aggregation as a tactic comes from the ability to normalise metadata into some sort of canonical form. This aspect of the aggregation adding value to the data it aggregates is crucial if the source record holders are to be persuaded to participate.

The group suggested that JORUM‘s role in supporting the national (and global) Open Educational Resources (OER) movement was very much in line with this thinking: that JORUM enhances discoverability of OERs created in UK institutions, while simultaneously offering the potential for long term archiving (preservation). Again, the importance of the registry becomes apparent this group suggested, with JORUM likely to become important as a service providing identification and ‘provenance’ services.

The group discussed the idea of concentrating on one particular domain, such as geography, on the grounds that this could then be built out to an extent that other domains would become interested once they had seen what has been achieved. The counter to this argument was a suggestion that it might be better to consider a range of resource types including scholarly communications (bibliographic data), learning materials, repositories, spatial/geographical data and multi-media.

It was also noted that the ‘aggregation as a tactic’ argument might apply to self-archiving and Open Access – which has similar arguments as for JORUM and OERs.

It was suggested that this was leading to a set of tactics which would help content providers get over a ‘fear’ of aggregation, and of encouraging them to open up from a position of ‘data ownership’. It was also recognised that once this is achieved, aggregation as a tactic creates opportunities for ‘middle-men’ to add value through new services building on top of the aggregation.

Interestingly, this group suggested that aggregation as a tactic might be a short-or-medium-term tactic, that the ‘end game’ would be to dis-aggregate content back to source. At this point, the remaining infrastructure would be of the ‘registry’ type, helping to locate data at source.

Breakout 3: Build better websites!

The emphasis of this session was about advising & enabling those who hold source metadata to make it available in an appropriate form. The group identified a number of ‘steps’ that a content provider might take. These steps are ordered in a system of progressive desirability in a model influenced by Tim Berners-Lee’s Linked Data Note:

  1. make data available in an open form (even using the much-maligned CSV format if necessary)
  2. assign and expose HTTP URIs for everything, and expose useful content at those URIs
  3. publish as XML
  4. expose semantics

It was noted that these steps do not demand that a provider should work their way through them sequentially – it is perfectly acceptable and even desirable to jump in at step 4 – however this might represent a significant barrier to some, so steps 1-3 are there to give content providers a chance to engage comfortably.

Barriers specific to this model being adopted successfully include the issue of securing vendor ‘buy-in’. For content providers to support this model, their software platforms need to enable it. This may not be the case at present in most cases. Also, specific skills in Linked Data are not so widespread in these sectors (yet), and an appreciation of and support for Linked Data is not common among senior managers. It was recommended that JISC create some political momentum around this, perhaps devising a convincing argument for senior management. It was also suggested in this breakout group that RDTF should provide a central resource (guidance & possibly infrastructure) for hosting data, especially for smaller organisations.

This approach was summed up as a description of a potential glam.ac.uk where glam is galleries, libraries, archives and museums.

General Issues

  1. Lack of technical expertise in libraries, museums and archives. This applies most strongly in respect of the ‘build better websites’ model, but is also true more generally, especially when the long-tail of glams is considered.
  2. Business case, or possible lack thereof. The content providers need to see a clear benefit before committing to the cost involved in supporting the aggregation of their data.
  3. Content providers often show a reluctance to make data openly available on the grounds that they may expose poor quality which reflects badly on them

Recommendations

The various discussions during the meeting gave rise to a number of suggested recommendations. It should be noted that these are based on a few short hours of discussion – however the experience of the group which made them is considerable, so I hope they might be considered seriously.

  1. The 4 step model for advising/supporting content providers in opening up their metadata
  2. The RDTF should fund aggregation projects that demonstrate value in these steps
    • e.g. “Tell me how my content is being used”
  3. Providers should provide a semantic sitemap leading to a data aggregation. This could be RDF or XML
  4. Providers should expose the schemas they use (whether their own schemas or links to established schemas)
  5. Aggregation services should provide guidance to content providers about schemas to be used (a registry of recommended schemas would be a useful component)
  6. Aggregators should not reject data on basis of schema used by the content provider – aggregators should be prepared to accept anything
  7. The RDTF should (in partnership with others) seek to engage with vendors of collections/content management systems in the various domains.
  8. Aggregations should have supported APIs which are attractive to and convenient for developers, offering developer-friendly output formats such as XML or JSON
  9. Aggregation should be considered, perhaps, as a temporary approach to aiding discoverability. More extremely, a ‘just in time’ approach to aggregation might be considered.
  10. A ‘cookbook’ of design patterns involving aggregation as a technical approach to resource discovery might be a useful thing to consider funding.
  11. A ’2 tier’ model of metadata might be worth considering, where one tier is for common, basic description and identification, and the other tier is for more targeted uses.

Many thanks to those who attended and made the meeting a success:

And thanks to Adrian also for organising the meeting.

Posted in Infrastructure | Tagged , , | 8 Comments

Technical Standards for the JISC IE (part 1)

One of the key conclusions emerging from our ongoing consultation with some of those who have been involved with the JISC Information Environment (JISC IE) since its early days is that the emphasis on interoperability through open standards was one of the key drivers which gave the programme direction and momentum. Giving focus to this emphasis on open standards was a web document, JISC Information Environment Technical Standards, which introduced itself thus:

This document provides a list of the key standards and protocols that make up the JISC IE technical architecture. This document is intended primarily for developers, in order to provide them with a single point of reference to the main technologies that they should be using when working in the context of the JISC IE.

These standards are intended to apply to all JISC IE service components listed in the JISC IE Glossary (portals, brokers, aggregators, content providers, subject gateways, authentication/authorisation services, service registries, user-preferences services, OpenURL resolvers, institutional profile services, metadata schema registries, terminology services or other shared infrastructure services).

It has been suggested by more than one of those with whom we have consulted that this document was the most important of the several documents developed by UKOLN (primarily by Andy Powell) to technically inform what was then designated the JISC Information Environment Architecture. It gave those developing services in the JISC IE a touchstone, allowing them to validate that their work was in accordance with one of its over-arching principles. During our consultations, we have heard more than once that this document was more important than the perhaps more widely recognisable Technical Architecture diagram.

The document borrows the Internet Engineering Task Force (IETF) convention of using the words must, should and may (in a bold typeface) in a particular way to convey a more precise indication of the strength of recommendation or requirement being articulated. Nevertheless it has, for some, been unclear whether or not this document was intended to mandate or to advise on the use of technical standards in the JISC IE. Although the IETF convention was not applied to the document’s introduction, it seems reasonable to take the line that when the author said:

This document is intended primarily for developers, in order to provide them with a single point of reference to the main technologies that they should be using when working in the context of the JISC IE.

we might reasonably interpret that use of the word ‘should’ in the IETF sense, to mean:

should – indicates that there may exist valid reasons not to treat this point of guidance as an absolute requirement, but the full implications should be understood and the case carefully weighed before it is disregarded

So, it is reasonable to argue that the JISC IE was, at a technical level, based on an identification of appropriate technical standards. We will assume that the provision & maintenance of such a document is still useful which means that, in looking forward to the future, two questions present themselves:

  1. Should the document be ‘prescriptive’ or ‘descriptive’?
  2. Which standards are relevant to the JISC IE today?

For this post, we’ll address the first of these questions – a second post, which will appear in a few days, will address the other.

Prescriptive or descriptive?

Should our ongoing identification and documentation of the next iteration of what we are now calling the JISC IE Technical Foundations take a prescriptive (‘must’) or descriptive (‘may’) approach to its treatment of technical standards?

It has been suggested that the original, somewhat prescriptive approach had the effect of embedding the belief of the importance of shared, open standards for interoperability into the culture of those developing services for the JISC IE. But it has also been suggested that this cultural acceptance has now been achieved and that developers can be trusted to assume the need for interoperability and, consequently, be given freedom to innovate where appropriate.

Is the importance of open standards now so widely accepted that we can assume that developers will make sensible choices, balancing the need for interoperability with the desire to innovate?

Posted in Standards, Technical Review | Tagged | 2 Comments

Provision, fusion, presentation and shared infrastructure

jisc-ie-arch.gif

The JISC IE introduced a characterisation of service types which categorised them into one of provision, fusion, presentation or shared infrastructure. It should be stressed that this characterisation was not meant to be strict so much as it was intended to be a device to aid high level thinking around the problem space with which the IE is concerned. The fact that the iconic diagram (above) has become so closely associated with the IE is a testament to the appeal of this approach. The architecture which this categorisation scheme implies is informed by the contemporary interest in the Service Oriented Architecture (SOA) as an approach to sector-wide service provision and usage, and underpins the expectation of an environment evolving out of strategic investment, machine-level interoperability through open standards, and a separation of concerns.

Largely unanticipated only a few years ago, the effect of the Web 2.0 phenomenon on how users interact with services and on how services inter-operate has been profound. The diagram above implies a flow of information towards the user: the reality today is that users expect to interact with services with richer interfaces than was the case 3-4 years ago. The idea of a distinct layer of presentation services is something which is surely challenged today, as is the positioning of provision services, somewhat remote from the user, with layers of intervening services.

Having said that, while the diagram with its divisions of provision, fusion or presentation services might appear now to be an over-simplification of how services in the IE could fit together, it was in truth never intended to be a blueprint or even an architecture.The introduction to the IE of some of the fundamentals of modern systems design, such as the separation of concerns and modularity of services encouraged by the SOA has been valuable. However, it could be (and has been) argued that a Resource-Oriented-Architecture (ROA) is a better fit for the IE. We will examine the relevance of ROA to the JISC IE in detail in a subsequent post. In the meantime, we should take the opportunity to review the impact and continuing relevance of this approach.

Questions:

  • Is the characterisation of provision, fusion or presentation still useful? If not, is there a better categorisation we might adopt, or is this whole approach no longer useful?
  • How has the emergence of simple point-to-point services on the Web affected this picture?
  • Rather than a focus on services, there would seem to be an emerging emphasis on users and data. How should this inform the evolution of the JISC IE’s technical foundations?

Further Reading on this topic:

Mapping the JISC IE service landscape, Andy Powell, Ariadne, Issue 36

The JISC IE: Some lessons from Web 2.0 , Paul Walk, presentation to the JISC IE Working Group

Lost in the JISC Information Environment, Tony Ross, Ariadne, Issue 56

Lost in the JISC Information Environment? a post by Andy Powell on the eFoundations blog

All models are wrong, but some are useful, a post (+ associated comments) on Paul Walk’s blog

Posted in Architecture, Technical Review | Tagged , , | 2 Comments

Reviewing the technical foundations of the JISC Information Environment

The JISC Information Environment (IE) is one of of JISC’s well established strategic ‘themes’. At a technical level, the IE is framed by some important documents: a ‘technical architecture’ and by a set of technical standards, both previously developed at UKOLN.

Since this technical framework was created there have been significant changes in the wider technical environment such as the growth in Web 2.0 applications, cloud computing and the use of third party services. The time has come for a comprehensive technical review of these technical foundations which were established more than three years ago. Both the content of these foundation documents, and the approach of using such documents to frame something as broad and complex as the IE, need to be reviewed.

A great deal of technical development has taken place under the umbrella of IE Programmes in the last three years. Some of this has been pure R&D, while some has led to deployed and supported services. It is time to review the extent to which the technical architecture and standards of the IE have remained relevant to these development.

Funded by the JISC, UKOLN is undertaking a technical review of the IE. Towards this end, we will be carrying out a process of consultation. This consultation will take the following form:

  1. A series of initial interviews with identified stakeholders in order to establish a broad sense of those issues in the IE which are relevant to a technical review. These interviews will be in the nature of an ‘intelligence gathering’ exercise – they will be neither transcribed nor quoted from.
  2. A series of posts on this blog, introducing issues, themes and questions as part of the ongoing review. These will driven by the issues being raised in the initial interviews (in addition to others already identified) and will be posted with a view to generating discussion (either via direct comments on this blog, or elsewhere) from the community as a whole.
  3. As issues are raised, posted here and discussed, materials will be synthesised form these to inform a workshop to be held in February 2010. This workshop will be offered by invitation, and will be focussed around major themes or issues emerging from the consultation to that point.
  4. This blog will be continued indefinitely, as a component in a planned process of ‘continuous review’, intended to ensure that the IE remains current and relevant.
  5. The review will be formerly written up and reported in 2010.

In order to integrate into this discussion any comments made elsewhere, please use the tag:

jiscietech

If you would like to comment on this review, or on the technical foundations of the IE in general, please feel free to do so – we are happy to receive any and all relevant comments.

We will be posting further, more specific discussion pieces on this blog, so if you are interested in being part of the discussion then please subscribe to the RSS feeds:

Some starter questions:

  • We intend to review the following technical documents:

    Are these documents useful? Is it worth developing these further?

  • Is the IE’s technical focus on interoperability through open standards still relevant? Is it enough?
  • Does the IE require an architecture?
  • How has Web 2.0 changed the wider context for the IE?
  • There has been significant technological change and development since the IE Technical Architecture was published (~2005). What implications do you think this has for any attempt to update the technical standards list and the IE architecture?
Posted in Technical Review | Tagged | 1 Comment