Metadata strategy

October 6, 2010

The University of Southampton data management project has proposed a three-level metadata strategy, see their blog entry “Metadata strategy“:

  1. Project
  2. Discipline
  3. Core

Tardis is based on the Core Scientific Metadata model (CSMD) developed within the Science & Technology Facilities Council (STFC).  One metadata hierarchy they’ve adopted is (turned upside down to match Southampton’s):

  1. Science Specific
  2. Instrument Specific
  3. Core

(This reminds me of Robert Pirsig’s Intellectual Scalpel)

We’re extending Tardis for use within the Australian Synchrotron and ANSTO, where the STFC model is more appropriate.  However, institutional use of Tardis may also be project based.

Tardis supports configurable schemas (parameter sets) at  the experiment, dataset and datafile level.  Appropriate use of the configurable schema should allow us to handle both models, or a combined model.

ANDS Data Capture Briefing

September 13, 2010

ANDS held a Data Capture Briefing in Melbourne on 2 Sep 2010.  It was great to see some of the other projects in progress, and that Tardis is a potential platform for some of those projects.

There was also updates on RIF-CS, with version 1.2 coming soon.  The updated Service definitions should better meet the MeCAT project requirements.

One aspect that I think needs further thought in RIF-CS is Party lookup.  When making entries available for harvesting, Tardis should connect researchers with the collections they were involved with.  The lack of reliable automated Party lookup makes this difficult to guarantee.

Data Bites : Ifs, ANDS and buts

August 29, 2010

There is now a blog for ANDS (The Australian National Data Service) funded projects at  We may move across to the shared blog at a future date, but will stay here for now.  I ‘ve added the ANDS blog to the blogroll.

Access Controls

July 26, 2010

As has been highlighted in previous posts, most projects about data publication in the research field have come across the problem that while researchers believe in publishing / sharing data in principle, they have lots of reasons not to do it in practice.  This is a much larger problem than can be addressed in one project, so we’ve decided to work around the problem as much as possible by providing access controls within MeCAT that support publishing data immediately, restricting access either indefinitely or until criteria are met, or sharing on an individual basis.

The set of use cases that we’re using as the basis of the Access Control design are listed below.

The Data Owner has the ability to grant and remove access privileges to the data owned.  The Data Owner will typically be the Principle Investigator or a representative of the Institution.

  • Publicly Accessible
    The data is made publicly available immediately, e.g. data that will become part of a reference database.
  • Accessible by the Data Owner and assigned team members
    Team members may be assigned individually or as a group.
  • Access granted by the Data Owner
    E.g. as a result of direct contact by another researcher.
  • Accessible by anyone at a given physical location, typically the instrument
  • Publicly Accessible after an embargo period, e.g. 3 years
  • Publicly Accessible after a trigger, e.g. paper is published
  • Accessible by facility scientist.  Facility scientists typically have access to all data from the instrument they are responsible for.

I’ll cover the design we’re proposing to support these use cases in a subsequent entry, and am interested in any feedback on these use cases.

Using a Core Scientific Metadata Model in Large-Scale Facilities

July 22, 2010

Thanks to the UKOLN News Feed for pointing to the International Journal of Digital Curation Vol 5., No 1. It contains a paper titled Using a Core Scientific Metadata Model in Large-Scale Facilities.  The paper provides a good overview of the CSMD schema, which is “a model for the representation of scientific study metadata developed within the Science & Technology Facilities Council (STFC) to represent the data generated from scientific facilities”.

Clarion Project

July 19, 2010

Thanks to Lesley from the Incremental project for pointing me to the Clarion Project blog.

Clarion provides some great questions to ask scientists when trying to get agreement on publishing data in their
Principal Investigators’ opinions on Open Data entry.

I also like their Design Principles and am looking forward to hearing more on the success of their electronic logbook project.

Incremental Project

July 15, 2010

The University of Cambridge and University of Glasgow have a joint project on data management named “Incremental”.  See their blog entry Scoping study and implementation plan released.

The issues they are looking to address are much the same as we are facing at the Australian Synchrotron and ANSTO with the MeCAT project, including:

  • Procedures for creating and organising data
  • Data storage and access
  • Data back-up
  • Preservation
  • Data sharing and re-use

One more issue comes immediately to mind:

  • Accurate and complete capture of metadata

While AS and ANSTO face all of the issues listed in the Incremental report to a greater or lesser degree, our project is focussed on their last issue listed above.  The Incremental report articulates the problem very clearly:

While many researchers are positive about sharing data in principle, they are almost universally reluctant in practice.  They have invested in collecting or processing data, and using these data to publish results before anyone else is the primary way of gaining prestige in nearly all disciplines.  In addition, researchers complainthat data must be carefully prepared, annotated, and contextualised before they can make it public, which is all very time-consuming and funding is rarely set aside for this.

The report goes in to more details, providing examples of why researchers are reluctant to publish data, and under what conditions they are more likely to share data.

At the moment we’re taking a three-prong approach to this problem:

  1. Defer the problem by providing suitably flexible access control system that allows data to be initially private and then published at a later date.
  2. Initially encourage researchers to just making the existence of the data public, with access only granted on an individual basis after discussion with the researcher.
  3. Focusing on data that can be made public immediately, e.g. reference spectral data sets.

Cultural change will be required in the long term.

The report also notes that “resources must be simple, engaging and easy to access”.  Given our issues with metadata capture, I would emphasize the need for the systems to be engaging.