MIRIAM URN Scheme: Request for Community Feedback

Dear MIRIAM Resources user,

MIRIAM Resources continues to grow in size and number of users. We are in constant discussions with both our current user base, as well as with new potential groups and users in order to improve the services we provide. With reference to some recent discussions with various groups who are looking at adopting MIRIAM URNs, we would like to poll our existing and future users on some potential changes, in particular regarding the encoding of MIRIAM URNs.

Detailed below is the list of topics for which we would appreciate any input you may have.

Finally, please take a few minutes to fill our very short online survey.

Thank you.

1. Introduction

MIRIAM Resources retains information for a plethora of identifier formats, registered against various data types. In some cases, this can lead to some confusion on how they should be encoded within XML file formats. First we would like to discuss whether there will ever be a need from the community to create XML elements which are named/identified by MIRIAM URNs (section 2.1). Also, as a related topic, we would like to get some feedback on whether the current handling of identifiers containing a colon character (':') is appropriate (section 3.0) or whether the community would like to adopt a new policy (section 4). Moreover, we would also like to discuss some other associated issues, such as identifiers which contain intrinsic redundancies (section 4.1), as exemplified by many of the OBO ontologies.

2. Specific usages of MIRIAM URNs in XML based formats

2.1. Usage of identifiers as XML tags

The Ontology for Biomedical Investigations incorporates terms from the Information Artifact Ontology as properties. In such case, the IAO term identifier is used as a tag name. This results, when implemented in RDF/XML, in snippets as exemplified by the following:

<rdf:RDF xmlns:obo="http://purl.obolibrary.org/obo/" [...]>
 [...]
  <owl:Class rdf:about="http://purl.obolibrary.org/obo/OBI_1110114">
    <rdfs:label xml:lang="en">contact to pathogen carrying biological vector</rdfs:label>
    <obo:IAO_0000115 rdf:datatype="http://www.w3.org/2001/XMLSchema#string">a process in which a vector [...] </obo:IAO_0000115>
    <obo:IAO_0000117 xml:lang="en">IEDB</obo:IAO_0000117>
    [...]
  </owl:Class>
 [...]

In the example above, you can see several IAO terms, such as IAO:0000117 (written as "IAO_0000117"), used as an 'XML tag'. In this instance, the colon character (':') is arbitrarily transformed into an underscore ('_'), and the subsequent identifier can then be used legally (section 2.2) as a tag name. Moreover, the declaration of the "obo" namespace allows unambiguous identification of the underlying concept via the Persistent URL: http://purl.obolibrary.org/obo/IAO_0000117. This PURL being a simple redirection towards an online location where information about the term can be obtained.

2.2. Technical requirements for XML tags

The fact is that in XML (and therefore all XML-based formats such as RDF/XML), there is a cascading list of requirements that must be met to generate a valid tag name:

  1. The names for tags must be of the type QName
  2. QName must be composed of: NCName ':' NCName (where the preceeding NCName constitutes the namespace prefix)
  3. NCName is defined as: Name - (Char* ':' Char*). Importantly this means that NCName is an XML Name which does not include the colon character (":").
  4. XML Name can be composed of:
    • NameStartChar (NameChar)*
    • NameStartChar ::= ":" | [A-Z] | "_" | [a-z] | [some other Unicode stuff]
    • NameChar ::= NameStartChar | "-" | "." | [0-9] | [some other Unicode stuff]

In summary, an XML tag cannot start with a digit, cannot contain ':' (except to separate the namespace prefix and local part -or tag name-) and cannot contain '%'.

2.3. Potential issues with using MIRIAM URNs as tag names

There are several potential issues if users want to adopt this type of tag name identification. First of all, identifiers that incorporate a colon character must be transformed in some way. Moreover, purely numerical identifiers cannot be used in this context without modification (the addition of a prefix is mandatory).

To enable MIRIAM URNs to be properly usable as legitimate tag names in XML-based formats, several changes would be required. For example, identifiers that incorporate a colon character must be transformed, and the "%3A" transformation itself is not appropriate. In addition, simply removing the colon and any preceeding text to leave a bare numerical identifier is itself also not a feasible option. Therefore, this issue is particularly relevant for MIRIAM data types that incorporate a colon within their identifier syntax, and for those that are referenced purely by numerical identifiers.

If you remain unconvinced regarding that the current URI scheme cannot be used for such cases, please look at the examples below (a-e). To convince yourself, feel free to use the online validator. Potential solutions are proposed in section 4, and 'colon' usage is discussed in section 3.

a) Using a MIRIAM URN simply as is (so, incorporating the '%3A' encoding for a colon):

   
   <test xmlns="Random_namespace">
 	<urn:miriam:obo.chebi:CHEBI%3A12345>...some term information...</urn:miriam:obo.chebi:CHEBI%3A12345>
   </test>

Invalid XML due to the use of '%' characters, and inability to determine a QName (too many colon characters).

b) Using a MIRIAM URN, without the '%3A' colon encoding:

   <test xmlns="Random_namespace">
   	<urn:miriam:obo.chebi:CHEBI:12345>...some term information...</urn:miriam:obo.chebi:CHEBI:12345>
   </test>

Invalid XML due to the inability to determine a QName, and also 'urn:' is detected as a namespace, but has not been declared. Therefore both errors are due to too many colon characters.

c) Using a '%3A' stripped URN, but declaring 'urn' as a namespace prefix:

   <test xmlns="Random_namespace" xmlns:urn="MIRIAM_URNs_namespace">
   	<urn:miriam:obo.chebi:CHEBI:12345>...some term information...</urn:miriam:obo.chebi:CHEBI:12345>
   </test>

Invalid XML due to the inability to parse the QName. There are still too many colon characters, where there should be only one to separate the namespace prefix and the local part (or tag name).

d) Declaring a 'mir' namespace composed of the stem of the MIRIAM URN and not using the '%3A' encoding. Any other namespace can be used for this example, and would have identical problems.

   <test xmlns="Random_namespace" xmlns:mir="urn:miriam:obo.chebi">
   	<mir:CHEBI:12345>...some term information...</mir:CHEBI:12345>
   </test>

Invalid XML due to the inability to parse the QName. There are too many colon characters, where there should be only one to separate the namespace prefix and the local part (or tag name).

e) Declaring a 'mir' namespace composed of the stem of the MIRIAM URN:

   <test xmlns="Random_namespace" xmlns:mir="urn:miriam:obo.chebi">
   	<mir:CHEBI%3A12345>...some term information...</mir:CHEBI%3A12345>
   </test>

Invalid XML due to the presence of an illegal '%' character in the tag name.

f) Using the entire MIRIAM URN as a namespace, leaving only the digit identifier as tag name:

   <test xmlns="Random_namespace" xmlns:mir="urn:miriam:obo.chebi:CHEBI%3A">
   	<mir:12345>...some term information...</mir:12345>
   </test>

Invalid XML since a tag name cannot start with a digit. Note that this problem would show up with any data type using identifiers composed purely of digits, such as PubMed, Taxonomy, EC Code, ...

3. Usage of ':' in identifiers of MIRIAM URNs

The colon character is present in various identifiers schemes referenced by MIRIAM Resources, with the OBO ontologies forming a significant part of this subset. Despite the meaning of the prefix as a namespace, in most situations prefix and string of digits are used together as the identifier of a term. Therefore, a computer program will not process the term "0000188" as being of the Gene Ontology data type, since the prefix is required, "GO:0000188", to specify the namespace of Gene Ontology. As a consequence, currently a MIRIAM URN for a Gene Ontology term must contain the following parts:

  1. "urn" to specify that the string must be parsed as a Unified Resource Name;
  2. "miriam" to specify that the URN must be understood as a MIRIAM URI;
  3. "obo.go" to specify that the identifier belongs to the subdomain GO of OBO, identifying the data type 'Gene Ontology';
  4. "GO:0000188" to specify the dataset identifier within the data type Gene Ontology.

The resulting URN being "urn:miriam:obo.go:GO%3A0000188". The replacement of ":" by %3A is discussed below.

':' is a restricted character, used as a separator in URNs. In order to handle the fact that the last ':' of those MIRIAM URNs is semantically different from the others, MIRIAM URNs use the percent-encoding form (hence the '%3A'). This %3A-encoded form corresponds to the hexadecimal representation of the ":". This solution was chosen as it is the way described and required by the URI RFC. Therefore, currently the correct way of referring to "GO:0000188" or "CHEBI:15422" is "urn:miriam:obo.go:GO%3A0000188" and "urn:miriam:obo.chebi:CHEBI%3A15422", respectively

The problems are:

  1. the confusion generated by the '%3A';
  2. the need to convert ':' in '%3A' and back when one generates or parses MIRIAM URNs (although Web Services provide this feature);
  3. the redundancy between the data type namespace and the OBO namespace (e.g. duplicated go/GO and chebi/CHEBI);
  4. the impracticality to use the generated URN as a tag name (cf. section 2.).

4. Possible solutions

There are a number of possible solutions, or solution-enabling steps, to the issue of ':' inclusion within identifiers and using MIRIAM URNs as tag names within XML-based formats. These are illustrated below (sections 4.1 through 4.5).

4.1. Removal of the "idspace name" for OBO ontologies

One possibility is to get rid of the OBO "ID-Space" (the part before the ':' that becomes encoded as "%3A"), which is redundant with the MIRIAM namespace ("obo.go"), and only keep the digits (or OBO "Local-ID") in the identifier portion of MIRIAM URNs. For example "urn:miriam:obo.go:GO%3A0000188" would become "urn:miriam:obo.go:0000188".

As an aside, this solution would also remove the redundancy seen between the data type identifier, "obo.go", and in the dataset identifier "GO:0000188". This change could also be applied to all data types with an invariable prefix in their identifier scheme. Note that this is already what is done with the Enzyme Nomenclature. It should, however, also be borne in mind that some data types in MIRIAM do specify that the "ID-Space" is part of their official identifier. Something like "urn:miriam:obo.go:0000188" would therefore be MIRIAM-specific.

Important note: this solution does not solve the usage of URNs in XML tags, since the identifier to be used as a tag name would then be composed of bare digits, which cannot be used to start a tag name. This would have to be part of a larger potential solution.

4.2 Fusion of the "idspace name" with the MIRIAM data type namespace

One possibility similar to the previous one is to use the "ID-Space" as the data type MIRIAM namespace, and only keep the digits (or "Local-ID") in the identifier portion of MIRIAM URNs. For example "urn:miriam:obo.go:GO%3A0000188" would become "urn:miriam:obo.GO:0000188" or "urn:miriam:GO:0000188".

As an aside, this solution would also remove the redundancy seen between the data type identifier, "obo.go", and in the dataset identifier "GO:0000188". This change could also be applied to all data types with an invariable prefix in their identifier scheme.

Important note: this solution does not solve the usage of URNs in XML tags, since the identifier to be used as a tag name would then be composed of bare digits, which cannot be used to start a tag name. This would have to be part of a larger potential solution.

4.3. Different encoding for ':'

We could follow the decision taken by various groups working with OWL, and arbitrarily replace the ":" with an "_". For example "urn:miriam:obo.go:GO%3A0006915" would become "urn:miriam:obo.go:GO_0006915".

If this action was taken, we could generate valid tag names:

   <test xmlns="Random_namespace" xmlns:mir="urn:miriam:obo.go">
   	<mir:GO_0000188>...some term information...</mir:GO_0000188>
   </test>

While users would still have to convert ":" into "_" and back, as they do currently with "%3A", there is at least a valid correspondance between ':' and its '%3A' encoding. The correspondence, however, between ":" and "_" relies on an arbitrary decision that must be known and incorporated explicitly into the software tools handling MIRIAM URNs.

4.4. Drop URNs and use URLs

Initially, both URLs and URNs were used in MIRIAM Resources. In January 2008, the decision to use URNs was voted upon during the Super-hackathon about Standards and Ontologies for Systems Biology. The main reason (although not the only one) behind this choice was to explicitly disambiguate the physical locations where data resides (URLs), from its identification (URN). It was thought that having a URI of a form akin to that of a URL, but being unresolvable, would be confusing. Therefore the current MIRIAM URN syntax was adopted.

Consequently, another potential solution would be to revert back to, or allow optional use of, the parallel URL form, or indeed to move fully to the URL format. Such a transition would, for example, result in a URI of one of the following forms:

One should emphasise that using a URL form for MIRIAM URIs would *not* be the same as using a PURL. We would still associate several alternative resources or locations to the same dataset.

Important note: this would solve the usage in XML tags if there is implementation of several rules in order to manage identifiers containing special characters, such as ":". For example, the first two example URLs would not solve the usage in XML tags.

4.5. Split across "%3A" encoding

Since tag names cannot begin with a digit, it is possible to split the "%3A" colon encoding between a namespace declaration, and a tag name, thus:

   <test xmlns="Random_namespace" xmlns:mir="urn:miriam:obo.go:GO%3">
   	<mir:A0006915>...some term information...</mir:A0006915>
   </test>

The advantage of this approach is that it is valid, and requires much less work to implement. The disadvantages are that it is not an aesthetically pleasing solution, and would probably be harder to be interpreted by a person. The fact that the OBO namespace would be dropped from the tag name makes the result confusing. In the example above, one may assume the existence of an identifier "A0006915", where there is none.

Note: this solution still requires the "%3A" encoding.

5. Summary / Discussion

We would be very happy to hear your opinion about these possible issues and suggested solutions. For this purpose, we have set up a very short survey. The results will be made public, and we will announce any further decisions.

If you have any related queries or concerns, feel free to contact us.