wiki:IntermediateLanguage
Last modified 7 years ago Last modified on 07/29/10 13:02:08

Intermediate Language (IML)

IML is a dataflow graph transformation language for manipulating RDF data. It is intended to allow non-technical users interested in accessing biological and biomedical information sets (e.g. ontologies, thesauri, data) to write view definitions. Existing view definition mechanisms (e.g. vSPARQL, NetworkedGraphs?) incorporate a declarative syntax that does not easily match the high-level transformations that users want to apply to information sets. IML consists of a small number of graph transformations which can be composed in a dataflow style to define a view over RDF-based information sets. The language's operations closely map to the manipulations users undertake when manipulating and transforming RDF datasets using a visual editor.

Background: RDF, SPARQL

RDF (Resource Description Framework) is the model developed by the W3C for describing data on the semantic web. Data in RDF is a directed, labeled graph of triples (subject, predicate, object) describing the relationship between resources. Resources are URIs or literals (objects only) and edges are named links between resources.

An example RDF graph of the Brain in the Foundational Model of Anatomy (FMA) is: http://www.cs.washington.edu/homes/mar/rdf_pic.png

SPARQL is the query language developed by the W3C for querying RDF data. A query indicates the specific triple patterns to be found in an RDF graph using a combination of ground facts and variables (?v or $v). If the triple pattern is found in the RDF graph, the query is successful and a set of variable bindings corresponding to those instances can be returned.

Below is a sample SPARQL query over the FMA. The WHERE clause indicates the triple pattern that is to be found in the underlying RDF graph (specified via the FROM <http://sig.biostr.washington.edu/fma3.0#> statement); for this query, all direct outgoing properties of the Brain are found. The SELECT statement indicates the set of variables (?b, ?c) that should be projected out as results of the query.

PREFIX rdfs:<http://www.w3.org/2000/01/rdf-schema#>
PREFIX fma:<http://sig.biostr.washington.edu/fma3.0#>

SELECT ?b ?c 
FROM <http://sig.biostr.washington.edu/fma3.0>
WHERE {
      fma:Brain ?b ?c  .
}

A snippet of the results returned by this query is:

| b                         	 	 | c 
===============================================
| rdfs:subClassOf            	 | fma:Segment_of_neuraxis    
| fma:constitutional_part    	 | fma:Neural_tissue_of_brain 
| fma:constitutional_part    	 | fma:Vasculature_of_brain   
| fma:definition             	 | "Segment of neuraxis that has as 
| fma:regional_part          	 | fma:Forebrain                    
| fma:regional_part          	 | fma:Midbrain                     
| fma:regional_part          	 | fma:Hindbrain                    
| fma:regional_part_of        	 | fma:Neuraxis                     
| fma:bounded_by             	 | fma:Plane_of_foramen_magnum 
| fma:constitutional_part_of | fma:Cranial_compartment     
| fma:member_of              	 | fma:Nervous_system_of_head  
...

The language contains a number of other features which are described at http://www.w3.org/TR/rdf-sparql-query/.

vSPARQL: A view definition language for RDF

vSPARQL is a set of extensions to the SPARQL query language designed to allow users to create view definitions over RDF data. The language allows extraction, modification, and augmentation of a RDF data.

Intermediate Language (IML)

IML is designed to remove some of the technical burden of creating view definitions over RDF data. SPARQL's (and thus vSPARQL) syntax is similar in style to SQL; while this is a feature for users well-versed in SQL, it can be prohibitive for non-technical users. Additionally, there is a mismatch between the high-level operations users wish to use to transform an ontology (e.g. get the part hierarchy of the liver) to the declarative syntax of SPARQL. To eliminate some of the mismatch, IML provides a set of high-level graph operations that can be combined in a dataflow language.

IML syntax

IML is a dataflow language allowing users to specify high-level graph operations for transforming RDF data. The figure below indicates, at a high-level, IML's structure. As with SPARQL queries, PREFIX statements can be used to define shorthand for URI namespaces.

http://www.cs.washington.edu/homes/mar/iml_overview.png

A sample IML view definition:

http://www.cs.washington.edu/homes/mar/iml_example_view.png

IML subquery blocks

An IML view definition contains a set of IML subquery blocks that each have a specified set of input graphs and a named output graph. The result of these subquery blocks can then be directed to other subquery blocks as input graphs. Within a given block, results flow between operations via a default graph; operations can be applied to the default graph or individual input graphs can be specified for querying.

A subquery block consists of:

INPUT [comma separated list of RDF graph URIs]
{
    [ IML Operations ]
} OUTPUT [ name of output graph ]

For example, the following subquery block takes the FMA as an input graph and produces a local graph called "brain_parts" that can be used by other subquery blocks. (We discuss the EXTRACT_TREE and EXTRACT_EDGES operation below.) The first subquery block extracts the entire Brain regional_part and constitutional_part hierarchy and creates a local graph called <brain_parts>; <brain_parts> is used as input to the second subquery block which subsequently extracts the subgraph consisting of regional_part edges and produces a local graph <brain_regional_parts>.

INPUT <http://sig.biostr.washington.edu/fma3.0>
{
    EXTRACT_TREE { fma:Brain [ forward(fma:regional_part), forward(fma:constitutional_part) ] }
        GRAPH <http://sig.biostr.washington.edu/fma3.0>
} OUTPUT <brain_parts>

INPUT <brain_parts>
{
    EXTRACT_EDGES { ?a fma:regional_part ?c } WHERE { GRAPH <brain_parts> { ?a fma:regional_part ?c } }
} OUTPUT <brain_regional_parts>


IML Operations

Within an IML subquery block, a series of IML operations can be applied to input graphs to produce the desired output graph. The results of each operation within a block are passed to the next operation via an unnamed default graph. At the top of the IML subquery block, the default graph is empty; the first operation will begin to populate that graph. The output of the IML subquery block is the default graph after the last operation in the block.

Individual IML operations consist of:

  • the name of the operation
  • the specific output of the operation
  • an optional GRAPH statement indicating the GRAPH to which the operation should be applied
  • an optional WHERE clause that can be used to bind variables (using a SPARQL WHERE clause) to be used in the output

More details and examples are given for each of the individual operations below.

IML Operations: Selection

IML provides a set of operations specifically for extracting information from an RDF graph. The operations include extract_edges, extract_tree, extract_reachable, extract_path, and extract_recursive.

EXTRACT_EDGES

EXTRACT_EDGES { <list of triple patterns> }
    [ GRAPH <uri> ]
    [ WHERE { <list of triple patterns> } ]

EXTRACT_EDGES allows a user to indicate the subset of RDF edges that should be added to the default graph. If the graph pattern described in the WHERE clause is found in the specified input graph, the output triples are added to the default graph. (Note that this operation completely overwrites the default graph coming in to the operation.)

INPUT <http://sig.biostr.washington.edu/fma3.0>
{
    EXTRACT_EDGES { fma:Heart ?b ?c } 
        GRAPH <http://sig.biostr.washington.edu/fma3.0> 
        WHERE { fma:Heart ?b ?c }
    EXTRACT_EDGES { ?x fma:regional_part ?z } 
        WHERE { ?x fma:regional_part ?z }
}
OUTPUT <sample_extract_edges>

In this example, the first EXTRACT_EDGES specifies <http://sig.biostr.washington.edu/fma3.0> as the graph to which the operation should be applied. The WHERE clause indicates the graph pattern that must be found in the graph, which is that the fma:Heart has at least one outgoing edge; all of the identified outgoing edges of the fma:Heart are placed in the default graph.

The second EXTRACT_EDGES command is applied to the default graph (the results of the previous operation); the WHERE clause locates all fma:regional_part edges in the default graph and places them in the default graph. The output of the entire block (<sample_extract_edges>)will consist of all of the fma:Heart's fma:regional_part edges.

EXTRACT_TREE

EXTRACT_TREE { <root node> [ <list of properties> ] }
    [ GRAPH <uri> ]
    [ WHERE { <list of triple patterns> } ]

EXTRACT_TREE allows a user to indicate a root node and a set of RDF edges that should be recursively followed to extract a subgraph of the input. Within the list of edges, users can specify an edge's direction using either "forward(prop)" or "backward(prop)". "Forward" indicates that an edge should be outgoing from a node; "backward" indicates that an edge should be incoming to a node; if no direction is specified, then it is assumed that the subgraph should be constructed by following edges in both directions.

INPUT 
<http://sig.biostr.washington.edu/fma3.0>
{
    EXTRACT_TREE { fma:Liver [forward(fma:regional_part)] } 
        GRAPH <http://sig.biostr.washington.edu/fma3.0>
} OUTPUT <liver_regional_parts>

In this example, the EXTRACT_TREE operation starts at the fma:Liver and recursively traverses the graph over outgoing fma:regional_part edges. A graph is produced containing the fma:Liver regional_part hierarchy.

EXTRACT_REACHABLE

EXTRACT_REACHABLE { <root node> [ <list of properties> ] }
    [ GRAPH <uri> ]
    [ WHERE { <list of triple patterns> } ]

EXTRACT_REACHABLE allows a user to indicate a root node and a set or RDF edges that should be recursively followed to identify the set of nodes that can be reached by traversing those edges. As with EXTRACT_TREE, direction can be specified for each of the edges. The operation produces an RDF object list with <root node> as the source, <http://localhost/reaches> as the predicate, and a reachable node as the object.

INPUT 
<http://sig.biostr.washington.edu/fma3.0>
{
    EXTRACT_REACHABLE { fma:Liver [forward(fma:regional_part)] } 
       GRAPH <http://sig.biostr.washington.edu/fma3.0>
} OUTPUT <liver_reachable_regional_parts>

This example produces all of the parts that can be reached by starting at the fma:Liver and recursively traversing fma:regional_part links. A snippet is shown here:

fma:Liver
      <http://localhost/reaches>
                    fma:Quadrate_lobe_of_liver ;
      <http://localhost/reaches>
                    fma:Caudate_lobe_of_liver ;
      <http://localhost/reaches>
                    fma:Lateral_inferior_area_of_lateral_segment_of_left_lobe_of
_liver ;
      <http://localhost/reaches>
                    fma:Right_segment_of_caudate_lobe_of_liver ;
      <http://localhost/reaches>
                    fma:Anterior_superior_area_of_anterior_segment_of_right_lobe
_of_liver ;
...

EXTRACT_PATH

EXTRACT_PATH { <source node> [ <list of properties> ] <sink node> }
    [ GRAPH <uri> ]
    [ WHERE { <list of triple patterns> } ]

EXTRACT_PATH allows a user to specify a <source node> and <sink node> and a list of properties; the operation returns the subgraph containing the path from source to sink by recursively traversing the list of properties.

INPUT <http://sig.biostr.washington.edu/fma3.0>
{
    EXTRACT_PATH { fma:Liver [forward(fma:regional_part)]  fma:Medial_inferior_area_of_medial_segment_of_left_lobe_of_liver } 
        GRAPH <http://sig.biostr.washington.edu/fma3.0>
} OUTPUT <sample_extract_path>

This example returns the subgraph that is the path taken by traversing fma:regional_part links from fma:Liver to fma:Medial_inferior_area_of_medial_segment_of_left_lobe_of_liver.

EXTRACT_RECURSIVE

EXTRACT_RECURSIVE
{ { <list of triple patterns> } 
      [ GRAPH <uri> ]
      [ WHERE { <list of triple patterns> } ]
}
{ { <list of triple patterns> } 
      [ GRAPH <uri> ]
      [ WHERE { <list of triple patterns> } ]
}

EXTRACT_RECURSIVE is a general recursion mechanism allowing users to precisely specify the edges that they want to follow and the output that should be produced. (All of the recursive extract operations can be achieved using this mechanism.) The operation consists of a list of EXTRACT_EDGES commands (grouped by the outer "{ ... }"). This list should be considered a set of base cases and recursive cases: a base case accesses the incoming set of RDF graphs; a recursive case accesses both the incoming set of RDF graphs and the result set being produced, referenced by the graph name <recursive>. The operation evaluates each case and new results are added to the output <recursive> graph using set union, and then the recursive cases are evaluated again and the results are added to <recursive>. This iteration process continues until a stable state is reached. (No new results are added to <recursive>.) Note: Introduction of dynamically generated URIs can result in non-termination.)

INPUT <http://sig.biostr.washington.edu/fma3.0>
{
     EXTRACT_RECURSIVE 
     { { fma:Liver fma:regional_part ?c } 
       GRAPH <http://sig.biostr.washington.edu/fma3.0>
       WHERE {  fma:Liver fma:regional_part ?c}  
     }
     { { ?c ?b ?e } 
       GRAPH <http://sig.biostr.washington.edu/fma3.0>
       WHERE { GRAPH <recursive> { ?a ?b ?c } . 
               ?c ?b ?e
             }
     }
} OUTPUT <sample_extract_recursive>

This example performs the same action as the EXTRACT_TREE example for finding the fma: regional_part hierarchy for fma:Liver. The first (base) case finds all of the direct fma:regional_parts of the fma:Liver; the second (recursive) case uses the results found in <recursive> and expands the graph one level deeper, continuing until the entire fma:Liver fma:regional_part hierarchy has been produced.

IML Operations: Modification

IML provides a number of high-level operations for modifying an RDF graph. The operations include delete_edge, delete_node, delete_property, delete_tree, move_subject, move_object, replace_property, replace_node, replace_literal, and replace_edge_{subject, property, object, literal}.

DELETE_EDGE

DELETE_EDGE < <triple pattern> >
    [ GRAPH <uri> ]
    [ WHERE { <list of triple patterns> } ]
    [ CLEANOPT ]

DELETE_EDGE allows a user to indicate the specific RDF triples that should be deleted from a graph. If the optional GRAPH clause is used to specify an input graph, then the edges are deleted from this input graph; otherwise, the edges are deleted from the incoming default graph. A single triple pattern is used to indicate the specific edges that should be deleted from the graph; bindings from the WHERE clause can be used to identify multiple edges to be deleted at once.

INPUT <http://sig.biostr.washington.edu/fma3.0>
{
      DELETE_EDGE < fma:Liver ?b ?c >  
         GRAPH <http://sig.biostr.washington.edu/fma3.0> 
         WHERE { fma:Liver ?b ?c } 
} OUTPUT <sample_delete_edge>

In this example, all of the direct outgoing properties of fma:Liver are deleted from the FMA. The output of this operation is the entire FMA except the deleted edges.

DELETE_NODE

DELETE_NODE <node>
    [ GRAPH <uri> ]
    [ WHERE { <list of triple patterns> } ]
    [ CLEANOPT ]

DELETE_NODE allows a user to indicate a specific node that should be deleted from the input graph. All edges that contain the node as either a subject or object are deleted from the graph. The user can specify either a URI or a variable (bound in the WHERE clause) to indicate the node to be deleted.

If the optional GRAPH clause is used to specify an input graph, then the edges are deleted from this input graph; otherwise, the edges are deleted from the incoming default graph.

DELETE_PROPERTY

DELETE_PROPERTY <node>
    [ GRAPH <uri> ]
    [ WHERE { <list of triple patterns> } ]
    [ CLEANOPT ]

DELETE_PROPERTY allows a user to indicate a specific property that should be deleted from the input graph. All edges with the specified property are deleted from the graph. The user can specify either a URI or a variable (bound in the WHERE clause) to indicate the property to be deleted.

If the optional GRAPH clause is used to specify an input graph, then the edges are deleted from this input graph; otherwise, the edges are deleted from the incoming default graph.

DELETE_TREE

DELETE_TREE { <root node> [ <list of properties> ] }
    [ GRAPH <uri> ]
    [ WHERE { <list of triple patterns> } ]
    [ CLEANOPT ]

DELETE_TREE allows a user to indicate a subgraph that should be deleted from the input graph. The <root node> and <list of properties> define a subgraph (as described in EXTRACT_TREE) that should be identified and deleted from the graph. The user can specify either a URI or a variable (bound in the WHERE clause) to indicate the data to be deleted.

If the optional GRAPH clause is used to specify an input graph, then the edges are deleted from this input graph; otherwise, the edges are deleted from the incoming default graph.

MOVE_SUBJECT

MOVE_SUBJECT < <triple pattern> > <replacement node>
    [ GRAPH <uri> ]
    [ WHERE { <list of triple patterns> } ]

MOVE_SUBJECT can be used to change the subject of a particular RDF triple. (This is effectively the same as REPLACE_EDGE_SUBJECT.) The subject of the specified triple pattern is changed to <replacement node> in the operation's output. All other edges in the input graph are unchanged and exist in the operation's output graph.

MOVE_OBJECT

MOVE_OBJECT < <triple pattern> > <replacement node>
    [ GRAPH <uri> ]
    [ WHERE { <list of triple patterns> } ]

MOVE_OBJECT can be used to change the object of a particular RDF triple. (This is effectively the same as REPLACE_EDGE_OBJECT.) The object of the specified triple pattern is changed to <replacement node> in the operation's output. All other edges in the input graph are unchanged and exist in the operation's output graph.

REPLACE_NODE

REPLACE_NODE <node to replace> <replacement node>
    [ GRAPH <uri> ]
    [ WHERE { <list of triple patterns> } ]

REPLACE_NODE can be used to change all instances of a resource URI <node to replace> with <replacement node> in the output graph. All other edges in the input graph are unchanged and exist in the operation's output graph.

REPLACE_PROPERTY

REPLACE_NODE <property to replace> <replacement property>
    [ GRAPH <uri> ]
    [ WHERE { <list of triple patterns> } ]

REPLACE_PROPERTY can be used to change all instances of a property URI <property to replace> with <replacement property> in the output graph. All other edges in the input graph are unchanged and exist in the operation's output graph.

REPLACE_EDGE_XXX REPLACE_EDGE_SUBJECT, REPLACE_EDGE_PROPERTY, REPLACE_EDGE_OBJECT, REPLACE_EDGE_LITERAL

REPLACE_EDGE_XXX < <triple pattern> > <replacement>
    [ GRAPH <uri> ]
    [ WHERE { <list of triple patterns> } ]

REPLACE_EDGE_XXX can be used to change the value of a triple position in a specified RDF triple. For example, REPLACE_EDGE_SUBJECT can be used to replace the subject of a RDF triple with <replacement>. All other edges in the input graph are unchanged and exist in the operation's output graph.

REPLACE_EDGE_LITERAL ensures that the a literal, which is only permitted in the object position of an RDF triple, is indeed replaced by a literal. This is necessary because replacing an object resource (URI) with an object literal (string) can produce invalid RDF.

IML Operations: Addition

IML provides a small set of operations that can be used to introduce new edges to an RDF graph. The operations are add_edge, merge_nodes, and split_node.

ADD_EDGE

ADD_EDGE < <triple pattern> >
    [ GRAPH <uri> ]
    [ WHERE { <list of triple patterns> } ]

ADD_EDGE can be used to create a new RDF edge that should be combined with the input graph. The edge can be hard-code or can use the variables bound in the optional WHERE clause. All other edges (in either the default graph or the graph specified via GRAPH) are combined with the new edges in the output graph.

INPUT <http://sig.biostr.washington.edu/fma3.0>
{
    EXTRACT_EDGES { fma:Heart ?b ?c } 
        GRAPH <http://sig.biostr.washington.edu/fma3.0>
        WHERE { fma:Heart ?b ?c }
    ADD_EDGE < fma:Heart fma:part ?c> 
        WHERE { { fma:Heart fma:regional_part ?c } 
                UNION 
                { fma:Heart fma:constitutional_part ?c } 
        }
} OUTPUT <sample_add_edge>

This example selects all of the direct outgoing properties of the fma:Heart and places them in the default graph. Then, for every fma:regional_part and fma:constitutional_part of the fma:Heart, it creates a new edge with property fma:part.

MERGE_NODES

MERGE_NODES [ <list of nodes to merge> ]
    [ CREATE <uri or skolem function> ]
    [ RETAIN <node> [ <list of properties to retain from this node during merge> ] ]
    [ ELIMINATE <node> [ <list of properties to eliminate from this node during merge> ] ]
    [ GRAPH <uri> ]
    [ WHERE { <list of triple patterns> } ]

MERGE_NODES allows a user to create a new node by combining the edges (both incoming and outgoing) from a set of nodes. The URI of the newly created node is either a blank node or has the URI specified via the optional CREATE. By default, edges will be added to the output graph to ensure that the newly created node has all of the same incoming and outgoing edges as the nodes in the <list of nodes to merge>. The optional RETAIN and ELIMINATE features can be used to specify the specific edge properties that should and should not be inherited from specific nodes in the <list of nodes to merge>.

This operation is not destructive; it simply adds new edges to the input graph. If the user wishes to delete the nodes that were merged, a separate DELETE_NODE operation should be applied to this operation's results.

SPLIT_NODE

SPLIT_NODE <node>
    CREATE <uri or skolem function> ]
    [ RETAIN <node> [ <list of properties for this node to retain during split> ] ]
    [ ELIMINATE <node> [ <list of properties for this node to eliminate during split> ] ]
    CREATE <uri or skolem function> ]
    [ RETAIN <node> [ <list of properties for this node to retain during split> ] ]
    [ ELIMINATE <node> [ <list of properties for this node to eliminate during split> ] ]
    [ GRAPH <uri> ]
    [ WHERE { <list of triple patterns> } ]

SPLIT_NODE allows a user to create two or more new nodes by replicating the edges (both incoming and outgoing) from a designated node. A list of nodes to create is specified via the CREATE feature. By default, edges will be added to the output graph to ensure that the newly created nodes have all of the same incoming and outgoing edges of the node being split. The optional RETAIN and ELIMINATE features can be used to indicate the specific edge properties that should and should not be inherited from the node being split.

This operation is not destructive; it simply adds new edges to the input graph. If the user wishes to delete the node that was split, a separate DELETE_NODE operation should be applied to this operation's results.

IML Operations: Utility

IML provides a set of utility functions for manipulating graphs and combining the output of IML subquery blocks. These operations are union_graphs, copy_graph, and cleanup.

UNION_GRAPHS

UNION_GRAPHS <list of RDF graphs>

UNION_GRAPHS produces the result of combining all of the graphs in the <list of RDF graphs>. Duplicate edges in multiple RDF graphs will only be seen once in the result graph.

COPY_GRAPH

COPY_GRAPH <RDF graph>

COPY_GRAPH is unique in the set of IML operations in that it is the only operation that explicitly deals with blank node support. COPY_GRAPH takes an input graph and produces an output graph that is identical to the input EXCEPT for those edges containing blank nodes. If a blank node occurs in the input graph, a new corresponding blank node is produced in the output graph. THE BLANK NODE IN THE INPUT GRAPH AND THE BLANK NODE IN THE OUTPUT GRAPH ARE NOT THE SAME. This is traditional RDF blank node semantics.

In contrast, all other IML operations have a modified version of blank node semantics. Each IML subquery block produced graph is considered a "virtual graph," which instead of newly generated blank nodes, uses pointers to the blank nodes in the original graph. This allows the blank nodes in subquery results to be directly comparable to the blank nodes in their corresponding input blocks. This is not traditional RDF blank node semantics.

CLEANUP

CLEANUP [ <list of reference nodes> ] [ <list of properties or "forward" or "backward" or "all"> ]
    [ GRAPH <uri> ]

CLEANUP allows a user to perform garbage collection over an RDF graph. The various IML transformations may have caused a graph to become multiple unconnected subgraphs, and a user may only be interested in one of those subgraphs. CLEANUP allows the user to indicate a list of reference nodes that are in the relevant subgraph and the set of properties used to identify all of the connected nodes in the subgraph. At the end of this operation, the output graph will only contain the subgraph indicated to be relevant by the user.

When specifying the paths that should be traversed in identifying "live" nodes, a user can specify either a list of properties or simply specify "forward," "backward", or "all". "Forward" indicates that all outgoing edges from the list of reference nodes should be included in the live set; "backward" indicates that all incoming edges from the list of reference nodes should be included in the live set; "all" indicates that both incoming and outgoing edges should be included in the live set.

This operation can be useful for significantly reducing the size of the graph that needs to be manipulated by subsequent transformations.

INPUT <http://sig.biostr.washington.edu/fma3.0>
{
    CLEANUP [fma:Organ, fma:Cardinal_organ_part] [forward(rdfs:subClassOf)]
    GRAPH <http://sig.biostr.washington.edu/fma3.0>
} OUTPUT <example_cleanup>

This example identifies the fma:Organ and fma:Cardinal_organ_part as the starting reference nodes in the garbage collection process. In growing the set of "live" edges, the operation should only follow forward(rdfs:subClassOf) links. (This produces the union of the subClass hierarchy of fma:Organ and fma:Cardinal_organ_part.) The identified "live" set of edges are the only edges in the output graph.