Short User Manual for the Concept Retrieval Interface

Author: Wolfgang Lipp
Contact: w dot lipp at bgbm dot org
Institution:Botanical Garden and Botanic Museum (BGBM), Berlin (http://bgbm.org)
Version: 0.0.1
Copyright: This document has been placed in the public domain.
Date: 2005–02–20

Table of Contents

About this document

This document describes how to accomplish the most important tasks around managing, updating and using the Concept Retrieval Interface (CRI) for the ABCD Configuration Assistant.

Symbolic References

In file locators, the following symbolic references are used:

$bc the BioCASe home directory
$cfg $bc/configuration
$persistence $cfg/persistence
$u $bc/lib/biocase/configtool/utilities

Furthermore, we splice file locators into object-oriented names using braces; for example: {$u/makeDatabases/makeGrove.py}.populate(ds) points to the object (a module-level function in this case) named populate that lives in file $u/makeDatabases/makeGrove.py.

Usage Recipes

Building the Grove Database

The central database that holds all data on graphs registered with the CRI, dsGrove, must be rebuilt every time a graph is to be added or deleted from it. At present, there is no recommended way to prune or graft items from or into the grove. The script that performs the database reconstruction process is $bc/tools/​makeGrove.py (see below for details).

As detailed in the section on grove sourcefiles, all valid CMFs dropped into $cfg/​templates/cmf will be added to the grove without further configuration or registration; an association of the namespace URL with a short graph name in {$cfg/​.configuration}​.namespacesByPrefix is optional. -- Graphs that adhere to the simple *.flow format used for the :bion: graph of conceptual terms must, however, still be added to ~.additionalGraphsourcesByPrefix and ~.additionalNamespacesByPrefix.

Updating the Graph of Conceptual Terms

It may sometimes become desirable to modify :bion:, the ontology of terms used in the topical search mode of the CRI. This entails changes at two places: (1) the ontology itself must be changed; (2) topical assessments made under the assumption of the previous version must be modified so the concept locators used there still point to valid concepts.

Updating :bion:

The structure of the *.flow notation is very straightforward -- entries are listed on separate lines, unguarded texts indicate conceptual terms, and indentation is used to symbolize parent/child relationships. Additionally, the following symbols for the secondary (supplementary, logical) structure are used:

Symbol Meaning
<=> is equivalent to
<= is subsequent of
=> is antecedent of

As an example, consider the following snippet:

collection
    botanical-garden
    herbarium
    zoological-garden
        <= /bion/organism-group/animals

This structure defines, inter alia, that zoological-garden and herbarium are both child nodes of collection, and that the concept identified as /​bion/​organism-group/​animals is a logical antecedent of zoological-garden. As becomes evident, when nodes are moved around in the notation, references to nodes -- which must at present quote absolute locators -- may have to be updated alongside.

Updating Assessments

When the graph of conceptual terms has been updated, it becomes inevitable to check the assessments source files for inconsistencies. If a concept has been moved to a location that is semantically and topologically similar to the original one, then it should suffice to replace the old concept locators by new ones; if, however, the move is thought to entail a semantic change in the concept, or if the topological context of a term is significantly modified, it will also be necessary to modify the values given for the topic/focus pairs concerned. This may or may not entail the modification of assessments made for neighboring concepts.

Updating Documentation and Lexico-Statistical Data

As part of the functionality of the makeGrove.py script, documentation is also gathered from the XML source files and prepared for the lexico-statistical interface. It is possible, though not recommended, to update this part of the database by running $bc/tools/​makeDocumentation.py, or to run $u/makeDatabases/​makeGrove.py with the appropriate parameters suitably set.

Retrieving Graph and Node Details from dsGrove

This snippet shows how to (1) retrieve dsGrove, the object representing the grove database; (2) retrieve the table that holds summary data on each graph and (3) iterate over all or over a selection of nodes from each graph. The example assumes that the script adheres to the remarks made about running scripts within the CRI context; under these conditions, g is a variable that provides direct access to modules without the need of import statements. The code is also available in $bc/​tools/​demoGroveUsage.py.

#===========================================================================================================
import                                  sys
import                                  imp
import os.path as                       _ospath
u                                       = g.configtool.utilities
_feds                                   = u.feds
inter                                   = u.kpyInterpol.inter

#===========================================================================================================
if __name__ == '__main__':

    #=======================================================================================================
    #   (1)     retrieve grove datasource
    #=======================================================================================================
    groveFed    = _feds.resolve( 'Grove' )
    dsGrove     = groveFed.dsGrove
    graphnames  = list( dsGrove.iterGraphnames() )
    graphnames.sort()
    graphnames  = '\n\t'.join( graphnames )
    print inter( ''
        +   "The grove contains graphs with the following names:\n\t$graphnames"
        )

    #=======================================================================================================
    #   (2)     iterate over graph facts
    #=======================================================================================================
    for graph in dsGrove.getTable( 'graph' ):
        print
        #   print graph
        print inter( ''
            +   "*\tGraph :$graph.name: represents\n"
            +   "\tnamespace '$graph.namespaceurl'\n"
            +   "\tand has $graph.length nodes. It has been constructed from\n"
            +   "\t'$graph.localgraphsource'."
            )
        #===================================================================================================
        #   (3)     iterate over node facts
        #===================================================================================================
        if graph.name == 'abcd12':
            nodename = '@PreferredFlag'
            nodes    = dsGrove.catchTable(
                node = dict(                # select * from node where
                    gauge   = graph.name,   #   graph = 'abcd12' and
                    name    = nodename,     #   name = '@PreferredFlag';
                    )
                )
            print inter( ''
                +   "\n\tThere are $len(nodes) nodes called '$@PreferredFlag' in graph :$graph.name:;\n"
                +   "\tthey are located below the following nodes:"
                )
            nodetable = dsGrove.getTable( 'nodetable' )
            for node in nodes:
                termref = ( graph.name, node.parentTermnr )
                parentnode = dsGrove.nodefactForTermref( termref )
                print inter( '\t\t$parentnode.locator' )

Normalizing Namespaces and Graphnames

Working with graph-related data, it is both important to uniquely identify a graph when addressing it, and to mark up data in a readable fashion. In both cases, it is sometimes advantageous to use short names, and sometimes it is preferrable or necessary to use namespace URLs. While the web interface allows to mix both kind of givens, much of the grove API does not; it then becomes necessary to perform conversions before using the API. The responsible method is {$u/feds/​DatasourceClasses/​Grove.py}​.getGraphnameAndNamespace() which accepts both namespace URLs and graphnames and will always try to return a (graphname,​namespace) pair.

This piece of code demonstrates how to use the two methods to unify graph identification data (observe the remarks on running scripts):

dsGrove = g.configtool.utilities.feds.resolve( 'Grove' ).dsGrove

for either in (
    'bion',
    'http://www.biocase.org/schemas/metaprofile/1.2',
    'http://www.biocase.org/schemas/bion/0',
    'http://www.tdwg.org/schemas/abcd/1.2',
    'abcd12',
    ):

    graphname, namespace = dsGrove.getGraphnameAndNamespace( either )
    print 'Namespace URL:', namespace
    print 'Filed under  :', graphname

In order to prepare a list of namespace URLs for use with a method that expects graphnames, use an incantation along the lines of

namespaces = [
    'http://www.biocase.org/schemas/metaprofile/1.2',
    'http://www.biocase.org/schemas/bion/0',
    ]
graphnames = [
    dsGrove.getGraphnameAndNamespace( namespace )[ 0 ]
        for namespace in namespaces
        ]

Searching the Topical Database Using the Ratings API

The script $bc/tools/​searchTopicalRatings.py demonstrates how to use the Ratings API to conduct topical searches against the ratings for focus/topic pairs derived by $u/makeDatabases/​makeRatings.py (cf. section on Make* Scripts). The central lines are quite simple (see Running Scripts for the meaning of g):

_Ratingscalculator  = g.configtool.utilities.tempratingsweb.Ratingscalculator.Ratingscalculator
rc                  = _Ratingscalculator( query, topicGraphname, focusGraphname )
ratings             = rc.getRatings()
#---------------------------------------------------------------------------------------------------
for rating, focusTermnr in ratings:
    print rating, focusTermnr

You can use the Grove Database e.g. to convert between term numbers and schema node locators (this is demonstrated in $bc/tools/​searchTopicalRatings.py); read Normalizing Namespaces and Graphnames for how to convert between graph names and namespace URLs.

Searching the Lexico-Statistical Database Using the Lupy Wrapper API

The script $bc/tools/​searchWithLupy.py demonstrates how to use the Lupy Wrapper API to conduct lexical searches against the documentation[1] that was extracted from the CMFs. The underlying functionality demonstrated in this script is formulated in {$u/lupywrapper/​search.py}​.search(phrase,​*graphnames); this module-level function accepts a string contain whitespace-separated query terms and and a possibly empty iterable of graphnames[2]; it returns a triplet (normalphrase,​graphnames,​resulttriplets) where normalphrase is string containing the normalized terms of the search phrase, graphnames is a set of the names of the graphs that were searched, and resulttriplets is a reversely sorted list consisting itself of (rating,​graphname,​termnr) triplets. Sometimes life is easier using {$u/lupywrapper/​search.py}​.nodefactsForSearch() instead, which returns (normalphrase,​graphnames,​resultpairs) triplets where resultpairs is a list of (rating,​nodefact) pairs.

As in the web interface, the mentioned API functions will honor some special syntax for the search phrase: A + (plus) prepended to a term will restrict results to those where the term actually appears; a - (minus) in front of a term excludes all results that show that term; a term encased in colons, like :abcd12:, specifies a graph to search. Multiple graphnames may be given, and in case no graphnames are given, the entire grove will be searched.

[1]in fact, not only is the documentation present for a particular node included in lexixal searches, but also the documentation of ancestral nodes and the parsed locator of each element. For example, if a node /schema/​geosciences/​geographyAndDemography/​europe is undocumented, but /schema/​geosciences and /schema/​geosciences/​geographyAndDemography both do have annotations in their CMFs, then terms used in those ancestral documentation are added for the node in question, as are the terms from the node locator, i.e. schema, geoscience, geography, demography, and europe.
[2]read Normalizing Namespaces and Graphnames in order to learn how to convert namespace URLs to be used with this function

This code shows how to use the Lupy wrapper. Again, instead of using dsGrove.nodefactForTermref(termref) we could have chosen to call .nodefactsForSearch() instead of search():

u                                       = g.configtool.utilities
_feds                                   = u.feds
_inter                                  = u.kpyInterpol.inter
_search                                 = u.lupywrapper.search.search
_wrap                                   = u.typo.wrap

#===========================================================================================================
def searchAndTell( phrase ):
    #-------------------------------------------------------------------------------------------------------
    groveFed    = _feds.resolve( 'Grove' )
    dsGrove     = groveFed.dsGrove
    #-------------------------------------------------------------------------------------------------------
    normalphrase, graphnames, results = _search( phrase )
    graphnames = list( graphnames )
    graphnames.sort()
    graphnames = ', '.join( graphnames )
    #-------------------------------------------------------------------------------------------------------
    print _inter( ''
        +   "You searched for '$normalphrase'\n"
        +   "in $graphnames. The query returned $len(results) results:"
        )
    for result in results:
        rating, graphname, termnr = result
        rating      = '%5.3f' % rating
        termref     = ( graphname, termnr )
        nodefact    = dsGrove.nodefactForTermref( termref )
        short, long = dsGrove.getDocumentationForTermref( termref )
        print
        print _inter( '($rating) $nodefact.ontonym' )
        if short:
            print _wrap( short )

Building CRI Query URLs

Using the CGI script $bc/www/​configtool/​ratingsweb/​index.cgi, it is possible to fetch results of topical and lexical searches via the CRI web interface. This script ultimately calls {$u/tempratingsweb}​.MyServlet.respond_default() and accepts a number of URL query string parameters; these are

param. name default purpose
q Query '' Specifies search phrase, ex; 'zoo* or botanical-garden'
ts Topic Schema 'bion' Specifies topic schema (from where the search terms are taken)
te Target Element '' Name of a JavaScript function in the opening window that may be called with the graphname and the termnr as arguments; this function is responsible for further processing of the result. See Addressing the Interface from a Web Page for more.
dl Display Locators 'n' Whether ('y') or not ('n') to display concept locators.
dd Display Docu 'n' Display no ('n') documentation, or short ('y') or extended ('x') documentation for results.
st Search Type 't' Whether to conduct a topical ('t') or a lexico-statistical ('l') search.
rf Result Format 'g' Whether to return graphnames ('g') or namespace URLs ('n').

Assuming the server is listening on port 8080 of localhost, this example shows how to construct a simple query URL (with parameters displayed one by one):

http://127.0.0.1:8080/biocase/configtool/ratingsweb/index.cgi?
    fs=http://www.tdwg.org/schemas/abcd/1.2
    &ts=http://www.biocase.org/schemas/bion/0
    &q=plants and not fungi
    &st=t

Depending on the software that goes between the URL and the sender, it may or may not be necessary to escape characters outside the small subset of US-ASCII that is permissible in URLs.

Since namespaces are automatically converted to graphnames as defined in {$cfg/.configuration}​.namespacesByPrefix, the above query is equivalent to the shorter

http://127.0.0.1:8080/biocase/configtool/ratingsweb/index.cgi?
    fs=abcd12
    &ts=bion
    &q=plants and not fungi
    &st=t

Addressing the Concept Retrieval Interface from a Web Page

As shown above, one of the parameters for the CR web interface query URL, te, may be the name of a JavaScript callable that accepts two parameters, namely the name of the graph and the number of the selected node (the term number). The function will be called when a client window has openened the CRI as a popup window, the user has conducted a search, and then clicks on one of the search results. This is demonstrated in $bc/www/​configtool/​CriWebInterfaceDemo.html, a fairly minimal page that includes all the basic functionality. This page includes two JavaScript functions and an HTML form. On the form, users get a chance to enter a graphname or a namespace URL they want to search; the submit button will open the CRI interface. Selecting a search result there will add results to the form's <textarea/> element:

<form name='myform' action='javascript:search();' method='get'>
    Name or Namespace URL of schema to search:<br />
    <input type='text' size='80' name='graphname' value='abcd12'></input>
    <input type='submit'></input><br />
    <br />
    Results:<br />
    <textarea name='mytextinput' rows='25' cols='80'></textarea><br />
    </form>

The JavaScript function that is responsible for receiving and processing results looks like this:

function receiveConcept(graphname,termnr){
    /*  Receive results from the CRI popup window, construct a string from them, and append
        them to the form element value.
    */
    document.myform.mytextinput.value +=
        ':' + graphname + ':' + termnr + ' ';
};

In case you prefer namespace URLs to be returned rather than graphnames, be sure to include the parameter rf=n in your query URL (see above for details). This is what the JavaScript function to open a suitable popup window for the CRI looks like in the sample page:

function search() {
    /*  Formulate an appropriate query URL and open the CRI popup.
    */
    graphname       = document.myform.graphname.value;
    graphname       = escape( graphname );
    popupWindowUrl  = './ratingsweb/index.cgi?te=receiveConcept&fs=' + graphname + '&st=l';
    popupWindowName = 'biocaseratingswebmain';
    popupWindowSpec = ''
        +   'dependent=no,'
        +   'innerWidth=800,'
        +   'innerHeight=500,'
        +   'width=800,'
        +   'height=500,'
        +   'top=10,'
        +   'left=10,'
        +   'scrollbars=yes,'
        +   'resizable=yes,'
        +   'status=no,'
        +   'toolbar=no';
    window.open( popupWindowUrl, popupWindowName, popupWindowSpec )
    }

References

The Make* Scripts

In $bc/tools, there are the following standalone scripts that may be run directly from the command line:

  • makeGrove.py, the principal script that should be used to rebuild the entire grove database. It is a wrapper around $u/makeDatabases/​makeGrove.py.
  • makeDocumentation.py, to rebuild the documentation only.
  • makeRatings.py, to rebuild data needed by the topical search mode of the CRI, and also shown on demand in search results.
  • makeLexicalIndices.py, to rebuild data needed by the lexico-statistical search mode of the CRI. This process depends on data added by makeDocumentation.py.

The behavior of $bc/​tools/​makeGrove.py may be tuned by setting some parameter definitions definitions near the top of $u/makeDatabases/​makeGrove.py (starting around line 40). The standard settings are:

#   TESTING                                 = True
POPULATE                                = True
ADDRATINGS                              = True
#   SHOWPROPERTIES                          = True
#   SHOWRATINGS                             = True
#   SHOWSOMEGROVEFACTS                      = True
PERSISTTRACES                           = True
ADDDOCUMENTATION                        = True
BUILDLEXICALINDICES                     = True
#   VERBOSE                                 = True

All settings that are outcommented are considered False. Of these settings, the diagnostic PERSISTTRACES may be safely set to False to reduce the size of the database.

Sources used to discover graph structure and grove content

When {$u/makeDatabases/​makeGrove.py}​.populate(ds) is run to feed datasource ds with graphs and nodes, it calls {$u/PsfCmfIdioms.py}​.CmfIdioms​.iterPrefixLocatorAndNamespacesForAllSchemas() to discover sources for graphs. This method, in turn, uses ~.iterAllLikelyCmfLocators2(), which iterates over all file locators that look like CMF files in the directory defined in {$cfg/​.configuration} ​.rawCmfTemplateLocator. In order to associate the namespace URLs defined in the CMFs with graph names, the associations defined in {$cfg/​.configuration}​.namespacesByPrefix (read about restrictions on graphnames on what to observe when planning to extend this mapping) are honored. Additionally, the method will iterate over the files mentioned in ~.additionalGraphsourcesByPrefix. In case ratings for the topical search mode of the CRI are also being built, all files detailed in Source Files for Assessments will also be iterated over.

The method iterPrefixLocatorAndNamespacesForAllSchemas() has a number of options that define what to do in case an invalid XML file is encountered, or whether to accept CMFs that have no namespace defined or no graph name associated with the namespace. The default setting is to read over invalid XML files, but issue a message, to skip CMFs with no namespace defined, and to associate graphs from a CMF that has a namespace, but no defined name with a name that is constructed from the namespace (this has mainly been done in order to ensure every graph gets a unique name in the grove; usage of the namespace URL is preferred in those cases).

The net effect is that (1) the graph structures of all the CMFs living in, by default, $cfg/​templates/cmf will be added to the grove database, under the names defined in the .configuration, and furthermore (2) the graph of conceptual terms defined in $cfg/​configtool/​bion.flow will be also be added.

The Metakit Grove Database

The central datasource for all information concerning graph structure as well as semantic and lexico- statistical knowledge about graph nodes is kept in a Metakit database. It consists of a single file located at $persistence/​grove/​Grove.mkdb and may be accessed via the API defined in Grove.py and MkDatasource.py (both in $u/​feds/​DatasourceClasses). It follows an outline of its structure:

Table of Graphs

graph
name format note
familyname string the name of the family of schemas
length integer the number of nodes in the graph
localgraphsource string local file providing the schema description
name string unique graph name
namespaceurl string the unique namespace URL
version string the version string of this schema

Table of Nodes

node
name format note
gauge string name of referenced graph
locator string locator of node, always starts with /
name string name of node; XML attribute names start with @
level integer level of node in graph == number of slashes in locator
termnr integer number of node when locators are sorted alphabetically; root is always term number zero.
parentTermnr integer term number of parent node; root has -1 here
isLeaf integer true if node has no descendants or all descendants are attributes
lastDescendantTermnr integer term number of the last node that is a descendant of the present node, excluding self; -1 indicates no descendants

Table of Node Properties

property
name format note
gauge string name of referenced graph
locator string term number or locater of referenced node
verb string name of relationship, e.g. 'ancestors'
value string pickle of a tuple or list of term numbers.

Table of Documentation Strings

documentation
name format note
graphname string name of referenced graph
termnr integer number of referenced node
short string short documentation text
long string long documentation text

Table of Ratings

rating
name format note
topicGraphname string name of the describing graph (normally, 'bion')
topicTermnr integer number of describing term
focusGraphname string name of the graph described
value string pickle a sparse vector (i.e. a dictionary) with the effective ratings, indexed by focus term numbers.

Table of Rating Traces

ratingtrace
name format note
topicGraphname string as in table 'rating'
topicTermnr integer as in table 'rating'
focusGraphname string as in table 'rating'
focusTermnr integer as in table 'rating'
value string pickle of a list containing term reference pairs identifying the nodes and ratings responsible for the resulting rating of the referenced node. The sum of all contributions is the value recorded in 'rating.value'.

Source Files for Assessments

All source files for assessments with data for the grove database should have the file extension *.flow and be put into $persistence/​configtool/​assessments. The format of these files is fairly simple, as exemplified here with an abridged snippet from assessments.mdb.xls.txt.flow:

:bion:/bion/collection/organism-group/plants
    :abcd12:/DataSets/DataSet/Units/Unit/Identifications#1
    :abcd12:/DataSets/DataSet/Units/Unit/UnitCollectionDomain/MycologicalUnit#-2
    :abcd12:/DataSets/DataSet/Units/Unit/UnitCollectionDomain/ZoologicalUnit#-2

:bion:/bion/collection/organism-group/fungi
    :abcd12:/DataSets/DataSet/Units/Unit/Identifications#1
    :abcd12:/DataSets/DataSet/Units/Unit/UnitCollectionDomain/BotanicalGardenUnit#-2
    :abcd12:/DataSets/DataSet/Units/Unit/UnitCollectionDomain/MycologicalUnit#2
    :abcd12:/DataSets/DataSet/Units/Unit/UnitCollectionDomain/ZoologicalUnit#-2

These lines specifiy e.g. that as for the topic :bion:/​bion/​collection/​organism-group/​plants, the focus :abcd12:/​DataSets/​DataSet/​Units/​Unit/​Identifications is rated with 1 on a scale ranging from -2 for 'irrelevant' over 0 for 'indifferent' to +2 for 'highly relevant', so we may conclude that ..Unit/Identifications is 'somewhat interesting' when talking about 'plants' in the sense of the topic ontology. Likewise, :abcd12:.../​ZoologicalUnit has been strongly defavored, but the neighbouring MycologicalUnit has been strongly boosted for :bion:.../fungi.

Lupy, the Lexico-Statistical Database

Divmod Lupy is a Python-implementation of the Lucene search engine. It has been integrated into the project by way of a thin wrapper living in $u/lupywrapper/​search.py to facilitate searching, influence query building, and customizing search results. The routines in this module are also used by the Lupy Web Interface.

The Lupy database itself is located in $persistence/​lupyindex. This directory holds several subdirectories, each bearing the name of a graph in the grove (note that this design choice effects a certain restriction on legal graphnames). The files in the directories are generated by executing either makeGrove.py or makeLexicalIndices.py in the {$bc/tools} directory; see the sections on building the grove database and the makegrove script for details.

Restrictions on Graph Names For Namespace URLs in the Configuration

In $bc/configuration/​.configuration, the item namespacesByPrefix represents a mapping of graph prefixes to their namespaces. Since this mapping defines, among other things, the names of directories of the Lupy lexico-statistical database, it is important to choose names that are legal filenames in most OSs; it is safest to restrict oneself to letters, numbers, underscores, periods, and hyphens here.

How to Run Scripts in the CRI Context

Method A: This code shows how to prepare a Python source file to enable it for execution from the command line; it basically checks for the existence of the identifier g and executes the file $bc/lib/​biocase/​fundamentals.py (to which a relative path is calculated using a somewhat concocted, but OS-independent incantation); this file contains code that loads g:

#
# -*- coding: utf-8 -*-

#-----------------------------------------------------------------------------------------------------------
#   Enable globals:
try:
    g
except NameError:
    import os.path as _p; import inspect as _i
    pathToScript = _p.pardir, _p.pardir, 'biocase', 'lib', 'biocase', 'fundamentals.py'
    execfile( _p.join( _p.dirname( _i.getfile( lambda:None ) ), *pathToScript ) )

g.configtool.utilities.makeDatabases.makeDocumentation.make( verbose = False )

If you want to simplify your life and know what you are doing, simple put a line like execfile('c:/​foo/​bar/​biocase/​lib/​biocase/​fundamentals.py') into your file prior to using g.

Method B: Do not run your script directly with the Python interpreter, but pass the filename as an argument to $bc/tools/​environmenteliza2.py, like so:

python /biocase/tools/environmenteliza2.py /biocase/tools/myscript.py

Like fundamentals.py, environmenteliza2.py will make sure that g is made available to the script.