Author: | Wolfgang Lipp |
---|---|
Contact: | w dot lipp at bgbm dot org |
Institution: | Botanical Garden and Botanic Museum (BGBM), Berlin (http://bgbm.org) |
Version: | 0.0.1 |
Copyright: | This document has been placed in the public domain. |
Date: | 2005–02–20 |
This document describes how to accomplish the most important tasks around managing, updating and using the Concept Retrieval Interface (CRI) for the ABCD Configuration Assistant.
In file locators, the following symbolic references are used:
$bc | the BioCASe home directory |
$cfg | $bc/configuration |
$persistence | $cfg/persistence |
$u | $bc/lib/biocase/configtool/utilities |
Furthermore, we splice file locators into object-oriented names using braces; for example: {$u/makeDatabases/makeGrove.py}.populate(ds) points to the object (a module-level function in this case) named populate that lives in file $u/makeDatabases/makeGrove.py.
Tasks to be discussed in this section:
The central database that holds all data on graphs registered with the CRI, dsGrove, must be rebuilt every time a graph is to be added or deleted from it. At present, there is no recommended way to prune or graft items from or into the grove. The script that performs the database reconstruction process is $bc/tools/makeGrove.py (see below for details).
As detailed in the section on grove sourcefiles, all valid CMFs dropped into $cfg/templates/cmf will be added to the grove without further configuration or registration; an association of the namespace URL with a short graph name in {$cfg/.configuration}.namespacesByPrefix is optional. -- Graphs that adhere to the simple *.flow format used for the :bion: graph of conceptual terms must, however, still be added to ~.additionalGraphsourcesByPrefix and ~.additionalNamespacesByPrefix.
It may sometimes become desirable to modify :bion:, the ontology of terms used in the topical search mode of the CRI. This entails changes at two places: (1) the ontology itself must be changed; (2) topical assessments made under the assumption of the previous version must be modified so the concept locators used there still point to valid concepts.
The structure of the *.flow notation is very straightforward -- entries are listed on separate lines, unguarded texts indicate conceptual terms, and indentation is used to symbolize parent/child relationships. Additionally, the following symbols for the secondary (supplementary, logical) structure are used:
Symbol | Meaning |
---|---|
<=> | is equivalent to |
<= | is subsequent of |
=> | is antecedent of |
As an example, consider the following snippet:
collection botanical-garden herbarium zoological-garden <= /bion/organism-group/animals
This structure defines, inter alia, that zoological-garden and herbarium are both child nodes of collection, and that the concept identified as /bion/organism-group/animals is a logical antecedent of zoological-garden. As becomes evident, when nodes are moved around in the notation, references to nodes -- which must at present quote absolute locators -- may have to be updated alongside.
When the graph of conceptual terms has been updated, it becomes inevitable to check the assessments source files for inconsistencies. If a concept has been moved to a location that is semantically and topologically similar to the original one, then it should suffice to replace the old concept locators by new ones; if, however, the move is thought to entail a semantic change in the concept, or if the topological context of a term is significantly modified, it will also be necessary to modify the values given for the topic/focus pairs concerned. This may or may not entail the modification of assessments made for neighboring concepts.
As part of the functionality of the makeGrove.py script, documentation is also gathered from the XML source files and prepared for the lexico-statistical interface. It is possible, though not recommended, to update this part of the database by running $bc/tools/makeDocumentation.py, or to run $u/makeDatabases/makeGrove.py with the appropriate parameters suitably set.
This snippet shows how to (1) retrieve dsGrove, the object representing the grove database; (2) retrieve the table that holds summary data on each graph and (3) iterate over all or over a selection of nodes from each graph. The example assumes that the script adheres to the remarks made about running scripts within the CRI context; under these conditions, g is a variable that provides direct access to modules without the need of import statements. The code is also available in $bc/tools/demoGroveUsage.py.
#=========================================================================================================== import sys import imp import os.path as _ospath u = g.configtool.utilities _feds = u.feds inter = u.kpyInterpol.inter #=========================================================================================================== if __name__ == '__main__': #======================================================================================================= # (1) retrieve grove datasource #======================================================================================================= groveFed = _feds.resolve( 'Grove' ) dsGrove = groveFed.dsGrove graphnames = list( dsGrove.iterGraphnames() ) graphnames.sort() graphnames = '\n\t'.join( graphnames ) print inter( '' + "The grove contains graphs with the following names:\n\t$graphnames" ) #======================================================================================================= # (2) iterate over graph facts #======================================================================================================= for graph in dsGrove.getTable( 'graph' ): print # print graph print inter( '' + "*\tGraph :$graph.name: represents\n" + "\tnamespace '$graph.namespaceurl'\n" + "\tand has $graph.length nodes. It has been constructed from\n" + "\t'$graph.localgraphsource'." ) #=================================================================================================== # (3) iterate over node facts #=================================================================================================== if graph.name == 'abcd12': nodename = '@PreferredFlag' nodes = dsGrove.catchTable( node = dict( # select * from node where gauge = graph.name, # graph = 'abcd12' and name = nodename, # name = '@PreferredFlag'; ) ) print inter( '' + "\n\tThere are $len(nodes) nodes called '$@PreferredFlag' in graph :$graph.name:;\n" + "\tthey are located below the following nodes:" ) nodetable = dsGrove.getTable( 'nodetable' ) for node in nodes: termref = ( graph.name, node.parentTermnr ) parentnode = dsGrove.nodefactForTermref( termref ) print inter( '\t\t$parentnode.locator' )
Working with graph-related data, it is both important to uniquely identify a graph when addressing it, and to mark up data in a readable fashion. In both cases, it is sometimes advantageous to use short names, and sometimes it is preferrable or necessary to use namespace URLs. While the web interface allows to mix both kind of givens, much of the grove API does not; it then becomes necessary to perform conversions before using the API. The responsible method is {$u/feds/DatasourceClasses/Grove.py}.getGraphnameAndNamespace() which accepts both namespace URLs and graphnames and will always try to return a (graphname,namespace) pair.
This piece of code demonstrates how to use the two methods to unify graph identification data (observe the remarks on running scripts):
dsGrove = g.configtool.utilities.feds.resolve( 'Grove' ).dsGrove for either in ( 'bion', 'http://www.biocase.org/schemas/metaprofile/1.2', 'http://www.biocase.org/schemas/bion/0', 'http://www.tdwg.org/schemas/abcd/1.2', 'abcd12', ): graphname, namespace = dsGrove.getGraphnameAndNamespace( either ) print 'Namespace URL:', namespace print 'Filed under :', graphname
In order to prepare a list of namespace URLs for use with a method that expects graphnames, use an incantation along the lines of
namespaces = [ 'http://www.biocase.org/schemas/metaprofile/1.2', 'http://www.biocase.org/schemas/bion/0', ] graphnames = [ dsGrove.getGraphnameAndNamespace( namespace )[ 0 ] for namespace in namespaces ]
The script $bc/tools/searchTopicalRatings.py demonstrates how to use the Ratings API to conduct topical searches against the ratings for focus/topic pairs derived by $u/makeDatabases/makeRatings.py (cf. section on Make* Scripts). The central lines are quite simple (see Running Scripts for the meaning of g):
_Ratingscalculator = g.configtool.utilities.tempratingsweb.Ratingscalculator.Ratingscalculator rc = _Ratingscalculator( query, topicGraphname, focusGraphname ) ratings = rc.getRatings() #--------------------------------------------------------------------------------------------------- for rating, focusTermnr in ratings: print rating, focusTermnr
You can use the Grove Database e.g. to convert between term numbers and schema node locators (this is demonstrated in $bc/tools/searchTopicalRatings.py); read Normalizing Namespaces and Graphnames for how to convert between graph names and namespace URLs.
The script $bc/tools/searchWithLupy.py demonstrates how to use the Lupy Wrapper API to conduct lexical searches against the documentation[1] that was extracted from the CMFs. The underlying functionality demonstrated in this script is formulated in {$u/lupywrapper/search.py}.search(phrase,*graphnames); this module-level function accepts a string contain whitespace-separated query terms and and a possibly empty iterable of graphnames[2]; it returns a triplet (normalphrase,graphnames,resulttriplets) where normalphrase is string containing the normalized terms of the search phrase, graphnames is a set of the names of the graphs that were searched, and resulttriplets is a reversely sorted list consisting itself of (rating,graphname,termnr) triplets. Sometimes life is easier using {$u/lupywrapper/search.py}.nodefactsForSearch() instead, which returns (normalphrase,graphnames,resultpairs) triplets where resultpairs is a list of (rating,nodefact) pairs.
As in the web interface, the mentioned API functions will honor some special syntax for the search phrase: A + (plus) prepended to a term will restrict results to those where the term actually appears; a - (minus) in front of a term excludes all results that show that term; a term encased in colons, like :abcd12:, specifies a graph to search. Multiple graphnames may be given, and in case no graphnames are given, the entire grove will be searched.
[1] | in fact, not only is the documentation present for a particular node included in lexixal searches, but also the documentation of ancestral nodes and the parsed locator of each element. For example, if a node /schema/geosciences/geographyAndDemography/europe is undocumented, but /schema/geosciences and /schema/geosciences/geographyAndDemography both do have annotations in their CMFs, then terms used in those ancestral documentation are added for the node in question, as are the terms from the node locator, i.e. schema, geoscience, geography, demography, and europe. |
[2] | read Normalizing Namespaces and Graphnames in order to learn how to convert namespace URLs to be used with this function |
This code shows how to use the Lupy wrapper. Again, instead of using dsGrove.nodefactForTermref(termref) we could have chosen to call .nodefactsForSearch() instead of search():
u = g.configtool.utilities _feds = u.feds _inter = u.kpyInterpol.inter _search = u.lupywrapper.search.search _wrap = u.typo.wrap #=========================================================================================================== def searchAndTell( phrase ): #------------------------------------------------------------------------------------------------------- groveFed = _feds.resolve( 'Grove' ) dsGrove = groveFed.dsGrove #------------------------------------------------------------------------------------------------------- normalphrase, graphnames, results = _search( phrase ) graphnames = list( graphnames ) graphnames.sort() graphnames = ', '.join( graphnames ) #------------------------------------------------------------------------------------------------------- print _inter( '' + "You searched for '$normalphrase'\n" + "in $graphnames. The query returned $len(results) results:" ) for result in results: rating, graphname, termnr = result rating = '%5.3f' % rating termref = ( graphname, termnr ) nodefact = dsGrove.nodefactForTermref( termref ) short, long = dsGrove.getDocumentationForTermref( termref ) print print _inter( '($rating) $nodefact.ontonym' ) if short: print _wrap( short )
Using the CGI script $bc/www/configtool/ratingsweb/index.cgi, it is possible to fetch results of topical and lexical searches via the CRI web interface. This script ultimately calls {$u/tempratingsweb}.MyServlet.respond_default() and accepts a number of URL query string parameters; these are
param. | name | default | purpose |
---|---|---|---|
q | Query | '' | Specifies search phrase, ex; 'zoo* or botanical-garden' |
ts | Topic Schema | 'bion' | Specifies topic schema (from where the search terms are taken) |
te | Target Element | '' | Name of a JavaScript function in the opening window that may be called with the graphname and the termnr as arguments; this function is responsible for further processing of the result. See Addressing the Interface from a Web Page for more. |
dl | Display Locators | 'n' | Whether ('y') or not ('n') to display concept locators. |
dd | Display Docu | 'n' | Display no ('n') documentation, or short ('y') or extended ('x') documentation for results. |
st | Search Type | 't' | Whether to conduct a topical ('t') or a lexico-statistical ('l') search. |
rf | Result Format | 'g' | Whether to return graphnames ('g') or namespace URLs ('n'). |
Assuming the server is listening on port 8080 of localhost, this example shows how to construct a simple query URL (with parameters displayed one by one):
http://127.0.0.1:8080/biocase/configtool/ratingsweb/index.cgi? fs=http://www.tdwg.org/schemas/abcd/1.2 &ts=http://www.biocase.org/schemas/bion/0 &q=plants and not fungi &st=t
Depending on the software that goes between the URL and the sender, it may or may not be necessary to escape characters outside the small subset of US-ASCII that is permissible in URLs.
Since namespaces are automatically converted to graphnames as defined in {$cfg/.configuration}.namespacesByPrefix, the above query is equivalent to the shorter
http://127.0.0.1:8080/biocase/configtool/ratingsweb/index.cgi? fs=abcd12 &ts=bion &q=plants and not fungi &st=t
As shown above, one of the parameters for the CR web interface query URL, te, may be the name of a JavaScript callable that accepts two parameters, namely the name of the graph and the number of the selected node (the term number). The function will be called when a client window has openened the CRI as a popup window, the user has conducted a search, and then clicks on one of the search results. This is demonstrated in $bc/www/configtool/CriWebInterfaceDemo.html, a fairly minimal page that includes all the basic functionality. This page includes two JavaScript functions and an HTML form. On the form, users get a chance to enter a graphname or a namespace URL they want to search; the submit button will open the CRI interface. Selecting a search result there will add results to the form's <textarea/> element:
<form name='myform' action='javascript:search();' method='get'> Name or Namespace URL of schema to search:<br /> <input type='text' size='80' name='graphname' value='abcd12'></input> <input type='submit'></input><br /> <br /> Results:<br /> <textarea name='mytextinput' rows='25' cols='80'></textarea><br /> </form>
The JavaScript function that is responsible for receiving and processing results looks like this:
function receiveConcept(graphname,termnr){ /* Receive results from the CRI popup window, construct a string from them, and append them to the form element value. */ document.myform.mytextinput.value += ':' + graphname + ':' + termnr + ' '; };
In case you prefer namespace URLs to be returned rather than graphnames, be sure to include the parameter rf=n in your query URL (see above for details). This is what the JavaScript function to open a suitable popup window for the CRI looks like in the sample page:
function search() { /* Formulate an appropriate query URL and open the CRI popup. */ graphname = document.myform.graphname.value; graphname = escape( graphname ); popupWindowUrl = './ratingsweb/index.cgi?te=receiveConcept&fs=' + graphname + '&st=l'; popupWindowName = 'biocaseratingswebmain'; popupWindowSpec = '' + 'dependent=no,' + 'innerWidth=800,' + 'innerHeight=500,' + 'width=800,' + 'height=500,' + 'top=10,' + 'left=10,' + 'scrollbars=yes,' + 'resizable=yes,' + 'status=no,' + 'toolbar=no'; window.open( popupWindowUrl, popupWindowName, popupWindowSpec ) }
In $bc/tools, there are the following standalone scripts that may be run directly from the command line:
The behavior of $bc/tools/makeGrove.py may be tuned by setting some parameter definitions definitions near the top of $u/makeDatabases/makeGrove.py (starting around line 40). The standard settings are:
# TESTING = True POPULATE = True ADDRATINGS = True # SHOWPROPERTIES = True # SHOWRATINGS = True # SHOWSOMEGROVEFACTS = True PERSISTTRACES = True ADDDOCUMENTATION = True BUILDLEXICALINDICES = True # VERBOSE = True
All settings that are outcommented are considered False. Of these settings, the diagnostic PERSISTTRACES may be safely set to False to reduce the size of the database.
When {$u/makeDatabases/makeGrove.py}.populate(ds) is run to feed datasource ds with graphs and nodes, it calls {$u/PsfCmfIdioms.py}.CmfIdioms.iterPrefixLocatorAndNamespacesForAllSchemas() to discover sources for graphs. This method, in turn, uses ~.iterAllLikelyCmfLocators2(), which iterates over all file locators that look like CMF files in the directory defined in {$cfg/.configuration} .rawCmfTemplateLocator. In order to associate the namespace URLs defined in the CMFs with graph names, the associations defined in {$cfg/.configuration}.namespacesByPrefix (read about restrictions on graphnames on what to observe when planning to extend this mapping) are honored. Additionally, the method will iterate over the files mentioned in ~.additionalGraphsourcesByPrefix. In case ratings for the topical search mode of the CRI are also being built, all files detailed in Source Files for Assessments will also be iterated over.
The method iterPrefixLocatorAndNamespacesForAllSchemas() has a number of options that define what to do in case an invalid XML file is encountered, or whether to accept CMFs that have no namespace defined or no graph name associated with the namespace. The default setting is to read over invalid XML files, but issue a message, to skip CMFs with no namespace defined, and to associate graphs from a CMF that has a namespace, but no defined name with a name that is constructed from the namespace (this has mainly been done in order to ensure every graph gets a unique name in the grove; usage of the namespace URL is preferred in those cases).
The net effect is that (1) the graph structures of all the CMFs living in, by default, $cfg/templates/cmf will be added to the grove database, under the names defined in the .configuration, and furthermore (2) the graph of conceptual terms defined in $cfg/configtool/bion.flow will be also be added.
The central datasource for all information concerning graph structure as well as semantic and lexico- statistical knowledge about graph nodes is kept in a Metakit database. It consists of a single file located at $persistence/grove/Grove.mkdb and may be accessed via the API defined in Grove.py and MkDatasource.py (both in $u/feds/DatasourceClasses). It follows an outline of its structure:
graph | ||
---|---|---|
name | format | note |
familyname | string | the name of the family of schemas |
length | integer | the number of nodes in the graph |
localgraphsource | string | local file providing the schema description |
name | string | unique graph name |
namespaceurl | string | the unique namespace URL |
version | string | the version string of this schema |
node | ||
---|---|---|
name | format | note |
gauge | string | name of referenced graph |
locator | string | locator of node, always starts with / |
name | string | name of node; XML attribute names start with @ |
level | integer | level of node in graph == number of slashes in locator |
termnr | integer | number of node when locators are sorted alphabetically; root is always term number zero. |
parentTermnr | integer | term number of parent node; root has -1 here |
isLeaf | integer | true if node has no descendants or all descendants are attributes |
lastDescendantTermnr | integer | term number of the last node that is a descendant of the present node, excluding self; -1 indicates no descendants |
property | ||
---|---|---|
name | format | note |
gauge | string | name of referenced graph |
locator | string | term number or locater of referenced node |
verb | string | name of relationship, e.g. 'ancestors' |
value | string | pickle of a tuple or list of term numbers. |
documentation | ||
---|---|---|
name | format | note |
graphname | string | name of referenced graph |
termnr | integer | number of referenced node |
short | string | short documentation text |
long | string | long documentation text |
rating | ||
---|---|---|
name | format | note |
topicGraphname | string | name of the describing graph (normally, 'bion') |
topicTermnr | integer | number of describing term |
focusGraphname | string | name of the graph described |
value | string | pickle a sparse vector (i.e. a dictionary) with the effective ratings, indexed by focus term numbers. |
ratingtrace | ||
---|---|---|
name | format | note |
topicGraphname | string | as in table 'rating' |
topicTermnr | integer | as in table 'rating' |
focusGraphname | string | as in table 'rating' |
focusTermnr | integer | as in table 'rating' |
value | string | pickle of a list containing term reference pairs identifying the nodes and ratings responsible for the resulting rating of the referenced node. The sum of all contributions is the value recorded in 'rating.value'. |
All source files for assessments with data for the grove database should have the file extension *.flow and be put into $persistence/configtool/assessments. The format of these files is fairly simple, as exemplified here with an abridged snippet from assessments.mdb.xls.txt.flow:
:bion:/bion/collection/organism-group/plants :abcd12:/DataSets/DataSet/Units/Unit/Identifications#1 :abcd12:/DataSets/DataSet/Units/Unit/UnitCollectionDomain/MycologicalUnit#-2 :abcd12:/DataSets/DataSet/Units/Unit/UnitCollectionDomain/ZoologicalUnit#-2 :bion:/bion/collection/organism-group/fungi :abcd12:/DataSets/DataSet/Units/Unit/Identifications#1 :abcd12:/DataSets/DataSet/Units/Unit/UnitCollectionDomain/BotanicalGardenUnit#-2 :abcd12:/DataSets/DataSet/Units/Unit/UnitCollectionDomain/MycologicalUnit#2 :abcd12:/DataSets/DataSet/Units/Unit/UnitCollectionDomain/ZoologicalUnit#-2
These lines specifiy e.g. that as for the topic :bion:/bion/collection/organism-group/plants, the focus :abcd12:/DataSets/DataSet/Units/Unit/Identifications is rated with 1 on a scale ranging from -2 for 'irrelevant' over 0 for 'indifferent' to +2 for 'highly relevant', so we may conclude that ..Unit/Identifications is 'somewhat interesting' when talking about 'plants' in the sense of the topic ontology. Likewise, :abcd12:.../ZoologicalUnit has been strongly defavored, but the neighbouring MycologicalUnit has been strongly boosted for :bion:.../fungi.
Divmod Lupy is a Python-implementation of the Lucene search engine. It has been integrated into the project by way of a thin wrapper living in $u/lupywrapper/search.py to facilitate searching, influence query building, and customizing search results. The routines in this module are also used by the Lupy Web Interface.
The Lupy database itself is located in $persistence/lupyindex. This directory holds several subdirectories, each bearing the name of a graph in the grove (note that this design choice effects a certain restriction on legal graphnames). The files in the directories are generated by executing either makeGrove.py or makeLexicalIndices.py in the {$bc/tools} directory; see the sections on building the grove database and the makegrove script for details.
In $bc/configuration/.configuration, the item namespacesByPrefix represents a mapping of graph prefixes to their namespaces. Since this mapping defines, among other things, the names of directories of the Lupy lexico-statistical database, it is important to choose names that are legal filenames in most OSs; it is safest to restrict oneself to letters, numbers, underscores, periods, and hyphens here.
Method A: This code shows how to prepare a Python source file to enable it for execution from the command line; it basically checks for the existence of the identifier g and executes the file $bc/lib/biocase/fundamentals.py (to which a relative path is calculated using a somewhat concocted, but OS-independent incantation); this file contains code that loads g:
# # -*- coding: utf-8 -*- #----------------------------------------------------------------------------------------------------------- # Enable globals: try: g except NameError: import os.path as _p; import inspect as _i pathToScript = _p.pardir, _p.pardir, 'biocase', 'lib', 'biocase', 'fundamentals.py' execfile( _p.join( _p.dirname( _i.getfile( lambda:None ) ), *pathToScript ) ) g.configtool.utilities.makeDatabases.makeDocumentation.make( verbose = False )
If you want to simplify your life and know what you are doing, simple put a line like execfile('c:/foo/bar/biocase/lib/biocase/fundamentals.py') into your file prior to using g.
Method B: Do not run your script directly with the Python interpreter, but pass the filename as an argument to $bc/tools/environmenteliza2.py, like so:
python /biocase/tools/environmenteliza2.py /biocase/tools/myscript.py
Like fundamentals.py, environmenteliza2.py will make sure that g is made available to the script.
Lucene: The Lucene Search Engine. http://lucene.sourceforge.net
Divmod Lupy: Lucene implemented in Python. http://www.divmod.org/Lupy