Mario Latendresse, Ph.D.
The BioVelo language is designed to let the user write precise queries against Pathway/Genome Databases created using the Pathway Tools software. By writing queries in BioVelo you can extract lists of objects (e.g., proteins) satisfying specific conditions.
BioVelo can be used as queries for BioCyc Web services. Note that according to the BioCyc Web services documentation, a BioVelo query used in a Web service request must create a list of objects (e.g., pathways, genes). In particular, a BioVelo query used in a BioCyc Web service request cannot return names, numbers, etc. from these objects. Consult the BioCyc Web services documentation for more details and examples.
The BioVelo language is based on a computer science concept called list comprehension, which is popular in functional languages such as Haskell, Miranda, and Python. BioVelo is rich enough to let the user specify complex queries without relying on procedural side effects. This property eases writing queries by starting with simple ones and composing them to create more complex ones. Functional languages like BioVelo offer this advantage over procedural languages.
For a very quick introduction on BioVelo, with short examples, please see Table 2.
The databases queried by BioVelo contain objects belonging to various classes: metabolic pathways, reactions, proteins, genes, and so on. Each class has a set of attributes associated with it. For example, the class Proteins has attributes that include pI (its isoelectric point), and Gene (the gene encoding the protein). That means that each protein object (instance of the class) has the attribute Gene, although in some objects, the attribute may have no value.
The list of classes, their attributes, and the type of each attribute are part of what is called the schema (or ontology) of the databases. All PGDBs share one schema, although that schema changes with new versions of Pathway Tools. The schema is not part of BioVelo. The graphical user interface provided with BioVelo gives some details of the schema and the list of accessible databases -- this list will change with time.
The Pathway Tools schema is described in several documents. The most comprehensive is the Pathway Tools User's Guide, which is available as part of the Pathway Tools software download package. See also several publications listed on the BioCyc publications page. The better you know the Pathway Tools schema, the more adept you will be at writing BioVelo queries, because you need to know what classes to base your queries on, and which attributes to filter in your queries.
By applying relations (e.g., is equal to) to attributes you can constrain your search to specific attribute values. Several relations on multiple attributes can be used in the same query (or subquery).
Each attribute has a type. For example, a type can be a string, a number, a Boolean, an enumerated list of values (aka ``a controlled vocabulary"), or some specific class. For each type, a specific list of relational operators is allowed. For example, addition can be done only on numbers, not strings. Knowing these types is important to write valid queries.
BioVelo is based on List Comprehension, a notation that was first created in mathematics as Set Comprehension. The syntax is similar, but BioVelo emphasizes lists over sets. Note: a list may contain duplicates of some elements, whereas a set does not contain duplicates. BioVelo can process sets, but using lists is usually more efficient.
Table 1 presents the formal syntax of a valid BioVelo expression -- that is the element called Expr can be provided as a query. The notation used to define this syntax is described in the caption of the table although informal details are presented in the next paragraphs.
Typically, a query starts and ends with brackets (that is [ ] or { }). From Table 1 a bracketed expression is either ListExpr or SetExpr. We introduce BioVelo with such queries.
A bracketed query has the form
[ e : q1,..., qn]
or
{ e : q1,..., qn}
where e -- called the head -- is an expression, and the qi are qualifiers that are either generators, binders, or filters. The result of a query is an ordered list (i.e., duplicates may exist) if square brackets [ ] are used; but it is a set (i.e., no duplicates exist) if curly braces { } are used. We will detail the syntax and semantics of such expressions in the following paragraphs.
The head expression e is either a single expression (a SingleExpr from Table 3) or a tuple of single expressions (e1,..., en). In the simplest and most common case, the e is a single identifier that is assigned (aka bound) by one of the generators or binders.
A generator has the form p<-X where p is a variable or a tuple of variables and X is an expression of type list. The scope of the variables in p extends to all qualifiers to the right of that generator -- that is all the variables in p can be referenced by qualifiers on the right of this generator. Variables are identifiers as defined in Table 2 and note that they are case sensitive (e.g., variable X is not the same as variable x). The variables are iteratively bound to every element of X -- for each iteration the right qualifiers are evaluated. The generators are similar to ``loops" in programming languages. If p is a tuple, the number of variables in it must be the same as the tuples that are in X.
Here is a simple example with a single variable (clicking the query will bring you to the Free Form Advanced Query Page (FFAQP) where you can submit the query and see the result as a table):
This gives the list of all reactions in the database E. coli.
The double caret ^^ operator, applied to a database and a class name of this database, as in the previous example, specifies the list of all objects and subclasses of that class for this database. The left operand of ^^ must be either the name of a database accessible from the server or the name of a bound variable to a database name accessible from the server. For example ecoli^^proteins gives the list of all protein objects and subclasses of the database E. coli. The underlying schema decides the exact capitalization of the class names. Although, most of the time, a class name does not have to be exactly capitalized, that is, you can most of the time enter the class name all in lower case. Indeed, we provide the following heuristic to help the user avoid entering the exact capitalization: we first try to find the class name in the schema as you enter it; if this fails, we capitalize the class name provided by capitalizing all the words separated by a dash (e.g., Super-Pathways). For example, if you enter proteins, it is first tried as is in the schema; since it does not exist in the current schema, the capitalization Proteins is then tried, which is correct in the current schema. Similarly with all-genes, it is capitalized to All-Genes. On the other hand, the class DNA-Binding-Sites must be written exactly that way since our heuristic capitalization would give Dna-Binding-Sites -- which is incorrect in the current schema.
The triple caret ^^^ operator is identical to the double caret ^^ operator, except that the result includes only the instances of its argument class, not the instances and subclasses.
The tilde ~ can refer to a single object of one database based on a frame-id. For example, the expression ecoli~rxn0-3151 refers to the object whith frame-id rxn0-3151 in database ecoli. If the frame-id contains characters that are not part of legal identifier (see Table 1 for the definition of identifier), then it must be enclosed in a pair of double-quotes. For example, ecoli~"GO:0009252". In this case, the colon character (:) is not a legal character for an identifier. It is not an error to use a pair of double-quotes when it is not required. In general, the left operand of ~ can be a variable bound to a database. For example, you can search the entire list of databases for one particular frame-id (e.g., eg10124) by using dbs like so: [(x,x~eg10124): x<-dbs, x~eg10124].
The single caret ^ operator references the value of an attribute of an object of a class. The left operand must be an expression evaluating to an object. The right operand must be the name of a queryable attribute. The name of an attribute is not case sensitive. For example, assume that identifier x is bound to a protein object; then x^dna-footprint-size gives the value of the attribute dna-footprint-size of that protein. In this case, this is an integer value. An attribute of any type can be referenced using ^. If the attribute does not belong to the class of the object bound to the given variable, its value will be the empty list.
The ^? operator is essentially the same as ^ but it always generates a string that represent an HTML link to the object on its left operand. This is typically used in the head of the result to get clickable links in a browser. For example,
[r^?name : r <- ecoli^^reactions]would give the list of reaction names with a clickable link to get more information for each reaction. Using the ^ operator would only give the name.
The special attribute FRAME-ID exists for all objects (it has the type String). Although it is not a real attribute of the schema, it is available to make it explicit that each object has a unique ID that can be referenced.
A filter is a Boolean expression formed by connectives, bound variables, comparison operators, arithmetic operators, and/or class predicates. The connective for logical "and" is &, for "or" it is |. The constants false and true represent the corresponding logical values. The comparison operators <, >, <=, >=, =, != pertain to numeric values. The priority (order of grouping) of these operators is given in Table 2.
The logical operators can be applied to any types. For example, assuming that e is a variable bound to a gene object, it is valid to write e^comment | e^citations which is true if there is a comment or some citations for e. That is, an attribute that has no value is considered false. So, that !(e^comment) (the exclamation point is the logical not operator) is true when e^comment has no value. This is useful to find out which attributes have values or no values.
For example, in the following we have a simple Boolean expression 3 > #r^left that is true if and only if the number of elements in the left attribute is smaller than 3:
[(r^?name, #r^left) : r <- ecoli^^reactions, 3 > #r^left ]
This expression "loops" through each reaction of E. coli, verifies for each one that the left attribute has less than 3 elements, and returns only these reactions along with their number of elements in left. Notice that in 3 > #r^left, the ^ has higher priority than #, and > has lower priority than #. So it is interpreted as: reference the attribute left from r, then take the length of left, then compare it with 3. The order is important, otherwise, in this case, these operations would not make sense because of invalid types (e.g., you cannot take the length of an object, but only of a list or set).
Several generators and filters can be specified as qualifiers in one bracketed query. The filters are simply evaluated from left to right as if a logical "and" operation was between them. Generators are also evaluated from left to right. They are nested "loops". For example, in
{r^?name : r <- ecoli^^reactions, product <- r^left, product isa proteins}
we have two generators. The first one loops through all the reactions of E. coli. For each reaction, the second generator loops through the products on the left attribute of this reaction. The filter verifies that product is a protein. Notice that we have used the curly braces so that the same reactions will not be returned twice. This will return the list of all reactions that have at least one protein on its left side. The similar query with square brackets:
[r^?name : r <- ecoli^^reactions, product <- r^left, product isa proteins]
would return a similar list of reactions but a reaction that has more than one protein in its left attribute would be repeated.
The following example contains three generators. This query generates a list of all pathways of E. coli with every possible pair of their reactions.
[(p^?name, r1^?name, r2^?name) : p <- ecoli^^pathways, r1 <- p^reaction-list, r2 <- p^reaction-list, r1 != r2]
The monadic operator # gives the number of elements of a list or set. For example, #ecoli^^proteins would give the number of protein objects in database ecoli. It can be applied to any expression that returns a list or set. For example the following expression gives the number of reactions that has only one object in its left attribute.
#[r : r <- ecoli^^reactions, 1 = #r^left ]
It is possible to bind a variable to any expression value. For example, the qualifier e := x - y binds the variable e to the value of the expression x - y. In general, a binder has the form p:=e where p is a pattern and e is an expression. The symbol := is used to bind one or several identifiers to the value of the expression e. If the pattern p is a tuple with n variables, then the expression e must generate a tuple of n values. If the pattern p is a single variable, the expression e can be of any type. This is useful to avoid recalculating a value (e.g., computation time is saved) or simply shortening a query. For example, in
[r^?name : r <- ecoli^^reactions, p := #r^left, 2 < p, p < 5]
we avoid doing a reference to #r^left twice by using the identifier p.
A tuple of variables can be used in some cases to bind multiple variables with multiple values. For example, in
[ (p^?name, e^?name) : p <- ecoli^^proteins, e := [(c1, c2) : (c1, c2) <- protein-to-components p, c2 > 1], #e > 4]
there is a tuple of variable (c1, c2) bound for each element returned by protein-to-components -- this function returns a tuple of two values for components of the protein p, the component c1 and the number of times c2 it occurs in p. This query would return all the proteins of E. coli that has more than four components that repeat at least once.
The list of possible databases is given by the special identifier dbs. For example, the generator db <- dbs would bind the variable db to each available database. A specific database is specified by using its name - for example, ecoli identifies the E. coli K-12 database, EcoCyc. The complete list of available databases is provided by the graphical user interface to BioVelo. It is also possible to have this list by entering the query dbs -- it will return the list of databases currently accessible from the server.
For any given object, the function database returns its database. It is not a string but a database object. The returned value can be used by any function that requires a database.
Arithmetic expressions can be formed by using arithmetic operators +, -, /, *, quotient (integer division), remainder, and abs. The subtraction (i.e., -) operation is dyadic as well as monadic (i.e., a minus sign). Moreover, lexically, it is necessary to have a delimiter between a - and an adjacent operator or identifier. For example, writing a-b does not mean a subtraction between identifier a and b, but rather the identifier a-b, since the dash can be used to create identifiers.
In the following query, a subtraction is done between the end and start positions of every gene of E. coli associated with an RNA. This query find all RNAs in E. coli that have less than 100 nucleotides.
[(r^?name, l^?name, r^?gene) : r <- ecoli^^RNA, g <- r^gene, l := abs(g^left-end-position - g^right-end-position), l < 100 ]
The form x isa c -- where x is an expression returning an object, and c a class -- is a Boolean expression, or filter, that tests the membership of the object bound to x in class c. The expression is true if and only if x is an object of class c. Notice that this includes the cases where the class of the object is a subclass of class c. For example, r isa binding-reactions is true if variable r is bound to an object of class binding-reactions or to one of its subclasses.
The class c is either a string, a symbol, the special keyword class or an expression using the operator ~.
The previous example could have been written in the following way using isa:
[(r^?name, l^?name, g^?name) : g <- ecoli^^genes, #(g^product) = 1, r <- g^product, r isa RNA, l := abs(g^left-end-position - g^right-end-position), l < 100 ]
The following query returns all the proteins of E. coli, with their DNA binding sites, if any; one binding site per line when displayed as a table.
[(p^?name, e^?name) : p <-ecoli^^proteins, c <- p^component-of, c isa Protein-DNA-Complexes, e := [c2 : c2 <- c^components, c2 isa DNA-Binding-Sites], #e > 0]
This last expression will actually repeat the same protein for each of its binding sites. To have one protein with all its binding sites "next to it" when displayed in a table, one could write
[(p^?name, e^?name) : p <-ecoli^^proteins, e := [c2 : c <- p^component-of, c isa Protein-DNA-Complexes, c2 <- c^components, c2 isa DNA-Binding-Sites], #e > 0]
The special keyword class can be used to test if an expression is a class. This keyword is useful to filter out non-instances of classes, for example, the pathway subclasses of class Pathways
[ p : p <-ecoli^^pathways, !(p isa class)]
The operator in, such as x in X, returns true if and only if element x is in list X. The type of x can be a number, string, or database object.
The Boolean expression s instring S is true if and only if the string s is a substring of string S; or if S is a list of strings, that s is a substring of these concatenated strings; or if S is a list of lists, each list is converted into a string, removing all spaces and parentheses, and s is a substring of these concatenated strings. This expression returns false otherwise. There is also a case insensitive version called instringci.
Regular expressions can be used to specify general string patterns to search. The operator ~= must be used to specify one regular expression. Its left operand must be of type string, and its right operand must be of type string whose value is a string that specifies a regular expression. The syntax of this regular expression follows the rules of the Perl language. Notice that there is a match if the left operand has a substring matching the regular expression. For example, to find all genes in E. coli having a name with a substring made of 't' followed by at least one 'a', you would write:
[g^?name : g <-ecoli^^genes, g^name ~= "t.*a" ]If we were looking for gene names that starts with 't' and ends with 'a', we would use the specifiers '^' and '$' for beginning and end string:
[g^?name : g <-ecoli^^genes, g^name ~= "^t.*a$" ]Notice that the pattern is case-sensitive. For example, the last query would not find gene names that end with 'A'.
The monadic operator set, applied to a list, returns the same list but without duplicated elements. Curly braces can be used instead of the square brackets -- for the whole query -- to get a set instead of a list.
The dyadic operator ++ concatenates two lists or sets (it does not remove duplicates). (The union of two sets can be obtained with ++ and set as in (set A ++ B)) As usual, it is used as an infix operator, as in A ++ B. The operator ** does an intersection between two lists or two sets; it does not remove duplicates.
When forming complex expressions, it is important to consider the priority of the operators used. For example, in the expression y+3*x > 1, the multiplication of 3 to x is done before the addition to y. Likewise, the addition is done before the comparison operator > is applied. That is, the priority of operator * is higher than + which is higher than >. Table 2 gives the relative priority of all operators: an operator higher up in the table has higher priority. As usual, parentheses can be used to apply the operators in a different order; for example in (y+3)*x > 1, the addition is done before the multiplication to x. Adding extra or redundant parentheses is not an error -- in many cases it is better to over-specify the order of evaluation so that the expression is clearer.
The associativity of an operator, or group of operators with the same priority, is either right or left. The operators * and / have the same priority and are left-associative; for example the expression x*y/z*w is interpreted as ((x*y)/z)*w. Note that an expression such as x/y/z is interpreted as (x/y)/z. Table 2 gives the associativity of all operators. In some cases, the associativity does not apply as the operator cannot be composed with itself. For example, an expression such as db^^c^^d is invalid since the second ^^ cannot be applied to a class -- that is, neither db^^(c^^d) nor (db^^c)^^d has a valid interpretation.
The expression [exp1 @ exp2] represents a list of numbers from exp1 up to at most exp2. For example, [1 @ 10] represents the list of integers from 1 to 10. The starting expression exp1 and ending expression exp2 must be of type numbers; they can be integer or non-integer expressions. The list of numbers generated starts at the value of exp1 and, by increasing by 1, goes to a maximum of exp2. If the ending value is smaller than the starting value, the generated list is empty.
Similarly, the expression [exp1 @ exp2, exp3] represents a list of numbers from exp1 up to at most exp2 in steps specified by exp3. For example, [1 @ 10, 2] represents the list of odd integers from 1 to 9. The starting expression exp1 and ending expression exp2 must be of type numbers; they can be integer or non-integer expressions. The list of numbers generated starts at the value of exp1 and, by increasing by exp3, goes to a maximum of exp2. If exp3 is zero, the list has only one element, the value of exp1. In the case that exp3 is negative, if the starting value exp1 is greater than the ending value exp2 than the list of values will be decreasing; otherwise the empty list will be generated.
List of numbers can also be specified by simply using a comma separated list of expressions between square brackets. For example, [1, -1, 100, 9] is a list of four numbers. General numerical expressions can be used as in [n, 2*n, m-1], assuming that the variables n and m are bound to numerical values.
Indexing can be done on list of any type by using the list syntax [...]. For example, [r : r <- ecoli^^reactions][0 @ 4] selects the first five reactions returned by the query on E. coli reactions. The indices do not have to be in numerical sequence, e.g. [3, 5, 7] is also valid. If the list of indices is only one element, the result will have the type of the extracted element, otherwise it is a list of the elements. Applying an indexing operation to a non-list generates an empty list. If some indices are out of range (e.g., greater than the length of the list or negative), no error occurs, but these indices are skipped. Indexing is 0-based, that is, L[0] returns the first element of variable L.
A BioVelo query can also be an expression that does not start with a bracket -- a query is in general an Expr according to the formal syntax of Table 1. For example, a tuple (#dbs, [c^?name:c<-ecoli^^genes]) is a legal query. This gives the number of accessible databases and the list of genes of E. coli. The expression #[p:p<-ecoli^^pathways] is also valid as a query. It gives the number of pathways in E. coli. The following would return two lists, each one interpreted as a table when displayed. The first table contains all E. coli reactions that have two objects on the right attribute; and the second table contains all E. coli reactions that have three.
([r?^name : r <- ecoli^^reactions, 2 = #r^right ], [r?^name : r <- ecoli^^reactions, 3 = #r^right ] )
So, in general, arithmetic expressions and tuples are also valid queries -- in the former case they return single values; in the latter case they return multiple values of possibly complex values (e.g., lists).
Here are some more details about some of these functions.
The special function find-objects searches a given database, i.e., its first operand, based on a given object as the second operand. The search is done using the common name, the synonyms and the frame-id of the given object. Typically, the given object is from a database different than the given database. A similar operation could be done using a complex BioVelo query based on the attributes of the object, but the find-objects function does the search in a more efficient way. For example, the following query will search for the gene trpA of ecoli in all databases available: (the gene trpA from ecoli is also included in the result)
[(db, l): g <-ecoli^^genes, g^name = "trpA", db <- dbs, l:=find-objects(db,g), #l > 0]
Note: the variable dbs is bound to the list of available databases.
The result of a query can be sorted using the following functions: sort, html-sort-descending, and html-sort-ascending.
The sort function always sort in ascending order. It is applied directly to the given list without preprocessing the elements. For exampe, sort [x^?name: x <- ecoli^^proteins] would return the ascending sorted list of proteins based on their name. If the result is a list of tuples (i.e., several columns), by default the sorting is done on the first element of the tuple (i.e., the first column). This can be changed by specifying two arguments to the function sort. The first argument is the query, the second is an integer specifying the index of the element to sort on. For example, sort ([(x^name,x^frame-id): x <- ecoli^^proteins],2) sorts on the second element, the frame-id. Note if the query is executed in the context of HTML output, all the elements are strings, including numbers, so that lexicographic order applies in this case.
The function html-sort-descending should only be used in the context of a query ran from one of the web interfaces: the SAQP or FFAQP. (Note: These interfaces are available at SAQP/FFAQP.) The output is not sorted by the server, but sorted by the JavaScript in the HTML page generated by the server. Naturally, as its name says, it sorts in descending order. As for sort, its second argument specifies which column to sort on. By default this is the first column.
The function html-sort-ascending is similar to html-sort-descending but sorts in ascending order.
The function html-table-headers controls the headers content of tables generated by a web server when generating HTML output. It should only be used in that context. Its first argument is a query that results in a list, its second argument is a tuple of strings. Each string will appear in the column headers of the HTML table to display the list. Note that the list is made of tuples of the same arity which is the number of columns of the displayed table. The number of strings specified in the second argument should be the same as the number of columns, although it is not an error to specify more or less than that number. The list will be either truncated, if too long, or supplied with some headers, if too short. Typically, the added headers are formulated by taking the head specification of the query, although this may result in odd looking text.
Example:
html-table-headers([(pathway-to-reactions(x), pathway-to-genes(x) ) : x <- ecoli^^pathways], ("My Reactions", "My Genes"))
The functions html-table-headers, html-sort-ascending, and html-sort-descending can be freely composed. That is, for example, after sorting using either html-sort-ascending or html-sort-descending you can apply html-table-headers.
|
|