Chase and Provenance
Prometheux natively supports full explanations of logical processes for output generation in reasoning tasks via the materialization of the chase graph during its parallel and distributed evaluation.
The chase graph mode is activated using the @chase
annotation and is
materialized in a parallel and distributed fashion into datasources such as CSV
files or Neo4j databases.
Configuring the Chase Graph
The @chase
annotation configures how the chase graph is stored. Its syntax is
as follows:
@chase("datasource", "filepath", "filename").
The chase graph's storage format varies based on the selected datasource.
Materializing the Chase Graph on CSV
For CSV datasources, the chase graph is stored as a CSV dataset having the following columns "Fact", "ProvenanceLeft", "ProvenanceRight", "Rule" representing the chase fact, the left body predicate, the right body predicate (empty if the chase fact is generated by a linear rule), the description of the rule generating the chase fact, respectively. Consider this example:
arc(1,2).
arc(1,3).
path(X,Y) :- arc(X,Y).
@chase("csv", "disk/data", "chase.csv").
@output("path").
This @chase
annotation instructs Vadalog to store the chase in the path
disk/data/chase.csv
. The directory may contain multiple CSVs due to the
distributed evaluation, with the following rows:
Fact,ProvenanceLeft,ProvenanceRight,Rule
"path(1,2)","arc(1,2)","","path(X,Y) :- arc(X,Y)"
"path(1,3)","arc(1,3)", "","path(X,Y) :- arc(X,Y)"
Handling Aggregations
The production of the chase for programs with aggregations requires to introduce
artificial intermediate chase nodes representing facts with group by values.
This intermediate chase node guarantees the connectivity between the fact
resulting from an aggregation and the facts that contributed to calculate it.
Such a intermediate chase fact is identified by its predicate which is the
original predicate name prepended by aggregated_explainability_
.
As an example, the following program:
own(1,2,0.3).
own(1,2,0.4).
path_own(X,Y,Z) :- own(X,Y,Z), Z > 0.
path_own_agg(X,Y,C) :- path_own(X,Y,Z), C = msum(Z).
@chase("csv coalesce=true", "disk/data", "chase.csv").
@output("path_own_agg").
produces a csv containing the following rows:
Fact,ProvenanceLeft,ProvenanceRight,Rule
"path_own(1,2,0.3)","own(1,2,0.3)","","path_own(X,Y,Z) :- own(X,Y,Z), Z>0."
"path_own(1,2,0.4)","own(1,2,0.4)","","path_own(X,Y,Z) :- own(X,Y,Z), Z>0."
"aggregated_explainability_path_own_agg(1,2)","path_own(1,2,0.3)","","path_own_agg(X,Y,C) :- path_own(X,Y,Z), C=msum(Z)."
"aggregated_explainability_path_own_agg(1,2)","path_own(1,2,0.4)","","path_own_agg(X,Y,C) :- path_own(X,Y,Z), C=msum(Z)."
"path_own_agg(1,2,0.7)","aggregated_explainability_path_own_agg(1,2)","","path_own_agg(X,Y,C) :- path_own(X,Y,Z), C=msum(Z)."
Materializing the Chase Graph on CSV using Neo4j Bulk Import
When materializing the chase in CSV format, it is possible to structure the data
for import into Neo4j using the bulk import tool. This is particularly useful
for navigating very large chase graphs with Neo4j. To activate this
functionality, set the parameter forNeo4jBulkImport=true
(the default is
false). This command enables the materialization of the chase into two CSV
files: one for nodes and one for edges. The CSV of the nodes has the following
columns "id:ID", ":Label". The CSV of the edges has the following columns
:START_ID, :END_ID, rule:string,:TYPE. Consider this example:
arc(1,2).
arc(1,3).
path(X,Y) :- arc(X,Y).
@chase("csv forNeo4jBulkImport=true, compression=gzip", "neo4j-import", "chase").
@output("path").
The chase annotation instructs Vadalog to store the chase nodes in the path neo4j-import/nodes and the chase edges in the path neo4j-import/edges. We can optionally add the compression mode to reduce the occupation of the materialized chase.
The nodes directory may contain multiple CSVs due to the distributed evaluation, with the following entries:
"arc(1,2)", "CHASE_NODE";"ARC"
"arc(1,3)", "CHASE_NODE";"ARC"
"path(1,2)", "CHASE_NODE";"PATH"
"path(1,3)", "CHASE_NODE";"PATH"
The edges directory may contain multiple CSVs due to the distributed evaluation, with the following entries:
:START_ID, :END_ID, rule:string, :TYPE
"path(1,2)", "arc(1,2)", "path(X,Y) :- arc(X,Y)", "DERIVED_BY"
"path(1,3)", "arc(1,3)", "path(X,Y) :- arc(X,Y)", "DERIVED_BY"
Notice that "rule:string" is an attribute of the edge relationship having type "DERIVED_BY" connecting the start node with the end node.
After materializing the chase in CSV files one can run the following script to import the chase onto Neo4j using the bulk import command:
docker run --rm \
--volume=${PWD}/neo4j-data:/var/lib/neo4j/data \
--volume=${PWD}/neo4j-import:/var/lib/neo4j/import \
neo4j:4.4.31 \
neo4j-admin import --database=neo4j \
--nodes=/var/lib/neo4j/import/chase/nodes/part-[a-zA-Z0-9\-]+.csv.gz \
--relationships=/var/lib/neo4j/import/chase/edges/part-[a-zA-Z0-9\-]+.csv.gz
Materializing the Chase Graph on Neo4j with connectors
For Neo4j, the chase graph is represented as a graph with a node for each chase
node and an edge for each derivation having name DERIVED_BY. Each node is a
Neo4j node with the label CHASE_NODE and a property fact
for the chase fact.
Consider this example:
arc(1,2).
arc(1,3).
path(X,Y) :- arc(X,Y).
@chase("neo4j", "", "").
@output("path").
The configuration for Neo4j is specified in the vada.properties file, thus the third and fourth fields of the @chase annotation are not needed. The properties for Neo4j configuration might look like this:
neo4j.chase.url=bolt://localhost:7687
neo4j.chase.username=neo4j
neo4j.chase.password=neo4j
neo4j.chase.database=neo4j
neo4j.chase.authenticationType=basic
neo4j.chase.partitions=1
This chase annotation instructs Vadalog to store the chase in the neo4j database
reachable at the URL bolt://localhost:7687
. In this case there will be created
four CHASE_NODE
nodes
CHASE_NODE(fact: 'path(1,2)')
CHASE_NODE(fact: 'path(1,3)')
CHASE_NODE(fact: 'arc(1,2)')
CHASE_NODE(fact: 'arc(1,2)')
and two DERIVED_BY
edges
CHASE_NODE(fact: 'path(1,2)') -[DERIVED_BY(rule: path(X,Y) :- arc(X,Y))]->CHASE_NODE(fact: arc(1,2))
CHASE_NODE(fact: 'path(1,3)') -[DERIVED_BY(rule: path(X,Y) :- arc(X,Y))]->CHASE_NODE(fact: arc(1,3))
Retrieving the Chase from Neo4j
After storing the chase in Neo4j, it can be retrieved using distributed Neo4j Connectors with the following annotations:
@input("chase_neo4j").
@qbind("chase_neo4j", "neo4j", "\", "MATCH(n:CHASE_NODE) -[r:DERIVED_BY]->(m:CHASE_NODE) RETURN n.fact, m.fact, r.rule").
chase_edge(X,Y,R) :- chase_neo4j(X,Y,R).
@output("chase_edge").
This retrieves the entire chase graph and outputs it in the console. Additionally, bind annotations can store the retrieved data in a different datasource.
Note that the Neo4j connector for the above @qbind gets from the vada.properties file the classic properties to read from Neo4j:
neo4j.url=bolt://localhost:7687
neo4j.username=neo4j
neo4j.password=neo4j
neo4j.database=neo4j
neo4j.authenticationType=basic
neo4j.partitions=1
Optimize the Chase materialization and retrieval on Neo4j
Note that to have better performances while materializing and retrieving the
chase from Neo4j, you can index the CHASE_NODE
via the following cypher query
in the Neo4j Browser or via cypher shell:
CREATE INDEX index_chase_node__fact IF NOT EXISTS FOR (chase_node:CHASE_NODE) ON (chase_node.fact)
Explanation of the output
The query below demonstrates how to retrieve specific chase data related to the fact 'a(1,2)':
@input("chase_neo4j").
@qbind("chase_neo4j", "neo4j", "\", "MATCH (root:CHASE_NODE {fact: 'a(1,2)' }) CALL apoc.path.subgraphNodes(root, {relationshipFilter: 'DERIVED_BY>', limit: 1000}) YIELD node MATCH (node)-[r]->(m) RETURN node.fact, m.fact, r.rule").
chase_edge(X,Y,R) :- chase_neo4j(X,Y,R).
@output("chase_edge").
This query requires the APOC library installed in Neo4j. To set up a Neo4j container instance with APOC, use the following Docker script:
#!/bin/bash
docker run --rm -d --name neo4j-vada \
-p 7474:7474 \
-p 7687:7687 \
--volume=${PWD}/neo4j-data:/var/lib/neo4j/data \
--volume=${PWD}/neo4j-import:/var/lib/neo4j/import \
--volume=${PWD}/neo4j-plugins:/var/lib/neo4j/plugins \
-e NEO4J_apoc_export_file_enabled=true \
-e NEO4J_apoc_import_file_enabled=true \
-e NEO4J_apoc_import_file_use__neo4j__config=true \
-e NEO4JLABS_PLUGINS='["graph-data-science","apoc"]' \
-e NEO4J_dbms_memory_heap_max__size=2G \
-e NEO4J_dbms_security_procedures_unrestricted=gds.\\\* neo4j:4.4.31