Skip to main content

Chase and Provenance

Prometheux natively supports full explanations of logical processes for output generation in reasoning tasks via the materialization of the chase graph during its parallel and distributed evaluation.

The chase graph mode is activated using the @chase annotation and is materialized in a parallel and distributed fashion into datasources such as CSV files or Neo4j databases.

Configuring the Chase Graph

The @chase annotation configures how the chase graph is stored. Its syntax is as follows:

@chase("datasource", "filepath", "filename").

The chase graph's storage format varies based on the selected datasource.

Materializing the Chase Graph on CSV

For CSV datasources, the chase graph is stored as a CSV dataset having the following columns "Fact", "ProvenanceLeft", "ProvenanceRight", "Rule" representing the chase fact, the left body predicate, the right body predicate (empty if the chase fact is generated by a linear rule), the description of the rule generating the chase fact, respectively. Consider this example:

arc(1,2).
arc(1,3).

path(X,Y) :- arc(X,Y).
@chase("csv", "disk/data", "chase.csv").
@output("path").

This @chase annotation instructs Vadalog to store the chase in the path disk/data/chase.csv. The directory may contain multiple CSVs due to the distributed evaluation, with the following rows:

Chase graph at: disk/data/chase.csv
Fact,ProvenanceLeft,ProvenanceRight,Rule
"path(1,2)","arc(1,2)","","path(X,Y) :- arc(X,Y)"
"path(1,3)","arc(1,3)", "","path(X,Y) :- arc(X,Y)"

Handling Aggregations

The production of the chase for programs with aggregations requires to introduce artificial intermediate chase nodes representing facts with group by values. This intermediate chase node guarantees the connectivity between the fact resulting from an aggregation and the facts that contributed to calculate it. Such a intermediate chase fact is identified by its predicate which is the original predicate name prepended by aggregated_explainability_ .

As an example, the following program:

own(1,2,0.3).
own(1,2,0.4).
path_own(X,Y,Z) :- own(X,Y,Z), Z > 0.
path_own_agg(X,Y,C) :- path_own(X,Y,Z), C = msum(Z).
@chase("csv coalesce=true", "disk/data", "chase.csv").
@output("path_own_agg").

produces a csv containing the following rows:

Chase graph at: disk/data/chase.csv
Fact,ProvenanceLeft,ProvenanceRight,Rule
"path_own(1,2,0.3)","own(1,2,0.3)","","path_own(X,Y,Z) :- own(X,Y,Z), Z>0."
"path_own(1,2,0.4)","own(1,2,0.4)","","path_own(X,Y,Z) :- own(X,Y,Z), Z>0."
"aggregated_explainability_path_own_agg(1,2)","path_own(1,2,0.3)","","path_own_agg(X,Y,C) :- path_own(X,Y,Z), C=msum(Z)."
"aggregated_explainability_path_own_agg(1,2)","path_own(1,2,0.4)","","path_own_agg(X,Y,C) :- path_own(X,Y,Z), C=msum(Z)."
"path_own_agg(1,2,0.7)","aggregated_explainability_path_own_agg(1,2)","","path_own_agg(X,Y,C) :- path_own(X,Y,Z), C=msum(Z)."

Materializing the Chase Graph on CSV using Neo4j Bulk Import

When materializing the chase in CSV format, it is possible to structure the data for import into Neo4j using the bulk import tool. This is particularly useful for navigating very large chase graphs with Neo4j. To activate this functionality, set the parameter forNeo4jBulkImport=true (the default is false). This command enables the materialization of the chase into two CSV files: one for nodes and one for edges. The CSV of the nodes has the following columns "id:ID", ":Label". The CSV of the edges has the following columns :START_ID, :END_ID, rule:string,:TYPE. Consider this example:

arc(1,2).
arc(1,3).

path(X,Y) :- arc(X,Y).
@chase("csv forNeo4jBulkImport=true, compression=gzip", "neo4j-import", "chase").
@output("path").

The chase annotation instructs Vadalog to store the chase nodes in the path neo4j-import/nodes and the chase edges in the path neo4j-import/edges. We can optionally add the compression mode to reduce the occupation of the materialized chase.

The nodes directory may contain multiple CSVs due to the distributed evaluation, with the following entries:

Chase graph at: neo4j-import/chase/nodes/part-0.csv
"arc(1,2)", "CHASE_NODE";"ARC"
"arc(1,3)", "CHASE_NODE";"ARC"
"path(1,2)", "CHASE_NODE";"PATH"
"path(1,3)", "CHASE_NODE";"PATH"

The edges directory may contain multiple CSVs due to the distributed evaluation, with the following entries:

Chase graph at: neo4j-import/chase/edges/part-0.csv
:START_ID, :END_ID, rule:string, :TYPE
"path(1,2)", "arc(1,2)", "path(X,Y) :- arc(X,Y)", "DERIVED_BY"
"path(1,3)", "arc(1,3)", "path(X,Y) :- arc(X,Y)", "DERIVED_BY"

Notice that "rule:string" is an attribute of the edge relationship having type "DERIVED_BY" connecting the start node with the end node.

After materializing the chase in CSV files one can run the following script to import the chase onto Neo4j using the bulk import command:

docker run --rm \
--volume=${PWD}/neo4j-data:/var/lib/neo4j/data \
--volume=${PWD}/neo4j-import:/var/lib/neo4j/import \
neo4j:4.4.31 \
neo4j-admin import --database=neo4j \
--nodes=/var/lib/neo4j/import/chase/nodes/part-[a-zA-Z0-9\-]+.csv.gz \
--relationships=/var/lib/neo4j/import/chase/edges/part-[a-zA-Z0-9\-]+.csv.gz

Materializing the Chase Graph on Neo4j with connectors

For Neo4j, the chase graph is represented as a graph with a node for each chase node and an edge for each derivation having name DERIVED_BY. Each node is a Neo4j node with the label CHASE_NODE and a property fact for the chase fact. Consider this example:

arc(1,2).
arc(1,3).

path(X,Y) :- arc(X,Y).
@chase("neo4j", "", "").
@output("path").

The configuration for Neo4j is specified in the vada.properties file, thus the third and fourth fields of the @chase annotation are not needed. The properties for Neo4j configuration might look like this:

neo4j.chase.url=bolt://localhost:7687
neo4j.chase.username=neo4j
neo4j.chase.password=neo4j
neo4j.chase.database=neo4j
neo4j.chase.authenticationType=basic
neo4j.chase.partitions=1

This chase annotation instructs Vadalog to store the chase in the neo4j database reachable at the URL bolt://localhost:7687. In this case there will be created four CHASE_NODE nodes

CHASE_NODE(fact: 'path(1,2)')
CHASE_NODE(fact: 'path(1,3)')
CHASE_NODE(fact: 'arc(1,2)')
CHASE_NODE(fact: 'arc(1,2)')

and two DERIVED_BY edges

CHASE_NODE(fact: 'path(1,2)') -[DERIVED_BY(rule: path(X,Y) :- arc(X,Y))]->CHASE_NODE(fact: arc(1,2))
CHASE_NODE(fact: 'path(1,3)') -[DERIVED_BY(rule: path(X,Y) :- arc(X,Y))]->CHASE_NODE(fact: arc(1,3))

Retrieving the Chase from Neo4j

After storing the chase in Neo4j, it can be retrieved using distributed Neo4j Connectors with the following annotations:

@input("chase_neo4j").
@qbind("chase_neo4j", "neo4j", "\", "MATCH(n:CHASE_NODE) -[r:DERIVED_BY]->(m:CHASE_NODE) RETURN n.fact, m.fact, r.rule").
chase_edge(X,Y,R) :- chase_neo4j(X,Y,R).
@output("chase_edge").

This retrieves the entire chase graph and outputs it in the console. Additionally, bind annotations can store the retrieved data in a different datasource.

Note that the Neo4j connector for the above @qbind gets from the vada.properties file the classic properties to read from Neo4j:

neo4j.url=bolt://localhost:7687
neo4j.username=neo4j
neo4j.password=neo4j
neo4j.database=neo4j
neo4j.authenticationType=basic
neo4j.partitions=1

Optimize the Chase materialization and retrieval on Neo4j

Note that to have better performances while materializing and retrieving the chase from Neo4j, you can index the CHASE_NODE via the following cypher query in the Neo4j Browser or via cypher shell:

CREATE INDEX index_chase_node__fact IF NOT EXISTS FOR (chase_node:CHASE_NODE) ON (chase_node.fact)

Explanation of the output

The query below demonstrates how to retrieve specific chase data related to the fact 'a(1,2)':

@input("chase_neo4j").
@qbind("chase_neo4j", "neo4j", "\", "MATCH (root:CHASE_NODE {fact: 'a(1,2)' }) CALL apoc.path.subgraphNodes(root, {relationshipFilter: 'DERIVED_BY>', limit: 1000}) YIELD node MATCH (node)-[r]->(m) RETURN node.fact, m.fact, r.rule").

chase_edge(X,Y,R) :- chase_neo4j(X,Y,R).
@output("chase_edge").

This query requires the APOC library installed in Neo4j. To set up a Neo4j container instance with APOC, use the following Docker script:

#!/bin/bash
docker run --rm -d --name neo4j-vada \
-p 7474:7474 \
-p 7687:7687 \
--volume=${PWD}/neo4j-data:/var/lib/neo4j/data \
--volume=${PWD}/neo4j-import:/var/lib/neo4j/import \
--volume=${PWD}/neo4j-plugins:/var/lib/neo4j/plugins \
-e NEO4J_apoc_export_file_enabled=true \
-e NEO4J_apoc_import_file_enabled=true \
-e NEO4J_apoc_import_file_use__neo4j__config=true \
-e NEO4JLABS_PLUGINS='["graph-data-science","apoc"]' \
-e NEO4J_dbms_memory_heap_max__size=2G \
-e NEO4J_dbms_security_procedures_unrestricted=gds.\\\* neo4j:4.4.31