Configuring Prometheux

When configuring database properties such as credentials, these settings are applied to all instances for the specific database type (e.g., PostgreSQL, Neo4j, MariaDB, MongoDB). However, if specific configurations are defined via the bind annotation @bind annotation, those settings will override the global configurations and will be applied to the specifically.

This allows flexibility in managing multiple connections or setting up different configurations for various tasks while still maintaining a default global configuration for general usage.

Below is a list of properties that need to be properly configured to ensure correct functionality of Prometheux.

Property Name	Default Value	Description
`postgresql.url`	`jdbc:postgresql://localhost:5432/postgres`	JDBC URL for connecting to the PostgreSQL database.
`postgresql.username`	`postgres`	PostgreSQL database username.
`postgresql.password`	`postgres`	PostgreSQL database password.
`sqlite.url`	`jdbc:sqlite:sqlite/testdb.db`	SQLite database URL for connection.
`decimal_digits`	`3`	Number of digits to use after the decimal point for decimal values.
`neo4j.url`	`bolt://localhost:7687`	URL for connecting to the Neo4j instance using Bolt protocol.
`neo4j.username`	`neo4j`	Username for authenticating to the Neo4j database.
`neo4j.password`	`neo4j`	Password for authenticating to the Neo4j database.
`neo4j.authentication.type`	`basic`	Type of authentication to use with Neo4j (options: none, basic, kerberos, custom, bearer).
`neo4j.relationshipNodesMap`	`false`	Flag to determine if relationship nodes should be mapped.
`neo4j.chase.url`	`bolt://localhost:7687`	URL for storing chase data in Neo4j.
`neo4j.chase.username`	`neo4j`	Username for authenticating to Neo4j for chase storage.
`neo4j.chase.password`	`neo4j`	Password for authenticating to Neo4j for chase storage.
`neo4j.chase.authenticationType`	`basic`	Authentication type for chase storage in Neo4j (same options as general Neo4j config).
`mariadb.platform`	`mariadb`	Platform identifier for MariaDB connection.
`mariadb.url`	`jdbc:mysql://localhost:3306/mariadb`	JDBC URL for connecting to the MariaDB instance.
`mariadb.username`	`mariadb`	MariaDB database username.
`mariadb.password`	`mariadb`	MariaDB database password.
`csv.withHeader`	`false`	Flag indicating whether the CSV file should include a header row.
`mongodb.url`	`mongodb://localhost:27017`	MongoDB connection URL.
`mongodb.username`	`mongo`	MongoDB database username.
`mongodb.password`	`mongo`	MongoDB database password.
`nullGenerationMode`	`UNIQUE_NULLS`	Determines how null values are generated: `UNIQUE_NULLS` generates unique null values, while `SAME_NULLS` generates identical null values.
`sparkConfFile`	`spark-defaults.conf`	Full path to the Spark configuration file.
`optimizationStrategy`	`default`	Optimization strategy for physical plan execution. Options include `default`, `snaJoin`, `sna`, and `noTermination`.
`computeAcceleratorPreference`	`cpu`	Preferred compute accelerator for execution. Options include `cpu` or `gpu` (only available in GPU-enabled environments).
`s3aAaccessKey`	`myAccess`	AWS S3 access key for connecting to S3-compatible storage.
`s3aSecretKey`	`mySecret`	S3 access key for connecting to S3-compatible storage.
`restService`	off	Set this property to `livy` to enable submitting Prometheux jobs via the Livy REST service.

Spark Configuration

For configuring Spark properties, refer to the Apache Spark documentation for detailed explanations. Below are common Spark properties used in Prometheux:

Property Name	Default Value	Description
`appName`	prometheux	Name of the Spark application.
`spark.master`	local[*]	Master URL for the cluster. Options include `local[*]`, `local[Number_of_cores]`, `spark://HOST:PORT`, `yarn`.
`spark.driver.memory`	4g	Maximum memory allocation for the driver.
`spark.driver.maxResultSize`	4g	Maximum size of the result produced by the driver.
`spark.executor.memory`	4g	Maximum memory allocation per executor.
`spark.submit.deployMode`	client	Deployment mode for the application. Options include `cluster` and `client`.
`spark.executor.instances`	1	Number of executors per worker instance.
`spark.executor.cores`	4	Number of cores to allocate per executor.
`spark.dynamicAllocation.enabled`	false	Enable or disable dynamic allocation of executors and cores.
`spark.sql.adaptive.enabled`	true	Enable or disable adaptive execution for optimizing shuffle partitions.
`spark.sql.shuffle.partitions`	4	Number of partitions to use when shuffling data for joins or aggregations.
`spark.hadoop.defaultFS`	hdfs://localhost:9000	Hadoop FileSystem URL for distributed storage.
`spark.yarn.stagingDir`	hdfs://localhost:9000/user/	Staging directory for storing checkpoints when using Yarn.
`spark.hadoop.yarn.resourcemanager.hostname`	localhost	Hostname of the Yarn resource manager.
`spark.hadoop.yarn.resourcemanager.address`	localhost:8032	IP address and port of the Yarn resource manager.
`spark.hadoop.yarn.resourcemanager.scheduler.address`	localhost:8030	IP address and port of the Yarn resource manager scheduler.
`spark.serializer`	org.apache.spark.serializer.KryoSerializer	Spark serializer class for serializing data (Kryo serializer).
`spark.jars`	thirdparty/distributed-add-on/distributed-aggregations-add-on-1.13.5.jar	Path to JARs to distribute (used for Lambda functions).
`spark.local.dir`	tmp	Directory for local scratch space.
`spark.checkpoint.compress`	true	Enable or disable compression for RDD checkpoints.
`spark.shuffle.compress`	true	Enable or disable compression before shuffling data.
`spark.sql.autoBroadcastJoinThreshold`	-1	Set to -1 to use SortMergeJoin instead of Broadcast Hash Join.
`spark.hadoop.fs.s3a.impl`	org.apache.hadoop.fs.s3a.S3AFileSystem	AWS S3 implementation for Hadoop FileSystem.
`spark.hadoop.fs.s3a.path.style.access`	true	Enable path-style access for AWS S3 buckets.
`spark.hadoop.fs.s3a.server-side-encryption-algorithm`	AES256	Server-side encryption algorithm for AWS S3.

Accelerating Processing with GPUs

To accelerate processing using GPUs, Prometheux supports the Spark-RAPIDS plugin, which allows leveraging GPU hardware for enhanced performance in Spark jobs. Configuring Spark-RAPIDS enables you to offload certain SQL operations to GPUs, improving speed and efficiency.

Below are some properties to enable Spark-RAPIDS for GPU acceleration:

Property Name	Default Value	Description
`spark.plugins`	com.nvidia.spark.SQLPlugin	Enables the Spark-RAPIDS SQL plugin to use GPUs.
`spark.kryo.registrator`	com.nvidia.spark.rapids.GpuKryoRegistrator	Enables Kryo serialization for GPU-accelerated data processing.
`spark.rapids.sql.enabled`	true	Enables GPU acceleration for SQL operations.
`spark.rapids.sql.concurrentGpuTasks`	2	Configures the number of concurrent tasks a GPU can run.

For more details on how Spark-RAPIDS accelerates Prometheux and improves performance, you can refer to our blog post on Accelerating Neuro-Symbolic AI with RAPIDS and Vadalog Parallel.

Submitting Prometheux Jobs from Anywhere

Prometheux allows you to submit jobs from any location by using a distributed environment like Apache Livy or connecting directly to a Spark cluster. This flexibility ensures that you can manage, monitor, and execute your data processing tasks remotely or from various clients, allowing Prometheux to be integrated into a wide range of workflows and applications. Whether you're running on a local environment or a large-scale cluster, the configuration options ensure that jobs can be executed seamlessly.

Configuring Livy REST Service

Prometheux jobs can be submitted via Apache Livy, which acts as a REST interface for interacting with a Spark cluster. Livy allows you to run Spark jobs asynchronously, making it easier to submit jobs, manage sessions, and retrieve results without needing direct access to the cluster.

What is Livy?

Apache Livy is a service that enables easy interaction with Spark clusters over a REST API. It supports submitting Spark jobs from various programming languages like Python, Scala, or Java, and provides a way to monitor and manage Spark applications. Livy simplifies working with Spark for developers who need to interact with Spark clusters without setting up Spark locally on their machines.

For more details on configuring and using Livy, refer to the Livy documentation.

To enable Livy for submitting Prometheux jobs, set the following property: restService=livy.

Livy REST Service Configuration

Property Name	Default Value	Description
`livy.uri`	`http://localhost:8998`	URI for connecting to the Livy REST service.
`livy.hdfs.jar.files`		HDFS path for JAR files required by the job.
`livy.hdfs.jar.path`	/home/prometheux/livy	Path where the JAR files are stored, either in `SPARK_HOME/jars` or in HDFS.
`livy.java.security.auth.login.config`	jaas.conf	Java security login configuration for authentication (Kerberos).
`livy.java.security.krb5.conf`	none	Kerberos configuration file.
`livy.sun.security.krb5.debug`	true	Enables debugging for Kerberos authentication.
`livy.javax.security.auth.useSubjectCredsOnly`	true	Java security setting to use subject credentials only.
`livy.session.logSize`	0	The size of the logs to display when retrieving session information.
`livy.shutdownContext`	true	Determines whether to shut down the underlying Spark context after the job is completed.

Spark Configuration​

Accelerating Processing with GPUs​

Submitting Prometheux Jobs from Anywhere​

Configuring Livy REST Service​

What is Livy?​

Livy REST Service Configuration​