Skip to main content

Configuring Prometheux

When configuring database properties such as credentials, these settings are applied to all instances for the specific database type (e.g., PostgreSQL, Neo4j, MariaDB, MongoDB). However, if specific configurations are defined via the bind annotation @bind annotation, those settings will override the global configurations and will be applied to the specifically.

This allows flexibility in managing multiple connections or setting up different configurations for various tasks while still maintaining a default global configuration for general usage.

Below is a list of properties that need to be properly configured to ensure correct functionality of Prometheux.

Property NameDefault ValueDescription
postgresql.urljdbc:postgresql://localhost:5432/postgresJDBC URL for connecting to the PostgreSQL database.
postgresql.usernamepostgresPostgreSQL database username.
postgresql.passwordpostgresPostgreSQL database password.
sqlite.urljdbc:sqlite:sqlite/testdb.dbSQLite database URL for connection.
decimal_digits3Number of digits to use after the decimal point for decimal values.
neo4j.urlbolt://localhost:7687URL for connecting to the Neo4j instance using Bolt protocol.
neo4j.usernameneo4jUsername for authenticating to the Neo4j database.
neo4j.passwordneo4jPassword for authenticating to the Neo4j database.
neo4j.authentication.typebasicType of authentication to use with Neo4j (options: none, basic, kerberos, custom, bearer).
neo4j.relationshipNodesMapfalseFlag to determine if relationship nodes should be mapped.
neo4j.chase.urlbolt://localhost:7687URL for storing chase data in Neo4j.
neo4j.chase.usernameneo4jUsername for authenticating to Neo4j for chase storage.
neo4j.chase.passwordneo4jPassword for authenticating to Neo4j for chase storage.
neo4j.chase.authenticationTypebasicAuthentication type for chase storage in Neo4j (same options as general Neo4j config).
mariadb.platformmariadbPlatform identifier for MariaDB connection.
mariadb.urljdbc:mysql://localhost:3306/mariadbJDBC URL for connecting to the MariaDB instance.
mariadb.usernamemariadbMariaDB database username.
mariadb.passwordmariadbMariaDB database password.
csv.withHeaderfalseFlag indicating whether the CSV file should include a header row.
mongodb.urlmongodb://localhost:27017MongoDB connection URL.
mongodb.usernamemongoMongoDB database username.
mongodb.passwordmongoMongoDB database password.
nullGenerationModeUNIQUE_NULLSDetermines how null values are generated: UNIQUE_NULLS generates unique null values, while SAME_NULLS generates identical null values.
sparkConfFilespark-defaults.confFull path to the Spark configuration file.
optimizationStrategydefaultOptimization strategy for physical plan execution. Options include default, snaJoin, sna, and noTermination.
computeAcceleratorPreferencecpuPreferred compute accelerator for execution. Options include cpu or gpu (only available in GPU-enabled environments).
s3aAaccessKeymyAccessAWS S3 access key for connecting to S3-compatible storage.
s3aSecretKeymySecretS3 access key for connecting to S3-compatible storage.
restServiceoffSet this property to livy to enable submitting Prometheux jobs via the Livy REST service.

Spark Configuration

For configuring Spark properties, refer to the Apache Spark documentation for detailed explanations. Below are common Spark properties used in Prometheux:

Property NameDefault ValueDescription
appNameprometheuxName of the Spark application.
spark.masterlocal[*]Master URL for the cluster. Options include local[*], local[Number_of_cores], spark://HOST:PORT, yarn.
spark.driver.memory4gMaximum memory allocation for the driver.
spark.driver.maxResultSize4gMaximum size of the result produced by the driver.
spark.executor.memory4gMaximum memory allocation per executor.
spark.submit.deployModeclientDeployment mode for the application. Options include cluster and client.
spark.executor.instances1Number of executors per worker instance.
spark.executor.cores4Number of cores to allocate per executor.
spark.dynamicAllocation.enabledfalseEnable or disable dynamic allocation of executors and cores.
spark.sql.adaptive.enabledtrueEnable or disable adaptive execution for optimizing shuffle partitions.
spark.sql.shuffle.partitions4Number of partitions to use when shuffling data for joins or aggregations.
spark.hadoop.defaultFShdfs://localhost:9000Hadoop FileSystem URL for distributed storage.
spark.yarn.stagingDirhdfs://localhost:9000/user/Staging directory for storing checkpoints when using Yarn.
spark.hadoop.yarn.resourcemanager.hostnamelocalhostHostname of the Yarn resource manager.
spark.hadoop.yarn.resourcemanager.addresslocalhost:8032IP address and port of the Yarn resource manager.
spark.hadoop.yarn.resourcemanager.scheduler.addresslocalhost:8030IP address and port of the Yarn resource manager scheduler.
spark.serializerorg.apache.spark.serializer.KryoSerializerSpark serializer class for serializing data (Kryo serializer).
spark.jarsthirdparty/distributed-add-on/distributed-aggregations-add-on-1.13.5.jarPath to JARs to distribute (used for Lambda functions).
spark.local.dirtmpDirectory for local scratch space.
spark.checkpoint.compresstrueEnable or disable compression for RDD checkpoints.
spark.shuffle.compresstrueEnable or disable compression before shuffling data.
spark.sql.autoBroadcastJoinThreshold-1Set to -1 to use SortMergeJoin instead of Broadcast Hash Join.
spark.hadoop.fs.s3a.implorg.apache.hadoop.fs.s3a.S3AFileSystemAWS S3 implementation for Hadoop FileSystem.
spark.hadoop.fs.s3a.path.style.accesstrueEnable path-style access for AWS S3 buckets.
spark.hadoop.fs.s3a.server-side-encryption-algorithmAES256Server-side encryption algorithm for AWS S3.

Accelerating Processing with GPUs

To accelerate processing using GPUs, Prometheux supports the Spark-RAPIDS plugin, which allows leveraging GPU hardware for enhanced performance in Spark jobs. Configuring Spark-RAPIDS enables you to offload certain SQL operations to GPUs, improving speed and efficiency.

Below are some properties to enable Spark-RAPIDS for GPU acceleration:

Property NameDefault ValueDescription
spark.pluginscom.nvidia.spark.SQLPluginEnables the Spark-RAPIDS SQL plugin to use GPUs.
spark.kryo.registratorcom.nvidia.spark.rapids.GpuKryoRegistratorEnables Kryo serialization for GPU-accelerated data processing.
spark.rapids.sql.enabledtrueEnables GPU acceleration for SQL operations.
spark.rapids.sql.concurrentGpuTasks2Configures the number of concurrent tasks a GPU can run.

For more details on how Spark-RAPIDS accelerates Prometheux and improves performance, you can refer to our blog post on Accelerating Neuro-Symbolic AI with RAPIDS and Vadalog Parallel.

Submitting Prometheux Jobs from Anywhere

Prometheux allows you to submit jobs from any location by using a distributed environment like Apache Livy or connecting directly to a Spark cluster. This flexibility ensures that you can manage, monitor, and execute your data processing tasks remotely or from various clients, allowing Prometheux to be integrated into a wide range of workflows and applications. Whether you're running on a local environment or a large-scale cluster, the configuration options ensure that jobs can be executed seamlessly.

Configuring Livy REST Service

Prometheux jobs can be submitted via Apache Livy, which acts as a REST interface for interacting with a Spark cluster. Livy allows you to run Spark jobs asynchronously, making it easier to submit jobs, manage sessions, and retrieve results without needing direct access to the cluster.

What is Livy?

Apache Livy is a service that enables easy interaction with Spark clusters over a REST API. It supports submitting Spark jobs from various programming languages like Python, Scala, or Java, and provides a way to monitor and manage Spark applications. Livy simplifies working with Spark for developers who need to interact with Spark clusters without setting up Spark locally on their machines.

For more details on configuring and using Livy, refer to the Livy documentation.

To enable Livy for submitting Prometheux jobs, set the following property: restService=livy.

Livy REST Service Configuration

Property NameDefault ValueDescription
livy.urihttp://localhost:8998URI for connecting to the Livy REST service.
livy.hdfs.jar.filesHDFS path for JAR files required by the job.
livy.hdfs.jar.path/home/prometheux/livyPath where the JAR files are stored, either in SPARK_HOME/jars or in HDFS.
livy.java.security.auth.login.configjaas.confJava security login configuration for authentication (Kerberos).
livy.java.security.krb5.confnoneKerberos configuration file.
livy.sun.security.krb5.debugtrueEnables debugging for Kerberos authentication.
livy.javax.security.auth.useSubjectCredsOnlytrueJava security setting to use subject credentials only.
livy.session.logSize0The size of the logs to display when retrieving session information.
livy.shutdownContexttrueDetermines whether to shut down the underlying Spark context after the job is completed.