Configuring Prometheux
When configuring database properties such as credentials, these settings are applied to all instances for the specific database type (e.g., PostgreSQL, Neo4j, MariaDB, MongoDB). However, if specific configurations are defined via the bind annotation @bind
annotation, those settings will override the global configurations and will be applied to the specifically.
This allows flexibility in managing multiple connections or setting up different configurations for various tasks while still maintaining a default global configuration for general usage.
Below is a list of properties that need to be properly configured to ensure correct functionality of Prometheux.
Property Name | Default Value | Description |
---|---|---|
postgresql.url | jdbc:postgresql://localhost:5432/postgres | JDBC URL for connecting to the PostgreSQL database. |
postgresql.username | postgres | PostgreSQL database username. |
postgresql.password | postgres | PostgreSQL database password. |
sqlite.url | jdbc:sqlite:sqlite/testdb.db | SQLite database URL for connection. |
decimal_digits | 3 | Number of digits to use after the decimal point for decimal values. |
neo4j.url | bolt://localhost:7687 | URL for connecting to the Neo4j instance using Bolt protocol. |
neo4j.username | neo4j | Username for authenticating to the Neo4j database. |
neo4j.password | neo4j | Password for authenticating to the Neo4j database. |
neo4j.authentication.type | basic | Type of authentication to use with Neo4j (options: none, basic, kerberos, custom, bearer). |
neo4j.relationshipNodesMap | false | Flag to determine if relationship nodes should be mapped. |
neo4j.chase.url | bolt://localhost:7687 | URL for storing chase data in Neo4j. |
neo4j.chase.username | neo4j | Username for authenticating to Neo4j for chase storage. |
neo4j.chase.password | neo4j | Password for authenticating to Neo4j for chase storage. |
neo4j.chase.authenticationType | basic | Authentication type for chase storage in Neo4j (same options as general Neo4j config). |
mariadb.platform | mariadb | Platform identifier for MariaDB connection. |
mariadb.url | jdbc:mysql://localhost:3306/mariadb | JDBC URL for connecting to the MariaDB instance. |
mariadb.username | mariadb | MariaDB database username. |
mariadb.password | mariadb | MariaDB database password. |
csv.withHeader | false | Flag indicating whether the CSV file should include a header row. |
mongodb.url | mongodb://localhost:27017 | MongoDB connection URL. |
mongodb.username | mongo | MongoDB database username. |
mongodb.password | mongo | MongoDB database password. |
nullGenerationMode | UNIQUE_NULLS | Determines how null values are generated: UNIQUE_NULLS generates unique null values, while SAME_NULLS generates identical null values. |
sparkConfFile | spark-defaults.conf | Full path to the Spark configuration file. |
optimizationStrategy | default | Optimization strategy for physical plan execution. Options include default , snaJoin , sna , and noTermination . |
computeAcceleratorPreference | cpu | Preferred compute accelerator for execution. Options include cpu or gpu (only available in GPU-enabled environments). |
s3aAaccessKey | myAccess | AWS S3 access key for connecting to S3-compatible storage. |
s3aSecretKey | mySecret | S3 access key for connecting to S3-compatible storage. |
restService | off | Set this property to livy to enable submitting Prometheux jobs via the Livy REST service. |
Spark Configuration
For configuring Spark properties, refer to the Apache Spark documentation for detailed explanations. Below are common Spark properties used in Prometheux:
Property Name | Default Value | Description |
---|---|---|
appName | prometheux | Name of the Spark application. |
spark.master | local[*] | Master URL for the cluster. Options include local[*] , local[Number_of_cores] , spark://HOST:PORT , yarn . |
spark.driver.memory | 4g | Maximum memory allocation for the driver. |
spark.driver.maxResultSize | 4g | Maximum size of the result produced by the driver. |
spark.executor.memory | 4g | Maximum memory allocation per executor. |
spark.submit.deployMode | client | Deployment mode for the application. Options include cluster and client . |
spark.executor.instances | 1 | Number of executors per worker instance. |
spark.executor.cores | 4 | Number of cores to allocate per executor. |
spark.dynamicAllocation.enabled | false | Enable or disable dynamic allocation of executors and cores. |
spark.sql.adaptive.enabled | true | Enable or disable adaptive execution for optimizing shuffle partitions. |
spark.sql.shuffle.partitions | 4 | Number of partitions to use when shuffling data for joins or aggregations. |
spark.hadoop.defaultFS | hdfs://localhost:9000 | Hadoop FileSystem URL for distributed storage. |
spark.yarn.stagingDir | hdfs://localhost:9000/user/ | Staging directory for storing checkpoints when using Yarn. |
spark.hadoop.yarn.resourcemanager.hostname | localhost | Hostname of the Yarn resource manager. |
spark.hadoop.yarn.resourcemanager.address | localhost:8032 | IP address and port of the Yarn resource manager. |
spark.hadoop.yarn.resourcemanager.scheduler.address | localhost:8030 | IP address and port of the Yarn resource manager scheduler. |
spark.serializer | org.apache.spark.serializer.KryoSerializer | Spark serializer class for serializing data (Kryo serializer). |
spark.jars | thirdparty/distributed-add-on/distributed-aggregations-add-on-1.13.5.jar | Path to JARs to distribute (used for Lambda functions). |
spark.local.dir | tmp | Directory for local scratch space. |
spark.checkpoint.compress | true | Enable or disable compression for RDD checkpoints. |
spark.shuffle.compress | true | Enable or disable compression before shuffling data. |
spark.sql.autoBroadcastJoinThreshold | -1 | Set to -1 to use SortMergeJoin instead of Broadcast Hash Join. |
spark.hadoop.fs.s3a.impl | org.apache.hadoop.fs.s3a.S3AFileSystem | AWS S3 implementation for Hadoop FileSystem. |
spark.hadoop.fs.s3a.path.style.access | true | Enable path-style access for AWS S3 buckets. |
spark.hadoop.fs.s3a.server-side-encryption-algorithm | AES256 | Server-side encryption algorithm for AWS S3. |
Accelerating Processing with GPUs
To accelerate processing using GPUs, Prometheux supports the Spark-RAPIDS plugin, which allows leveraging GPU hardware for enhanced performance in Spark jobs. Configuring Spark-RAPIDS enables you to offload certain SQL operations to GPUs, improving speed and efficiency.
Below are some properties to enable Spark-RAPIDS for GPU acceleration:
Property Name | Default Value | Description |
---|---|---|
spark.plugins | com.nvidia.spark.SQLPlugin | Enables the Spark-RAPIDS SQL plugin to use GPUs. |
spark.kryo.registrator | com.nvidia.spark.rapids.GpuKryoRegistrator | Enables Kryo serialization for GPU-accelerated data processing. |
spark.rapids.sql.enabled | true | Enables GPU acceleration for SQL operations. |
spark.rapids.sql.concurrentGpuTasks | 2 | Configures the number of concurrent tasks a GPU can run. |
For more details on how Spark-RAPIDS accelerates Prometheux and improves performance, you can refer to our blog post on Accelerating Neuro-Symbolic AI with RAPIDS and Vadalog Parallel.
Submitting Prometheux Jobs from Anywhere
Prometheux allows you to submit jobs from any location by using a distributed environment like Apache Livy or connecting directly to a Spark cluster. This flexibility ensures that you can manage, monitor, and execute your data processing tasks remotely or from various clients, allowing Prometheux to be integrated into a wide range of workflows and applications. Whether you're running on a local environment or a large-scale cluster, the configuration options ensure that jobs can be executed seamlessly.
Configuring Livy REST Service
Prometheux jobs can be submitted via Apache Livy, which acts as a REST interface for interacting with a Spark cluster. Livy allows you to run Spark jobs asynchronously, making it easier to submit jobs, manage sessions, and retrieve results without needing direct access to the cluster.
What is Livy?
Apache Livy is a service that enables easy interaction with Spark clusters over a REST API. It supports submitting Spark jobs from various programming languages like Python, Scala, or Java, and provides a way to monitor and manage Spark applications. Livy simplifies working with Spark for developers who need to interact with Spark clusters without setting up Spark locally on their machines.
For more details on configuring and using Livy, refer to the Livy documentation.
To enable Livy for submitting Prometheux jobs, set the following property: restService=livy
.
Livy REST Service Configuration
Property Name | Default Value | Description |
---|---|---|
livy.uri | http://localhost:8998 | URI for connecting to the Livy REST service. |
livy.hdfs.jar.files | HDFS path for JAR files required by the job. | |
livy.hdfs.jar.path | /home/prometheux/livy | Path where the JAR files are stored, either in SPARK_HOME/jars or in HDFS. |
livy.java.security.auth.login.config | jaas.conf | Java security login configuration for authentication (Kerberos). |
livy.java.security.krb5.conf | none | Kerberos configuration file. |
livy.sun.security.krb5.debug | true | Enables debugging for Kerberos authentication. |
livy.javax.security.auth.useSubjectCredsOnly | true | Java security setting to use subject credentials only. |
livy.session.logSize | 0 | The size of the logs to display when retrieving session information. |
livy.shutdownContext | true | Determines whether to shut down the underlying Spark context after the job is completed. |