Skip to main content

Connecting to Databases and External Datasources

With connectors available for databases and support for data sources, Prometheux can be used to seamlessly integrate and migrate data across various platforms. Moreover, it supports cloud and distributed file systems like S3 and HDFS, providing the flexibility needed for modern data lake and data migration scenarios.

@bind options

The bind command allows for the configuration of reading from and writing to database and datasources. The syntax is as follows:

@bind("concept_name",
"datasource_type option_1 = 'value_1', option_2 = 'value_2', …, option_n = 'value_n'",
"database_name",
"table_name").
Syntax Rules

Critical syntax requirements for @bind annotations:

  1. Options must be comma-separated: option1='value1', option2='value2'
  2. Values must be quoted: host='localhost' not host=localhost
  3. Use username= not user= for database authentication
  4. Must end with a dot: @bind(...).

✅ Correct PostgreSQL example:

@bind("employees", "postgresql host='localhost', port=5432, username='postgres', password='mypass'", "company_db", "employees").

❌ Common mistakes:

@bind("employees", "postgresql host=localhost port=5432 user=postgres password=mypass", "company_db", "employees"). // Missing commas and quotes
@bind("employees", "postgresql", "host='localhost', port=5432", "employees") // Wrong parameter placement

where datasource_type should be one of:

  • csv for CSV files
  • parquet for Parquet files
  • excel for Excel files
  • json for JSON files
  • cobol for legacy COBOL / EBCDIC data files (with copybook)
  • postgresql for PostgreSQL databases
  • neo4j for Neo4j databases
  • db2 for DB2 databases
  • mariadb for MariaDB databases
  • oracle for Oracle databases
  • sqlite for SQLite databases
  • mysql for MySQL databases
  • sqlserver for SQL Server databases
  • h2 for H2 databases
  • sybase for Sybase databases
  • teradata for Teradata databases
  • redshift for Redshift databases
  • bigquery for Google BigQuery
  • hive for Hive
  • presto for Presto
  • snowflake for Snowflake
  • databricks for Databricks
  • dynamodb for Amazon DynamoDB
  • qdrant for Qdrant vector database
  • api for consuming data via API
  • text for consuming data from plain text files
  • binary for consuming data from binary files (PDF, images, etc)

And the available configuration options are:

  • url: URL to use for the database connection (e.g. jdbc:postgresql://localhost:5432/prometheux)
  • protocol: Protocol to use for the database connection (e.g. jdbc, odbc, jdbc-odbc, bolt)
  • host: Database host (e.g. localhost)
  • port: Database port (e.g., 5432 for postgres, 7678 for neo4j)
  • database: Database name (e.g. prometheux)
  • username: Username to login with.
  • password: Password to login with.

Configuring credential access

Sensitive credentials such as database connection details (e.g., username, password) or AWS credentials (e.g., accessKey, secretKey) can be specified directly as options in the @bind annotations. Example:

@bind("concept_name",
"datasource_type option_1 = 'value_1', option_2 = 'value_2', …, option_n = 'value_n'",
"database_name",
"table_name").

This method allows for streamlined integration within the same code, ensuring that each datasource has its necessary credentials attached during the binding process.

However, for better security and flexibility, these credentials can also be stored in external configurations:

  • Credentials can be stored the px.properties file to centralize sensitive information and allow reusability without hardcoding values within the @bind annotations.
  • REST APIs for dynamic configuration management, where you can set individual credentials or update multiple settings at once through API endpoints.

CSV Datasource

Prometheux supports CSV files as data source, both for reading and writing.

The default CSV binding ("csv") is thus suitable for processing big CSV files. It does not make a guess about the input schema. Therefore, if no schema (@mapping) is provided, all fields are treated as strings. Values \N are treated as null values and interpreted as Labelled Nulls while reading the CSV file.

@bind options

The bind command allows for the configuration of reading and writing csv files. The syntax is as follows:

@bind("relation",
"csv option_1 = 'value_1', option_2 = 'value_2', …, option_n = 'value_n'",
"filepath",
"filename").

The options that are available are

  • useHeaders: Values can be true or false, depending on whether a header is available/output.
  • delimiter: Specifies the character that is used to separate single entries
  • recordSeparator: Specifies how the record are seperated
  • quoteMode: Defines quoting behavior. Possible values are:
    • all: Quotes all fields.
    • minimal: Quotes fields which contain special characters such as a the field delimiter, quote character or any of the characters in the line separator string.
    • non_numeric: Quotes all non-numeric fields.
    • none: Never quotes fields.
  • nullString: Value can be any string, which replaces null values in the csv file.
  • multiline: Handles multi-line fields. When multiline is set to true it handles fields that span multiple lines correctly
  • coalesce: Unifies the output in one partition. The output will be a single CSV file instead of partitioned CSV files. Supported only in standalone environments.

When specifying a configuaration, any subset of these options can be set at the same time. Each value has be surrounded by single quotation marks 'value'. An example for a csv bind command with configuration would be the following:

@bind("relation",
"csv useHeaders=true, delimiter='\t', recordSeparator='\n'",
"filepath",
"filename").

Examples

In the examples below we use a sample CSV file with the following content:

a,b,c
d,e,f

Simply reading a csv file into a relation:

@bind("myCsv", "csv", "/path_to_csv/folder", "csv_name.csv").
myAtom(X,Y,Z) <- myCsv(X,Y,Z).
@output("myAtom").

We can also map the columns:

@bind("myCsv", "csv useHeaders=true", "/path_to_csv/folder", "csv_name.csv").
@mapping("myCsv", 0, "var1", "string").
@mapping("myCsv", 1, "var2", "string").
@mapping("myCsv", 2, "var3", "string").

myAtom(X,Y,Z) <- myCsv(X,Y,Z).
@output("myAtom").

Selecting columns and filtering with SQL

To select specific columns or filter rows, bind the full CSV with useHeaders=true and use a rule with a SQL body over the bound predicate. The SQL query can reference column names directly from the CSV header:

@bind("users", "csv useHeaders=true", "/path_to_csv/folder", "users.csv").

% Select only Name, Surname and Age where Age > 10
filtered_users() <- SELECT Name, Surname, Age FROM users WHERE Age > 10.
@output("filtered_users").

To store results into another CSV file, bind the output predicate:

@bind("users", "csv useHeaders=true", "/path_to_csv/folder", "users.csv").

% Filter users using SQL
young_users() <- SELECT Name, Surname, Age FROM users WHERE Age < 25.

@output("young_users").
@bind("young_users", "csv", "/another_csv_path/folder/output", "young_users").

Parquet Datasource

Parquet is a columnar storage format optimized for both storage efficiency and query performance, making it well-suited for large-scale data lake scenarios. In this section, we'll explore how Parquet files can be integrated into Vadalog workflows, using the provided example:

% Bind the 'shipping_parquet' concept to a Parquet file located in the specified directory
% The data source type is 'parquet', and the file is 'shipping.parquet' in the 'disk/data/input_files' folder
@bind("shipping_parquet", "parquet", "disk/data/input_files", "shipping.parquet").

% Define a rule that extracts the 'OrderId' and 'ShippingDate' from the 'shipping_parquet' data
% The 'shipping_parquet_test' concept is created from the data in the 'shipping_parquet' input
shipping_parquet_test(OrderId, ShippingDate) <- shipping_parquet(OrderId, ShippingDate).

% Declare the output concept 'shipping_parquet_test', making the processed data available for output
@output("shipping_parquet_test").

Excel Datasource

Excel files are widely used for tabular data storage and exchange in business and analytics workflows. Prometheux can easily read from and write to Excel files, integrating them seamlessly into its data processing workflows.

In this example, we will read data from a CSV file and populate an Excel file with the extracted data.

% Bind the 'shipping_excel_csv' concept to the CSV file 'shipping_data_excel.csv' located in the 'disk/data/generated_data' directory
% Use 'useHeaders=true' to indicate that the first row contains column headers
@bind("shipping_excel_csv", "csv useHeaders=true", "disk/data/generated_data", "shipping_data_excel.csv").

% Model definition for the 'shipping_excel' concept, specifying the data types for OrderId (int) and ShippingDate (date)
@model("shipping_excel", "['OrderId:int','ShippingDate:date']").

% Rule to populate the 'shipping_excel' concept by extracting OrderId and ShippingDate from 'shipping_excel_csv'
shipping_excel(OrderId, ShippingDate) <- shipping_excel_csv(OrderId, ShippingDate).

% Bind the 'shipping_excel' concept to an Excel file 'shipping.xls' in the 'disk/data/input_files' directory
% Use 'useHeaders=true' to indicate that column headers should be included in the Excel file
@bind("shipping_excel", "excel useHeaders=true", "disk/data/input_files", "shipping.xls").

% Declare the output concept 'shipping_excel' for writing to the Excel file
@output("shipping_excel").

In this example, we will read data from the previously populated Excel file.

% Bind the 'shipping_excel' concept to the Excel file 'shipping.xls' located in 'disk/data/input_files'
% Use 'useHeaders=true' to indicate that the first row contains column headers
@bind("shipping_excel", "excel useHeaders=true", "disk/data/input_files", "shipping.xls").

% Define a rule that extracts OrderId and ShippingDate from 'shipping_excel'
shipping_excel_test(OrderId, ShippingDate) <- shipping_excel(OrderId, ShippingDate).

% Declare the output concept 'shipping_excel_test' for making the processed data available
@output("shipping_excel_test").

In this example, we will read data from a specific sheet of an Excel file.

% Bind the 'shipping_excel_sheet' concept to the Excel file 'shipping.xls' located in 'disk/data/input_files'
% Use 'useHeaders=true' to indicate that the first row contains column headers
@bind("shipping_excel_sheet", "excel useHeaders=true, dataAddress=''Sheet1'!A1'", "disk/data/input_files", "shipping.xls").

% Define a rule that extracts OrderId and ShippingDate from 'shipping_excel_sheet'
shipping_excel_sheet_test(OrderId, ShippingDate) <- shipping_excel_sheet(OrderId, ShippingDate).

% Declare the output concept 'shipping_excel_sheet_test' for making the processed data available
@output("shipping_excel_sheet_test").

JSON Datasource

JSON (JavaScript Object Notation) is a lightweight data interchange format that is widely used for APIs, configuration files, and structured data storage. Prometheux supports reading JSON files and querying their nested structures using two powerful approaches: SQL queries in rule bodies and the struct:get function for accessing nested fields.

When Prometheux reads JSON files, nested objects are automatically inferred as struct types, allowing you to access nested fields using dot notation in SQL queries.

Example JSON Structure

Throughout this section, we'll use examples based on an e-commerce orders JSON file with the following structure:

[
{
"order_id": "ORD-2025-001",
"order_date": "2025-01-15",
"customer": {
"customer_id": "CUST-123",
"name": "Alice Johnson",
"email": "alice@example.com"
},
"shipping_address": {
"street": "123 Main St",
"city": "Boston",
"state": "MA",
"zip": "02101"
},
"items": [
{"product": "Laptop", "quantity": 1, "price": 1299.99},
{"product": "Mouse", "quantity": 2, "price": 29.99}
],
"total_amount": 1359.97,
"status": "shipped"
}
]
Understanding JSON Array Handling

Root-level arrays vs Nested arrays:

  • Root-level JSON array (like the example above): Each array element becomes a separate row automatically. No collections:explode needed.
  • Nested array field (e.g., {"data": [item1, item2, ...]}): The array becomes a single column value. Use collections:explode to convert array elements into multiple rows.

Example of nested array requiring explode:

{
"items": [
{"product": "Laptop", "price": 1299.99},
{"product": "Mouse", "price": 29.99}
]
}

This would give you 1 row with items as an array column. To get multiple rows (one per product), use collections:explode.

Simple Example: Reading JSON Files

This example demonstrates reading a JSON file without any query:

% Bind the 'orders' concept to a JSON file
@bind("orders", "json", "data/orders", "orders.json").

% Access the data in Vadalog rules
all_orders(OrderId, CustomerName, Total) <- orders(OrderId, _, CustomerName, _, _, Total, _).

@output("all_orders").

Accessing Nested Fields with SQL in Rule Bodies

The recommended approach for querying JSON data is to use SQL directly in rule bodies. This allows you to process nested structures using dot notation:

% Bind to a JSON file containing order data
@bind("orders", "json", "data/orders", "orders.json").

% Use SQL in the rule body to query nested fields
% Access nested 'customer' struct fields using dot notation
customer_orders() <- SELECT order_id,
customer.name AS customer_name,
customer.email AS email,
total_amount
FROM orders
WHERE status = 'shipped'.

@output("customer_orders").

SQL Queries with Filtering and Aggregation

JSON datasources support full SQL capabilities including WHERE clauses, aggregations, and grouping:

% Example 1: Filter by specific customer
@bind("orders", "json", "data/orders", "orders.json").

customer_order_history() <- SELECT order_id, order_date, total_amount
FROM orders
WHERE customer.customer_id = 'CUST-123'.

@output("customer_order_history").
% Example 2: Aggregation with GROUP BY
@bind("orders", "json", "data/orders", "orders.json").

orders_by_city() <- SELECT shipping_address.city AS city,
COUNT(*) AS order_count,
SUM(total_amount) AS total_revenue
FROM orders
GROUP BY shipping_address.city.

@output("orders_by_city").
% Example 3: Complex WHERE conditions with multiple nested fields
@bind("orders", "json", "data/orders", "orders.json").

high_value_orders() <- SELECT order_id,
customer.name AS customer_name,
shipping_address.city AS city,
total_amount
FROM orders
WHERE total_amount > 1000
AND status = 'shipped'
AND shipping_address.state = 'MA'.

@output("high_value_orders").

SQL Queries with Ordering and Distinct

% Example: ORDER BY date
@bind("orders", "json", "data/orders", "orders.json").

recent_orders() <- SELECT order_id, customer.name AS customer_name, order_date, total_amount
FROM orders
ORDER BY order_date DESC.

@output("recent_orders").
% Example: DISTINCT values to find unique cities
@bind("orders", "json", "data/orders", "orders.json").

shipping_cities() <- SELECT DISTINCT shipping_address.city AS city,
shipping_address.state AS state
FROM orders.

@output("shipping_cities").

Using Parameters in SQL Queries

You can use @param to parameterize your JSON queries:

% Define parameters for filtering
@param("min_amount", "500").
@param("target_state", "CA").

@bind("orders", "json", "data/orders", "orders.json").

% Use parameters in the WHERE clause
filtered_orders() <- SELECT order_id, customer.name AS customer_name, total_amount
FROM orders
WHERE total_amount > ${min_amount}
AND shipping_address.state = '${target_state}'.

@output("filtered_orders").

Alternative: Using struct:get Function

For accessing individual nested fields in Vadalog rules (without SQL), you can use the struct:get function. Note that the variable must be on the left side of the assignment:

@bind("orders", "json", "data/orders", "orders.json").

% Extract nested fields using struct:get
% Syntax: Variable = struct:get("fieldName", StructField)
order_customers(OrderId, CustomerName, Email) <-
orders(OrderId, _, Customer, _, _, _, _),
CustomerName = struct:get("name", Customer),
Email = struct:get("email", Customer).

@output("order_customers").

Example: Filtering with struct:get

@bind("orders", "json", "data/orders", "orders.json").

% Filter orders by status using struct:get
shipped_orders(OrderId, CustomerName) <-
orders(OrderId, _, Customer, _, _, _, Status),
OrderId = struct:get("order_id", orders),
CustomerName = struct:get("name", Customer),
Status = "shipped".

@output("shipped_orders").

Example: Accessing Multiple Nested Structs

@bind("orders", "json", "data/orders", "orders.json").

% Access both 'customer' and 'shipping_address' structs
order_shipping_info(OrderId, CustomerName, City, State) <-
orders(_, _, Customer, ShippingAddress, _, _, _),
OrderId = struct:get("order_id", orders),
CustomerName = struct:get("name", Customer),
City = struct:get("city", ShippingAddress),
State = struct:get("state", ShippingAddress).

@output("order_shipping_info").

Using the query Option in @bind

Alternatively, you can specify SQL queries directly in the @bind annotation using the query option:

% Query specified in the bind annotation
@bind("high_value_orders",
"json query='SELECT order_id, customer.name AS customer_name, total_amount FROM high_value_orders WHERE total_amount > 1000'",
"data/orders",
"orders.json").

result(OrderId, CustomerName, Amount) <- high_value_orders(OrderId, CustomerName, Amount).

@output("result").

Real-World Example: E-Commerce Analytics

This example demonstrates a complete analytics workflow for processing order data from JSON files:

% Bind to order data from JSON
@bind("orders", "json", "data/orders", "orders.json").

% Extract high-value customers with detailed order information
high_value_customers() <-
SELECT
customer.customer_id AS customer_id,
customer.name AS customer_name,
customer.email AS email,
COUNT(*) AS total_orders,
SUM(total_amount) AS total_spent,
AVG(total_amount) AS avg_order_value
FROM orders
WHERE status = 'shipped'
GROUP BY customer.customer_id, customer.name, customer.email
HAVING SUM(total_amount) > 5000
ORDER BY total_spent DESC.

@output("high_value_customers").

% Analyze orders by geographic region
regional_sales_summary() <-
SELECT
shipping_address.state AS state,
shipping_address.city AS city,
COUNT(*) AS order_count,
SUM(total_amount) AS revenue,
AVG(total_amount) AS avg_order_value
FROM orders
WHERE status IN ('shipped', 'delivered')
GROUP BY shipping_address.state, shipping_address.city
HAVING COUNT(*) > 10.

@output("regional_sales_summary").

% Identify pending orders that need attention
pending_orders_alert() <-
SELECT
order_id,
customer.name AS customer_name,
customer.email AS contact_email,
order_date,
total_amount
FROM orders
WHERE status = 'pending'
AND order_date < '2025-01-01'.

@output("pending_orders_alert").

COBOL Datasource

Prometheux can read legacy COBOL / EBCDIC data files alongside the other file‑based datasources, so mainframe extracts (VSAM dumps, flat record files, variable‑length RDW/BDW streams) can be joined and transformed with the same Vadalog rules you already use for CSV, Parquet or JSON data. The connector is powered by the AbsaOSS Cobrix Spark data source, so it fits naturally into Spark‑native Prometheux deployments.

A COBOL binding always needs two inputs:

  1. a data file (binary, addressed by <filepath> + <filename>, exactly as for CSV or Parquet),
  2. a copybook describing the record layout — passed via the copybook option as a path to a .cpy file on the same file system (local, HDFS, or S3 via s3a://).
Why the copybook is always a path

The engine‑side @bind tokenizer strips \n and \t characters from the option block and does not support escaped single quotes, which makes multi‑line copybook bodies and VALUE 'A' clauses structurally impossible to embed inline. Keep the copybook as a file next to (or accessible from) the data file and reference it by path.

@bind options

@bind("relation",
"cobol option_1='value_1', option_2='value_2', …, option_n='value_n'",
"filepath",
"filename").

Supported options:

  • copybook: required, path to the COBOL copybook file describing the record layout (e.g. /data/layouts/customer_master.cpy).
  • cobolPreset: shortcut for common mainframe framings. One of:
    • flat-fixed-ebcdic (default) — fixed‑length records, EBCDIC encoding (IBM037 / IBM1140). Matches most VSAM ESDS and flat dump exports.
    • flat-fixed-ascii — fixed‑length records, ASCII encoding. Used when a vendor has already transcoded the extract.
    • mainframe-v — IBM variable‑length records with a 4‑byte big‑endian RDW header.
    • mainframe-vb — IBM variable‑blocked: BDW‑prefixed blocks of RDW‑prefixed records.
    • custom — no preset; every Cobrix flag is supplied via cobrix.* options.
  • encoding: override the encoding inferred by the preset. Typical values are ebcdic (with ebcdic_code_page='common', 'cp037', 'cp1140', …) or ascii.
  • record_format: Cobrix record format. F for fixed, V for variable, VB for variable‑blocked.
  • cobolFlattenPolicy: how nested COBOL groups (e.g. CUST-ADDRESS in the example below) are exposed in the Vadalog schema.
    • dotted (default) — nested groups are flattened and leaf columns are prefixed with their parent group name (CUST_ADDRESS_CUST_CITY).
    • keepNested — nested structs are preserved and exposed as map values in the Vadalog predicate, consumed via struct:get.
  • cobolFileExtensions: optional comma‑separated extension filter when the filepath points to a directory (e.g. .dat,.bin).
  • cobrix.<flag>: passthrough escape hatch — any option starting with cobrix. is forwarded to the underlying Cobrix reader verbatim (e.g. cobrix.is_rdw_big_endian='true'). Useful when a preset does not cover a particular vendor quirk.

Example: reading customer_master.dat

The copybook describes a 119‑byte fixed‑length record with a nested address group, COMP‑3 packed decimals, a COMP binary counter, and zoned‑decimal identifiers — a realistic shape for a customer master extract:

01 CUSTOMER-REC.
05 CUST-ID PIC 9(8).
05 CUST-NAME PIC X(25).
05 CUST-STATUS PIC X.
88 STATUS-ACTIVE VALUE 'A'.
88 STATUS-INACTIVE VALUE 'I'.
05 CUST-ADDRESS.
10 CUST-STREET PIC X(30).
10 CUST-CITY PIC X(20).
10 CUST-POSTCODE PIC X(10).
10 CUST-COUNTRY PIC X(3).
05 CUST-BALANCE PIC S9(9)V99 COMP-3.
05 CUST-CREDIT-LIMIT PIC S9(9)V99 COMP-3.
05 CUST-REG-DATE PIC 9(8).
05 CUST-ORDER-COUNT PIC 9(4) COMP.

Given the copybook customer_master.cpy and the binary extract customer_master.dat sitting in /data/cobol/, the binding below reads the records, flattens the CUST-ADDRESS group into top‑level columns, and projects the customers registered in Italy:

% Bind to the EBCDIC extract. The preset covers encoding, record_format=F,
% and schema_retention_policy so the copybook's 01-level name is retained
% as a prefix on the flattened columns.
@bind("customer_master",
"cobol copybook='/data/cobol/customer_master.cpy', cobolPreset='flat-fixed-ebcdic'",
"/data/cobol",
"customer_master.dat").

% The nested CUST-ADDRESS group is flattened with dotted naming, so every
% leaf column (CUST_ADDRESS_CUST_CITY, CUST_ADDRESS_CUST_COUNTRY, …) is
% addressable as a plain Vadalog variable.
italian_customers(Id, Name, Status, City, Balance) <-
SELECT CUST_REC_CUST_ID AS id,
CUST_REC_CUST_NAME AS name,
CUST_REC_CUST_STATUS AS status,
CUST_REC_CUST_ADDRESS_CUST_CITY AS city,
CUST_REC_CUST_BALANCE AS balance
FROM customer_master
WHERE CUST_REC_CUST_ADDRESS_CUST_COUNTRY = 'ITA'
AND CUST_REC_CUST_STATUS = 'A'.

@output("italian_customers").

Example: variable‑length mainframe file

For extracts shipped with IBM's variable‑length framing (RDW‑prefixed records), switch the preset:

@bind("transactions",
"cobol copybook='/data/cobol/transactions.cpy', cobolPreset='mainframe-v'",
"/data/cobol",
"transactions.bin").

high_value_tx(Id, Amount) <-
transactions(Id, _, _, Amount, _),
Amount > 10000.

@output("high_value_tx").

Example: advanced Cobrix overrides

When the file uses a non‑standard framing (e.g. RDW whose length field is not part of the record length, or a big‑endian BDW with non‑zero adjustment), drop down to the custom preset and forward the Cobrix flags unchanged via the cobrix.* passthrough:

@bind("legacy_stream",
"cobol copybook='/data/cobol/legacy.cpy',
cobolPreset='custom',
encoding='ebcdic',
record_format='VB',
cobrix.is_rdw_big_endian='true',
cobrix.rdw_adjustment='-4',
cobrix.bdw_adjustment='-4',
cobrix.is_rdw_part_of_record_length='false'",
"s3a://my-legacy-bucket/cobol",
"legacy.bin").

result(Id, Code) <- legacy_stream(Id, Code, _).
@output("result").

Keeping nested groups as structs

If you prefer to query nested groups with struct:get (as you already do for JSON), flip the flatten policy:

@bind("customer_master",
"cobol copybook='/data/cobol/customer_master.cpy',
cobolPreset='flat-fixed-ebcdic',
cobolFlattenPolicy='keepNested'",
"/data/cobol",
"customer_master.dat").

% CUST-ADDRESS is now a struct column; reach into it with struct:get.
customer_city(Id, City) <-
customer_master(Id, _, _, Address, _, _, _, _),
City = struct:get("CUST_CITY", Address).

@output("customer_city").

Querying COBOL data with SQL

Like any file-based datasource, COBOL binds support SQL queries via the <- operator. The COBOL connector reads the binary file into a Spark Dataset, and Spark SQL executes the query in memory. This is useful for previewing, paginating, counting, or filtering mainframe data without writing COBOL-specific logic:

@bind("customer_master",
"cobol copybook='/data/cobol/customer_master.cpy', cobolPreset='flat-fixed-ebcdic'",
"/data/cobol",
"customer_master.dat").

% Preview the first 10 records
customer_preview() <- SELECT * FROM customer_master LIMIT 10 OFFSET 0.

% Count total records in the file
customer_count(Total) <- SELECT COUNT(*) AS total FROM customer_master.

@output("customer_preview").
@output("customer_count").
End-to-end example

For a complete example showing COBOL data preview, cross-source JOINs with PostgreSQL, and compliance rules, see From Mainframe to Modern: Migrating Card Clearing.

Tips

  • PIC S9(n)V99 COMP-3 (packed decimal) is decoded as a decimal on the Spark side and exposed as double in the Vadalog model — use double in @model/@mapping if you declare the schema explicitly.
  • PIC 9(n) COMP binary counters are exposed as integers (int/long depending on width).
  • OCCURS n TIMES groups become Vadalog list columns and can be expanded with collections:explode, just like JSON arrays.
  • Point filepath to a directory (not a single file) to read every candidate extract in one go; combine with cobolFileExtensions='.dat,.bin' to narrow the selection.

PostgreSQL Database

PostgreSQL is a robust open-source relational database that supports a wide range of data types and advanced querying capabilities. In this section, we will explore how to integrate PostgreSQL with Vadalog by first populating a customer table from a CSV file and then reading data from it using two approaches: full table read and a custom query.

In this example, we read data from a CSV file and populate the customer table in a PostgreSQL database.

% Bind the 'customer_postgres_csv' concept to the CSV file located in 'disk/data/generated_data/customer_postgres.csv'
% The option 'useHeaders=true' indicates the CSV file contains headers
@bind("customer_postgres_csv", "csv useHeaders=true", "disk/data/generated_data", "customer_postgres.csv").

% Define a rule that extracts CustomerID, Name, Surname, and Email from the CSV and assigns them to the 'customer_postgres' concept
customer_postgres(CustomerID, Name, Surname, Email) <-
customer_postgres_csv(CustomerID, Name, Surname, Email).

% Define the data model for the 'customer_postgres' concept (mapping column names to types)
@model("customer_postgres", "['customer_id:int', 'name:string', 'surname:string', 'email:string']").

% Declare the 'customer_postgres' concept as the output, which will be written to the PostgreSQL database
@output("customer_postgres").

% Bind the 'customer_postgres' concept to a PostgreSQL table 'customer' in the 'prometheux' database
% Specify the database connection details (host, port, username, and password)
@bind("customer_postgres", "postgresql host='postgres-host', port=5432, username='prometheux', password='myPassw'",
"prometheux", "customer").

This example demonstrates reading the full customer table from PostgreSQL.

% Bind the 'customer_postgres' concept to the PostgreSQL table 'customer' in the 'prometheux' database
@bind("customer_postgres", "postgresql host='postgres-host', port=5432, username='prometheux', password='myPassw'",
"prometheux", "customer").

% Define a rule to extract CustomerID, Name, Surname, and Email from the 'customer' table in PostgreSQL
customer_postgres_test(CustomerID, Name, Surname, Email) <-
customer_postgres(CustomerID, Name, Surname, Email).

% Declare the output concept 'customer_postgres_test' to make the processed data available
@output("customer_postgres_test").

In this example, we read specific columns and filter data using a SQL query.

% Bind the 'customer_postgres' concept to PostgreSQL using a SQL query
% The query filters for CustomerID > 0 and selects CustomerID and Email
@qbind("customer_postgres", "postgresql host='postgres-host', port=5432, username='prometheux', password='myPassw', database='prometheux'",
"", "select CustomerID, Email from customer where CustomerID > 0").

% Define a rule to filter the emails that end with 'prometheux.ai'
customer_postgres_test(CustomerID, Email) <-
customer_postgres(CustomerID, Email), OnlyPx = ends_with(Email, "prometheux.ai"), OnlyPx = #T.

% Declare the output concept 'customer_postgres_test' to make the filtered data available
@output("customer_postgres_test").

PostgreSQL with Supabase

Supabase is an open-source Firebase alternative that provides a hosted PostgreSQL database. To connect Prometheux to your Supabase database, you can use the Transaction Pooler connection method with a JDBC URL.

How to Retrieve Your Supabase Connection String

  1. Log in to your Supabase Dashboard.
  2. Select your project.
  3. Navigate to Project SettingsDatabase.
  4. Under Connection string, select the JDBC tab.
  5. Choose Transaction Pooler as the connection mode (recommended for serverless and short-lived connections).
  6. Copy the JDBC connection string.

The connection string will be in the following format:

jdbc:postgresql://aws-1-eu-west-1.pooler.supabase.com:6543/postgres?user=postgres.[YOUR_PROJECT_ID]&password=[YOUR_PASSWORD]

Example: Connecting to Supabase with JDBC URL

This example demonstrates how to read data from a Supabase PostgreSQL table using the Transaction Pooler and JDBC connection.

% Bind the 'owns' concept to the Supabase PostgreSQL database using JDBC URL
% Replace [YOUR_PROJECT_ID] with your Supabase project ID
% Replace [YOUR_PASSWORD] with your actual Supabase database password
@bind("owns", "postgresql url='jdbc:postgresql://aws-1-eu-west-1.pooler.supabase.com:6543/postgres?user=postgres.[YOUR_PROJECT_ID]&password=[YOUR_PASSWORD]'",
"postgres", "owns").

% Define a rule to extract data from the 'owns' table
out(X, Y, Z) <- owns(X, Y, Z).

% Declare the output concept 'out' for making the processed data available
@output("out").

Alternatively, instead of using the url parameter, you can specify the connection details individually:

% Bind the 'owns' concept to the Supabase PostgreSQL database using individual connection parameters
% Replace [YOUR_PROJECT_ID] with your Supabase project ID
% Replace [YOUR_PASSWORD] with your actual Supabase database password
@bind("owns", "postgresql host='aws-1-eu-west-1.pooler.supabase.com', port='6543', username='postgres.[YOUR_PROJECT_ID]', password='[YOUR_PASSWORD]'",
"postgres", "owns").

% Define a rule to extract data from the 'owns' table
out(X, Y, Z) <- owns(X, Y, Z).

% Declare the output concept 'out' for making the processed data available
@output("out").
Connection Modes

Supabase offers different connection modes:

  • Transaction Pooler (port 6543): Best for serverless functions and short-lived connections. Uses PgBouncer in transaction mode.
  • Session Pooler (port 5432): For long-lived connections that need session-level features.
  • Direct Connection (port 5432): Direct connection to the database without pooling.

For most Prometheux use cases, the Transaction Pooler is recommended as it efficiently manages connection pooling.

For enhanced security, instead of using your main database password, you can create a dedicated read-only user with access limited to specific tables. This follows the principle of least privilege and minimizes security risks.

Steps to Create a Read-Only User in Supabase
  1. Log in to your Supabase Dashboard.
  2. Select your project.
  3. Navigate to SQL Editor and execute the following SQL commands:
-- Create read-only user with a secure password
CREATE USER vadalog_reader WITH PASSWORD '[YOUR_SECURE_PASSWORD]';

-- Grant connect permission to the database
GRANT CONNECT ON DATABASE postgres TO vadalog_reader;

-- Grant usage on the public schema
GRANT USAGE ON SCHEMA public TO vadalog_reader;

-- ============================================
-- Grant SELECT only on specific tables
-- ============================================

-- Example: Grant read access to the 'ownerships' table
GRANT SELECT ON public.ownerships TO vadalog_reader;

-- Add more tables as needed
-- GRANT SELECT ON public.your_other_table TO vadalog_reader;

-- ============================================
-- Configure Row Level Security (RLS) policies
-- ============================================

-- Enable RLS on the table (if not already enabled)
ALTER TABLE public.ownerships ENABLE ROW LEVEL SECURITY;

-- Create policy to allow vadalog_reader to read all rows
CREATE POLICY "Allow vadalog_reader to read all ownerships"
ON public.ownerships
FOR SELECT
TO vadalog_reader
USING (true);
Setup Time

After creating the user and granting permissions, it may take a few minutes for the changes to propagate and become active. If you encounter connection issues immediately after setup, wait 5 minutes and try again.

Once the read-only user is created, use it in your connection string:

% Connect using the read-only user
@bind("owns", "postgresql host='aws-1-eu-west-1.pooler.supabase.com', port='6543', username='vadalog_reader', password='[YOUR_SECURE_PASSWORD]'",
"postgres", "ownerships").

out(X, Y, Z) <- owns(X, Y, Z).
@output("out").

Alternative: Connecting via Supabase REST API

Recommendation

The JDBC connection method (shown above) is heavily recommended for production use. It provides full PostgreSQL capabilities, better performance, and more reliable connections. The REST API method below is primarily suitable for one-time access to simple tables or quick prototyping scenarios.

Supabase also exposes a REST API (powered by PostgREST) that allows you to access your database tables directly via HTTP. This approach uses your Supabase API keys for authentication.

How to Retrieve Your Supabase API Credentials
  1. Log in to your Supabase Dashboard.
  2. Select your project.
  3. Navigate to Project SettingsAPI.
  4. Copy your Project URL (e.g., https://yourprojectid.supabase.co).
  5. Copy your service_role key (secret) or anon key depending on your security requirements.

API keys: the publishable key / anon (legacy) respects Row Level Security (RLS) policies if enabled for your tables, while the secret key bypasses RLS policies.

Example: Connecting to Supabase via REST API

This example demonstrates how to read data from a simple Supabase table using the REST API with bearer token authentication. This method is best suited for quick, one-time data access or prototyping scenarios.

% Bind the 'owns' concept to the Supabase REST API
% Replace [YOUR_PROJECT_ID] with your Supabase project ID
% Replace [YOUR_KEY] with your publishable or secret key
@bind("owns", "api
authType='bearer',
headers='apikey:[YOUR_KEY]'",
"https://[YOUR_PROJECT_ID].supabase.co/rest/v1/", "owns").

% Define a rule to extract data from the 'owns' table
out(X, Y, Z) <- owns(X, Y, Z).

% Declare the output concept 'out' for making the processed data available
@output("out").

MariaDB Database

MariaDB is a popular open-source relational database, highly compatible with MySQL. It supports various SQL features and is commonly used in web applications and data platforms. In this example, we will explore how to interact with a MariaDB database in Prometheux, focusing on reading data from the order_customer table to test if the data has been populated correctly.

% Bind the 'order_mariadb' concept to the 'order_customer' table in MariaDB
% The connection details (host, port, username, and password) are specified
@bind("order_mariadb", "mariadb host='mariadb-host', port=3306, username='prometheux', password='myPassw'",
"prometheux", "order_customer").

% Define a rule that extracts OrderId, CustomerId, and Cost from the 'order_customer' table
order_mariadb_test(OrderId, CustomerId, Cost) <-
order_mariadb(OrderId, CustomerId, Cost).

% Declare the output concept 'order_mariadb_test', making the processed data available
@output("order_mariadb_test").

Neo4j Database

Neo4j is a graph database designed for efficiently storing and querying highly connected data. In this example, we'll explore how to populate Neo4j with nodes representing Person and Order entities, as well as relationships between them. We'll also cover how to query this data using a Cypher query.

This example shows how to read data from a CSV file, populate Neo4j with Person and Order nodes, and create a relationship between them.

% Bind the 'persons_order_neo4j_csv' concept to the CSV file located in 'disk/data/generated_data/persons_order_neo4j.csv'
@bind("persons_order_neo4j_csv", "csv useHeaders=true", "disk/data/generated_data", "persons_order_neo4j.csv").

% Define the 'person_neo4j' concept by extracting customer details from the CSV file
person_neo4j(CustomerId, Name, Surname, Email) <-
persons_order_neo4j_csv(CustomerId, Name, Surname, Email, OrderId, Cost).

% Bind the 'person_neo4j' concept to a Neo4j node with label 'Person'
@output("person_neo4j").
@bind("person_neo4j", "neo4j username='neo4j', password='myPassw', host='neo4j-host', port=7680", "neo4j", "(:Person)").

% Map the 'person_neo4j' concept's fields to the Neo4j 'Person' node properties
@model("person_neo4j", "['customerId(ID):int', 'name:string', 'surname:string', 'email:string']").

% Define the 'order' concept by extracting order details from the CSV file
order(OrderId, Cost) <-
persons_order_neo4j_csv(CustomerId, Name, Surname, Email, OrderId, Cost).

% Bind the 'order' concept to a Neo4j node with label 'Order'
@output("order").
@bind("order", "neo4j username='neo4j', password='myPassw', host='neo4j-host', port=7680", "neo4j", "(:Order)").

% Map the 'order' concept's fields to the Neo4j 'Order' node properties
@mapping("order", 0, "orderId(ID)", "int").
@mapping("order", 1, "cost", "string").

% Define the 'order_person_rel_neo4j' concept for creating a relationship between Order and Person
order_person_rel_neo4j(OrderId, CustomerId) <-
persons_order_neo4j_csv(CustomerId, Name, Surname, Email, OrderId, Cost).

% Bind the 'order_person_rel_neo4j' concept to create a relationship between the 'Order' and 'Person' nodes in Neo4j
@output("order_person_rel_neo4j").
@bind("order_person_rel_neo4j", "neo4j username='neo4j', password='myPassw', host='neo4j-host', port=7680", "neo4j", "(:Order)-[IS_RELATED_TO]->(:Person)").

% Map the 'order_person_rel_neo4j' concept's fields to the relationship between Order and Person
@mapping("order_person_rel_neo4j", 0, "orderId:orderId(sID)", "int").
@mapping("order_person_rel_neo4j", 1, "customerId:customerId(tID)", "int").

In this example, we query Neo4j to retrieve the relationship between Person and Order nodes.

% Use @qbind to execute a Cypher query that retrieves OrderId and CustomerId from related Order and Person nodes
@qbind("persons_order_neo4j", "neo4j username='neo4j', password='myPassw', host='neo4j-host', port=7680", "",
"MATCH (o:Order)-[r:IS_RELATED_TO]->(p:Person) RETURN o.orderId, p.customerId").

% Define a rule to store the result of the Cypher query into the 'persons_order_neo4j_test' concept
persons_order_neo4j_test(CustomerId, OrderId) <-
persons_order_neo4j(CustomerId, OrderId).

% Declare the output concept 'persons_order_neo4j_test' to make the query result available
@output("persons_order_neo4j_test").

Querying Neo4j with Cypher in rule bodies

Instead of placing the Cypher query inside a @qbind annotation, you can write it directly in a rule body. When the body starts with a Cypher keyword (MATCH, OPTIONAL, UNWIND, CALL, CREATE, MERGE), it is interpreted as a Cypher query and pushed down to Neo4j automatically.

You still need a @bind annotation to provide the Neo4j connection details — Prometheux uses it to know where to execute the query.

@bind("person_db", "neo4j username='neo4j', password='myPassw', host='neo4j-host', port=7680", "neo4j", "(:Person)").

% Cypher query directly in the rule body
order_persons(OrderId, CustomerId) <-
MATCH (o:Order)-[r:IS_RELATED_TO]->(p:Person)
RETURN o.orderId, p.customerId.

@output("order_persons").

This is equivalent to writing a separate @qbind, but keeps the query inline with the rule — useful when you want the Cypher logic visible alongside the rest of the program.

Filtering and aggregating with Cypher

Any valid Cypher query can be used in the body:

@bind("person_db", "neo4j username='neo4j', password='myPassw', host='neo4j-host', port=7680", "neo4j", "(:Person)").

% Find persons with more than 5 friends
popular_persons(Name, FriendCount) <-
MATCH (p:Person)-[:FRIEND_OF]->()
WITH p, count(*) AS cnt
WHERE cnt > 5
RETURN p.name, cnt.

@output("popular_persons").
@bind("person_db", "neo4j username='neo4j', password='myPassw', host='neo4j-host', port=7680", "neo4j", "(:Person)").

% Shortest path between two persons
path_result(StartName, EndName, Length) <-
MATCH path = shortestPath((a:Person {name:'Davide'})-[:FRIEND_OF*]-(b:Person {name:'Matteo'}))
RETURN a.name, b.name, length(path).

@output("path_result").
When to use Cypher vs SQL
  • Cypher in rule body — when you need Neo4j-specific features like graph traversals, shortestPath, variable-length patterns, or APOC procedures.
  • SQL in rule body — when you want familiar SQL syntax for previewing, counting, filtering, or paginating Neo4j data. Prometheux translates it to Cypher for you.
  • Both approaches push the query down to the Neo4j server — no data is loaded into memory unnecessarily.

Querying Neo4j with SQL

You can also use SQL queries directly against Neo4j-bound predicates. Prometheux automatically translates the SQL into optimized Cypher and pushes it down to the Neo4j server — no data is loaded into memory unnecessarily.

This is useful when you want to preview, paginate, count, or filter Neo4j data using familiar SQL syntax.

Reading all nodes

Bind a Neo4j node label and read all its data:

@bind("person_db", "neo4j username='neo4j', password='myPassw', host='neo4j-host', port=7680", "neo4j", "(:Person)").

person_result(Id, Name, Surname, Email) <- person_db(Id, Name, Surname, Email).

@output("person_result").

Preview with LIMIT and OFFSET

Paginate through Neo4j data using SQL LIMIT and OFFSET. These are translated to Cypher SKIP/LIMIT and pushed down to the Neo4j server:

@bind("person_db", "neo4j username='neo4j', password='myPassw', host='neo4j-host', port=7680", "neo4j", "(:Person)").

% Get the first 10 persons
person_preview() <- SELECT * FROM person_db LIMIT 10.

@output("person_preview").
@bind("person_db", "neo4j username='neo4j', password='myPassw', host='neo4j-host', port=7680", "neo4j", "(:Person)").

% Get persons 11-20 (page 2 with page size 10)
person_page_2() <- SELECT * FROM person_db LIMIT 10 OFFSET 10.

@output("person_page_2").

Counting records

Use SQL COUNT(*) to count nodes or relationships. The count is computed entirely by the Neo4j server:

@bind("person_db", "neo4j username='neo4j', password='myPassw', host='neo4j-host', port=7680", "neo4j", "(:Person)").

person_count(Total) <- SELECT COUNT(*) AS total FROM person_db.

@output("person_count").

Querying relationships with SQL

SQL works on relationship patterns as well:

@bind("friend_of_db", "neo4j username='neo4j', password='myPassw', host='neo4j-host', port=7680", "neo4j", "(:Person)-[:FRIEND_OF]->(:Person)").

% Preview the first 5 friendship relationships
friendship_preview() <- SELECT * FROM friend_of_db LIMIT 5.

@output("friendship_preview").
@bind("friend_of_db", "neo4j username='neo4j', password='myPassw', host='neo4j-host', port=7680", "neo4j", "(:Person)-[:FRIEND_OF]->(:Person)").

% Count the total number of FRIEND_OF relationships
friendship_count(Total) <- SELECT COUNT(*) AS total FROM friend_of_db.

@output("friendship_count").

Combining preview and count

A common pattern is to run both a paginated preview and a count in the same program:

@bind("order_rel", "neo4j username='neo4j', password='myPassw', host='neo4j-host', port=7680", "neo4j", "(:Order)-[IS_RELATED_TO]->(:Person)").

% Preview: first 10 results
order_preview() <- SELECT * FROM order_rel LIMIT 10 OFFSET 0.

% Count: total number of relationships
order_count(Total) <- SELECT COUNT(*) AS total FROM order_rel.

@output("order_preview").
@output("order_count").
How SQL-to-Neo4j translation works

When you write SQL against a Neo4j @bind predicate, Prometheux transparently converts the query:

  • SELECT * with LIMIT/OFFSET → The @bind pattern is used to generate optimized Cypher with WITH ... SKIP n LIMIT n RETURN ..., fully pushed down to Neo4j.
  • SELECT COUNT(*) and other aggregations → Converted to a Cypher @qbind query (e.g., MATCH (n:Person) RETURN count(n) AS total) and executed server-side.
  • No data is loaded into memory for filtering or pagination — everything is handled by the Neo4j server.

Amazon DynamoDB Database

Amazon DynamoDB is a fully managed NoSQL database service that provides fast and predictable performance with seamless scalability. Prometheux supports both reading from and writing to DynamoDB tables, with automatic table creation capabilities and support for PartiQL queries.

@bind options for DynamoDB

The DynamoDB connector supports the following configuration options:

  • region: AWS region for the DynamoDB instance (e.g., us-east-1, eu-west-1)
  • username: AWS Access Key ID for authentication
  • password: AWS Secret Access Key for authentication
  • sessionToken: AWS Session Token for temporary credentials (optional)
  • endpointOverride: Custom endpoint URL (useful for DynamoDB Local testing)
  • partitionKey: The partition key attribute name for table creation
  • sortKey: The sort key attribute name for table creation (optional)
  • billingMode: Either PAY_PER_REQUEST (default) or PROVISIONED
  • readCapacity: Read capacity units for provisioned billing mode (default: 5)
  • writeCapacity: Write capacity units for provisioned billing mode (default: 5)
  • writeBatchSize: Number of items to write per batch (1-25, default: 25)
  • readPageSize: Page size for read operations (default: 100)
  • totalSegments: Number of segments for parallel scanning (default: 8)
  • inferSampleLimit: Number of items to sample for schema inference (default: 64)
  • secondaryIndexName: Name of Global Secondary Index to create (optional)
  • secondaryIndexPartitionKey: Partition key for the GSI (optional)
  • secondaryIndexSortKey: Sort key for the GSI (optional)

Example 1: Writing Data to DynamoDB

This example shows how to read data from a CSV file and write it to a DynamoDB table with automatic table creation.

% Bind the 'users_csv' concept to the CSV file containing user data
@bind("users_csv", "csv useHeaders=true", "disk/data/input", "users.csv").

% Define the data model for the 'users_dynamodb' concept
@model("users_dynamodb", "['id:string', 'name:string', 'email:string', 'age:int']").

% Define a rule that maps CSV data to the DynamoDB concept
users_dynamodb(Id, Name, Email, Age) <-
users_csv(Id, Name, Email, Age).

% Declare the output concept for writing to DynamoDB
@output("users_dynamodb").

% Bind the 'users_dynamodb' concept to a DynamoDB table
% The table will be automatically created with 'id' as partition key
% Region can be set as an option (has priority) or as third argument of the bind.
@bind("users_dynamodb",
"dynamodb username='AKIAIOSFODNN7EXAMPLE', password='wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY', partitionKey='id', billingMode='PAY_PER_REQUEST'",
"us-east-1", "users").

Example 2: Reading Data from DynamoDB

This example demonstrates reading data from an existing DynamoDB table.

% Bind the 'users_dynamodb' concept to the DynamoDB table
% Region us-east-1 overrides us-east-2
@bind("users_dynamodb",
"dynamodb region='us-east-1', username='AKIAIOSFODNN7EXAMPLE', password='wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY'",
"us-east-2", "users").

% Define a rule to filter users by age
young_users(Id, Name, Email) <-
users_dynamodb(Id, Name, Email, Age), Age < 30.

% Declare the output concept for the filtered results
@output("young_users").

Example 3: Using PartiQL Queries

This example shows how to use PartiQL (SQL-compatible query language) to query DynamoDB data.

% Use @qbind to execute a PartiQL query against DynamoDB
@qbind("user_orders",
"dynamodb username='AKIAIOSFODNN7EXAMPLE', password='wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY'",
"us-east-1",
"SELECT user_id, order_date, total_amount FROM orders WHERE user_id = 'u123' AND begins_with(order_date, '2025')").

% Define a rule to process the query results
recent_orders(UserId, OrderDate, Amount) <-
user_orders(UserId, OrderDate, Amount).

% Declare the output concept
@output("recent_orders").

Example 4: Advanced Configuration with Sort Key and GSI

This example demonstrates creating a table with both partition and sort keys, plus a Global Secondary Index.

% Bind to CSV file containing order data
@bind("orders_csv", "csv useHeaders=true", "disk/data/input", "orders.csv").

% Define the data model for orders
@model("orders_dynamodb", "['customer_id:string', 'order_date:string', 'order_id:string', 'total_amount:double', 'status:string']").

% Define a rule to map CSV data to the DynamoDB concept
orders_dynamodb(CustomerId, OrderDate, OrderId, TotalAmount, Status) <-
orders_csv(CustomerId, OrderDate, OrderId, TotalAmount, Status).

% Declare the output concept
@output("orders_dynamodb").

% Bind with advanced configuration: sort key and Global Secondary Index
@bind("orders_dynamodb",
"dynamodb region='us-east-1', username='AKIAIOSFODNN7EXAMPLE', password='wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY', partitionKey='customer_id', sortKey='order_date', billingMode='PROVISIONED', readCapacity=10, writeCapacity=5, secondaryIndexName='StatusIndex', secondaryIndexPartitionKey='status'",
"", "orders").

Example 5: Using Session Credentials

This example shows how to use temporary AWS credentials with session tokens.

% Bind to CSV file
@bind("products_csv", "csv useHeaders=true", "disk/data/input", "products.csv").

% Define the product concept
products_dynamodb(ProductId, Name, Price, Category) <-
products_csv(ProductId, Name, Price, Category).

% Declare output
@output("products_dynamodb").

% Bind with session credentials (useful for role-based access)
@bind("products_dynamodb",
"dynamodb region='us-east-1', username='AKIAIOSFODNN7EXAMPLE', password='wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY', sessionToken='AQoEXAMPLEH4aoAH0gNCAPyJxz4BlCFFxWNE1OPTgk5TthT+FvwqnKwRcOIfrRh3c/LTo6UDdyJwOOvEVPvLXCrrrUtdnniCEXAMPLE/IvU1dYUg2RVAJBanLiHb4IgRmpRV3zrkuWJOgQs8IZZaIv2BXIa2R4OlgkBN9bkUDNCJiBeb/AXlzBBko7b15fjrBs=', partitionKey='product_id', billingMode='PAY_PER_REQUEST'",
"", "products").

Example 6: Using DynamoDB Local for Development

This example shows how to connect to DynamoDB Local for development and testing.

% Bind to local CSV file
@bind("test_data_csv", "csv useHeaders=true", "test/data", "sample_data.csv").

% Define test data concept
test_data(Id, Name, Value) <-
test_data_csv(Id, Name, Value).

% Declare output
@output("test_data").

% Bind to DynamoDB Local (running on localhost:8000)
@bind("test_data",
"dynamodb region='us-east-1', username='test', password='test', endpointOverride='http://localhost:8000', partitionKey='id', billingMode='PAY_PER_REQUEST'",
"testdb", "test_table").

Advanced PartiQL Features

DynamoDB supports powerful PartiQL functions that can be used in queries:

% Example using begins_with function
@qbind("recent_logs",
"dynamodb region='us-east-1', username='AKIAIOSFODNN7EXAMPLE', password='wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY'",
"",
"SELECT log_id, message, timestamp FROM application_logs WHERE begins_with(timestamp, '2025-01') AND contains(message, 'ERROR')").

% Example using attribute_exists function
@qbind("complete_profiles",
"dynamodb username='AKIAIOSFODNN7EXAMPLE', password='wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY'",
"us-east-1",
"SELECT user_id, name, email FROM user_profiles WHERE attribute_exists(phone_number) AND attribute_exists(address)").

Configuration Best Practices

  1. Credentials Management: Store sensitive credentials in pmtx.properties or environment variables rather than hardcoding them in bind annotations.

  2. Billing Mode: Use PAY_PER_REQUEST for variable workloads and PROVISIONED for predictable traffic patterns.

  3. Batch Sizes: Adjust writeBatchSize based on item size - use smaller batches for larger items.

  4. Parallel Scanning: Increase totalSegments for faster reads on large tables, but be mindful of consumed read capacity.

Qdrant Vector Database

Qdrant is a high-performance vector similarity search engine designed for storing, indexing, and querying dense vector embeddings. Prometheux integrates with Qdrant via gRPC, supporting concept-aware embedding generation, raw vector search, payload filtering, and structured result decomposition — all from within Vadalog rules.

Qdrant collections store points, each consisting of:

  • A vector (dense float array) used for similarity search
  • A payload (key-value metadata) attached to the point
  • A point ID (integer or UUID)

Prometheux supports three read modes and a concept-aware write mode, making it suitable for both text-based semantic search and domain-specific embeddings (e.g., molecular fingerprints, image features).

Vadalog + Qdrant = Ontology-grounded RAG

Most "RAG over a vector database" pipelines work in one direction only: you embed your text, you embed your query, you hope cosine similarity returns something useful, and an LLM stitches an answer together. There is no ground truth about what each row means, no way to express domain rules, and no audit trail.

Vadalog + Qdrant changes that. The @model annotation lets you attach a small ontology to every Qdrant-backed predicate: a description template that says, in your domain's language, what a row represents. The same ontology is then used everywhere:

PhaseWhat the ontology does
WriteThe description template is resolved per row (e.g. "Aspirin is a pharmaceutical drug indicated for pain relief") and that sentence is what gets embedded — so the stored vector is grounded in domain semantics, not raw payload fragments.
RetrieveWhen you query, the same template is appended to your question (e.g. "cardiac treatment options. Context: [name] is a drug indicated for [indication]"), so the query vector lives in the same semantic space as the data. This is what concept=true does, by default.
ReasonRetrieved scored points come back as a regular Vadalog predicate. You can join them with structured data, apply filters, chain rules, and derive new facts — the part classical RAG can't do.

The result is ontology-grounded RAG with reasoning — a pipeline where vector search and rule-based inference share the same domain model. Concept-aware retrieval gives you semantic recall; rules give you precision, explainability, and the ability to combine retrieved knowledge with everything else in your program.

If you want to see this end-to-end, jump to:

  • Example 11 — the minimal write → semantic search → reason loop in ~20 lines.
  • Example 12 — a richer case study unifying PDF + TXT + CSV into a pharma knowledge base, comparing concept=true vs concept=false, and applying reasoning rules to the retrieved drugs.

Minimal Configuration

All connection and embedding options have sensible defaults provided by the Prometheux platform configuration. This means the simplest possible @bind for a Qdrant write with concept-aware embeddings is:

@bind("my_predicate", "qdrant", "", "my_collection").

And for a semantic search read:

@bind("my_predicate", "qdrant query='search text'", "", "my_collection").

You only need to specify options explicitly when you want to override the defaults.

@bind options for Qdrant

The Qdrant connector supports the following configuration options:

Connection:

  • host: Qdrant server hostname (default: platform-configured)
  • port: gRPC port (default: 6334)
  • useTls: Enable TLS for the connection (default: false)
  • password: API key for Qdrant authentication (also available as apiKey)

Read options:

  • query: Natural language search query — the text is embedded via Azure OpenAI and used for KNN search
  • queryVector: Pre-computed vector for direct similarity search, bypassing text embedding (e.g., queryVector=[0.1 0.2 0.3])
  • limit: Maximum number of results to return (default: 10)
  • scoreThreshold: Minimum similarity score for results (optional)
  • includeVector: Whether to include the vector in the result (default: false)
  • filter: Payload filter expression (see Payload Filtering)
  • scrollPageSize: Page size for bulk scroll reads (default: 100)
  • concept: Enable concept-aware query enrichment (default: true; set to false to disable)

Write options:

  • vectorDimension: Dimension of the vectors in the collection. Automatically derived from the embedding model when not specified (e.g., 3072 for text-embedding-3-large, 1536 for text-embedding-3-small). Only required when using pre-computed embeddings with a custom dimension.
  • distance: Distance metric — Cosine, Euclid, Dot, or Manhattan (default: Cosine)
  • toEmbedField: Name of the column whose raw text value should be embedded. Used when concept=false and you want to embed a single column rather than generating a concept description from all fields (e.g., toEmbedField='description'). The platform embeds this column's value via Azure OpenAI.
  • toEmbedFieldPos: 0-based positional index of the column to embed, as an alternative to toEmbedField when no @model annotation is present (e.g., toEmbedFieldPos=2 means the 3rd column is the text to embed)
  • embeddingField: Name of the DataFrame column containing a pre-computed vector (a list of floats). Use this when the embeddings are already generated externally — the platform skips embedding generation and writes the vector directly (e.g., embeddingField='my_vector').
  • embeddingFieldPos: 0-based positional index of the pre-computed vector column, as an alternative to embeddingField when no @model annotation is present (e.g., embeddingFieldPos=3 means the 4th column is the vector)
  • batchSize: Number of points per write batch (default: 64)
  • shardNumber: Number of shards for the collection (default: 1)
  • replicationFactor: Replication factor (default: 1)
  • embeddingModel: Azure OpenAI embedding model name (default: text-embedding-3-large)
Controlling what gets embedded

There are three ways to control what gets embedded when writing to Qdrant:

OptionInput typeWhat happens
@model with description templateOne or more fieldsThe template (e.g., "[Name] is a person described as [Description]") is resolved per row and the resulting sentence is embedded. Only the fields referenced in the template contribute to the embedding, while all fields are stored as payload. This is the most flexible and semantically rich option — you can embed one field, multiple fields, or all fields simply by choosing which [placeholders] to include in the template.
toEmbedField='description' or toEmbedFieldPos=2 with concept=falseSingle columnOnly the specified column's raw text value is embedded via Azure OpenAI. All fields are stored as payload. Use toEmbedFieldPos when no @model annotation is present.
embeddingField='my_vector' or embeddingFieldPos=3Pre-computed vectorThe vector column is written directly — no embedding generation. The vector column is excluded from the payload. Use embeddingFieldPos when no @model annotation is present.

When none of these is specified and concept=true (default), the system auto-generates a concept description from all column names and values.

Tip: @model with a description template already covers the toEmbedField use case and more. For example, @model("pred", "['name:string','desc:string']", "[desc]") embeds only the desc field — equivalent to toEmbedField='desc' — but with the added benefit that you can later enrich it (e.g., "[name] is described as [desc]") without changing the @bind.

Example 1: Writing Data with Concept-Aware Embeddings

This example writes city data to Qdrant. The @model annotation provides a description template that is resolved per row and used to generate embeddings. Each row produces a natural language sentence that is then embedded.

% Source data
city("Paris", "France", "Europe", 2161000).
city("Tokyo", "Japan", "Asia", 13960000).
city("Rome", "Italy", "Europe", 2873000).
city("New York", "United States", "Americas", 8336000).

% Define the write predicate
city_out(Name, Country, Continent, Population) <- city(Name, Country, Continent, Population).
@output("city_out").

% Bind to Qdrant with concept-aware embedding
@bind("city_out",
"qdrant host='localhost', port=6334, vectorDimension=3072",
"", "cities").

% The @model description defines how each row is converted to text for embedding.
% Placeholders [field_name] are replaced with actual values before embedding.
@model("city_out",
"['name:string', 'country:string', 'continent:string', 'population:int']",
"[name] is a city in [country], located in [continent], with a population of [population]").

After execution, the cities collection contains 4 points, each with:

  • A 3072-dimensional embedding generated from the resolved description (e.g., "Paris is a city in France, located in Europe, with a population of 2161000")
  • Payload fields: name, country, continent, population, plus a __qdrant_text__ field containing the description text

Example 2: Semantic Search with Natural Language Query

Read from Qdrant by providing a natural language query. The query text is embedded and used for KNN similarity search.

% Bind to the Qdrant collection
@bind("city_in",
"qdrant host='localhost', port=6334, query='European capital city', limit=3",
"", "cities").

% @model is optional on a read predicate; here we supply a description
% so the query embedding gets concept-aware enrichment.
@model("city_in",
"['name:string', 'country:string', 'continent:string', 'population:long']",
"[name] is a city in [country], located in [continent], with a population of [population]").

% Body atom binds positionally to Qdrant's canonical column ordering for
% this collection: __id__, continent, country, name, population,
% __qdrant_text__, __score__.
result(Name, Country, Score) <- city_in(Id, Continent, Country, Name, Pop, Text, Score).
@output("result").
Project only the columns you need

You can also write the body as a SQL projection and skip the full positional binding:

result(Name, Country, Score) <- SELECT name, country, __score__ FROM city_in.

The reader pushes the projection (and any LIMIT/WHERE/OFFSET) down to Qdrant, so only the requested payload fields are deserialized server-side.

System columns

Qdrant results always include the following system columns, in a fixed canonical position:

  • __id__ (string) — the point ID (integer or UUID, returned as string)
  • __qdrant_text__ (string) — the text used for embedding (only present if the data was written with concept-aware embedding)
  • __vector__ (array of floats) — the raw vector. Returned only when includeVector=true is set on @bind, e.g. "qdrant includeVector=true, query='…'".
  • __score__ (double) — the similarity score (only present in search mode, not in scroll mode)

Example 3: Semantic Search in Rule Bodies (SELECT "..." / ASK)

The SELECT "..." FROM <pred> syntax (and its equivalent ASK "..." FROM <pred>) provides a concise way to write Qdrant semantic-search queries directly in rule bodies, without manually configuring @bind options. The rule is automatically rewritten into a synthetic @bind with the appropriate query parameters.

The Qdrant reader always returns columns in a fixed canonical order:

__id__, payload fields (alphabetical), __qdrant_text__ if present, __vector__ if includeVector=true, __score__ (search mode only).

The head atom binds positionally to this layout — the i-th head variable receives whatever Qdrant put at column position i. The variable names are not significant. If the head has fewer variables than Qdrant returns, only the first N columns are projected.

% Suppose 'product_catalog' has payload fields: category, name, price
% Canonical column order on a search read is:
% pos 0 -> __id__
% pos 1 -> category (alphabetical)
% pos 2 -> name
% pos 3 -> price
% pos 4 -> __score__
@bind("products",
"qdrant host='localhost', port=6334",
"", "product_catalog").

% Head arity = 5, so all 5 canonical columns are projected.
result(Id, Category, Name, Price, Score) <-
SELECT "wireless noise cancelling headphones" FROM products LIMIT 5.
@output("result").

ASK is accepted as a fully equivalent alias — both syntaxes go through the same rewriter, so pick whichever reads better in context:

result(Id, Category, Name, Price, Score) <-
ASK "wireless noise cancelling headphones" FROM products LIMIT 5.

If the head has fewer variables, you get the leading columns. For example, to capture just __id__ and __score__ of the top match, take the first column plus skip ahead positionally — typically you would still take the full canonical projection and ignore what you don't need in the head atom of the consuming rule.

@model on a Qdrant read predicate

@model does not declare the schema of the read result and does not reorder columns. On a Qdrant read predicate, @model is used only to enrich the embedding (concept description and context). The column layout returned by the reader is always the canonical Qdrant ordering.

If you don't need concept enrichment, you can omit @model entirely on a Qdrant read.

Returning the embedding vector

To fetch the raw vector along with the search result, set includeVector=true on the @bind. The vector is inserted in the canonical layout right before __score__ (for searches) or as the trailing column (for scroll reads):

@bind("products",
"qdrant host='localhost', port=6334, includeVector=true",
"", "product_catalog").

% Canonical search ordering with includeVector=true:
% pos 0 -> __id__
% pos 1 -> category
% pos 2 -> name
% pos 3 -> price
% pos 4 -> __vector__ (array of floats)
% pos 5 -> __score__
result(Id, Category, Name, Price, Vec, Score) <-
SELECT "wireless headphones" FROM products LIMIT 3.
@output("result").
Vector column type for writes

When writing pre-computed embeddings, declare the vector column in @model as __vector__:list (an array of floats) — the writer uses @model to name the payload fields on the write side. See Example 10 and the embeddingField option.

Optional WHERE / LIMIT / SCORE_THRESHOLD clauses

Both SELECT "..." and ASK accept optional payload filters and limits, which are pushed down to Qdrant alongside the search:

% With LIMIT, SCORE_THRESHOLD, and WHERE filter
result(Id, Category, Name, Price, Score) <-
SELECT "affordable electronics" FROM products
WHERE category = "Electronics" AND price != 999
LIMIT 10
SCORE_THRESHOLD 0.7.

Example 4: Concept-Aware Query Enrichment

When concept=true (the default), the query text is automatically enriched with semantic context derived from the @model description (or, in its absence, from the column names declared by @model). This biases the embedding toward the predicate's structure, improving search relevance.

@bind("game_db",
"qdrant host='localhost', port=6334, query='adventure games set in space'",
"", "games").

% The @model description provides concept context.
% The query becomes: "adventure games set in space. Context: [title] is a game
% in [genre], developed by [developer], rated [rating]"
@model("game_db",
"['title:string', 'genre:string', 'developer:string', 'rating:double']",
"[title] is a game in [genre], developed by [developer], rated [rating]").

result(Title, Genre, Rating) <- game_db(Id, Title, Genre, Developer, Rating, Score).
@output("result").

To disable concept enrichment:

@bind("game_db",
"qdrant host='localhost', port=6334, query='adventure games', concept=false",
"", "games").

When your data uses pre-computed embeddings (e.g., molecular fingerprints, image features, or custom domain embeddings), you can bypass the text-to-embedding step and search directly with a raw vector.

% Search using a pre-computed 4-dimensional vector
@bind("item",
"qdrant host='localhost', port=6334, queryVector=[0.12 0.45 0.78 0.33], limit=5",
"", "my_collection").

% Canonical search ordering for this collection:
% pos 0 -> __id__
% pos 1 -> category (alphabetical payload)
% pos 2 -> name
% pos 3 -> __score__
result(Name, Category, Score) <- item(Id, Category, Name, Score).
@output("result").
Vector value separator

Inside queryVector=[...], use spaces to separate values, not commas. Commas are reserved as the @bind option delimiter.

Correct: queryVector=[0.1 0.2 0.3 0.4]
Incorrect: queryVector=[0.1,0.2,0.3,0.4] — commas will break the option parsing

Example 6: Bulk Scroll Read

When no query or queryVector is provided, Qdrant reads all points from the collection using the scroll API. This is useful for bulk data extraction.

@bind("all_products",
"qdrant host='localhost', port=6334",
"", "product_catalog").

% Bulk scroll canonical ordering (no __score__):
% pos 0 -> __id__
% pos 1 -> category
% pos 2 -> name
% pos 3 -> price
result(Name, Category, Price) <- all_products(Id, Category, Name, Price).
@output("result").

Example 7: Payload Filtering

Payload filters can be applied to both search and scroll operations. Filters restrict results based on payload field values without affecting the vector similarity scoring.

Equality and inequality:

% Only return Electronics products
@bind("products",
"qdrant host='localhost', port=6334, query='best laptop', filter=category = \"Electronics\"",
"", "product_catalog").

Multiple conditions with AND:

@bind("products",
"qdrant host='localhost', port=6334, query='premium items', filter=category = \"Electronics\" AND price != 999",
"", "product_catalog").

IN operator (match any value from a list):

When using IN with multiple values, wrap the filter in single quotes to protect the commas from the @bind option parser:

% Match products in multiple categories
@bind("products",
"qdrant host='localhost', port=6334, filter='category IN (\"Electronics\", \"Lighting\", \"Furniture\")'",
"", "product_catalog").

% Canonical scroll ordering: __id__, category, name, price
result(Name, Category, Price) <- products(Id, Category, Name, Price).
@output("result").

NOT IN operator:

% Exclude specific categories
@bind("products",
"qdrant host='localhost', port=6334, filter='category NOT IN (\"Discontinued\", \"Archived\")'",
"", "product_catalog").

Range filters (numeric comparisons):

% Find molecules with molecular weight between 180 and 500
@bind("molecules",
"qdrant filter='weight >= 180 AND weight < 500'",
"", "molecule_db").
% Combine range with equality filter
@bind("products",
"qdrant query='premium items', filter='price > 100 AND category = \"Electronics\"'",
"", "product_catalog").

Filter with scroll (no query):

% Bulk read only active items
@bind("active_items",
"qdrant filter=status = \"active\"",
"", "inventory").

% Canonical scroll ordering: __id__, name, status (alphabetical payload).
result(Name) <- active_items(Id, Name, Status).
@output("result").
Supported filter syntax
SyntaxDescriptionExample
field = "value"String equalityname = "Alice"
field = 123Integer equalityage = 30
field != "value"String inequalitystatus != "deleted"
field != 123Integer inequalitypriority != 0
field > 100Greater thanprice > 100
field >= 100Greater than or equalweight >= 180.5
field < 500Less thantemperature < 37.5
field <= 500Less than or equalscore <= 0.95
field IN ("v1", "v2")Match any valuecategory IN ("A", "B")
field IN (1, 2, 3)Match any integerlevel IN (1, 2, 3)
field NOT IN ("v1")Exclude valuestype NOT IN ("test")
cond1 AND cond2Multiple conditionsstatus = "active" AND price > 50

Example 8: Using the ask() Function

The ask() function is a Spark UDF that performs Qdrant vector search per-row, returning results as a JSON array. This is useful for enriching existing data with similarity search results, or for using runtime vectors from other data sources (e.g., a parquet file).

Syntax:

ask(prompt_or_vector, "option1=value1,option2=value2")

The second argument is a single options string containing comma-separated key-value pairs. Available options:

OptionDescriptionDefault
collectionQdrant collection name to searchFrom qdrant.collection in pmtx.properties, or "default"
limitMaximum number of resultsFrom qdrant.limit in pmtx.properties, or 3
hostQdrant server hostnameFrom qdrant.host in pmtx.properties, or "localhost"
portQdrant gRPC portFrom qdrant.port in pmtx.properties, or 6334
useTlsEnable TLSFrom qdrant.useTls in pmtx.properties, or false
apiKeyQdrant API keyFrom qdrant.apiKey in pmtx.properties, or ""
embeddingModelAzure OpenAI model for text embeddingFrom qdrant.embeddingModel in pmtx.properties, or "text-embedding-3-large"
Connection defaults

The ask() function resolves connection properties using a three-level fallback: inline option → pmtx.properties → hardcoded default. If your platform is already configured with Qdrant settings in pmtx.properties, the only option you typically need is collection:

ask("search query", "collection=my_collection,limit=5")

To target a different Qdrant instance, override the connection inline:

ask("search query", "host=52.8.114.148,port=6334,apiKey=my-key,collection=my_collection")

Text mode — search by natural language query:

% For each product, find similar products
product("laptop", "Electronics").
product("desk lamp", "Lighting").

similar(Name, Answer) <- product(Name, Category),
Answer = ask("products similar to " + Name,
"collection=product_catalog,limit=3").
@output("similar").

The Answer variable contains a JSON array:

[{"__id__":"42","name":"Gaming Laptop","category":"Electronics","__score__":0.94},
{"__id__":"15","name":"Ultrabook Pro","category":"Electronics","__score__":0.91}]

Using ${Variable} interpolation in the prompt:

When the prompt contains ${Variable} placeholders, the system automatically binds them to the corresponding Vadalog variables at runtime:

person("Alice", "expert in machine learning").
person("Bob", "cardiologist specialized in heart surgery").

similar_persons(Name, Answer) <- person(Name, Description),
Answer = ask("Find persons similar to ${Description}",
"collection=persons,limit=3").
@output("similar_persons").

For each row, ${Description} is replaced with the actual value of the Description variable before embedding the prompt.

Vector mode — search by pre-computed vector:

% Read vectors from a parquet file
@bind("molecules", "parquet", "/data/embeddings", "molecules.parquet").
@model("molecules", "['molecule_id:string', 'embedding:list']").

% For each molecule, find the most similar ones in Qdrant
similar_molecules(MolId, Answer) <- molecules(MolId, Vec),
Answer = ask(Vec, "collection=compound_library,limit=10").
@output("similar_molecules").

Example 9: Decomposing ask() Results with as_list, as_struct, struct:get

JSON structure returned by ask()

The ask() function returns a JSON array string where each element is a JSON object representing a Qdrant scored point. The objects always contain:

  • __id__ (string) — the point ID
  • All payload fields from the collection (e.g., name, formula, activity, __qdrant_text__ if concept-aware embeddings were used)
  • __score__ (double) — the similarity score

Example output:

[
{"__id__":"abc-123","name":"Aspirin","activity":"anti-inflammatory","__qdrant_text__":"...","__score__":0.94},
{"__id__":"def-456","name":"Ibuprofen","activity":"anti-inflammatory","__qdrant_text__":"...","__score__":0.91}
]

as_list — parse a JSON array into a typed array of structs

Syntax: as_list(json_string, "field1:type1, field2:type2, ...")

Parses a JSON array string into a Spark array of structs. The schema is a comma-separated list of field_name:type pairs. Supported types: string, double, integer, long, boolean.

Key rules:

  • Fields are matched by name, not by position. The order of fields in the schema does not need to match the order in the JSON. Spark's from_json finds each field by its key.
  • You only need to include the fields you want to extract. Omitted fields are simply ignored — you don't need to list every payload field.
  • Include __score__:double if you need the similarity score.
  • Include __id__:string if you need the point ID.
  • __qdrant_text__ is present only when the collection was written with concept-aware or text embeddings. Include it in the schema only if you need it.
% Source data with vectors stored in parquet
@bind("compounds", "parquet", "/data", "compound_vectors.parquet").
@model("compounds", "['compound_id:string', 'embedding:list']").

% For each compound, find the top similar match and extract its ID and score
top_match(CompoundId, SimilarId, Score) <- compounds(CompoundId, Vec),
Answer = ask(Vec, "collection=compound_library,limit=1"),
L = as_list(Answer, "compound_id:string, name:string, __score__:double"),
First = collections:get(L, 1),
SimilarId = struct:get("compound_id", First),
Score = struct:get("__score__", First).
@output("top_match").

In this example, the schema "compound_id:string, name:string, __score__:double" extracts only 3 fields from the JSON — even if each object contains many more payload fields. The rest are discarded.

collections:get — access elements by index

Syntax: collections:get(array, index)

Returns the element at the given position. Indexing is 1-based (1 = first element, 2 = second, etc.).

struct:get — extract a field from a struct

Syntax: struct:get("field_name", struct)

Extracts a single named field from a struct. The field name must match one of the fields declared in the as_list or as_struct schema.

Extracting multiple results:

% Get the top 3 similar cities and extract each one
city_query("European capital").

top3(Query, Name1, Name2, Name3) <- city_query(Query),
Answer = ask(Query, "collection=cities,limit=3"),
L = as_list(Answer, "name:string, __score__:double"),
First = collections:get(L, 1),
Second = collections:get(L, 2),
Third = collections:get(L, 3),
Name1 = struct:get("name", First),
Name2 = struct:get("name", Second),
Name3 = struct:get("name", Third).
@output("top3").

Here the schema only declares name and __score__ — the country, __id__, and any other payload fields are ignored because they aren't needed.

as_struct — parse a single JSON object

Syntax: as_struct(json_string, "field1:type1, field2:type2, ...")

Same as as_list but for a single JSON object (not an array). Returns a struct directly, without wrapping it in an array.

% If you have a single JSON object string (not an array)
data("{""name"":""Alice"",""age"":30}").

result(Name, Age) <- data(J),
S = as_struct(J, "name:string, age:integer"),
Name = struct:get("name", S),
Age = struct:get("age", S).
@output("result").
Schema cheat sheet for Qdrant ask() results

When writing the as_list schema for ask() results, you only need to declare the fields you actually use. Here are the available fields:

FieldTypeAlways present?Description
__id__stringYesPoint ID (UUID or integer)
(payload fields)variesYesAll payload fields from the collection (e.g., name, formula, weight)
__qdrant_text__stringOnly with concept/text embeddingsThe text that was embedded
__score__doubleYesCosine similarity score

Example — extracting only name and score from a molecules collection:

as_list(Answer, "name:string, __score__:double")

Example — extracting everything including the embedded text:

as_list(Answer, "__id__:string, name:string, formula:string, weight:double, activity:string, __qdrant_text__:string, __score__:double")

The field order in the schema does not matter — fields are matched by name.

Example 10: Writing Pre-Computed Embeddings

When your data already contains embedding vectors (e.g., from a custom model or external pipeline), you can write them directly to Qdrant without using the Azure OpenAI embedding service.

% Read data with pre-computed embeddings from parquet
@bind("vectors_parquet", "parquet", "/data/embeddings", "items_with_vectors.parquet").
@model("vectors_parquet",
"['item_id:string', 'name:string', 'category:string', 'embedding:list']").

% Write to Qdrant with the embedding column specified in the @model
items_qdrant(ItemId, Name, Category, Embedding) <-
vectors_parquet(ItemId, Name, Category, Embedding).
@output("items_qdrant").

@bind("items_qdrant",
"qdrant host='localhost', port=6334, vectorDimension=1024, distance='Cosine'",
"", "item_embeddings").
@model("items_qdrant", "['item_id:string', 'name:string', 'category:string', 'embedding:list']").

Example 11: End-to-End Pipeline — Write, Query, and Reason

This example demonstrates a complete workflow: writing data to Qdrant, querying it with semantic search, and applying Vadalog reasoning over the results.

% Step 1: Write product data with concept-aware embeddings
product("Wireless Headphones", "Electronics", 79.99).
product("Desk Lamp LED", "Lighting", 34.99).
product("Mechanical Keyboard", "Electronics", 129.99).
product("Standing Desk", "Furniture", 499.99).

product_out(Name, Category, Price) <- product(Name, Category, Price).
@output("product_out").

@bind("product_out",
"qdrant host='localhost', port=6334, vectorDimension=3072",
"", "products").
@model("product_out",
"['name:string', 'category:string', 'price:double']",
"[name] is a [category] product priced at [price] dollars").
% Step 2: Query with natural language and apply reasoning
@bind("product_search",
"qdrant host='localhost', port=6334, query='affordable computer accessories', limit=10",
"", "products").

% @model is optional on reads; provide a description to enrich the
% query embedding with concept context.
@model("product_search",
"['name:string', 'category:string', 'price:double']",
"[name] is a [category] product priced at [price] dollars").

% Body atom binds positionally to the canonical Qdrant ordering:
% __id__, category, name, price, __score__.
affordable_match(Name, Category, Price, Score) <-
product_search(Id, Category, Name, Price, Score),
Score > 0.7,
Price < 100.
@output("affordable_match").

The Role of @model on Reads

On a Qdrant read predicate, @model does not declare the schema and does not reorder columns. The read result is always shaped by the Qdrant reader itself, in the canonical column ordering documented in Deterministic Column Ordering:

__id__, payload fields (alphabetical), __qdrant_text__ if present, __vector__ if includeVector=true, __score__ (search mode only).

What @model does on a read predicate is purely embedding enrichment: it provides the concept description (and the field list used to auto-generate one) that the @bind query is enriched with when concept=true. See Example 4: Concept-Aware Query Enrichment.

% On reads, @model contributes only the concept context to the query.
% The column ordering of the result is fixed by the Qdrant reader.
@bind("mol_in", "qdrant query='pain relief',limit=3", "", "molecules").
@model("mol_in",
"['name:string', 'formula:string', 'weight:double', 'activity:string']",
"[name] is a molecule with formula [formula], weighing [weight], with activity [activity]").

% Body atom binds positionally to the canonical Qdrant ordering:
% pos 0 -> __id__
% pos 1 -> activity (alphabetical payload)
% pos 2 -> formula
% pos 3 -> name
% pos 4 -> weight
% pos 5 -> __qdrant_text__ (if present)
% pos 6 -> __score__ (search mode)
result(Name, Score) :- mol_in(Id, Activity, Formula, Name, Weight, Text, Score).
@output("result").

If you only need a few fields, prefer a SQL projection in the rule body — the projection (and any LIMIT/OFFSET/WHERE) is pushed down to Qdrant so only the requested payload fields are deserialised:

result(Name, Score) :- SELECT name, __score__ FROM mol_in.
tip

On a Qdrant read, @model is entirely optional — you only need it if you want concept enrichment (and the description it contributes). Without it, the canonical column ordering is still produced and rule bodies can still bind positionally.

@model is required on the write side: there it names the payload fields written to Qdrant and provides the description template used for concept-aware embedding.

Example 12: Ontology-grounded RAG with reasoning — a pharma knowledge base

This is the full ontology-grounded RAG pipeline: build a Qdrant knowledge base from three heterogeneous sources (a PDF, a text file, a CSV), retrieve over it with concept-aware semantic search, and then reason on the retrieved scored points with Vadalog rules.

Three layers of the pipeline:

LayerImplemented byWhat it contributes
Ontology@model description template on the write predicateResolves to a domain sentence per row (e.g. "Aspirin is a pharmaceutical drug indicated for pain relief") — this is what gets embedded, so vectors live in a domain-grounded semantic space rather than as raw text fragments.
RAGThe same @model on the read predicate + concept=true (default)At query time the description is appended to the user's question. The query vector now lives in the same semantic space as the stored data, so semantic recall is dramatically higher than embedding the raw question alone.
ReasoningPlain Vadalog rules over the retrieved predicateScored points come back as a regular predicate. Join them with structured facts (approvals, allergies, contraindications), filter by score thresholds, derive new facts. This is the layer classical RAG can't do.

The scenario: A pharmaceutical company maintains drug information across different formats. The pipeline below builds a unified knowledge base from those sources, retrieves drugs relevant to a cardiology query, and then applies clinical reasoning rules (e.g. "suggest cardiac drugs with high semantic relevance that are not contraindicated for the patient").

Step 1: Extract data from a PDF (drug package inserts)

% Extract drug information from PDF package inserts
@bind("drug_pdf", "binaryfile", "/data/pharma", "drug_compendium.pdf").

% The PDF extractor returns drug name and description text
drug_extracted(Name, Description) <- drug_pdf(Name, Description).

Step 2: Extract data from a text file (clinical notes)

% Extract clinical observations from plain text notes
@bind("clinical_notes", "text", "/data/pharma", "clinical_observations.txt").
@model("clinical_notes", "['drug_name:string', 'observation:string']").

notes_extracted(Drug, Observation) <- clinical_notes(Drug, Observation).

Step 3: Read structured data from CSV (drug registry)

% Read the official drug registry
@bind("drug_registry", "csv useHeaders=true", "/data/pharma", "drug_registry.csv").

registry_extracted(DrugId, Name, Category, Indication, Manufacturer) <-
drug_registry(DrugId, Name, Category, Indication, Manufacturer).

Step 4: Unify and write to Qdrant WITH concept enrichment

All three sources are unified into a single Qdrant collection. The @model description template tells Qdrant what each row means — this context is embedded alongside the data.

% Unify all sources into a single knowledge base predicate
pharma_kb(Name, Category, Indication, Source) <-
drug_extracted(Name, Desc), Category = "PDF Extract",
Indication = Desc, Source = "package_insert".
pharma_kb(Drug, Category, Observation, Source) <-
notes_extracted(Drug, Observation), Category = "Clinical Note",
Source = "clinical_notes".
pharma_kb(Name, Category, Indication, Source) <-
registry_extracted(Id, Name, Category, Indication, Manufacturer),
Source = "drug_registry".

@output("pharma_kb").

% Write to Qdrant with concept-aware embedding
@bind("pharma_kb",
"qdrant host='localhost', port=6334, vectorDimension=3072",
"", "pharma_knowledge").

% This description template is the key: it tells the embedding model
% that each row represents a pharmaceutical record with specific semantics.
@model("pharma_kb",
"['name:string', 'category:string', 'indication:string', 'source:string']",
"[name] is a pharmaceutical drug in category [category] indicated for [indication]").

When this runs, a row like ("Aspirin", "Analgesic", "pain relief and anti-inflammatory", "drug_registry") is embedded as:

"Aspirin is a pharmaceutical drug in category Analgesic indicated for pain relief and anti-inflammatory"

This sentence is what gets converted into a 3072-dimensional vector — it carries the full semantic meaning.

Step 5a: Query WITH concept enrichment (default: concept=true)

% Query the knowledge base — concept=true is the default
@bind("pharma_search",
"qdrant host='localhost', port=6334, query='cardiac treatment options', limit=5",
"", "pharma_knowledge").

% @model contributes only the concept description for query enrichment.
% The query "cardiac treatment options" is enriched to:
% "cardiac treatment options. Context: [name] is a pharmaceutical drug
% in category [category] indicated for [indication]"
@model("pharma_search",
"['name:string', 'category:string', 'indication:string', 'source:string']",
"[name] is a pharmaceutical drug in category [category] indicated for [indication]").

cardiac_drugs(Name, Category, Indication, Score) <-
pharma_search(Id, Category, Indication, Name, Source, Score).
@output("cardiac_drugs").

Result with concept=true:

NameCategoryIndicationScore
MetoprololBeta-Blockerhypertension and cardiac arrhythmia0.91
LisinoprilACE Inhibitorheart failure and hypertension0.89
WarfarinAnticoagulantprevention of cardiac thrombosis0.85

All results are relevant cardiac drugs because the concept-enriched query embedding was grounded in the pharmaceutical domain.

Step 5b: Query WITHOUT concept enrichment

% Same query but with concept=false — no semantic grounding
@bind("pharma_search_raw",
"qdrant host='localhost', port=6334, query='cardiac treatment options', concept=false, limit=5",
"", "pharma_knowledge").

% With concept=false, no concept enrichment is applied — the query is
% embedded as-is. @model isn't needed on the read side here at all; we
% omit it. The body atom still binds positionally to the canonical
% Qdrant order: __id__, category, indication, name, source, __score__.
raw_results(Name, Category, Indication, Score) <-
pharma_search_raw(Id, Category, Indication, Name, Source, Score).
@output("raw_results").

Result with concept=false:

NameCategoryIndicationScore
Cardiac Rehab GuidePDF Extractrehabilitation program overview0.72
Treatment Protocol v2Clinical Notegeneral treatment workflow0.68
MetoprololBeta-Blockerhypertension and cardiac arrhythmia0.65

Without concept enrichment, the query "cardiac treatment options" is embedded as a generic phrase. It matches documents that happen to contain the words "cardiac" or "treatment" anywhere, rather than finding drugs specifically indicated for cardiac conditions. The top results are irrelevant PDF fragments and clinical notes that mention the words but are not actual cardiac drugs.

Why concept enrichment works

The difference is in what gets embedded:

  • Without concept (concept=false): The query "cardiac treatment options" is embedded as-is — a generic 5-word phrase. The embedding model has no idea this is about pharmaceutical drugs.

  • With concept (concept=true): The query becomes "cardiac treatment options. Context: [name] is a pharmaceutical drug in category [category] indicated for [indication]". The embedding model now understands this query is looking for drugs, in categories, with specific indications. The resulting vector is pulled toward the same semantic region as the stored drug descriptions.

This is why concept enrichment is enabled by default — it bridges the gap between a user's natural language question and the structured domain knowledge stored in Qdrant.

Step 6: Reason on top of the retrieved drugs

Up to this point we have done a high-quality semantic retrieval — but that is still only retrieval. The third layer of the pipeline is reasoning: the retrieved drugs are now a regular Vadalog predicate, so we can join them with structured clinical facts and derive new ones with logical rules.

We add two small structured sources of clinical context — a patient allergy list and a recall list — and then declaratively express the policy "for this patient, suggest cardiac drugs that are highly relevant and safe to prescribe":

% Structured clinical context (could equally come from a database)
patient_allergy("p001", "ACE Inhibitor").
patient_allergy("p001", "Anticoagulant").

drug_recall("Warfarin", "2025-03-15", "manufacturing defect").

% A drug is unsafe for the current patient if it is in the patient's
% allergy list OR is on the active recall list.
unsafe_for_patient("p001", DrugName) <-
cardiac_drugs(DrugName, Category, _, _),
patient_allergy("p001", Category).

unsafe_for_patient("p001", DrugName) <-
cardiac_drugs(DrugName, _, _, _),
drug_recall(DrugName, _, _).

% Final recommendation: high semantic relevance AND clinically safe.
recommended_drug(PatientId, DrugName, Category, Indication, Score) <-
cardiac_drugs(DrugName, Category, Indication, Score),
Score > 0.8,
PatientId = "p001",
~unsafe_for_patient(PatientId, DrugName).
@output("recommended_drug").

For patient p001, who is allergic to ACE Inhibitors and Anticoagulants, and given the recall on Warfarin, only the beta-blocker survives all three filters — semantic relevance, allergy safety, recall safety:

PatientIdDrugNameCategoryIndicationScore
p001MetoprololBeta-Blockerhypertension and cardiac arrhythmia0.91

Lisinopril (ACE Inhibitor) is filtered out by the allergy rule; Warfarin (Anticoagulant and on the recall list) is filtered out by both rules.

What just happened

This is what ontology-grounded RAG with reasoning looks like in practice:

  1. The @model description acted as a lightweight ontology, grounding every embedded row in domain semantics.
  2. concept=true reused that same ontology on the query side, so the semantic search returned drugs that are genuinely about cardiac treatment — not just documents containing the words cardiac or treatment.
  3. The retrieved drugs entered the reasoning engine as the predicate cardiac_drugs, where they were joined with structured allergy and recall facts and filtered by a declarative safety policy.

Each layer is auditable: you can inspect the description that was embedded, the enriched query that was sent to the embedding model, the scored points returned by Qdrant, and the chain of rules that produced the final recommendation. There is no opaque LLM in the loop deciding what is "safe" — the decision is a derivation.

Walkthrough: Molecular Database Demo

This walkthrough demonstrates a complete Qdrant workflow — writing molecules, querying them with semantic search, applying filters, and decomposing results — using a minimal self-contained example.

How embeddings work behind the scenes

When your dataset does not contain a pre-computed embedding vector (i.e., no embeddingField or embeddingFieldPos is specified), Prometheux automatically generates embeddings using its default embedding system (Azure OpenAI text-embedding-3-large).

On write, the system constructs a natural language description for each row and embeds it into a 3072-dimensional vector:

  • If @model is provided with a description template (e.g., "[name] is a drug indicated for [indication]"), the template is resolved per row and embedded.
  • If @model is provided without a description template (or no annotation at all), the system auto-generates a description from the column names (e.g., "mol_out: name is Aspirin, formula is C9H8O4, weight is 180.16, activity is anti-inflammatory").
  • The generated text is stored in the __qdrant_text__ payload field alongside the vector.

On read, the same embedding system is used to convert your search query into a vector:

  • A query='pain relief medication' is embedded via the same model, producing a 3072-dimensional vector that is compared against the stored vectors using cosine similarity.
  • When concept=true (the default), the query is enriched with the concept context provided by @model (description template or auto-generated from field names) before embedding, biasing the vector toward the same semantic space as the stored data. This is the only role @model plays on a Qdrant read predicate — it does not shape or reorder the read result.

This means you can go from raw tabular data to semantic similarity search without writing any embedding code — the platform handles the entire vector lifecycle transparently.

Step 1: Write Molecules with Concept-Aware Embeddings

Write three molecules to Qdrant. @model names the payload fields and declares their types. Since no description template is provided, the system auto-generates concept text from the column names for embedding.

molecule("Aspirin", "C9H8O4", 180.16, "anti-inflammatory").
molecule("Ibuprofen", "C13H18O2", 206.28, "anti-inflammatory").
molecule("Caffeine", "C8H10N4O2", 194.19, "stimulant").

mol_out(Name, Formula, Weight, Activity) :- molecule(Name, Formula, Weight, Activity).

@output("mol_out").
@bind("mol_out", "qdrant", "", "molecules").
@model("mol_out",
"['name:string', 'formula:string', 'weight:double', 'activity:string']").

The minimal @bind("mol_out", "qdrant", "", "molecules") uses all platform defaults: host, port, distance (Cosine), and vectorDimension (3072 for the default text-embedding-3-large model).

Step 2: Semantic Search with Natural Language

Search the collection with a natural language query. On read, the column layout is fixed by the canonical Qdrant ordering — @model is optional and only contributes a description used to enrich the query embedding.

@bind("mol_in", "qdrant query='pain relief medication',limit=3", "", "molecules").
% @model optional — describes the concept to enrich the query embedding.
@model("mol_in",
"['name:string', 'formula:string', 'weight:double', 'activity:string']",
"[name] is a molecule with formula [formula], weighing [weight], with activity [activity]").

% Body atom binds positionally to the canonical Qdrant order:
% __id__, activity, formula, name, weight, __qdrant_text__, __score__.
result(Name, Formula, Activity, Score) :- mol_in(Id, Activity, Formula, Name, Weight, Text, Score).
@output("result").

Expected result:

NameFormulaActivityScore
IbuprofenC13H18O2anti-inflammatory0.670
AspirinC9H8O4anti-inflammatory0.667
CaffeineC8H10N4O2stimulant0.518

The anti-inflammatory drugs rank highest for "pain relief medication", while Caffeine (a stimulant) scores lowest.

Step 3: Range Filter on Molecular Weight

Filter molecules by numeric payload fields using range operators. This uses bulk scroll (no query=) with a range filter.

@bind("mol_in", "qdrant filter='weight >= 180 AND weight <= 200'", "", "molecules").

% Canonical scroll ordering: __id__, activity, formula, name, weight.
result(Name, Weight, Activity) :- mol_in(Id, Activity, Formula, Name, Weight).
@output("result").

Expected result:

NameWeightActivity
Aspirin180.16anti-inflammatory
Caffeine194.19stimulant

Ibuprofen (206.28) is excluded because it falls outside the 180–200 range.

Step 4: Combined Semantic Search with Equality Filter

Combine a semantic search with a payload filter to search for "pain relief" only among anti-inflammatory drugs.

@bind("mol_in", "qdrant query='pain relief',limit=3,filter='activity = ""anti-inflammatory""'", "", "molecules").

% Canonical search ordering:
% __id__, activity, formula, name, weight, __qdrant_text__, __score__
result(Name, Score) :- mol_in(Id, Activity, Formula, Name, Weight, Text, Score).
@output("result").

Expected result:

NameScore
Ibuprofen0.670
Aspirin0.667

Caffeine is excluded by the equality filter despite being in the collection.

Step 5: Semantic Search in the Rule Body (SELECT "..." / ASK)

The SELECT "..." FROM <pred> syntax (and its ASK "..." alias) performs semantic search directly in the rule body. The head atom binds positionally to the canonical Qdrant column ordering — the first N columns are projected (where N is the head's arity).

@bind("mol_in", "qdrant", "", "molecules").

% Canonical search ordering for this collection:
% pos 0 -> __id__
% pos 1 -> activity
% pos 2 -> formula
% pos 3 -> name
% pos 4 -> weight
% pos 5 -> __qdrant_text__
% pos 6 -> __score__
% Head arity = 2 -> binds to pos 0 (__id__) and pos 1 (activity).
result(Id, Activity) :- SELECT "energy boosting substance" FROM mol_in LIMIT 2.
@output("result").

To also fetch the raw embedding, set includeVector=true on the @bind. The vector slides into the canonical layout just before __score__:

@bind("mol_in", "qdrant includeVector=true", "", "molecules").

% Canonical search ordering with includeVector=true:
% pos 0 -> __id__
% pos 1 -> activity
% pos 2 -> formula
% pos 3 -> name
% pos 4 -> weight
% pos 5 -> __qdrant_text__
% pos 6 -> __vector__
% pos 7 -> __score__
result(Id, Activity, Formula, Name, Weight, Text, Vec, Score) :-
SELECT "energy boosting substance" FROM mol_in LIMIT 2.
@output("result").

Expected result:

NameActivity
Caffeinestimulant
Aspirinanti-inflammatory

Caffeine ranks first because "energy boosting substance" is semantically closest to a stimulant.

Step 6: ask() UDF with Variable Substitution

The ask() function searches Qdrant per-row using the ${Variable} pattern (same as llm:generate). It returns a JSON array of results that can be decomposed with as_list, collections:get, and struct:get.

@bind("mol_in", "qdrant", "", "molecules").

% Canonical scroll ordering: __id__, activity, formula, name, weight, __qdrant_text__.
% QueryName binds to position 3 (name).
result(QueryName, MatchName, Score) :-
mol_in(Id, Activity, Formula, QueryName, Weight, Text),
J = ask("find similar to ${QueryName}", "collection=molecules,limit=2"),
L = as_list(J, "name:string, activity:string, __score__:double"),
E = collections:get(L, 2),
MatchName = struct:get("name", E),
Score = struct:get("__score__", E).
@output("result").

For each molecule, ask() searches Qdrant with "find similar to [name]" and returns JSON results. The decomposition pipeline extracts the second match (collections:get(L, 2) — 1-based indexing) to find each molecule's nearest neighbor.

Expected result:

QueryNameMatchNameScore
AspirinIbuprofen0.394
IbuprofenAspirin0.417
CaffeineAspirin0.223

The anti-inflammatory drugs (Aspirin, Ibuprofen) are each other's nearest neighbors with higher similarity, while Caffeine's nearest match has a lower score — confirming the semantic clustering works as expected.

Deterministic Column Ordering

Qdrant stores payload fields internally using a protobuf map, which does not guarantee a consistent iteration order across reads. To ensure predictable schemas, Prometheux enforces a fixed column layout on every Qdrant read:

#Column(s)OrderingPresent when
1__id__Always firstAlways
2Payload fields (e.g. activity, formula, name, weight)Alphabetically sorted by field nameWhen payload is included (default)
3__qdrant_text__After all other payload fieldsWhen the data was written with concept-aware embedding
4__vector__After the payload groupWhen includeVector=true
5__score__Always lastSearch mode only (not in scroll reads)

This ordering is deterministic and consistent across runs, regardless of how Qdrant's protobuf layer iterates the payload map internally. The layout is fixed by the connector@mapping / @model annotations on a Qdrant read predicate do not reorder, rename, or project this layout. Rule bodies bind to the result by position.

tip

When the Prometheux data-manager generates Vadalog annotations for a Qdrant collection, the field list in the @model follows this same ordering convention. This means the schema you see in the data-manager UI matches exactly what the reader produces at runtime.

SQL in Rule Bodies

You can write SQL queries directly in rule bodies over Qdrant-bound predicates, just like you would with JDBC or Neo4j data sources. The system transparently converts the SQL into the appropriate Qdrant operations.

SELECT * with LIMIT and OFFSET

Read a page of results from a Qdrant collection:

@bind("mol_in", "qdrant", "", "molecules").

mol_out_result() <- SELECT * FROM mol_in LIMIT 10 OFFSET 0.
@output("mol_out_result").

The LIMIT and OFFSET clauses are extracted at optimization time and pushed to the Qdrant reader, which scrolls the collection and applies them efficiently. The result is shaped by the canonical Qdrant ordering — no @model needed.

Column projection with SELECT [columns]

You can project specific payload fields by listing them in the SELECT clause:

@bind("mol_in", "qdrant", "", "molecules").

mol_names(Name) <- SELECT name FROM mol_in LIMIT 3.
@output("mol_names").

When the SQL body lists explicit columns (instead of SELECT *), the reader pushes a payload allowlist down to Qdrant: only the requested payload fields are fetched and deserialised. The LIMIT/OFFSET/WHERE clauses are also pushed down, exactly like with SELECT *.

This works for any single-table SELECT over a Qdrant predicate (no JOINs, no aggregates) — column-projected and SELECT * bodies share the same fast path.

SELECT * with WHERE

Filter payload fields using SQL WHERE syntax:

@bind("mol_in", "qdrant", "", "molecules").

anti_inflammatory() <- SELECT * FROM mol_in WHERE activity = 'anti-inflammatory'.
@output("anti_inflammatory").

The WHERE clause is converted into a Qdrant payload filter and applied server-side before results are returned.

Aggregations: COUNT, SUM, AVG, etc.

Aggregate queries are executed via Spark SQL over the Qdrant DataFrame:

@bind("mol_in", "qdrant", "", "molecules").

count_result(Total) <- SELECT COUNT(*) as total FROM mol_in.
@output("count_result").

Since Qdrant does not have a native SQL engine, aggregate queries (COUNT, SUM, AVG, DISTINCT, …) are handled by first loading the collection data into a Spark DataFrame and then executing the SQL query over it using Spark SQL.

Combining Both in a Single Program

You can mix SELECT * and aggregate queries over the same Qdrant predicate:

@bind("mol_in", "qdrant", "", "molecules").

% Read a page of results
mol_page() <- SELECT * FROM mol_in LIMIT 10 OFFSET 0.
@output("mol_page").

% Count all rows
mol_count(Total) <- SELECT COUNT(*) as total FROM mol_in.
@output("mol_count").

% Average weight
mol_avg(AvgWeight) <- SELECT AVG(weight) as avg_weight FROM mol_in.
@output("mol_avg").
How SQL-in-body works for Qdrant

The optimization pipeline handles Qdrant SQL rules in two ways:

  • Single-table SELECT (with or without explicit columns) with optional LIMIT/OFFSET/WHERE: The clauses are extracted and stored as parameters on the @bind annotation. The Qdrant reader applies LIMIT and OFFSET at scroll time, and WHERE is converted into a Qdrant payload filter applied server-side. If the SELECT lists explicit columns, the reader also pushes a payload allowlist down to Qdrant, so only the requested fields are fetched.

  • Aggregates / JOINs (COUNT, SUM, AVG, DISTINCT, multi-table joins): The Qdrant collection is first loaded as a Spark DataFrame via the regular @bind mechanism, then the SQL query is executed over it using Spark SQL. All standard SQL aggregation features are available.

Configuration Best Practices

  1. Use Platform Defaults: Connection settings (host, port, API key, Azure OpenAI credentials) are managed by the Prometheux platform configuration. You only need to specify options in @bind when overriding the defaults.

  2. Vector Dimensions: When using concept-aware or text embedding, vectorDimension is automatically derived from the embedding model (e.g., 3072 for text-embedding-3-large). You only need to set it explicitly when using pre-computed embeddings with a custom dimension.

  3. Concept Enrichment: Leave concept=true (the default) when querying collections written with @model descriptions. This ensures the query embedding is biased toward the same semantic space as the stored embeddings.

  4. Raw Vectors vs. Text Queries: Use queryVector= when working with domain-specific embeddings that were not generated from text (e.g., molecular fingerprints, audio features). Use query= for natural language semantic search.

  5. Filter Performance: Payload filters are applied server-side by Qdrant before results are returned. Use filters to reduce result sets rather than filtering in Vadalog rules for better performance.

  6. Pre-Computed Embeddings: Use embeddingField (column name) when you have an @model that names the vector column, or embeddingFieldPos (0-based positional index) when no schema annotation is present.

S3 Storage

Amazon S3 (Simple Storage Service) is a widely used cloud storage service that allows for scalable object storage. In Vadalog, you can both read from and write to S3 buckets by treating the S3 storage as a file system. This example demonstrates how to interact with S3, specifically by writing a CSV file to S3 and then reading from it.

In this example, we first read data from a CSV file stored locally and then write it to an S3 bucket.

% Bind the 'user_csv' concept to the local CSV file 'users.csv' located in 'disk/datasources'
@bind("user_csv", "csv useHeaders=true", "disk/datasources", "users.csv").

% Define a rule to map the data from 'user_csv' to the 'user' concept
user(X) <- user_csv(X).

% Declare the output concept 'user' for writing the data to an S3 bucket
@output("user").

% Bind the 'user' concept to a CSV file in the specified S3 bucket
@bind("user", "csv", "s3a://your-s3-bucket/", "user.csv").

In this example, we demonstrate how to read data from a CSV file stored in an S3 bucket.

% Bind the 'user_s3' concept to the CSV file 'user.csv' located in the specified S3 bucket
@bind("user_s3", "csv", "s3a://your-s3-bucket/", "user.csv").

% Define a rule to map the data from 'user_s3' to the 'user' concept
user(X) <- user_s3(X).

% Declare the output concept 'user' for making the data available after reading from S3
@output("user").

Example: S3 Storage Integration

You can use a similar approach to bind predicates to CSV files stored in Amazon S3:

% Bind the 'user_s3' concept to the CSV file 'user.csv' located in the specified S3 bucket
@bind("user_s3", "csv", "s3a://your-s3-bucket/", "user.csv").

% Map the data as needed
user(X) <- user_s3(X).

% Declare the output concept
@output("user").

Consuming Data via API

Vadalog supports consuming data from REST APIs with flexible authentication and SQL query capabilities. The API reader supports JSON (default), CSV, and XML response formats, with full SQL integration for querying nested data structures.

Popular Use Cases:

  • 📈 Cryptocurrency Analytics: Bitcoin, Ethereum, real-time price monitoring
  • Sports Analytics: Football leagues, team statistics, match predictions
  • 🌍 Geospatial Data: Country demographics, weather patterns, mapping
  • 📊 Monitoring Systems: Prometheus metrics, Kubernetes clusters, Alertmanager
  • 🔗 Data Integration: Combine multiple APIs for comprehensive analytics

Quick Start Examples

Get Bitcoin Price (works immediately, no auth):

@bind("btc", "api", "https://api.coingecko.com/api/v3/", "coins/bitcoin").
result() <- SELECT id, symbol, market_data.current_price.usd AS price FROM btc.
@output("result").

Get Football Team Data:

@bind("teams", "api", "https://www.thesportsdb.com/api/v1/json/3/", "searchteams.php?t=Arsenal").
result() <- SELECT SIZE(teams) AS team_count FROM teams.
@output("result").

Query Prometheus Metrics:

@bind("metrics", "api", "http://prometheus-server:9090/api/v1/", "query?query=up").
result(Status, ResultType) <- metrics(Data, Status),
ResultType = struct:get("resultType", Data).
@output("result").

API Configuration Options

The API datasource supports the following configuration options:

  • responseFormat: Response format - json (default), csv, or xml. If not specified, JSON is assumed.
  • authType: Authentication method - basic, bearer, or apikey
  • username: Username for basic authentication
  • password: Password for basic authentication
  • token: Token for bearer authentication
  • apikey: API key for API key authentication
  • headers: Custom headers in format header1:value1,header2:value2
  • delimiter: CSV delimiter character (default: ,)
Free Public APIs

Many popular APIs don't require authentication and can be used immediately:

  • CoinGecko - Cryptocurrency prices and market data (rate limits apply on free tier)
  • TheSportsDB - Football/sports statistics and results
  • REST Countries - Country demographics and geospatial data
  • JSONPlaceholder - Test data for prototyping

Perfect for getting started with Vadalog API integration!

API Rate Limits

Free-tier APIs often have rate limits (e.g., CoinGecko: ~10-30 requests/minute). For production use:

  • Space out requests or use caching
  • Consider upgrading to paid API tiers for higher limits
  • Handle HTTP 429 (Rate Limit Exceeded) errors gracefully

Example 1: Simple API Call (JSON - Default Format)

Since JSON is the default response format, you don't need to specify responseFormat=json:

% Bind to a public API that returns JSON (no authentication needed)
% responseFormat defaults to JSON
@bind("users_api", "api", "https://jsonplaceholder.typicode.com/", "users").

% Use SQL in rule body to query nested fields
result() <- SELECT name, email, address.city AS city
FROM users_api
WHERE id <= 5.

@output("result").

Example 2: Querying Nested JSON with SQL

The recommended approach for querying API data is to use SQL directly in rule bodies:

% Bind to GitHub API with authentication
@bind("github_issues",
"api authType=bearer, token=${GITHUB_TOKEN}",
"https://api.github.com/repos/", "owner/repo/issues").

% Query nested JSON structures with SQL
open_bugs() <- SELECT title,
user.login AS author,
labels[0].name AS primary_label,
created_at
FROM github_issues
WHERE state = 'open'
AND labels[0].name = 'bug'
ORDER BY created_at DESC.

@output("open_bugs").
Accessing Nested Structures in API Responses

The following Prometheus examples demonstrate how to work with nested JSON structures using Vadalog collection functions:

  • struct:get("field", struct): Access a field within a struct
  • collections:explode(array): Convert an array into multiple rows (one row per element)
  • collections:get(array, index): Access a specific element in an array (0-based index)
  • SIZE(array): Get the number of elements in an array

These functions allow you to navigate complex API responses without SQL array indexing.

Example 3: Prometheus Metrics API

Prometheus is a popular monitoring system that exposes metrics via REST API. Here's how to query Prometheus instant queries and extract metric values:

% Query Prometheus instant query endpoint for 'up' metric
@bind("up_metric", "api", "http://prometheus-server:9090/api/v1/", "query?query=up").

% Explode the results array to process individual metrics
result_linear(Result) <-
up_metric(Data, Status),
Status = "success",
Results = struct:get("result", Data),
Result = collections:explode(Results).

% Extract metric labels and values
result(Job, Instance, Timestamp, Value) <-
result_linear(Result),
Metric = struct:get("metric", Result),
Job = struct:get("job", Metric),
Instance = struct:get("instance", Metric),
ValueArray = struct:get("value", Result),
Timestamp = collections:get(ValueArray, 0),
Value = collections:get(ValueArray, 1).

@output("result").

Example 4: Prometheus Targets Endpoint

Query Prometheus targets to get detailed information about scrape targets:

% Get information about scrape targets
@bind("targets", "api", "http://prometheus-server:9090/api/v1/", "targets").

% Explode active targets array
active_targets_linear(Target) <-
targets(Data, Status),
Status = "success",
ActiveTargets = struct:get("activeTargets", Data),
Target = collections:explode(ActiveTargets).

% Extract target details
result(Job, Address, Health, ScrapeUrl) <-
active_targets_linear(Target),
DiscoveredLabels = struct:get("discoveredLabels", Target),
Job = struct:get("job", DiscoveredLabels),
Address = struct:get("__address__", DiscoveredLabels),
Health = struct:get("health", Target),
ScrapeUrl = struct:get("scrapeUrl", Target).

@output("result").

Example 5: Prometheus Goroutines Metric

Query and extract goroutine counts from Prometheus:

% Query go_goroutines metric
@bind("goroutines", "api", "http://prometheus-server:9090/api/v1/", "query?query=go_goroutines").

% Explode results to process each metric
result_linear(Result) <-
goroutines(Data, Status),
Status = "success",
Results = struct:get("result", Data),
Result = collections:explode(Results).

% Extract goroutine count for each instance
result(Job, Instance, GoroutineCount) <-
result_linear(Result),
Metric = struct:get("metric", Result),
Job = struct:get("job", Metric),
Instance = struct:get("instance", Metric),
ValueArray = struct:get("value", Result),
GoroutineCount = collections:get(ValueArray, 1).

@output("result").

Example 6: Prometheus Alert Rules

Query Prometheus alert rule definitions (works even when no alerts are firing):

% Prometheus alert rules API
@bind("rules_api", "api", "http://prometheus-server:9090/api/v1/", "rules").

% Extract alert rule groups
groups_linear(Group) <-
rules_api(Data, Status),
Status = "success",
GroupsArray = struct:get("groups", Data),
Group = collections:explode(GroupsArray).

% Explode rules array within each group
rules_linear(Rule) <-
groups_linear(Group),
Rules = struct:get("rules", Group),
Rule = collections:explode(Rules).

% Extract alert details
result(AlertName, State, Query, Severity, Description) <-
rules_linear(Rule),
AlertName = struct:get("name", Rule),
State = struct:get("state", Rule),
Query = struct:get("query", Rule),
Labels = struct:get("labels", Rule),
Severity = struct:get("severity", Labels),
Annotations = struct:get("annotations", Rule),
Description = struct:get("summary", Annotations).

@output("result").

Example 7: Prometheus - List All Metric Names

Query Prometheus to get all available metric names and filter them:

% Get all available metric names
@bind("metric_names", "api", "http://prometheus-server:9090/api/v1/", "label/__name__/values").

% Explode the data array to get individual metric names
metrics_linear(MetricName) <-
metric_names(Data, Status),
Status = "success",
MetricName = collections:explode(Data).

% Filter for metrics containing "prometheus"
result(MetricName) <-
metrics_linear(MetricName),
contains(MetricName, "prometheus").

@output("result").

You can also count the total number of metrics:

% Get all metric names
@bind("all_metrics", "api", "http://prometheus-server:9090/api/v1/", "label/__name__/values").

% Count total metrics
result(TotalCount) <-
all_metrics(Data, Status),
Status = "success",
TotalCount = SIZE(Data).

@output("result").

Example 8: Prometheus - Query Build Information

Query Prometheus build information to get version details:

% Get Prometheus build information
@bind("build_info", "api", "http://prometheus-server:9090/api/v1/", "query?query=prometheus_build_info").

% Explode results to access build info
result_linear(Result) <-
build_info(Data, Status),
Status = "success",
Results = struct:get("result", Data),
Result = collections:explode(Results).

% Extract version information
result(Version, GoVersion, Instance) <-
result_linear(Result),
Metric = struct:get("metric", Result),
Version = struct:get("version", Metric),
GoVersion = struct:get("goversion", Metric),
Instance = struct:get("instance", Metric).

@output("result").

Example 9: Prometheus - Range Query

Query metric values over a time range:

% Range query for 'up' metric with 60s step
% Note: Use epoch timestamps for start/end parameters
% Example timestamps: Feb 2026 (update to current time for your use)
@bind("up_range",
"api authType=bearer, token=${PROMETHEUS_TOKEN}",
"https://prometheus.example.com/api/v1/",
"query_range?query=up&start=1770700000&end=1770800000&step=60").

% Analyze the range query response
range_summary() <- SELECT status,
data.resultType AS result_type,
SIZE(data.result) AS series_count
FROM up_range
WHERE status = 'success'.

@output("range_summary").

Using Parameters for Dynamic Time Ranges:

% Define time range parameters (epoch timestamps)
% Example: Large time range from 2026-2028 for long-term validity
@param("start_time", "1770700000"). % Feb 10, 2026
@param("end_time", "1770800000"). % Feb 11, 2026
@param("step", "60").

% Range query with parameterized time range
@bind("metric_range",
"api authType=bearer, token=${PROMETHEUS_TOKEN}",
"https://prometheus.example.com/api/v1/",
"query_range?query=up&start=${start_time}&end=${end_time}&step=${step}").

% Process the time series data
time_series_data() <- SELECT status,
data.resultType AS result_type,
SIZE(data.result) AS total_series
FROM metric_range
WHERE status = 'success'.

@output("time_series_data").
Dynamic Timestamps

For current data, you can use environment variables or update the parameter values to recent epoch timestamps. Prometheus typically retains metrics for 15 days by default, so use timestamps within your retention period.

Working Example Without Authentication:

If your Prometheus instance doesn't require authentication, you can omit the auth parameters:

% Public Prometheus instance (no authentication needed)
% Define time range parameters
@param("start_time", "1770700000").
@param("end_time", "1770800000").
@param("step", "60").

% Query the 'up' metric over time range
@bind("prom_up",
"api",
"http://databases.prometheux.ai:9090/api/v1/",
"query_range?query=up&start=${start_time}&end=${end_time}&step=${step}").

% Extract time series information
up_metrics() <- SELECT status,
data.resultType AS result_type,
SIZE(data.result) AS series_count
FROM prom_up
WHERE status = 'success'.

@output("up_metrics").

Example 10: Prometheus Alertmanager API

Alertmanager is Optional

Alertmanager is a separate service from Prometheus that handles alert routing and notification. It typically runs on port 9093. If you only have Prometheus running (port 9090), you won't have this endpoint available. This example is for advanced setups with Alertmanager deployed.

Query active alerts from Prometheus Alertmanager using the v2 API:

% Get active alerts from Alertmanager (requires Alertmanager service running)
% Note: Use API v2 (v1 was deprecated and removed in version 0.27.0)
@bind("alerts_api",
"api authType=bearer, token=${ALERTMANAGER_TOKEN}",
"https://alertmanager.example.com/api/v2/", "alerts").

% Count active alerts
alert_summary() <- SELECT SIZE(alerts_api) AS total_alerts.

@output("alert_summary").

Working Example Without Authentication:

% Public Alertmanager instance (no authentication needed)
@bind("alerts",
"api",
"http://databases.prometheux.ai:9093/api/v2/", "alerts").

% Extract alert information using collections:explode for array processing
alerts_linear(Alert) <- alerts(AlertsArray),
Alert = collections:explode(AlertsArray).

% Extract fields from each alert using struct:get
critical_alerts(AlertName, Status, StartsAt) <-
alerts_linear(Alert),
Labels = struct:get("labels", Alert),
AlertName = struct:get("alertname", Labels),
Severity = struct:get("severity", Labels),
StatusData = struct:get("status", Alert),
Status = struct:get("state", StatusData),
StartsAt = struct:get("startsAt", Alert),
Severity = "critical".

@output("critical_alerts").

Example 11: Kubernetes API

Query Kubernetes resources via the API server:

% Get pod information from Kubernetes API
@bind("pods",
"api authType=bearer, token=${K8S_TOKEN}",
"https://kubernetes.default.svc/api/v1/", "pods").

% Query pod metadata and status (top-level fields work reliably)
pod_info() <- SELECT metadata.name AS pod_name,
metadata.namespace AS namespace,
status.phase AS phase
FROM pods
WHERE status.phase != 'Running'.

@output("pod_info").
Working with Nested Arrays

When dealing with deeply nested arrays (like status.containerStatuses[0]), consider using the struct:get function in Vadalog rules instead of direct array indexing in SQL. This provides more reliable access to complex nested structures.

Example 12: CSV Format API (Weather Data)

For APIs that return CSV format, specify responseFormat=csv:

% Bind to weather API that returns CSV
@bind("meteo",
"api delimiter=';', username='my_username', password='my_password', responseFormat=csv",
"https://api.meteomatics.com/",
"2025-06-09T00:00:00Z--2025-06-12T00:00:00Z:PT3H/t_2m:C,relative_humidity_2m:p/47.423336,9.377225/csv").

% Query CSV data with SQL
hot_days() <- SELECT Valid_Date, T_2m_C
FROM meteo
WHERE T_2m_C > 25
ORDER BY T_2m_C DESC.

@output("hot_days").

Example 13: Joining Multiple API Sources

You can join data from multiple API endpoints using SQL in rule bodies:

% Bind to JSONPlaceholder users API
@bind("users_api", "api", "https://jsonplaceholder.typicode.com/", "users").

% Bind to JSONPlaceholder posts API
@bind("posts_api", "api", "https://jsonplaceholder.typicode.com/", "posts").

% Join users with their posts using renamed column format (users_api_3 = id)
user_posts() <- SELECT u.users_api_4 AS author_name,
u.users_api_2 AS email,
p.posts_api_2 AS post_title
FROM users_api u
JOIN posts_api p ON u.users_api_3 = p.posts_api_3
WHERE u.users_api_3 <= 5
LIMIT 20.

@output("user_posts").

Example 14: Common Table Expressions (CTEs) with API Data

Use CTEs for complex multi-step transformations:

@bind("orders_api",
"api authType=apikey, apikey=${API_KEY}",
"https://api.shop.com/v1/", "orders").

% Use CTE to calculate customer lifetime value
customer_value() <- WITH customer_orders AS (
SELECT customer.customer_id AS customer_id,
customer.name AS customer_name,
order_id,
total_amount
FROM orders_api
WHERE status = 'completed'
),
customer_totals AS (
SELECT customer_id,
customer_name,
COUNT(*) AS order_count,
SUM(total_amount) AS lifetime_value
FROM customer_orders
GROUP BY customer_id, customer_name
)
SELECT customer_name, lifetime_value, order_count
FROM customer_totals
WHERE lifetime_value > 5000
ORDER BY lifetime_value DESC.

@output("customer_value").

Example 15: Using Parameters in API Queries

Combine @param with API queries for dynamic filtering:

% Define parameters
@param("target_status", "success").
@param("max_results", "100").

@bind("metrics_api", "api authType=bearer, token=${TOKEN}", "https://prometheus.example.com/api/v1/", "query?query=up").

% Use parameters in WHERE clause
filtered_metrics() <- SELECT status,
data.resultType AS result_type
FROM metrics_api
WHERE status = '${target_status}'
LIMIT ${max_results}.

@output("filtered_metrics").

Example 16: Bitcoin & Cryptocurrency Price Analytics

Track and analyze cryptocurrency prices in real-time using the CoinGecko API (no authentication required):

Basic Bitcoin Data:

% Get Bitcoin basic information from CoinGecko
@bind("bitcoin", "api", "https://api.coingecko.com/api/v3/", "coins/bitcoin").

% Extract basic Bitcoin info
btc_info() <- SELECT id, symbol, name FROM bitcoin.

@output("btc_info").

Bitcoin Price with Nested Field Access:

% Get Bitcoin market data
@bind("bitcoin", "api", "https://api.coingecko.com/api/v3/", "coins/bitcoin").

% Access nested price data
btc_price() <- SELECT id,
symbol,
market_data.current_price.usd AS price_usd
FROM bitcoin.

@output("btc_price").

Real-Time Bitcoin Price Monitoring with Multiple Currencies:

% Monitor Bitcoin price in multiple currencies (faster endpoint)
@bind("btc_price", "api", "https://api.coingecko.com/api/v3/", "simple/price?ids=bitcoin&vs_currencies=usd,eur&include_24hr_change=true").

% Extract multi-currency prices
price_data() <- SELECT bitcoin.usd AS price_usd,
bitcoin.eur AS price_eur
FROM btc_price.

@output("price_data").

Advanced: Bitcoin Price Alerts with Market Signals:

% Monitor Bitcoin with market signal using CTE and CASE
@bind("btc_price", "api", "https://api.coingecko.com/api/v3/", "simple/price?ids=bitcoin&vs_currencies=usd&include_24hr_change=true").

% Analyze price changes and create market signals
price_analysis() <- WITH btc_data AS (
SELECT bitcoin.usd AS price_usd,
bitcoin.usd_24h_change AS change_pct
FROM btc_price
)
SELECT price_usd,
change_pct,
CASE
WHEN change_pct > 5 THEN 'SURGE'
WHEN change_pct < -5 THEN 'DROP'
ELSE 'STABLE'
END AS market_signal
FROM btc_data.

@output("price_analysis").

Compare Multiple Cryptocurrencies:

% Get list of all cryptocurrencies
@bind("crypto_list", "api", "https://api.coingecko.com/api/v3/", "coins/list").

% Filter for major cryptocurrencies
major_cryptos() <- SELECT id, symbol, name
FROM crypto_list
WHERE symbol IN ('btc', 'eth', 'usdt')
LIMIT 10.

@output("major_cryptos").

Example 17: Football (Soccer) Analytics

Analyze football match data, team statistics, and player performance using TheSportsDB API (no authentication required):

TheSportsDB Response Format

TheSportsDB wraps response data in arrays (e.g., table, teams, results). Use SIZE() to count items or access top-level response structure.

Premier League Standings - Count Teams:

% Get Premier League standings from TheSportsDB
@bind("premier_league", "api", "https://www.thesportsdb.com/api/v1/json/3/", "lookuptable.php?l=4328&s=2023-2024").

% Count teams in the league
team_count() <- SELECT SIZE(table) AS team_count
FROM premier_league
WHERE SIZE(table) > 0.

@output("team_count").

Search for Specific Team:

% Search for Arsenal team information
@bind("team_info", "api", "https://www.thesportsdb.com/api/v1/json/3/", "searchteams.php?t=Arsenal").

% Check how many teams match the search
team_results() <- SELECT SIZE(teams) AS team_count
FROM team_info
WHERE SIZE(teams) > 0.

@output("team_results").

Get Recent Matches:

% Get last matches for Arsenal (team ID 133604)
@bind("recent_matches", "api", "https://www.thesportsdb.com/api/v1/json/3/", "eventslast.php?id=133604").

% Count recent matches
match_count() <- SELECT SIZE(results) AS match_count
FROM recent_matches
WHERE SIZE(results) > 0.

@output("match_count").

Get All Available Leagues:

% Get all leagues from TheSportsDB
@bind("all_leagues", "api", "https://www.thesportsdb.com/api/v1/json/3/", "all_leagues.php").

% Count available leagues
league_data() <- SELECT SIZE(leagues) AS league_count
FROM all_leagues
WHERE SIZE(leagues) > 0.

@output("league_data").

Get All Countries with Sports:

% Get all countries with sports leagues
@bind("all_countries", "api", "https://www.thesportsdb.com/api/v1/json/3/", "all_countries.php").

% Count countries
country_data() <- SELECT SIZE(countries) AS country_count
FROM all_countries
WHERE SIZE(countries) > 0.

@output("country_data").

Example 18: Working with Arrays Using Vadalog Collection Functions

Instead of using SQL queries, you can process arrays directly in Vadalog using collection built-in functions. This is particularly useful when you want to work with arrays in a more functional style.

When Do You Need collections:explode?

API responses have two common structures:

  1. Root-level array - Returns [{item1}, {item2}, ...]

    • Each element becomes a separate row automatically
    • No explode needed
  2. Nested array field - Returns {"leagues": [{item1}, {item2}, ...]}

    • You get 1 row with the entire array as a single column
    • Use collections:explode to create multiple rows

Most public APIs (TheSportsDB, CoinGecko, etc.) return objects with nested array fields, so you'll typically need collections:explode to process individual array elements.

Using collections:explode to Convert Arrays to Rows:

% Get all football leagues from TheSportsDB
@bind("leagues_api", "api", "https://www.thesportsdb.com/api/v1/json/3/", "all_leagues.php").

% Explode the leagues array into individual rows
leagues_linear(League) <- leagues_api(LeaguesArray),
League = collections:explode(LeaguesArray).

% Extract fields from each league struct using struct:get
result(LeagueId, LeagueName) <- leagues_linear(League),
LeagueId = struct:get("idLeague", League),
LeagueName = struct:get("strLeague", League).

@output("result").

Using collections:transform with Lambda Expressions:

Transform array elements using lambda expressions before exploding:

% Get all football leagues
@bind("leagues_api", "api", "https://www.thesportsdb.com/api/v1/json/3/", "all_leagues.php").

% Transform each league struct into a simpler array [id, name]
leagues_transformed(TransformedArray) <-
leagues_api(LeaguesArray),
TransformedArray = collections:transform(LeaguesArray, "x -> array(x.idLeague, x.strLeague)").

% Explode the transformed array into rows
leagues_linear(LeagueData) <-
leagues_transformed(TransformedArray),
LeagueData = collections:explode(TransformedArray).

% Access array elements using collections:get (1-indexed)
result(LeagueId, LeagueName) <-
leagues_linear(LeagueData),
LeagueId = collections:get(LeagueData, 1),
LeagueName = collections:get(LeagueData, 2).

@output("result").

Cryptocurrency List with Collection Functions:

% Get list of all cryptocurrencies
@bind("crypto_list", "api", "https://api.coingecko.com/api/v3/", "coins/list").

% Explode cryptocurrency array
crypto_linear(Crypto) <- crypto_list(CryptoArray),
Crypto = collections:explode(CryptoArray).

% Extract crypto information
result(CryptoId, Symbol, Name) <-
crypto_linear(Crypto),
CryptoId = struct:get("id", Crypto),
Symbol = struct:get("symbol", Crypto),
Name = struct:get("name", Crypto).

@output("result").
Collection Functions vs SQL

When to use Collection Functions:

  • You prefer functional programming style
  • Working with simple array transformations
  • Need to combine struct:get with array processing
  • Want to leverage lambda expressions for transformations

When to use SQL:

  • Complex filtering and aggregations (WHERE, GROUP BY, HAVING)
  • Joining multiple data sources
  • Using SQL-specific functions (COUNT, SUM, AVG)
  • Need DISTINCT, ORDER BY, or LIMIT clauses

Both approaches are valid - choose based on your use case and coding style preference!

Example 19: Cross-Domain Analytics - Combining Multiple APIs

Combine data from multiple API sources for comprehensive analysis:

Bitcoin Price + Multiple Cryptocurrencies:

% Get Bitcoin price
@bind("btc_price", "api", "https://api.coingecko.com/api/v3/", "simple/price?ids=bitcoin&vs_currencies=usd&include_24hr_change=true").

% Get list of cryptocurrencies
@bind("crypto_list", "api", "https://api.coingecko.com/api/v3/", "coins/list").

% Analyze Bitcoin price context
btc_context() <- WITH btc_data AS (
SELECT bitcoin.usd AS price_usd,
bitcoin.usd_24h_change AS volatility
FROM btc_price
),
crypto_count AS (
SELECT COUNT(*) AS total_cryptos
FROM crypto_list
WHERE symbol IN ('btc', 'eth', 'usdt')
)
SELECT b.price_usd,
b.volatility,
c.total_cryptos,
CASE
WHEN ABS(b.volatility) > 5 THEN 'HIGH'
WHEN ABS(b.volatility) > 2 THEN 'MEDIUM'
ELSE 'LOW'
END AS volatility_level
FROM btc_data b
CROSS JOIN crypto_count c.

@output("btc_context").

Football Leagues + Teams Analysis:

% Get all available leagues
@bind("leagues", "api", "https://www.thesportsdb.com/api/v1/json/3/", "all_leagues.php").

% Get Premier League standings
@bind("premier_league", "api", "https://www.thesportsdb.com/api/v1/json/3/", "lookuptable.php?l=4328&s=2023-2024").

% Combine league and team data
sports_overview() <- WITH league_data AS (
SELECT SIZE(leagues) AS total_leagues
FROM leagues
WHERE SIZE(leagues) > 0
),
team_data AS (
SELECT SIZE(table) AS epl_teams
FROM premier_league
WHERE SIZE(table) > 0
)
SELECT l.total_leagues,
t.epl_teams
FROM league_data l
CROSS JOIN team_data t.

@output("sports_overview").

Example 20: REST Countries API - Public Data

Query public APIs without authentication:

% Query REST Countries API (no authentication required)
% Note: responseFormat defaults to JSON, so we don't need to specify it
@bind("countries", "api", "https://restcountries.com/v3.1/", "all?fields=name,region,population").

% Aggregate countries by region
regional_stats() <- SELECT region,
COUNT(*) AS country_count,
SUM(population) AS total_population,
AVG(population) AS avg_population
FROM countries
GROUP BY region
ORDER BY total_population DESC.

@output("regional_stats").

API Authentication Methods Summary

Auth TypeParametersExample
Bearer TokenauthType=bearer, token=<token>GitHub, Kubernetes, Prometheus
Basic AuthauthType=basic, username=<user>, password=<pass>Private APIs
API KeyauthType=apikey, apikey=<key>REST APIs with key-based auth
Custom Headersheaders=<key1>:<value1>,<key2>:<value2>APIs requiring custom headers
No AuthNo authentication parametersPublic APIs

Best Practices

  1. Default Format: JSON is the default responseFormat, so you can omit it for JSON APIs
  2. Use Environment Variables: Store sensitive tokens in environment variables: token=${API_TOKEN}
  3. SQL in Rule Bodies: Preferred approach for querying nested API data
  4. Filter Early: Apply WHERE clauses to reduce data at the source
  5. Handle Nested Data: Use dot notation (market_data.current_price.usd) for nested JSON fields
  6. Array Access: Use SIZE() function for reliable array length checks
  7. Aggregations: Use GROUP BY for summarizing API data
  8. JOINs: When joining multiple API sources, use renamed column format (predicate_name_i)
  9. Real-Time Data: Combine CTEs with CASE statements for market signals and alerts
  10. Rate Limits: Be mindful of API rate limits; use aggregations to reduce query frequency

Financial & Crypto Analytics:

  • ✅ Track Bitcoin prices in real-time (tested with CoinGecko API)
  • ✅ Monitor price changes across multiple currencies (USD, EUR)
  • ✅ Set up price alerts using CASE statements and CTEs
  • ✅ Compare multiple cryptocurrencies by symbol
  • ✅ Access nested market data (prices, volumes)

Sports Analytics:

  • ✅ Count teams in league standings (tested with TheSportsDB)
  • ✅ Search for specific teams and get team counts
  • ✅ Track recent match history
  • ✅ Query available leagues and countries
  • ✅ Use SIZE() function for array-wrapped responses

Monitoring & DevOps:

  • ✅ Query Prometheus for system metrics
  • ✅ Access Prometheus response structure (status, resultType)
  • ✅ Monitor active targets with SIZE()
  • ✅ Use CTEs for multi-step metric processing
  • ✅ Default JSON format handling

Recommended Patterns:

  • Use SIZE(array_field) to count array elements
  • Access nested fields with dot notation: field.subfield.value
  • Use CTEs for complex multi-step transformations
  • Use CASE statements for conditional logic and alerts
  • Apply WHERE clauses with SIZE() to filter valid responses

Troubleshooting

Issue: API request failed with status code: 401
Solution: Verify authentication credentials and token validity

Issue: responseFormat not recognized
Solution: Ensure format is one of: json, csv, or xml

Issue: Nested field not found
Solution: Check the API response structure and use correct dot notation

Issue: Column renamed to predicate_0, predicate_1
Solution: For JOINs between multiple sources, use renamed column references

Issue: Error accessing deeply nested arrays (e.g., data.result[0].metric.instance)
Solution: Array access within deeply nested JSON structures can be unreliable depending on the response format. Use one of these alternatives:

  1. Access top-level fields directly:

    % Instead of: data.result[0].value[1]
    % Use: status, data.resultType
    result() <- SELECT status, data.resultType FROM api_data.
  2. Use struct:get in Vadalog rules:

    % Access nested fields after binding
    result(MetricName) <-
    api_data(..., MetricStruct, ...),
    MetricName = struct:get("instance", MetricStruct).
  3. Use SIZE() for array lengths:

    % Instead of accessing individual array elements
    % Count array sizes
    result() <- SELECT SIZE(data.activeTargets) AS target_count FROM api_data.

Issue: Prometheus metrics with complex nested structures
Solution: For Prometheus API responses, focus on accessing:

  • Top-level response fields: status, data.resultType
  • Struct sizes: SIZE(data.activeTargets), SIZE(data.result)
  • Use CTEs and aggregations on response-level data rather than individual metric arrays

For accessing individual metric labels and values within Prometheus results, consider using the struct:get function in Vadalog rules after binding the data.


Text Files

Text files can be processed directly in Vadalog to extract concepts and relationships from textual content. The text datasource supports various text formats and can extract structured information from unstructured text.

Example: Reading from Text File

This example demonstrates how to read from a text file and extract structured information.

% Define the data model for the 'location' concept
@model("location","['Name:string', 'Description:string']").

% Bind the 'location' concept to the text file
@bind("location","text","/path/to/file","hansel_gretel_excerpt.txt").

% Define a rule to extract Name and Description from the text file
location_head(Name,Description) <- location(Name,Description).

% Declare the output concept 'location_head' for making the processed data available
@output("location_head").

Binary Files

Binary files support various formats including PDF, JPG, PNG, and other formats. Vadalog can extract concepts and relationships from binary files, making it suitable for processing documents and images.

Example: Reading from PDF File

This example demonstrates how to read from a PDF file and extract structured information.

% Bind the 'person' concept to the PDF file
@bind("person","binaryfile","/path/to/file","hansel_gretel_excerpt.pdf").

% Define a rule to extract Name and Role from the PDF file
person_head(Name,Role) <- person(Name,Role).

% Declare the output concept 'person_head' for making the processed data available
@output("person_head").

Example: Reading from Business Document (Invoice)

Binary files also support structured business documents such as ID documents, receipts, tax forms, mortgage documents, and other standardized business documents. The API supports various document types including:

  • Financial Documents: check.us, bankStatement.us, payStub.us, creditCard, invoice
  • ID Documents: idDocument.driverLicense, idDocument.passport, idDocument.nationalIdentityCard, idDocument.residencePermit, idDocument.usSocialSecurityCard
  • Receipts: receipt.retailMeal, receipt.creditCard, receipt.gas, receipt.parking, receipt.hotel
  • Tax Documents: tax.us.1040.2023, tax.us.w2, tax.us.w4, tax.us.1095A, tax.us.1098, tax.us.1099 (various forms)
  • Mortgage Documents: mortgage.us.1003 (URLA), mortgage.us.1004 (URAR), mortgage.us.closingDisclosure
  • Other Documents: contract, healthInsuranceCard.us, marriageCertificate.us

This example demonstrates how to read from an invoice PDF and extract structured business information.

% Bind the 'iNV_pdf' concept to the invoice PDF file with document type specification
@bind("iNV_pdf","binaryfile documentType='invoice'","/path/to/invoice","INV.pdf").

% Define a rule to extract comprehensive invoice information from the PDF file
iNV__pdf_head(CustomerName,CustomerId,PurchaseOrder,InvoiceId,InvoiceDate,DueDate,VendorName,VendorAddress,VendorAddressRecipient,CustomerAddress,CustomerAddressRecipient,BillingAddress,BillingAddressRecipient,ShippingAddress,ShippingAddressRecipient,SubTotal,TotalDiscount,TotalTax,InvoiceTotal,AmountDue,PreviousUnpaidBalance,RemittanceAddress,RemittanceAddressRecipient,ServiceAddress,ServiceAddressRecipient,ServiceStartDate,ServiceEndDate,VendorTaxId,CustomerTaxId,PaymentTerm,KVKNumber,PaymentUrl,PaymentDetails,TaxDetails,PaidInFourInstallements,Items) <- iNV_pdf(CustomerName,CustomerId,PurchaseOrder,InvoiceId,InvoiceDate,DueDate,VendorName,VendorAddress,VendorAddressRecipient,CustomerAddress,CustomerAddressRecipient,BillingAddress,BillingAddressRecipient,ShippingAddress,ShippingAddressRecipient,SubTotal,TotalDiscount,TotalTax,InvoiceTotal,AmountDue,PreviousUnpaidBalance,RemittanceAddress,RemittanceAddressRecipient,ServiceAddress,ServiceAddressRecipient,ServiceStartDate,ServiceEndDate,VendorTaxId,CustomerTaxId,PaymentTerm,KVKNumber,PaymentUrl,PaymentDetails,TaxDetails,PaidInFourInstallements,Items).

% Declare the output concept 'iNV_2026_002_000521_pdf_head' for making the processed data available
@output("iNV_2026_002_000521_pdf_head").

Example: PDF Sections to Qdrant — Semantic Search over Documents

This example demonstrates a complete workflow for turning unstructured PDF documents into a searchable Qdrant knowledge base and querying it with natural language.

The documentType='sections' mode extracts each document section as a row with its heading, content, role (e.g., title, section heading, body), and page number. These sections are then written to Qdrant with concept-aware embeddings, enabling semantic search that understands the document structure.

Step 1: Extract sections from PDF documents

% Extract sections from all PDF files in a directory.
% documentType='sections' uses Azure Document Intelligence to segment
% the document into paragraphs grouped under their headings.
% Schema returned: (filePath, sectionHeading, content, role, pageNumber)
@bind("doc_sections",
"binaryfile documentType='sections'",
"/data/reports", "annual_report_2025.pdf").

% Each row is one section from the PDF
section(FilePath, Heading, Content, Role, Page) <-
doc_sections(FilePath, Heading, Content, Role, Page).
@output("section").

Running this on a company annual report might produce rows like:

FilePathHeadingContentRolePage
.../annual_report.pdfFinancial OverviewRevenue grew 23% year-over-year to $4.2B...sectionHeading3
.../annual_report.pdfRisk FactorsThe company faces risks related to...sectionHeading12
.../annual_report.pdfESG InitiativesOur carbon emissions decreased by 15%...sectionHeading18

Step 2: Write sections to Qdrant with concept-aware embedding

% Extract sections from the PDF
@bind("doc_sections",
"binaryfile documentType='sections'",
"/data/reports", "annual_report_2025.pdf").

% Prepare sections for writing to Qdrant
report_kb(Heading, Content, Page) <-
doc_sections(FilePath, Heading, Content, Role, Page).
@output("report_kb").

% Write to Qdrant with concept-aware embedding.
% The @model description template tells the embedding model what each section
% represents — this is the key to accurate semantic search later.
@bind("report_kb",
"qdrant host='localhost', port=6334, vectorDimension=3072",
"", "annual_report_sections").
@model("report_kb",
"['heading:string', 'content:string', 'page:int']",
"Document section titled [heading] on page [page] of an annual report: [content]").

Each section is embedded as a sentence like:

"Document section titled Financial Overview on page 3 of an annual report: Revenue grew 23% year-over-year to $4.2B..."

This contextual embedding captures not just the text, but its role and position within the document.

Step 3: Query with natural language

% Semantic search over the report sections
@bind("report_search",
"qdrant host='localhost', port=6334, query='environmental sustainability initiatives', limit=3",
"", "annual_report_sections").

% @model provides both schema and concept enrichment for the query.
% The query is enriched with the same structure used during writing.
@model("report_search",
"['heading:string', 'content:string', 'page:int']",
"Document section titled [heading] on page [page] of an annual report: [content]").

result(Heading, Content, Page, Score) <-
report_search(Id, Heading, Content, Page, Score).
@output("result").

Result:

HeadingContentPageScore
ESG InitiativesOur carbon emissions decreased by 15%...180.92
Sustainability GoalsThe company targets net-zero by 2030...190.88
Environmental ComplianceAll facilities meet ISO 14001 standards...220.83

The query "environmental sustainability initiatives" correctly returns the ESG and sustainability sections — not financial or risk sections that might coincidentally mention the word "environment".

Step 4: Combine with structured reasoning

The power of Vadalog is that search results can feed into further reasoning rules:

% Find sections about sustainability that mention specific targets
@bind("report_search",
"qdrant host='localhost', port=6334, query='sustainability targets and goals', limit=10",
"", "annual_report_sections").
@model("report_search",
"['heading:string', 'content:string', 'page:int']",
"Document section titled [heading] on page [page] of an annual report: [content]").

% Only keep sections that mention quantitative targets
quantitative_target(Heading, Content, Page) <-
report_search(Id, Heading, Content, Page, Score),
Score > 0.75,
string:contains(Content, "%").

@output("quantitative_target").

This returns only sections about sustainability that contain percentage figures — combining semantic vector search with symbolic reasoning in a single Vadalog program.

HDFS File system

HDFS (Hadoop Distributed File System) is designed for distributed storage and large-scale data processing. Vadalog can integrate with HDFS by reading from and writing to files stored in HDFS clusters. This example shows how to read a CSV file from an HDFS location and process it within a Prometheux workflow.

% Bind the 'user_csv' concept to the CSV file located in HDFS ('users.csv')
% The file is located in the HDFS directory: hdfs://hdfs-host:9000/user
% The 'useHeaders=true' option indicates that the first row contains column headers
@bind("user_csv", "csv useHeaders=true", "hdfs://hdfs-host:9000/user", "users.csv").

% Define a rule to map the data from 'user_csv' to the 'user' concept
user(X) <- user_csv(X).

% Declare the output concept 'user' for making the processed data available
@output("user").

Sybase Database

Sybase (now SAP ASE) is a relational database management system used for online transaction processing. This example shows how to read data from a Sybase database.

% Bind the 'order_sybase' concept to the Sybase database using the JDBC connection details
@bind("order_sybase", "sybase host='sybase-host', port=5000, username='myUser', password='myPassw'",
"myDatabase", "orders").

% Define a rule that extracts OrderId, CustomerId, and Amount from the 'orders' table in Sybase
order_sybase_test(OrderId, CustomerId, Amount) <-
order_sybase(OrderId, CustomerId, Amount).

% Declare the output concept 'order_sybase_test' to make the processed data available
@output("order_sybase_test").

Teradata Database

Teradata is a highly scalable relational database often used in enterprise data warehousing. This example shows how to read data from a Teradata database.

% Bind the 'sales_teradata' concept to the Teradata database using the JDBC connection details
@bind("sales_teradata", "teradata host='teradata-host', port=1025, username='myUser', password='myPassw'",
"myDatabase", "sales").

% Define a rule to extract SaleId, ProductId, and SaleAmount from the 'sales' table in Teradata
sales_teradata_test(SaleId, ProductId, SaleAmount) <-
sales_teradata(SaleId, ProductId, SaleAmount).

% Declare the output concept 'sales_teradata_test' for making the processed data available
@output("sales_teradata_test").

Amazon Redshift

Amazon Redshift is a fully managed data warehouse service designed for large-scale data analytics. This example shows how to read data from a Redshift table.

% Bind the 'analytics_redshift' concept to the Redshift database using the JDBC connection details
@bind("analytics_redshift", "redshift host='redshift-cluster.amazonaws.com', port=5439, username='myUser', password='myPassword'",
"analyticsDB", "analytics").

% Define a rule to extract data from the 'analytics' table in Redshift
analytics_redshift_test(AnalysisId, Metric, Value) <-
analytics_redshift(AnalysisId, Metric, Value).

% Declare the output concept 'analytics_redshift_test' for making the processed data available
@output("analytics_redshift_test").

Google BigQuery

Google BigQuery is a serverless, highly scalable, and cost-effective multi-cloud data warehouse. This example demonstrates configuring and querying data from a BigQuery dataset.

Setting up Google Cloud Access

This guide shows how to:

  1. Enable the required Google Cloud APIs
  2. Create a service account for Prometheux jobs
  3. Grant the minimum IAM roles (data access & Storage API)
  4. Allow a human user to impersonate the service account and obtain short‑lived access tokens
  5. Create a JSON key file
  6. Generate a one‑hour access token

Project ID example: project-example-358816
Service account name example: example-sa
Human user example: example@gmail.com


Open a Google Cloud Shell within your Google Project and execute the following commands:

1 Enable required APIs

gcloud services enable compute.googleapis.com bigquery.googleapis.com bigquerystorage.googleapis.com iamcredentials.googleapis.com --project=project-example-358816

2 Create the service account

PROJECT_ID=project-example-358816
SA_NAME=example-sa

gcloud iam service-accounts create "$SA_NAME" --project="$PROJECT_ID" --display-name="Spark BigQuery SA"

Resulting e‑mail:
example-sa@project-example-358816.iam.gserviceaccount.com


3 Grant minimum BigQuery roles to the service account

gcloud projects add-iam-policy-binding "$PROJECT_ID" --member="serviceAccount:$SA_NAME@$PROJECT_ID.iam.gserviceaccount.com" --role="roles/bigquery.dataViewer"

gcloud projects add-iam-policy-binding "$PROJECT_ID" --member="serviceAccount:$SA_NAME@$PROJECT_ID.iam.gserviceaccount.com" --role="roles/bigquery.jobUser"

gcloud projects add-iam-policy-binding "$PROJECT_ID" --member="serviceAccount:$SA_NAME@$PROJECT_ID.iam.gserviceaccount.com" --role="roles/bigquery.readSessionUser"

4 Allow your user to impersonate the service account

gcloud iam service-accounts add-iam-policy-binding "$SA_NAME@$PROJECT_ID.iam.gserviceaccount.com" --member="user:example@gmail.com" --role="roles/iam.serviceAccountTokenCreator"

Verify:

gcloud iam service-accounts get-iam-policy "$SA_NAME@$PROJECT_ID.iam.gserviceaccount.com" --filter="bindings.role:roles/iam.serviceAccountTokenCreator"

5 Create a JSON key file

If you prefer file‑based creds:

gcloud iam service-accounts keys create ~/gcp-credentials.json --iam-account="$SA_NAME@$PROJECT_ID.iam.gserviceaccount.com"

Read the content

cat ~/gcp-credentials.json

Copy and paste it into a new file in your laptop or environment in your /path/to/gcp-credentials.json

This authMode is the default one.

Set the ENV var (in Docker or via EXPORT) GOOGLE_APPLICATION_CREDENTIALS=/path/to/gcp-credentials.json, or set the bigquery.credentialsFile=/path/to/gcp-credentials.json in the px.properties configuration file or declare the path via credentialsFile=/path/to/gcp-credentials.json as an option in the bind annotation.


6 Generate a one‑hour access token (impersonation)

If you prefer token‑based creds:

gcloud auth print-access-token --impersonate-service-account="$SA_NAME@$PROJECT_ID.iam.gserviceaccount.com"

Enable token-based authMode by setting set the bigquery.authMode in the px.propertiesconfiguration file or declare it as an option in the bind annotation authMode=gcpAccessToken and set the ENV var (in Docker or via EXPORT) GCP_ACCESS_TOKEN=my-token, or set the bigquery.gcpAccessToken=my-token config property or pass it via gcpAccessToken=my-token as option in the bind annotation.

Example using credentials file

% Bind the BigQuery table with credentials file
@bind("table",
"bigquery credentialsFile=/path/to/gcp-credentials.json",
"project-example-358816",
"bigquery-public-data.thelook_ecommerce.order_items").

% Rule to project the first three columns from the BigQuery table
bigquery_table(X,Y,Z) <- table(X,Y,Z).

% Post-process the BigQuery table to limit the results to 10
@post("bigquery_table","limit(10)").

% Define the output for the BigQuery table
@output("bigquery_table").

Example using token

% Bind the BigQuery table with token
@bind("table",
"bigquery authMode=gcpAccessToken, gcpAccessToken='my-gcp-access-token'",
"project-example-358816",
"bigquery-public-data.thelook_ecommerce.order_items").

% Rule to project the first three columns from the BigQuery table
bigquery_table(X,Y,Z) <- table(X,Y,Z).

% Define the output for the BigQuery table
@output("bigquery_table").
% Post-process the BigQuery table to limit the results to 10
@post("bigquery_table","limit(10)").

Example via query

% Bind a BigQuery query to retrieve product revenue data
@qbind("query",
"bigquery credentialsFile=src/test/resources/bigquery-credentials.json",
"quantum-feat-358816",
"SELECT product_id, SUM(sale_price) AS revenue FROM `bigquery-public-data.thelook_ecommerce.order_items`
WHERE sale_price > 100 GROUP BY product_id ORDER BY revenue DESC LIMIT 10").

% Define a rule to map the query results to the bigquery_query predicate
bigquery_query(X,Y) <- query(X,Y).

% Declare the output for the bigquery_query predicate
@output("bigquery_query").

Snowflake

Snowflake is a cloud-based data warehousing service that allows for data storage, processing, and analytics.

How to Retrieve Your Snowflake Connection Info for reading or writing tables

To obtain the connection details for your Snowflake account:

  1. Click the user icon in the bottom-left corner of the Snowflake UI.
  2. Select "Connect a tool to Snowflake".
  3. Go to the "Connectors / Drivers" section.
  4. Choose "JDBC" as the connection method.
  5. Select your warehouse, database, and set "Password" as the authentication method.

This will generate a JDBC connection string in the following format:

jdbc:snowflake://A778xxx-IVxxxx.snowflakecomputing.com/?user=PROMETHEUX&warehouse=COMPUTE_WH&db=TEST&schema=PUBLIC&password=my_password

From this string, you can extract the following values for your bind configuration:

  • url = 'A778xxx-IVxxxx.snowflakecomputing.com'
  • username = 'PROMETHEUX'
  • password = 'my_password'
  • warehouse = 'COMPUTE_WH'
  • database = 'TEST' (note: database names are usually uppercase)

Using Programmatic Access Tokens (PAT)

Snowflake supports Programmatic Access Tokens (PAT) as an alternative to password authentication. This is particularly useful for:

  • Avoiding MFA prompts during automated workflows
  • Enhanced security with token rotation
  • Integration with CI/CD pipelines

To use PAT instead of password authentication, simply replace the password parameter with the PAT token value in your bind configuration.

Setting Up PAT in Snowflake

First, identify your Snowflake user:

-- Show list of users
SHOW USERS;

Then execute the following script to set up PAT authentication:

-- ==================================================
-- Snowflake PAT Setup - Complete Script
-- ==================================================

-- Step 1: Switch to ACCOUNTADMIN OR SECURITYADMIN role
USE ROLE ACCOUNTADMIN;

-- Step 2: Set database context
USE DATABASE MY_DB;
USE WAREHOUSE MY_WH;

-- Step 3: Create authentication policy
CREATE AUTHENTICATION POLICY IF NOT EXISTS allow_pat_auth
AUTHENTICATION_METHODS = ('PASSWORD', 'PROGRAMMATIC_ACCESS_TOKEN')
PAT_POLICY = (
DEFAULT_EXPIRY_IN_DAYS = 90,
MAX_EXPIRY_IN_DAYS = 365,
NETWORK_POLICY_EVALUATION = ENFORCED_NOT_REQUIRED
);

-- Step 4: Apply policy to user (replace 'prometheux' with your username)
ALTER USER prometheux SET AUTHENTICATION POLICY allow_pat_auth;

-- Step 5: Generate PAT token
ALTER USER prometheux
ADD PROGRAMMATIC ACCESS TOKEN jdbc_access_token
DAYS_TO_EXPIRY = 90
COMMENT = 'Token for JDBC authentication';

-- ⚠️ COPY THE TOKEN VALUE FROM OUTPUT IMMEDIATELY!
-- The token will only be displayed once and cannot be retrieved later.

-- Step 6: Verify setup
SHOW USER PROGRAMMATIC ACCESS TOKENS FOR USER prometheux;

-- ==================================================
-- Setup Complete!
-- ==================================================

PAT Token Management Commands

-- Generate a new token
ALTER USER username ADD PROGRAMMATIC ACCESS TOKEN token_name DAYS_TO_EXPIRY = 90;

-- View all tokens for a user
SHOW USER PROGRAMMATIC ACCESS TOKENS FOR USER username;

-- Disable a token temporarily
ALTER USER username MODIFY PROGRAMMATIC ACCESS TOKEN token_name SET DISABLED = TRUE;

-- Re-enable a disabled token
ALTER USER username MODIFY PROGRAMMATIC ACCESS TOKEN token_name SET DISABLED = FALSE;

-- Rotate a token (create new and expire old)
ALTER USER username ROTATE PROGRAMMATIC ACCESS TOKEN token_name EXPIRE_ROTATED_TOKEN_AFTER_HOURS = 0;

-- Delete a token permanently
ALTER USER username REMOVE PROGRAMMATIC ACCESS TOKEN token_name;

Example: Reading from Snowflake with Password

This example demonstrates reading data from a Snowflake table using password authentication.

% Bind the 'transactions_snowflake' concept to the Snowflake database using the JDBC connection details
@bind("transactions_snowflake", "snowflake url='A778xxx-IVxxxx.snowflakecomputing.com', username='PROMETHEUX', password='myPassword', warehouse='COMPUTE_WH'",
"TEST", "transaction_data").

% Define a rule to extract TransactionId, CustomerId, and Amount from the 'transactions' table in Snowflake
transactions_snowflake_test(TransactionId, CustomerId, Amount) <-
transactions_snowflake(TransactionId, CustomerId, Amount).

% Declare the output concept 'transactions_snowflake_test' for making the processed data available
@output("transactions_snowflake_test").

Example: Reading from Snowflake with PAT

This example demonstrates reading data from a Snowflake table using Programmatic Access Token (PAT) authentication instead of password. This method avoids MFA prompts during execution.

% Bind the 'transactions_snowflake' concept to the Snowflake database using PAT authentication
% Replace 'password' with your PAT token value obtained from the ALTER USER ADD PROGRAMMATIC ACCESS TOKEN command
@bind("transactions_snowflake", "snowflake url='A778xxx-IVxxxx.snowflakecomputing.com', username='PROMETHEUX', password='your_pat_token_here', warehouse='COMPUTE_WH'",
"TEST", "transaction_data").

% Define a rule to extract TransactionId, CustomerId, and Amount from the 'transactions' table in Snowflake
transactions_snowflake_test(TransactionId, CustomerId, Amount) <-
transactions_snowflake(TransactionId, CustomerId, Amount).

% Declare the output concept 'transactions_snowflake_test' for making the processed data available
@output("transactions_snowflake_test").

Note: When using PAT, the syntax remains identical to password authentication - simply replace the password value with your PAT token. The token can be stored securely in px.properties configuration file or environment variables for better security practices.

Databricks

Databricks is a cloud-based platform for data engineering and data science.

This example demonstrates writing data to a Databricks table.

% Bind the 'sales_postgres' concept to the Postgres database using the JDBC connection details
@qbind("sales_postgres","postgresql host='postgres-host', port=5432, username='prometheux', password='myPassw'",
"postgres", "select sale_id, product_id, sale_amount from sales").

% Define a rule to extract SaleId, ProductId, and SaleAmount from the 'sales' table in Postgres
sales_databricks(SaleId, ProductId, SaleAmount) <-
sales_postgres(SaleId, ProductId, SaleAmount).

% Declare the output concept 'sales_databricks' to write data to the Databricks table
@output("sales_databricks").
@model("sales_databricks", "['sale_id:int', 'productId:int', 'sale_amount:int']").

% Bind the 'sales_databricks' concept to the Databricks cluster using the JDBC connection details
@bind("sales_databricks","databricks batchSize=5, OAuth2ClientId='your_client_id', OAuth2Secret='dosexxxx', host='dbc-xxxx-02fe.cloud.databricks.com'",
"/sql/1.0/warehouses/3283xxxx", "sales").

This example demonstrates reading data from a Databricks table.

% Bind the 'sales_databricks' concept to the Databricks cluster using the JDBC connection details
@qbind("sales_databricks","databricks fetchSize=5, authMode='PAT', token='dapixxxx', host='dbc-xxxx-02fe.cloud.databricks.com'",
"/sql/1.0/warehouses/3283xxxx", "select sale_id, productId from sales").

% Define a rule to extract ProductId from the 'sales' table in Databricks
sales(Product) <-
sales_databricks(Sale, Product).

% Declare the output concept 'sales' to return the processed data
@output("sales").