Skip to content

Kamae¤

CI PyPI - Version

Kamae bridges the gap between offline data processing and online model serving. Build preprocessing pipelines in Spark for big data workloads, then export them as Keras 3 models for low-latency inference. Multi-backend support allows numeric operations to run on TensorFlow, JAX, or PyTorch backends, while string and datetime operations require TensorFlow.

Why Kamae?¤

Training and serving often happen on different platforms. Spark for batch processing at scale, Keras for low-latency inference. Manually reimplementing preprocessing logic in both places creates: - Training/serving skew: Subtle bugs from inconsistent implementations - Development overhead: Writing and maintaining duplicate code - Deployment friction: Changes require updates in multiple systems

Kamae solves this by generating the inference model directly from your Spark pipeline, guaranteeing consistency between training and serving.

Installation¤

pip install kamae

Quick Start¤

from pyspark.sql import SparkSession
from kamae.spark.estimators import StandardScaleEstimator, StringIndexEstimator
from kamae.spark.pipeline import KamaeSparkPipeline
from kamae.spark.transformers import LogTransformer, ArrayConcatenateTransformer

# Define preprocessing in Spark
spark = SparkSession.builder.getOrCreate()
data = spark.createDataFrame(
    [(1, 2, "a"), (4, 5, "b"), (7, 8, "c")],
    ["col1", "col2", "category"]
)

pipeline = KamaeSparkPipeline(stages=[
    LogTransformer(inputCol="col1", outputCol="log_col1", alpha=1, inputDtype="float"),
    ArrayConcatenateTransformer(inputCols=["log_col1", "col2"], outputCol="features", inputDtype="float"),
    StandardScaleEstimator(inputCol="features", outputCol="scaled_features"),
    StringIndexEstimator(inputCol="category", outputCol="category_indexed"),
])

fitted_pipeline = pipeline.fit(data)
fitted_pipeline.transform(data).show()  # Use in Spark

# Export for TensorFlow Serving
tf_input_schema = [
    {"name": "col1", "dtype": "int32", "shape": (None, 1)},
    {"name": "col2", "dtype": "int32", "shape": (None, 1)},
    {"name": "category", "dtype": "string", "shape": (None, 1)},
]
keras_model = fitted_pipeline.build_keras_model(tf_input_schema=tf_input_schema)
keras_model.save("./preprocessing_model.keras")

Usage¤

Spark Pipeline (Recommended): Build preprocessing pipelines using Spark transformers and estimators, fit on DataFrames, then export as Keras models. See examples for common patterns.

Direct Keras Layers: Import and compose Keras layers directly for non-tabular data or custom workflows. Browse available layers in the transformation table below.

Backend Selection: Set KERAS_BACKEND environment variable before importing keras:

import os
os.environ['KERAS_BACKEND'] = 'tensorflow'  # or 'jax' or 'torch'

Multi-backend layers (numeric operations) work on all backends. TensorFlow-only layers (string/datetime operations) require TensorFlow backend. See the Backend column in the transformation table below, or use the discovery API:

import kamae
# Get layers/transformers compatible with current backend
layers = kamae.get_compatible_layers()
transformers = kamae.get_compatible_transformers()

# Get layers/transformers compatible with specific backend
jax_layers = kamae.get_compatible_layers('jax')
torch_transformers = kamae.get_compatible_transformers('torch')

Note: TensorFlow is a required dependency for Kamae, as the package includes TensorFlow-only layers. JAX and PyTorch backends provide an alternative execution path for numeric operations only.

Documentation¤

  • Examples: Full working examples for common use cases
  • Chaining models: Use Kamae preprocessing models as inputs to trainable models
  • Type parity: Ensuring consistent dtypes between Spark and Keras
  • Shape parity: Ensuring consistent shapes between Spark and Keras
  • Testing inference: Validate model outputs with TensorFlow Serving
  • Adding transformers: Contributing new transformations
  • v3.0.0 Migration Guide: Upgrading from Kamae 2.x to 3.0.0 (Keras 3 multi-backend, sklearn removal, deprecated layer removals, hash indexer changes)

Supported Preprocessing Layers¤

Transformation Description Keras Layer Backend Spark Transformer
AbsoluteValue Applies the abs(x) transform. Link Multi-backend Link
ArrayConcatenate Assembles multiple features into a single array. Link Multi-backend Link
ArrayCrop Crops or pads a feature array to a consistent size. Link Multi-backend Link
ArrayReduceMax Reduces the last dimension of a tensor by taking the maximum. Link Multi-backend Link
ArraySplit Splits a feature array into multiple features. Link Multi-backend Link
ArraySubtractMinimum Subtracts the minimum element in an array from therest to compute a timestamp difference. Ignores padded values. Link Multi-backend Link
BearingAngle Compute the bearing angle (https://en.wikipedia.org/wiki/Bearing_(navigation)) between two pairs of lat/long. Link Multi-backend Link
Bin Bins a numerical column into string categorical bins. Users can specify the bin values, labels and a default label. Link Multi-backend Link
BloomEncode Hash encodes a string feature multiple times to create an array of indices. Useful for compressing input dimensions for embeddings. Paper: https://arxiv.org/pdf/1706.03993.pdf Link TensorFlow-only Link
Bucketize Buckets a numerical column into integer bins. Link TensorFlow-only Link
ConditionalStandardScale Normalises by the mean and standard deviation, with ability to: apply a mask on another column, not scale the zeros, and apply a non standard scaling function. Link Multi-backend Link
CosineSimilarity Computes the cosine similarity between two array features. Link Multi-backend Link
CurrentDate Returns the current date for use in other transformers. Link TensorFlow-only Link
CurrentDateTime Returns the current date time in the format yyyy-MM-dd HH🇲🇲ss.SSS for use in other transformers. Link TensorFlow-only Link
CurrentUnixTimestamp Returns the current unix timestamp in either seconds or milliseconds for use in other transformers. Link TensorFlow-only Link
DateAdd Adds a static or dynamic number of days to a date feature. NOTE: Destroys any time component of the datetime if present. Link TensorFlow-only Link
DateDiff Computes the number of days between two date features. Link TensorFlow-only Link
DateParse Parses a string date of format YYYY-MM-DD to extract a given date part. E.g. day of year. Link TensorFlow-only Link
DateTimeToUnixTimestamp Converts a UTC datetime string to unix timestamp. Link TensorFlow-only Link
Divide Divides a single feature by a constant or divides multiple features against each other. Link Multi-backend Link
Exp Applies the exp(x) operation to the feature. Link Multi-backend Link
Exponent Applies the x^exponent to a single feature or x^y for multiple features. Link Multi-backend Link
HashIndex Transforms strings to indices via a hash table of predeterminded size. Link TensorFlow-only Link
HaversineDistance Computes the haversine distance between latitude and longitude pairs. Link Multi-backend Link
Identity Applies the identity operation, leaving the input the same. Link Multi-backend Link
IfStatement Computes a simple if statement on a set of columns/tensors and/or constants. Link TensorFlow-only Link
Impute Performs imputation of either mean or median value of the data over a specified mask. Link Multi-backend Link
LambdaFunction Transforms an input (or multiple inputs) to an output (or multiple outputs) with a user provided tensorflow function. Link TensorFlow-only Link
ListMax Computes the listwise max of a feature, optionally calculated only on the top items based on another given feature. Link TensorFlow-only Link
ListMean Computes the listwise mean of a feature, optionally calculated only on the top items based on another given feature. Link TensorFlow-only Link
ListMedian Computes the listwise median of a feature, optionally calculated only on the top items based on another given feature. Link TensorFlow-only Link
ListMin Computes the listwise min of a feature, optionally calculated only on the top items based on another given feature. Link TensorFlow-only Link
ListRank Computes the listwise rank (ordering) of a feature. Link TensorFlow-only Link
ListStdDev Computes the listwise standard deviation of a feature, optionally calculated only on the top items based on another given feature. Link TensorFlow-only Link
Log Applies the natural logarithm log(alpha + x) transform . Link Multi-backend Link
LogicalAnd Performs an and(x, y) operation on multiple boolean features. Link Multi-backend Link
LogicalNot Performs a not(x) operation on a single boolean feature. Link Multi-backend Link
LogicalOr Performs an or(x, y) operation on multiple boolean features. Link Multi-backend Link
Max Computes the maximum of a feature with a constant or multiple other features. Link Multi-backend Link
Mean Computes the mean of a feature with a constant or multiple other features. Link Multi-backend Link
Min Computes the minimum of a feature with a constant or multiple other features. Link Multi-backend Link
MinHashIndex Creates an integer bit array from a set of strings using the MinHash algorithm. Link TensorFlow-only Link
MinMaxScale Scales the input feature by the min/max resulting in a feature in [0, 1]. Link Multi-backend Link
Modulo Computes the modulo of a feature with the mod divisor being a constant or another feature. Link Multi-backend Link
Multiply Multiplies a single feature by a constant or multiples multiple features together. Link Multi-backend Link
NumericalIfStatement Performs a simple if else statement witha given operator. Value to check, result if true or false can be constants or features. Link Multi-backend Link
OneHotEncode Transforms a string to a one-hot array. Link TensorFlow-only Link
OrdinalArrayEncode Encodes strings in an array according to the order in which they appear. Only for 2D tensors. Link TensorFlow-only Link
PairwiseCosineSimilarity Computes the cosine similarity between an embedding and a list of candidate embeddings. Link Multi-backend Link
Round Rounds a floating feature to the nearest integer using ceil, floor or a standard round op. Link Multi-backend Link
RoundToDecimal Rounds a floating feature to the nearest decimal precision. Link Multi-backend Link
SharedOneHotEncode Transforms a string to a one-hot array, using labels across multiple inputs to determine the one-hot size. Link TensorFlow-only Link
SharedStringIndex Transforms strings to indices via a vocabulary lookup, sharing the vocabulary across multiple inputs. Link TensorFlow-only Link
SingleFeatureArrayStandardScale Normalises by the mean and standard deviation calculated over all elements of all inputs, with ability to mask a specified value. Link TensorFlow-only Link
StandardScale Normalises by the mean and standard deviation, with ability to mask a specified value. Link Multi-backend Link
StringAffix Prefixes and suffixes a string with provided constants. Link TensorFlow-only Link
StringArrayConstant Inserts provided string array constant into a column. Link TensorFlow-only Link
StringCase Applies an upper or lower casing operation to the feature. Link TensorFlow-only Link
StringConcatenate Joins string columns using the provided separator. Link TensorFlow-only Link
StringContains Checks for the existence of a constant or tensor-element substring within a feature. Link TensorFlow-only Link
StringContainsList Checks for the existence of any string from a list of string constants within a feature. Link TensorFlow-only Link
StringEqualsIfStatement Performs a simple if else statement on string equality. Value to check, result if true or false can be constants or features. Link TensorFlow-only Link
StringIndex Transforms strings to indices via a vocabulary lookup Link TensorFlow-only Link
StringListToString Concatenates a list of strings to a single string with a given delimiter. Link TensorFlow-only Link
StringMap Maps a list of string values to a list of other string values with a standard CASE WHEN statement. Can provide a default value for ELSE. Link TensorFlow-only Link
StringIsInList Checks if the feature is equal to at least one of the strings provided. Link TensorFlow-only Link
StringReplace Performs a regex replace operation on a feature with constant params or between multiple features Link TensorFlow-only Link
StringToStringList Splits a string by a separator, returning a list of parametrised length (with a default value for missing inputs). Link TensorFlow-only Link
SubStringDelimAtIndex Splits a string column using the provided delimiter, and returns the value at the index given. If the index is out of bounds, returns a given default value Link TensorFlow-only Link
Subtract Subtracts a constant from a single feature or subtracts multiple features from each other. Link Multi-backend Link
Sum Adds a constant to a single feature or sums multiple features together. Link Multi-backend Link
UnixTimestampToDateTime Converts a unix timestamp to a UTC datetime string. Link TensorFlow-only Link

Development¤

Setup¤

Requirements: Python 3.10 (for development), pipx (installation instructions)

make setup      # Install dependencies and pre-commit hooks
make all        # Run tests, formatting, and linting
make help       # See all available commands

The package supports Python 3.8-3.12 in production.

Common Commands¤

make run-example        # Run example pipeline
make test-tf-serving    # Test TensorFlow Serving inference
make test-end-to-end    # Run example + test serving

Contributing¤

Create a branch from main and open a pull request. Follow the adding transformers guide for new transformers.

Code quality: Pre-commit hooks enforce formatting and linting. Install with uv run pre-commit install. PRs must pass all tests in tests/.

Versioning: Automated via semantic-release. Use conventional commit prefixes in PR titles: fix: (patch), feat: (minor), BREAKING CHANGE: (major).

Contact: Questions? Reach out to the Kamae team.