Furiosa Executable Bundles (FXB)#
A Furiosa Executable Bundle (FXB) is Furiosa-LLM’s compiled-artifact format. An .fxb
file is a single archive holding the compiled binaries — generated for a specific model
architecture — together with the metadata needed to run the model on the Furiosa NPU. It
is designed to be easy to distribute, share, and reuse: once a model is compiled into an
.fxb, you can serve it directly without recompiling, copy it to another machine, or
publish it to the Hugging Face Hub for others to reuse.
A bundle is not tied to a single model. Models that compile to the same kernels share an
architecture fingerprint, and any model with a matching fingerprint can reuse the same
.fxb — so a fine-tuned model, a model whose weights were updated, or any other model of
the same architecture can be served from one bundle without recompiling. See
Fingerprint and Compatibility for how compatibility is determined.
Note
Coming from GPUs? On a GPU, the kernels are software: they ship with the framework,
usually as pre-compiled CUDA kernel libraries, and the same binaries run for every model.
RNGD works differently — the compiler generates code that is optimized for a specific
model, and that code is not part of the Furiosa-LLM software but a separate artifact you
build, distribute, and load. FXB is both the format for that artifact and the
tool (the fxb command) for building, sharing, and inspecting it.
Tip
FuriosaAI publishes a set of pre-compiled FXB bundles for popular models on the Hugging Face Hub 🤗 - FuriosaAI organization. You can download one of these and serve it for any compatible model with the workflow below.
Installation#
The fxb command ships with the Furiosa-LLM package — there is nothing extra to install.
If you have not installed Furiosa-LLM yet, follow Installing Furiosa-LLM:
pip install --upgrade pip setuptools wheel uv
uv pip install --upgrade --torch-backend=auto furiosa-llm
Verify that the command is available:
fxb --help
The fxb command groups the whole FXB lifecycle — building, downloading, caching, compatibility
checking, and inspection. The cache and inspection subcommands also read the local cache that
furiosa-llm serve consults at serving time.
Note
FXB is the compiled-artifact format used by Furiosa-LLM. For the broader model-conversion workflow and quantization, see Model Preparation.
Serving a model with a compatible cached FXB#
A model’s own Hugging Face repository may not ship an .fxb of its own. When that happens,
Furiosa-LLM can still serve it by reusing a fingerprint-compatible bundle from your local
cache (see Fingerprint and Compatibility for how compatibility is determined). This is especially useful
when the model weights are updated, or when you serve a fine-tuned model or a variation of a
supported model: as long as the architecture fingerprint is unchanged, you can reuse an existing
FXB instead of compiling a new one for every weight update or variant. The workflow is:
Get a compatible FXB into the cache — either download one from the Hub, or add one you already have on disk.
Check that the cached bundle is compatible with the model you want to serve.
Serve the model — Furiosa-LLM finds and uses the cached bundle automatically.
The example below serves Qwen/Qwen3-8B-FP8 (which does not ship an FXB) by reusing the
furiosa-ai/Qwen3-8B-FP8 bundle, which has the same fingerprint.
Step 1 — Get a compatible FXB into the cache#
Download a published bundle from the Hub:
fxb download furiosa-ai/Qwen3-8B-FP8
This fetches every .fxb in the repository into the cache and prints where each was stored.
Restrict the download to a single file with --file, pick a branch/tag/commit with
--revision, and re-fetch a cached bundle with --force.
Alternatively, register an .fxb you already have on disk — for example one you just built (see
Building an FXB for a model) or copied from another machine:
fxb add ./qwen3-8b-fp8.fxb
add copies the file into the cache and standardizes its name from the bundle’s manifest, so a
bare out.fxb becomes consistent with downloaded bundles.
Step 2 — Check compatibility#
First, confirm the bundle landed in the cache with fxb cache ls:
fxb cache ls
REPO ID ARCH SIZE FuriosaIR FURIOSA-LLM FXB FILE
----------------------- ----------------- ------- ------------------ ------------------ -----------------------------------
furiosa-ai/Qwen3-8B-FP8 Qwen3ForCausalLM 4.1 GiB 2026.3.0 (a1b2c3) 2026.3.0 (d4e5f6) Qwen3-8B-FP8-<npu>-<rt>-<ts>.fxb
1 file(s), 4.1 GiB
A bundle added with fxb add appears here under the local repo id. Now confirm it is
compatible with the target model. check reads the target repository’s model configuration and
lists every cached FXB whose fingerprint matches:
fxb check Qwen/Qwen3-8B-FP8
Target: Qwen/Qwen3-8B-FP8 (Qwen3ForCausalLM)
hidden_size=4096 intermediate_size=12288 num_attention_heads=32 vocab_size=151936 quant_method=fp8
Compatible cached FXB (1):
REPO ID SIZE FuriosaIR FXB FILE
--- ----------------------- --------- --------------------- -----------------------------------------------
✔ furiosa-ai/Qwen3-8B-FP8 4.1 GiB 2026.3.0 (a1b2c3) — match Qwen3-8B-FP8-<npu>-<rt>-<ts>.fxb
Recommended: /root/.cache/furiosa/llm/fxb/models--furiosa-ai--Qwen3-8B-FP8/snapshots/<sha>/Qwen3-8B-FP8-<npu>-<rt>-<ts>.fxb
The bundle marked ✔ is the recommended one. The FuriosaIR column is annotated with the
build-match status:
— match— the bundle’s FuriosaIR (compiler) revision matches your running build. This bundle is served automatically.— stale— the bundle is fingerprint-compatible but was built with a different FuriosaIR revision. It is not served automatically (see the next step).
If no compatible bundle is cached, check prints a message and exits with a non-zero status.
Step 3 — Serve the model#
Serve the target model as usual. Furiosa-LLM resolves the FXB to run in this order:
an explicit
--fxb <path>(used as-is);an
.fxbshipped inside the model’s own repository;the local cache — the recommended compatible bundle, but only if its FuriosaIR revision matches the running build.
furiosa-llm serve Qwen/Qwen3-8B-FP8
When the cache fallback is used, Furiosa-LLM logs an INFO line naming the cached bundle it
picked. The same resolution applies to the Python LLM(...) API, since both go through the same
code path.
Warning
If the cache holds only stale (FuriosaIR-revision-mismatched) compatible bundles, serving fails rather than silently loading a bundle built for a different compiler revision. The error tells you to choose one explicitly:
furiosa-llm serve Qwen/Qwen3-8B-FP8 --fxb /path/to/stale-bundle.fxb
Note
The cache lives at ~/.cache/furiosa/llm/fxb by default
($XDG_CACHE_HOME/furiosa/llm/fxb when XDG_CACHE_HOME is set). Every fxb cache
subcommand accepts --cache-dir to use a different location. The cache is only consulted for a
Hugging Face repo id — a local model path bypasses it.
Tip
Some repositories are gated. Authorize the Hugging Face Hub before downloading or checking them, the same way as in Authorizing Hugging Face Hub (Optional):
hf auth login --token $HF_TOKEN
Building an FXB for a model#
Use fxb build to compile a model into an .fxb bundle:
fxb build Qwen/Qwen3-8B-FP8 qwen3-8b-fp8
The first argument is a Hugging Face model id or a local path (a path must start with . or
/). The second is the output path; the .fxb extension is appended automatically when the
path has none, so the command above writes qwen3-8b-fp8.fxb.
Preview the resolved configuration and bucket plan without compiling using --dry-run:
fxb build Qwen/Qwen3-8B-FP8 qwen3-8b-fp8 --dry-run
Common options:
-tp/--tensor-parallel-size— number of PEs per tensor-parallel group (default: the model preset).-pp/--pipeline-parallel-size— pipeline-parallel size (default:1).--max-model-len— override the model’s maximum context length for bucket selection.-O/--optim-level— bucket-filtering optimization level:O0(minimal),O1(half),O2(quarter),O3(full, the default).--convert— task override for models with an ambiguous architecture:embedrewrites*ForCausalLM→*Model;score/classify→*ForSequenceClassification(default:auto, i.e. keep the architecture fromconfig.json).
After building, register the bundle in the cache so fxb check and furiosa-llm serve can
find it:
fxb add ./qwen3-8b-fp8.fxb
Note
fxb build is the FXB-only build path. The legacy artifact build path remains available under
furiosa-llm build for backward compatibility.
Command reference#
Every subcommand that touches the cache accepts --cache-dir (default:
~/.cache/furiosa/llm/fxb).
fxb build#
Build an FXB artifact from a model.
fxb build <model> <output_path> [options]
model— a Hugging Face model id or a local path (a path starts with.or/).output_path— where to write the.fxb(.fxbappended when missing).-tp/--tensor-parallel-size N— PEs per tensor-parallel group (default: model preset).-pp/--pipeline-parallel-size N— pipeline-parallel size (default:1).--max-model-len N— maximum context length used for bucket filtering.-O/--optim-level {O0,O1,O2,O3}— bucket-filtering level (default:O3).--convert {auto,embed,score,classify}— task override (default:auto).--dry-run— resolve config and buckets and print the build summary without compiling.--build-report— print a per-kernel compilation timing report after the build.--concurrency N— maximum kernel compilations to run in parallel (default:1).
On success (non-dry-run) it prints Artifact Build Completed.
fxb download#
Download an FXB bundle from a Hugging Face repository into the cache.
fxb download <repo_id> [--file F] [--revision R] [--force] [--cache-dir D]
repo_id— the Hugging Face repository id.--file F— restrict the download to a single.fxbby filename.--revision R— repository revision (branch, tag, or commit).--force— re-download even if the bundle is already cached.
fxb add#
Add local .fxb files to the cache. Like download, it fills the cache — from files on disk
rather than from the Hub. The stored filename is standardized from each bundle’s manifest, and the
files are copied (the originals are left in place). Multiple paths are accepted, so a shell glob such
as fxb add *.fxb adds every matching bundle; an invalid bundle is reported without aborting the
rest, and the command exits non-zero if any failed.
fxb add <path>... [--cache-dir D]
fxb check#
Find cached FXBs compatible with a Hugging Face repository’s model config (see Fingerprint and Compatibility for how compatibility is determined).
fxb check <repo_id> [--cache-dir D]
Prints the target fingerprint, a table of compatible cached bundles (recommended one marked ✔,
with each bundle’s FuriosaIR revision annotated — match or — stale), and the recommended
path. Exits with a non-zero status when no compatible bundle is cached.
fxb cache#
Inspect and prune what is already cached. Adding to the cache lives at the top level
(fxb download / fxb add); fxb cache is purely the inventory-management group, with the
subcommands ls, rm, and prune.
fxb cache ls#
List cached FXB bundles.
fxb cache ls [-q/--quiet] [--cache-dir D]
-q/--quiet— print only the.fxbpaths (useful for scripting).
REPO ID ARCH SIZE FuriosaIR FURIOSA-LLM FXB FILE
----------------------- ----------------- ------- ------------------ ------------------ -----------------------------------
furiosa-ai/Qwen3-8B-FP8 Qwen3ForCausalLM 4.1 GiB 2026.3.0 (a1b2c3) 2026.3.0 (d4e5f6) Qwen3-8B-FP8-<npu>-<rt>-<ts>.fxb
1 file(s), 4.1 GiB
fxb cache rm#
Remove cached FXB bundles. Requires at least one selector (a repo id or .fxb filename) or
--all. Without -y/--yes it shows what will be removed and prompts for confirmation.
fxb cache rm [<selector>...] [--all] [--dry-run] [-y/--yes] [--cache-dir D]
selectors— repo ids or.fxbfilenames to remove.--all— remove everything in the cache.--dry-run— show what would be removed without deleting anything.-y/--yes— do not prompt for confirmation (required when stdin is not interactive).
fxb cache prune#
Remove cached bundles by FuriosaIR version. By default it removes bundles that are stale for the
running build — those whose FuriosaIR (compiler) revision does not match the running build, or whose
revision is unknown — so they would never be served automatically anyway. With --older-than it
instead removes bundles whose FuriosaIR version is below a given semantic version. Like rm, it
previews and prompts unless -y/--yes is given.
fxb cache prune [--older-than IR_VERSION] [--dry-run] [-y/--yes] [--cache-dir D]
--older-than IR_VERSION— remove bundles whose FuriosaIR version is older thanIR_VERSION, e.g.2026.3.0,<2026.3.0, or<=2026.2.0(a bare version means<). Bundles with no recorded FuriosaIR version are kept. When omitted, prune targets the running build’s stale bundles.--dry-run— show what would be removed without deleting anything.-y/--yes— do not prompt for confirmation (required when stdin is not interactive).
fxb show#
Show bundle metadata: general info (format version, UUID, created_at, tool versions), the model
fingerprint, the parallel configuration, and the kernel/bucket summary.
fxb show <path>
path may be an .fxb file or an extracted artifact directory.
──────────────────────────────────────────────────────────────────────────
FXB Bundle — qwen3-8b-fp8.fxb
──────────────────────────────────────────────────────────────────────────
── General ──────────────────────────────────────────────────────────────
- format_version 2
- uuid fd0348f5-6361-4586-8013-4d4ba4f70171
- created_at 2026-06-16T19:49:52.720726644+00:00
- furiosa_llm 2026.3.0-dev (7a9d13150)
- furiosa_compiler 0.11.0-dev (795da8b53a)
── Model ────────────────────────────────────────────────────────────────
- architecture Qwen3ForCausalLM
- hub_repo_id Qwen/Qwen3-8B-FP8
- hidden_size 4096
- intermediate_size 12288
- num_attention_heads 32
- vocab_size 151936
- num_key_value_heads 8
- head_dim 128
- quant_method fp8
── Parallelism ──────────────────────────────────────────────────────────
- tensor_parallel_size 8
- pipeline_parallel_size 1
── Kernels (151 entries, 4 kernels) ─────────────────────────────────────
- mid_tokenwise
- first_tokenwise
- last_tokenwise_with_lm_head
- full_attention
── Buckets (133) ────────────────────────────────────────────────────────
- tokenwise [1, 4, 8, 16, 32, 64, 128, 256, 1024]
- prefill (7):
- batch=1 attn=128 input_ids=128
...
fxb inspect#
Inspect the per-kernel input/output signatures recorded in the bundle. For each kernel and bucket, it prints the size, shape, and dtype of every input and output tensor.
fxb inspect <path>
path may be an .fxb file or an extracted artifact directory. The output lists one block per
kernel/bucket; the excerpt below shows the first entry:
first_tokenwise (tw1):
Inputs:
[0] size=4 shape=[Broadcast=1]|[0_1=1:1] dtype=raw_i32
[1] size=4 shape=[Broadcast=1]|[0_1=1:1] dtype=raw_i32
[2] size=16777216 shape=[Broadcast=1]|[0_1=8192:1024, 1_1=1:1024, 2_1=8:128, 3_1=128:1] dtype=bf16
[3] size=16777216 shape=[Broadcast=1]|[0_1=8192:1024, 1_1=1:1024, 2_1=8:128, 3_1=128:1] dtype=bf16
[4] size=4 shape=[Broadcast=1]|[0_1=1:1, 1_1=1:1] dtype=raw_i32
[5] size=1244659712 shape=[Broadcast=1]|[0_1=151936:4096, 1_1=4096:1] dtype=bf16
...
Outputs:
[0] size=8192 shape=[Broadcast=1]|[0_1=1:4096, 1_1=4096:1] dtype=bf16
[1] size=8192 shape=[Broadcast=1]|[0_1=1:4096, 1_1=4096:1] dtype=bf16
Fingerprint and Compatibility#
The key property of an FXB is its architecture fingerprint — the value Furiosa-LLM uses to
decide whether a bundle is compatible with a given model. The manifest records the model’s
architecture and the configuration fields that determine the compiled kernels, and a bundle is
considered compatible with a model when their fingerprints match. A single FXB is therefore
reusable across any Hugging Face model that shares the same fingerprint, not just the one it was
built from — which is what makes it possible to serve a model whose own repository ships no .fxb
by reusing a compatible bundle from your local cache.
The fingerprint is built from the model architecture plus the config.json fields that affect
kernel generation — the dimensions, attention and mixture-of-experts settings, and quantization
that furiosa-kernels reads at build time. Representative examples are hidden_size,
num_attention_heads, sliding_window, the expert counts for MoE models, and the quantization
format. Two repositories that differ in any of these compile to different kernels and must not
share an FXB; fields used only at load time are not part of the fingerprint.
Matching is strict: two models are compatible only if their architecture and all
fingerprint fields are equal. Use fxb check to verify a match before reusing a bundle.
Note
The fingerprint-based compatibility matching is experimental. The exact set of fields that
make up the fingerprint, and how they are compared, may change in future releases. Always verify
a match with fxb check before relying on a cached bundle for a different model.
Best Practices#
fxb build trades build time against the runtime coverage and performance of the resulting
bundle. The right options depend on what you are building for. The scenarios below cover the common
cases; FuriosaAI’s own per-model production configurations live in the build matrix at
.github/fxb-artifacts.yaml.
Quick test build#
When you just want a runnable bundle as fast as possible — bring-up, a smoke test, or checking that a model compiles at all — minimize the number of kernels that get compiled and parallelize the build:
-O O0— the minimal bucket set, so far fewer kernels are compiled. This is the single biggest lever on build time.--max-model-len— cap the context length to something small so fewer and smaller buckets are generated.--concurrency— raise above the default of1to compile kernels in parallel and use the available cores on the build host.--dry-run— resolve the config and bucket plan and print the build summary without compiling, so you can confirm the plan before spending any compile time.
# Preview the plan first, then build a minimal bundle quickly
fxb build Qwen/Qwen3-8B-FP8 qwen3-8b-test.fxb --max-model-len 4096 --dry-run
fxb build Qwen/Qwen3-8B-FP8 qwen3-8b-test.fxb -O O0 --max-model-len 4096 --concurrency 8
A bundle built with -O O0 runs, but it only covers a minimal set of buckets; expect reduced
performance and coverage compared with a full build. Do not serve it in production.
Production build#
For the bundle you actually serve, favor full bucket coverage and a configuration that matches the deployment, accepting a longer build:
-O O3— the full bucket set (this is the default; set it explicitly to make the intent clear).-tp/--tensor-parallel-size— match the serving topology (8 PEs per card — e.g.8for a single card,32for four).--max-model-len— set to the maximum context length you actually serve, so buckets are sized for the real workload rather than over-built.--concurrency— set high to saturate the build host and shorten the (longer) full build.--build-report— print per-kernel compilation timing to spot unexpectedly slow kernels.
fxb build openai/gpt-oss-120b gpt-oss-120b.fxb \
-O O3 -tp 32 --max-model-len 32768 --concurrency 24 --build-report
Publishing and distribution#
An FXB is meant to be reused — from the local cache, copied to another machine, or published to the
Hugging Face Hub. Once you have a .fxb file, there are two ways to serve with it:
Register it in the cache with
fxb addso it is discovered automatically. After adding,fxb checkconfirms a target model is compatible, andfuriosa-llm servefinds the bundle by fingerprint without any extra flag (see Serving a model with a compatible cached FXB above for how compatibility is matched). This is the right choice when you serve the model regularly or share one cache across several models.fxb build Qwen/Qwen3-8B-FP8 qwen3-8b-fp8.fxb -O O3 -tp 8 fxb add ./qwen3-8b-fp8.fxb fxb check Qwen/Qwen3-8B-FP8 furiosa-llm serve Qwen/Qwen3-8B-FP8
Point at the file directly with
furiosa-llm serve --fxb <path>, which uses the given bundle as-is and skips cache lookup. This is convenient for a one-off run, a freshly built bundle you have not registered, or pinning a specific file.furiosa-llm serve Qwen/Qwen3-8B-FP8 --fxb ./qwen3-8b-fp8.fxb
Tensor parallelism is fixed at build time, while pipeline and data parallelism are set when
you serve. A bundle’s -tp/--tensor-parallel-size shapes the compiled kernels (each kernel is
sharded for that tensor-parallel degree), so it is part of the FXB and cannot be changed
afterward — build with the -tp you intend to deploy. Pipeline and data parallelism, by
contrast, replicate and stage the already-compiled bundle across more PEs without recompiling, so
they are chosen at furiosa-llm serve time via -pp/--pipeline-parallel-size and
-dp/--data-parallel-size:
# Built once for tensor-parallel size 8; deployed with pipeline and data parallelism
furiosa-llm serve Qwen/Qwen3-8B-FP8 --fxb ./qwen3-8b-fp8.fxb -pp 2 -dp 4
The simplest way to distribute a model is to ship the .fxb inside a standard model
directory. Take a directory saved by Transformers’ save_pretrained() — the usual
config.json, tokenizer.json, weights, and so on — drop a single .fxb file into it, and
furiosa-llm serve runs that directory directly: it loads the weights and tokenizer as usual and
picks up the bundled .fxb (the second step of the resolution order above), with no cache setup
or --fxb flag needed.
# A model directory that also contains an .fxb
my-model/
config.json
tokenizer.json
model-00001-of-00002.safetensors
...
qwen3-8b-fp8.fxb # the bundle, added alongside the weights
furiosa-llm serve ./my-model
The same directory can then be published or distributed however you like — pushed to the Hugging
Face Hub, copied to another machine, or packaged for deployment — and it serves on RNGD out of the
box for anyone who pulls it. The pre-compiled models under the
furiosa-ai organization on the Hugging Face Hub (for example
furiosa-ai/Qwen3-8B-FP8) are published exactly this way: each repository carries the model
files together with a matching .fxb.