Building a Model Artifact#
A compiled model artifact is required to run the LLM Engine using the LLM class or furiosa-llm serve
.
Note
If you are not familiar with what model artifact is, please refer to Model Preparation Workflow.
You can compile the model with the LLM ArtifactBuilder API or via the furiosa-llm build
command.
Warning
This document is based on Furiosa SDK 2025.1.0 (beta0) version, and the features and APIs described in this document may change in the future.
Prerequisites#
To use ArtifactBuilder API or furiosa-llm build
command,
you need the following prerequisites:
ArtifactBuilder API#
from furiosa_llm.artifact.builder import ArtifactBuilder
from furiosa_llm.models.config_types import SchedulerConfig
builder = ArtifactBuilder(
"meta-llama/Meta-Llama-3.1-8B-Instruct",
"npu:0",
"mlperf-llama3-1-8b-fp8",
tensor_parallel_size=4,
pipeline_parallel_size=1,
prefill_buckets=[
(1, 512),
(1, 1024),
],
decode_buckets=[
(1, 2048),
(8, 2048),
(16, 2048),
(64, 2048),
(128, 2048),
],
quantize_artifact_path="/path_to/quantized_artifacts",
calculate_logit_only_for_last_token=True,
paged_attention_num_blocks=512_000,
default_scheduler_config=SchedulerConfig(
max_processing_samples=24576,
spare_blocks_ratio=0.6,
npu_queue_limit=2,
),
)
builder.build(
"/path_to/artifact",
num_pipeline_builder_workers=4,
num_compile_workers=4,
)
There are more options available for the ArtifactBuilder API. You can find the full list of options and arguments in the ArtifactBuilder reference.
furiosa-llm build
command#
(WIP) The following is the list of options and arguments for the serve command:
usage: furiosa-llm build [-h] --model-id MODEL_ID [--name NAME] [--devices DEVICES] [-tp TENSOR_PARALLEL_SIZE] [-pp PIPELINE_PARALLEL_SIZE] [-dp DATA_PARALLEL_SIZE] [-pb PREFILL_BUCKETS] [-db DECODE_BUCKETS]
[--max-seq-len-to-capture MAX_SEQ_LEN_TO_CAPTURE] [--additional-model-config ADDITIONAL_MODEL_CONFIG] [--quantization-artifact-path QUANTIZATION_ARTIFACT_PATH]
[--kv-cache-sharing-across-beams-config KV_CACHE_SHARING_ACROSS_BEAMS_CONFIG] [--paged-attention-num-blocks PAGED_ATTENTION_NUM_BLOCKS] [--num-pipeline-builder-workers NUM_PIPELINE_BUILDER_WORKERS]
[--num-compile-workers NUM_COMPILE_WORKERS]
output_path
positional arguments:
output_path The path to export the artifacts.
options:
-h, --help show this help message and exit
--model-id MODEL_ID The Hugging Face pretrained id (e.g., "meta-llama/Meta-Llama-3.1-8B-Instruct").
--name NAME The name of the artifact to build.
--devices DEVICES Devices to use (e.g., "npu:0,npu:1"). If not specified, the artifact will be built using only one device.
-tp TENSOR_PARALLEL_SIZE, --tensor-parallel-size TENSOR_PARALLEL_SIZE
The number of PEs for each tensor parallelism group. (default: 4)
-pp PIPELINE_PARALLEL_SIZE, --pipeline-parallel-size PIPELINE_PARALLEL_SIZE
The number of stages for pipeline parallelism. (default: 1)
-dp DATA_PARALLEL_SIZE, --data-parallel-size DATA_PARALLEL_SIZE
The size of the data parallelism group. If not specified, it will be inferred based on the total available PEs and other parallelism configurations.
-pb PREFILL_BUCKETS, --prefill-buckets PREFILL_BUCKETS
Specify the bucket size for prefill in the format batch_size,context_length. Multiple entries are allowed (e.g., `--pb 1,128 --pb 1,256`).
-db DECODE_BUCKETS, --decode-buckets DECODE_BUCKETS
Specify the bucket size for decode in the format batch_size,context_length. Multiple entries are allowed (e.g., `--db 4,2048 --db 16,2048`).
--max-seq-len-to-capture MAX_SEQ_LEN_TO_CAPTURE
The maximum sequence length supported by the LLM engine. Sequences exceeding this length will not be handled.
--additional-model-config ADDITIONAL_MODEL_CONFIG
Specify compilation settings or optimization settings to apply to your model. You can specify multiple items in the form `key=value`.
--quantization-artifact-path QUANTIZATION_ARTIFACT_PATH
The path where quantization artifacts generated by the Furiosa Model Compressor are saved.
--kv-cache-sharing-across-beams-config KV_CACHE_SHARING_ACROSS_BEAMS_CONFIG
Configuration for sharing k/v caches across beams. Required if the model supports kv cache sharing. Format: beam_width,max_new_token (e.g., `4,128`).
--paged-attention-num-blocks PAGED_ATTENTION_NUM_BLOCKS
The maximum number of blocks each k/v storage layer can store. Required if the model uses paged attention.
--num-pipeline-builder-workers NUM_PIPELINE_BUILDER_WORKERS
The number of workers for building pipelines (excluding compilation). Defaults to 1 (no parallelism). Higher values reduce build time for large models but require more memory.
--num-compile-workers NUM_COMPILE_WORKERS
The number of workers used for compilation.
furiosa-llm build /path/to/artifacts \
--model-id meta-llama/Meta-Llama-3.1-8B-Instruct \
--devices "npu:0" \
--name mlperf-llama3-1-8b-fp8 \
-tp 4 -pp 1 \
-pb 1,512 -pb 1,1024 \
-db 1,2048 -db 8,2048 -db 16,2048 -db 64,2048 -db 128,2048 \
--quantization-artifact-path /path/to/quantized_artifacts \
--paged-attention-num-blocks 512000 \
--additional-model-config calculate_logit_only_for_last_token=True \
--num-pipeline-builder-workers 4 \
--num-compile-workers 4