LLMEngine class#
Overview#
The LLMEngine provides an interface for text generation, supporting configuration through command-line arguments.
Example Usage#
import argparse
from typing import List, Tuple
from furiosa_llm import EngineArgs, LLMEngine, RequestOutput, SamplingParams
def create_test_prompts() -> List[Tuple[str, SamplingParams]]:
"""Create a list of test prompts with their sampling parameters."""
return [
("A robot may not injure a human being",
SamplingParams(temperature=0.0)),
("To be or not to be,",
SamplingParams(temperature=0.8, top_k=5)),
("What is the meaning of life?",
SamplingParams(n=1,
best_of=5,
temperature=0.8,
top_p=0.95)),
]
def process_requests(engine: LLMEngine,
test_prompts: List[Tuple[str, SamplingParams]]):
"""Continuously process a list of prompts and handle the outputs."""
request_id = 0
while test_prompts or engine.has_unfinished_requests():
if test_prompts:
prompt, sampling_params = test_prompts.pop(0)
engine.add_request(str(request_id), prompt, sampling_params)
request_id += 1
request_outputs: List[RequestOutput] = engine.step()
for request_output in request_outputs:
if request_output.finished:
print(request_output)
def initialize_engine(args: argparse.Namespace) -> LLMEngine:
"""Initialize the LLMEngine from the command line arguments."""
engine_args = EngineArgs.from_cli_args(args)
return LLMEngine.from_engine_args(engine_args)
def main(args: argparse.Namespace):
"""Main function that sets up and runs the prompt processing."""
engine = initialize_engine(args)
test_prompts = create_test_prompts()
process_requests(engine, test_prompts)
if __name__ == '__main__':
parser = argparse.ArgumentParser()
parser = EngineArgs.add_cli_args(parser)
args = parser.parse_args()
main(args)
The script can be executed with various arguments defined in EngineArgs, as shown in the following example:
python llm_engine.py --model /path/to/model --devices npu:0
For a comprehensive list of available arguments for EngineArgs, please refer to the section below.
Arguments supported by LLMEngine#
usage: llm_engine.py [-h] --model MODEL [--revision REVISION] [--tokenizer TOKENIZER]
[--tokenizer-mode TOKENIZER_MODE] [--seed SEED]
[--devices DEVICES]
[--pipeline-parallel-size PIPELINE_PARALLEL_SIZE]
[--data-parallel-size DATA_PARALLEL_SIZE]
[--cache-dir CACHE_DIR]
[--paged-attention-num-blocks PAGED_ATTENTION_NUM_BLOCKS]
[--npu-queue-limit NPU_QUEUE_LIMIT]
[--max-processing-samples MAX_PROCESSING_SAMPLES]
[--spare-blocks-ratio SPARE_BLOCKS_RATIO]
options:
-h, --help show this help message and exit
--model MODEL The Hugging Face model id, or path to Furiosa model artifact.
Currently only one model is supported per server.
--revision REVISION The specific model revision on Hugging Face Hub if the model
is given as a Hugging Face model id. It can be a branch name,
a tag name, or a commit id. Its default value is main.
However, if a given model belongs to the furiosa-ai
organization, the model will use the release model tag by
default.
--tokenizer TOKENIZER
The name or path of a HuggingFace Transformers tokenizer.
--tokenizer-mode TOKENIZER_MODE
The tokenizer mode. "auto" will use the fast tokenizer if
available, and "slow" will always use the slow tokenizer.
--seed SEED The seed to initialize the random number generator for
sampling.
--devices DEVICES The devices to run the model. It can be a single device or a
comma-separated list of devices. Each device can be either
"npu:X" or "npu:X:Y", where X is a device index and Y is a
NPU core range notation (e.g. "npu:0" for whole npu 0,
"npu:0:0" for core 0 of NPU 0, and "npu:0:0-3" for fused core
0-3 of npu 0). If not given, all available unoccupied devices
will be used.
--pipeline-parallel-size PIPELINE_PARALLEL_SIZE
The size of the pipeline parallelism group. If not given, it
will use the default pp value of the artifact.
--data-parallel-size DATA_PARALLEL_SIZE
The size of the data parallelism group. If not given, it will
be inferred from total available PEs and other parallelism
degrees.
--cache-dir CACHE_DIR
The cache directory for temporarily generated files for this
LLM instance. When its value is ``None``, caching is
disabled. The default is "$HOME/.cache/furiosa/llm".
--paged-attention-num-blocks PAGED_ATTENTION_NUM_BLOCKS
The maximum number of blocks that each k/v storage per layer
can store. This argument must be given if model uses paged
attention.
--npu-queue-limit NPU_QUEUE_LIMIT
The NPU queue limit of the scheduler config.
--max-processing-samples MAX_PROCESSING_SAMPLES
The maximum processing samples. Used as an hint for the
scheduler.
--spare-blocks-ratio SPARE_BLOCKS_RATIO
The spare blocks ratio. Used as an hint for the scheduler.
API Reference#
- class furiosa_llm.LLMEngine(native_engine: NativeLLMEngine, tokenizer: PreTrainedTokenizer | PreTrainedTokenizerFast, prompt_max_seq_len: int, max_seq_len_to_capture: int)[source]#
LLMEngine receives requests and generates texts. Implements the API interface compatible with vLLM’s LLMEngine, but this class is based on furiosa-runtime and FuriosaAI NPU.
The request scheduling approach of this engine is different from that of vLLM’s . While vLLM provides fine-grained control over decoding via the step method, this engine immediately begins text generation in the background as soon as a request is submitted via
add_request()
, continuing asynchronously until completion. The generated results are placed in a queue that clients can retrieve by callingstep()
.The Furiosa native engine handles scheduling and batching internally, allowing clients to retrieve results via
step()
calls without needing to manage the decoding schedule.- add_request(request_id: str, prompt: str | TextPrompt | TokensPrompt, sampling_params: SamplingParams) None [source]#
Adds a new request to the engine. The decoding iteration starts immediately after adding the request.
- Parameters:
request_id – The unique id of the request.
prompt – The prompt to the LLM.
sampling_params – The sampling parameters of the request.