LLMEngine class#
Overview#
The LLMEngine provides an interface for text generation, supporting configuration through command-line arguments.
Example Usage#
import argparse
from typing import List, Tuple
from furiosa_llm import EngineArgs, LLMEngine, RequestOutput, SamplingParams
def create_test_prompts() -> List[Tuple[str, SamplingParams]]:
"""Create a list of test prompts with their sampling parameters."""
return [
("A robot may not injure a human being",
SamplingParams(temperature=0.0)),
("To be or not to be,",
SamplingParams(temperature=0.8, top_k=5)),
("What is the meaning of life?",
SamplingParams(n=1,
best_of=5,
temperature=0.8,
top_p=0.95)),
]
def process_requests(engine: LLMEngine,
test_prompts: List[Tuple[str, SamplingParams]]):
"""Continuously process a list of prompts and handle the outputs."""
request_id = 0
while test_prompts or engine.has_unfinished_requests():
if test_prompts:
prompt, sampling_params = test_prompts.pop(0)
engine.add_request(str(request_id), prompt, sampling_params)
request_id += 1
request_outputs: List[RequestOutput] = engine.step()
for request_output in request_outputs:
if request_output.finished:
print(request_output)
def initialize_engine(args: argparse.Namespace) -> LLMEngine:
"""Initialize the LLMEngine from the command line arguments."""
engine_args = EngineArgs.from_cli_args(args)
return LLMEngine.from_engine_args(engine_args)
def main(args: argparse.Namespace):
"""Main function that sets up and runs the prompt processing."""
engine = initialize_engine(args)
test_prompts = create_test_prompts()
process_requests(engine, test_prompts)
if __name__ == '__main__':
parser = argparse.ArgumentParser(
description='Demo on using the LLMEngine class directly')
parser = EngineArgs.add_cli_args(parser)
args = parser.parse_args()
main(args)
The script can be executed with various arguments defined in EngineArgs, as shown in the following example:
python llm_engine.py --model /path/to/model --devices npu:0
For a comprehensive list of available arguments for EngineArgs, please refer to the section below.
Arguments supported by LLMEngine#
usage: llm_engine.py [-h] --model MODEL [--tokenizer TOKENIZER] [--tokenizer-mode TOKENIZER_MODE] [--seed SEED] [--devices DEVICES] [--pipeline-parallel-size PIPELINE_PARALLEL_SIZE]
[--data-parallel-size DATA_PARALLEL_SIZE] [--cache-dir CACHE_DIR] [--paged-attention-num-blocks PAGED_ATTENTION_NUM_BLOCKS] [--npu-queue-limit NPU_QUEUE_LIMIT]
[--max-processing-samples MAX_PROCESSING_SAMPLES] [--spare-blocks-ratio SPARE_BLOCKS_RATIO] [--is-offline IS_OFFLINE]
Demo on using the LLMEngine class directly
options:
-h, --help show this help message and exit
--model MODEL Path to the LLM engine artifact (Pretrained id will be supported in the future releases).
--tokenizer TOKENIZER
The name or path of a HuggingFace Transformers tokenizer.
--tokenizer-mode TOKENIZER_MODE
The tokenizer mode. "auto" will use the fast tokenizer if available, and "slow" will always use the slow tokenizer.
--seed SEED The seed to initialize the random number generator for sampling.
--devices DEVICES The devices to run the model. It can be a single device or a list of devices. Each device can be either "npu:X" or "npu:X:*" where X is a specific device index.
If not given, available devices will be used.
--pipeline-parallel-size PIPELINE_PARALLEL_SIZE
The size of the pipeline parallelism group. If not given, it will use the default pp value of the artifact.
--data-parallel-size DATA_PARALLEL_SIZE
The size of the data parallelism group. If not given, it will be inferred from total avaialble PEs and other parallelism degrees.
--cache-dir CACHE_DIR
The cache directory for temporarily generated files for this LLM instance. When its value is ``None``, caching is disabled. The default is
"$HOME/.cache/furiosa/llm".
--paged-attention-num-blocks PAGED_ATTENTION_NUM_BLOCKS
The maximum number of blocks that each k/v storage per layer can store. This argument must be given if model uses paged attention.
--npu-queue-limit NPU_QUEUE_LIMIT
The NPU queue limit of the scheduler config.
--max-processing-samples MAX_PROCESSING_SAMPLES
The maximum processing samples. Used as an hint for the scheduler.
--spare-blocks-ratio SPARE_BLOCKS_RATIO
The spare blocks ratio. Used as an hint for the scheduler.
--is-offline IS_OFFLINE
If True, the scheduler will assume the workload will be offline scenario.
API Reference#
- class furiosa_llm.LLMEngine(native_engine: NativeLLMEngine, tokenizer: PreTrainedTokenizer | PreTrainedTokenizerFast, prompt_max_seq_len: int, max_seq_len_to_capture: int)[source]#
LLMEngine receives requests and generates texts. Implements the API interface compatible with vLLM’s LLMEngine, but this class is based on furiosa-runtime and FuriosaAI NPU.
The request scheduling approach of this engine is different from that of vLLM’s . While vLLM provides fine-grained control over decoding via the step method, this engine immediately begins text generation in the background as soon as a request is submitted via
add_request()
, continuing asynchronously until completion. The generated results are placed in a queue that clients can retrieve by callingstep()
.The Furiosa native engine handles scheduling and batching internally, allowing clients to retrieve results via
step()
calls without needing to manage the decoding schedule.Please note that cancelling a request using abort_request is not supported for now.
- add_request(request_id: str, prompt: str | TextPrompt | TokensPrompt, sampling_params: SamplingParams) None [source]#
Adds a new request to the engine. The decoding iteration starts immediately after adding the request.
- Parameters:
request_id – The unique id of the request.
prompt – The prompt to the LLM.
sampling_params – The sampling parameters of the request.