LLMEngine class#

Overview#

The LLMEngine provides an interface for text generation, supporting configuration through command-line arguments.

Example Usage#

import argparse
from typing import List, Tuple

from furiosa_llm import EngineArgs, LLMEngine, RequestOutput, SamplingParams


def create_test_prompts() -> List[Tuple[str, SamplingParams]]:
    """Create a list of test prompts with their sampling parameters."""
    return [
        ("A robot may not injure a human being",
        SamplingParams(temperature=0.0)),
        ("To be or not to be,",
        SamplingParams(temperature=0.8, top_k=5)),
        ("What is the meaning of life?",
        SamplingParams(n=1,
                        best_of=5,
                        temperature=0.8,
                        top_p=0.95)),
    ]


def process_requests(engine: LLMEngine,
                    test_prompts: List[Tuple[str, SamplingParams]]):
    """Continuously process a list of prompts and handle the outputs."""
    request_id = 0

    while test_prompts or engine.has_unfinished_requests():
        if test_prompts:
            prompt, sampling_params = test_prompts.pop(0)
            engine.add_request(str(request_id), prompt, sampling_params)
            request_id += 1

        request_outputs: List[RequestOutput] = engine.step()

        for request_output in request_outputs:
            if request_output.finished:
                print(request_output)


def initialize_engine(args: argparse.Namespace) -> LLMEngine:
    """Initialize the LLMEngine from the command line arguments."""
    engine_args = EngineArgs.from_cli_args(args)
    return LLMEngine.from_engine_args(engine_args)


def main(args: argparse.Namespace):
    """Main function that sets up and runs the prompt processing."""
    engine = initialize_engine(args)
    test_prompts = create_test_prompts()
    process_requests(engine, test_prompts)


if __name__ == '__main__':
    parser = argparse.ArgumentParser(
        description='Demo on using the LLMEngine class directly')
    parser = EngineArgs.add_cli_args(parser)
    args = parser.parse_args()
    main(args)

The script can be executed with various arguments defined in EngineArgs, as shown in the following example:

python llm_engine.py --model /path/to/model --devices npu:0

For a comprehensive list of available arguments for EngineArgs, please refer to the section below.

Arguments supported by LLMEngine#

usage: llm_engine.py [-h] --model MODEL [--tokenizer TOKENIZER] [--tokenizer-mode TOKENIZER_MODE] [--seed SEED] [--devices DEVICES] [--pipeline-parallel-size PIPELINE_PARALLEL_SIZE]
                     [--data-parallel-size DATA_PARALLEL_SIZE] [--cache-dir CACHE_DIR] [--paged-attention-num-blocks PAGED_ATTENTION_NUM_BLOCKS] [--npu-queue-limit NPU_QUEUE_LIMIT]
                     [--max-processing-samples MAX_PROCESSING_SAMPLES] [--spare-blocks-ratio SPARE_BLOCKS_RATIO] [--is-offline IS_OFFLINE]

Demo on using the LLMEngine class directly

options:
  -h, --help            show this help message and exit
  --model MODEL         Path to the LLM engine artifact (Pretrained id will be supported in the future releases).
  --tokenizer TOKENIZER
                        The name or path of a HuggingFace Transformers tokenizer.
  --tokenizer-mode TOKENIZER_MODE
                        The tokenizer mode. "auto" will use the fast tokenizer if available, and "slow" will always use the slow tokenizer.
  --seed SEED           The seed to initialize the random number generator for sampling.
  --devices DEVICES     The devices to run the model. It can be a single device or a list of devices. Each device can be either "npu:X" or "npu:X:*" where X is a specific device index.
                        If not given, available devices will be used.
  --pipeline-parallel-size PIPELINE_PARALLEL_SIZE
                        The size of the pipeline parallelism group. If not given, it will use the default pp value of the artifact.
  --data-parallel-size DATA_PARALLEL_SIZE
                        The size of the data parallelism group. If not given, it will be inferred from total avaialble PEs and other parallelism degrees.
  --cache-dir CACHE_DIR
                        The cache directory for temporarily generated files for this LLM instance. When its value is ``None``, caching is disabled. The default is
                        "$HOME/.cache/furiosa/llm".
  --paged-attention-num-blocks PAGED_ATTENTION_NUM_BLOCKS
                        The maximum number of blocks that each k/v storage per layer can store. This argument must be given if model uses paged attention.
  --npu-queue-limit NPU_QUEUE_LIMIT
                        The NPU queue limit of the scheduler config.
  --max-processing-samples MAX_PROCESSING_SAMPLES
                        The maximum processing samples. Used as an hint for the scheduler.
  --spare-blocks-ratio SPARE_BLOCKS_RATIO
                        The spare blocks ratio. Used as an hint for the scheduler.
  --is-offline IS_OFFLINE
                        If True, the scheduler will assume the workload will be offline scenario.

API Reference#

class furiosa_llm.LLMEngine(native_engine: NativeLLMEngine, tokenizer: PreTrainedTokenizer | PreTrainedTokenizerFast, prompt_max_seq_len: int, max_seq_len_to_capture: int)[source]#

LLMEngine receives requests and generates texts. Implements the API interface compatible with vLLM’s LLMEngine, but this class is based on furiosa-runtime and FuriosaAI NPU.

The request scheduling approach of this engine is different from that of vLLM’s . While vLLM provides fine-grained control over decoding via the step method, this engine immediately begins text generation in the background as soon as a request is submitted via add_request(), continuing asynchronously until completion. The generated results are placed in a queue that clients can retrieve by calling step().

The Furiosa native engine handles scheduling and batching internally, allowing clients to retrieve results via step() calls without needing to manage the decoding schedule.

Please note that cancelling a request using abort_request is not supported for now.

add_request(request_id: str, prompt: str | TextPrompt | TokensPrompt, sampling_params: SamplingParams) None[source]#

Adds a new request to the engine. The decoding iteration starts immediately after adding the request.

Parameters:
  • request_id – The unique id of the request.

  • prompt – The prompt to the LLM.

  • sampling_params – The sampling parameters of the request.

classmethod from_engine_args(args: EngineArgs) LLMEngine[source]#

Creates an LLMEngine from EngineArgs.

has_unfinished_requests() bool[source]#

Returns True if there are unfinished requests.

step() List[RequestOutput][source]#

Returns newly generated results of one decoding iteration from the queue.