AsyncLLMEngine class#

Overview#

The AsyncLLMEngine provides an asynchronous interface for text generation, supporting configuration through command-line arguments.

Example Usage#

import argparse
from furiosa_llm import AsyncLLMEngine, AsyncEngineArgs, SamplingParams

async def main():
    parser = argparse.ArgumentParser()
    parser = AsyncEngineArgs.add_cli_args(parser)
    args = parser.parse_args()
    engine_args = AsyncEngineArgs.from_cli_args(args)
    engine = AsyncLLMEngine.from_engine_args(engine_args)
    example_input = {
        "prompt": "What is LLM?",
        "temperature": 0.0,
        "request_id": "request-123",
    }

    results_generator = engine.generate(
        example_input["prompt"],
        SamplingParams(temperature=example_input["temperature"]),
        example_input["request_id"]
    )

    final_output = None
    async for request_output in results_generator:
        final_output = request_output

    print(final_output)

if __name__ == "__main__":
    import asyncio
    asyncio.run(main())

The script can be executed with various arguments defined in AsyncEngineArgs, as shown in the following example:

python async_llm_engine.py --model /path/to/model --devices npu:0

For a comprehensive list of available arguments for AsyncEngineArgs, please refer to the section below.

Arguments supported by AsyncLLMEngine#

The arguments are identical to those specified in Arguments supported by LLMEngine.

usage: async_llm_engine.py [-h] --model MODEL [--tokenizer TOKENIZER] [--tokenizer-mode TOKENIZER_MODE] [--seed SEED] [--devices DEVICES] [--pipeline-parallel-size PIPELINE_PARALLEL_SIZE]
                           [--data-parallel-size DATA_PARALLEL_SIZE] [--cache-dir CACHE_DIR] [--paged-attention-num-blocks PAGED_ATTENTION_NUM_BLOCKS] [--npu-queue-limit NPU_QUEUE_LIMIT]
                           [--max-processing-samples MAX_PROCESSING_SAMPLES] [--spare-blocks-ratio SPARE_BLOCKS_RATIO] [--is-offline IS_OFFLINE]

options:
  -h, --help            show this help message and exit
  --model MODEL         Path to the LLM engine artifact (Pretrained id will be supported in the future releases).
  --tokenizer TOKENIZER
                        The name or path of a HuggingFace Transformers tokenizer.
  --tokenizer-mode TOKENIZER_MODE
                        The tokenizer mode. "auto" will use the fast tokenizer if available, and "slow" will always use the slow tokenizer.
  --seed SEED           The seed to initialize the random number generator for sampling.
  --devices DEVICES     The devices to run the model. It can be a single device or a list of devices. Each device can be either "npu:X" or "npu:X:*" where X is a specific device index. If not
                        given, available devices will be used.
  --pipeline-parallel-size PIPELINE_PARALLEL_SIZE
                        The size of the pipeline parallelism group. If not given, it will use the default pp value of the artifact.
  --data-parallel-size DATA_PARALLEL_SIZE
                        The size of the data parallelism group. If not given, it will be inferred from total avaialble PEs and other parallelism degrees.
  --cache-dir CACHE_DIR
                        The cache directory for temporarily generated files for this LLM instance. When its value is ``None``, caching is disabled. The default is "$HOME/.cache/furiosa/llm".
  --paged-attention-num-blocks PAGED_ATTENTION_NUM_BLOCKS
                        The maximum number of blocks that each k/v storage per layer can store. This argument must be given if model uses paged attention.
  --npu-queue-limit NPU_QUEUE_LIMIT
                        The NPU queue limit of the scheduler config.
  --max-processing-samples MAX_PROCESSING_SAMPLES
                        The maximum processing samples. Used as an hint for the scheduler.
  --spare-blocks-ratio SPARE_BLOCKS_RATIO
                        The spare blocks ratio. Used as an hint for the scheduler.
  --is-offline IS_OFFLINE
                        If True, the scheduler will assume the workload will be offline scenario.

API Reference#

class furiosa_llm.AsyncLLMEngine(native_engine: NativeLLMEngine, tokenizer: PreTrainedTokenizer | PreTrainedTokenizerFast, prompt_max_seq_len: int, max_seq_len_to_capture: int)[source]#

AsyncLLMEngine receives requests and generates texts asynchronously. Implements the API interface compatible with vLLM’s AsyncLLMEngine, but this class is based on furiosa-runtime and FuriosaAI NPU.

classmethod from_engine_args(args: AsyncEngineArgs) AsyncLLMEngine[source]#

Creates an AsyncLLMEngine from AsyncEngineArgs.

async generate(prompt: str | TextPrompt | TokensPrompt, sampling_params: SamplingParams, request_id: str) AsyncGenerator[RequestOutput, None][source]#

Generates text completions for a given prompt.

Parameters:
  • prompt – The prompt to the LLM. See PromptType for more details about the format of each input.

  • sampling_params – The sampling parameters of the request.

  • request_id – The unique id of the request.