AsyncLLMEngine class#
Overview#
The AsyncLLMEngine provides an asynchronous interface for text generation, supporting configuration through command-line arguments.
Example Usage#
import argparse
from furiosa_llm import AsyncLLMEngine, AsyncEngineArgs, SamplingParams
async def main():
parser = argparse.ArgumentParser()
parser = AsyncEngineArgs.add_cli_args(parser)
args = parser.parse_args()
engine_args = AsyncEngineArgs.from_cli_args(args)
engine = AsyncLLMEngine.from_engine_args(engine_args)
example_input = {
"prompt": "What is LLM?",
"temperature": 0.0,
"request_id": "request-123",
}
results_generator = engine.generate(
example_input["prompt"],
SamplingParams(temperature=example_input["temperature"]),
example_input["request_id"]
)
final_output = None
async for request_output in results_generator:
final_output = request_output
print(final_output)
if __name__ == "__main__":
import asyncio
asyncio.run(main())
The script can be executed with various arguments defined in AsyncEngineArgs, as shown in the following example:
python async_llm_engine.py --model /path/to/model --devices npu:0
For a comprehensive list of available arguments for AsyncEngineArgs, please refer to the section below.
Arguments supported by AsyncLLMEngine#
The arguments are identical to those specified in Arguments supported by LLMEngine.
usage: async_llm_engine.py [-h] --model MODEL [--tokenizer TOKENIZER] [--tokenizer-mode TOKENIZER_MODE] [--seed SEED] [--devices DEVICES] [--pipeline-parallel-size PIPELINE_PARALLEL_SIZE]
[--data-parallel-size DATA_PARALLEL_SIZE] [--cache-dir CACHE_DIR] [--paged-attention-num-blocks PAGED_ATTENTION_NUM_BLOCKS] [--npu-queue-limit NPU_QUEUE_LIMIT]
[--max-processing-samples MAX_PROCESSING_SAMPLES] [--spare-blocks-ratio SPARE_BLOCKS_RATIO] [--is-offline IS_OFFLINE]
options:
-h, --help show this help message and exit
--model MODEL Path to the LLM engine artifact (Pretrained id will be supported in the future releases).
--tokenizer TOKENIZER
The name or path of a HuggingFace Transformers tokenizer.
--tokenizer-mode TOKENIZER_MODE
The tokenizer mode. "auto" will use the fast tokenizer if available, and "slow" will always use the slow tokenizer.
--seed SEED The seed to initialize the random number generator for sampling.
--devices DEVICES The devices to run the model. It can be a single device or a list of devices. Each device can be either "npu:X" or "npu:X:*" where X is a specific device index. If not
given, available devices will be used.
--pipeline-parallel-size PIPELINE_PARALLEL_SIZE
The size of the pipeline parallelism group. If not given, it will use the default pp value of the artifact.
--data-parallel-size DATA_PARALLEL_SIZE
The size of the data parallelism group. If not given, it will be inferred from total avaialble PEs and other parallelism degrees.
--cache-dir CACHE_DIR
The cache directory for temporarily generated files for this LLM instance. When its value is ``None``, caching is disabled. The default is "$HOME/.cache/furiosa/llm".
--paged-attention-num-blocks PAGED_ATTENTION_NUM_BLOCKS
The maximum number of blocks that each k/v storage per layer can store. This argument must be given if model uses paged attention.
--npu-queue-limit NPU_QUEUE_LIMIT
The NPU queue limit of the scheduler config.
--max-processing-samples MAX_PROCESSING_SAMPLES
The maximum processing samples. Used as an hint for the scheduler.
--spare-blocks-ratio SPARE_BLOCKS_RATIO
The spare blocks ratio. Used as an hint for the scheduler.
--is-offline IS_OFFLINE
If True, the scheduler will assume the workload will be offline scenario.
API Reference#
- class furiosa_llm.AsyncLLMEngine(native_engine: NativeLLMEngine, tokenizer: PreTrainedTokenizer | PreTrainedTokenizerFast, prompt_max_seq_len: int, max_seq_len_to_capture: int)[source]#
AsyncLLMEngine receives requests and generates texts asynchronously. Implements the API interface compatible with vLLM’s AsyncLLMEngine, but this class is based on furiosa-runtime and FuriosaAI NPU.
- classmethod from_engine_args(args: AsyncEngineArgs) AsyncLLMEngine [source]#
Creates an AsyncLLMEngine from AsyncEngineArgs.
- async generate(prompt: str | TextPrompt | TokensPrompt, sampling_params: SamplingParams, request_id: str) AsyncGenerator[RequestOutput, None] [source]#
Generates text completions for a given prompt.
- Parameters:
prompt – The prompt to the LLM. See
PromptType
for more details about the format of each input.sampling_params – The sampling parameters of the request.
request_id – The unique id of the request.