LLM class#

class furiosa_llm.LLM(model_id_or_path: str | PathLike, *, revision: str | None = None, devices: str | Sequence[Device] | None = None, data_parallel_size: int | None = None, pipeline_parallel_size: int | None = None, num_blocks_per_pp_stage: Sequence[int] | None = None, max_io_memory_mb: int = 2048, max_model_len: int | None = None, scheduler_config: SchedulerConfig | None = None, guided_decoding_backend: Literal['auto', 'guidance', 'xgrammar'] = 'auto', tokenizer: str | PreTrainedTokenizer | PreTrainedTokenizerFast | None = None, tokenizer_mode: Literal['auto', 'slow'] = 'auto', seed: int | None = None, cache_dir: PathLike = PosixPath('/root/.cache/furiosa/llm'), skip_engine: bool = False, enable_jit_wiring: bool = False, **kwargs)[source]#

Bases: object

An LLM for generating texts from given prompts and sampling parameters.

chat(messages: List[ChatCompletionDeveloperMessageParam | ChatCompletionSystemMessageParam | ChatCompletionUserMessageParam | ChatCompletionAssistantMessageParam | ChatCompletionToolMessageParam | ChatCompletionFunctionMessageParam] | List[List[ChatCompletionDeveloperMessageParam | ChatCompletionSystemMessageParam | ChatCompletionUserMessageParam | ChatCompletionAssistantMessageParam | ChatCompletionToolMessageParam | ChatCompletionFunctionMessageParam]], sampling_params: SamplingParams = SamplingParams(n=1, best_of=1, repetition_penalty=1.0, temperature=1.0, top_p=1.0, top_k=-1, min_p=0.0, use_beam_search=False, length_penalty=1.0, early_stopping=False, max_tokens=16, min_tokens=0, skip_special_tokens=True, logprobs=None, prompt_logprobs=None, guided_decoding=None, output_kind=RequestOutputKind.CUMULATIVE, stop_token_ids=None, ignore_eos=False), chat_template: str | None = None, chat_template_content_format: Literal['string'] = 'string', add_generation_prompt: bool = True, continue_final_message: bool = False, tools: List[Dict[str, Any]] | None = None, chat_template_kwargs: Dict[str, Any] | None = None) List[RequestOutput][source]#

Generate responses for a chat conversation.

The chat conversation is converted into a text prompt using the tokenizer and calls the generate() method to generate the responses.

Parameters:
  • messages

    A list of conversations or a single conversation.

    • Each conversation is represented as a list of messages.

    • Each message is a dictionary with ‘role’ and ‘content’ keys.

  • sampling_params – The sampling parameters for text generation.

  • chat_template – The template to use for structuring the chat. If not provided, the model’s default chat template will be used.

  • chat_template_content_format – The format to render message content. Currently only “string” is supported.

  • add_generation_prompt – If True, adds a generation template to each message.

  • continue_final_message – If True, continues the final message in the conversation instead of starting a new one. Cannot be True if add_generation_prompt is also True.

  • tools – Optional list of tools to use in the chat.

  • chat_template_kwargs – Additional keyword arguments to pass to the chat template rendering function.

Returns:

A list of RequestOutput objects containing the generated responses in the same order as the input messages.

embed(prompts: str | TextPrompt | TokensPrompt | Sequence[str | TextPrompt | TokensPrompt], pooling_params: PoolingParams | Sequence[PoolingParams] | None = None) List[EmbeddingRequestOutput][source]#

Generate an embedding vector for each prompt. Only applicable to embedding models.

Parameters:
  • prompts – The prompts to the LLM. You may pass a sequence of prompts for batch embedding.

  • pooling_params – The pooling parameters for pooling.

Returns:

A list of EmbeddingRequestOutput objects containing the embedding vectors in the same order as the input prompts.

encode(prompts: str | TextPrompt | TokensPrompt | Sequence[str | TextPrompt | TokensPrompt], pooling_params: PoolingParams | Sequence[PoolingParams] | None = None, *, pooling_task: Literal['embed', 'classify', 'score', 'token_embed', 'token_classify', 'plugin'] | None = None) List[PoolingRequestOutput][source]#

Apply pooling to the hidden states corresponding to the input prompts.

Parameters:
  • prompts – The prompts to the LLM. You may pass a sequence of prompts for batch inference.

  • pooling_params – The pooling parameters for pooling.

  • pooling_task – Override the pooling task to use.

Returns:

A list of PoolingRequestOutput objects containing the pooled hidden states in the same order as the input prompts.

generate(prompts: str | List[str], sampling_params: SamplingParams = SamplingParams(n=1, best_of=1, repetition_penalty=1.0, temperature=1.0, top_p=1.0, top_k=-1, min_p=0.0, use_beam_search=False, length_penalty=1.0, early_stopping=False, max_tokens=16, min_tokens=0, skip_special_tokens=True, logprobs=None, prompt_logprobs=None, guided_decoding=None, output_kind=RequestOutputKind.CUMULATIVE, stop_token_ids=None, ignore_eos=False), prompt_token_ids: BatchEncoding | None = None, tokenizer_kwargs: Dict[str, Any] | None = None) RequestOutput | List[RequestOutput][source]#

Generate texts from given prompts and sampling parameters.

Parameters:
  • prompts – The prompts to generate texts.

  • sampling_params – The sampling parameters for generating texts.

  • prompt_token_ids – Pre-tokenized prompt input as a BatchEncoding object. If not provided, the prompt will be tokenized internally using the tokenizer.

  • tokenizer_kwargs – Additional keyword arguments passed to the tokenizer’s encode method, such as {“use_special_tokens”: True}.

Returns:

A list of RequestOutput objects containing the generated completions in the same order as the input prompts.

classmethod load_artifact(model_id_or_path: str | PathLike, **kwargs) LLM[source]#

Deprecated: Use LLM() constructor directly.

This method is kept for backward compatibility and will be removed in a future release.

score(data_1: str | TextPrompt | TokensPrompt | Sequence[str | TextPrompt | TokensPrompt], data_2: str | TextPrompt | TokensPrompt | Sequence[str | TextPrompt | TokensPrompt], /, *, truncate_prompt_tokens: int | None = None, pooling_params: PoolingParams | None = None, chat_template: str | None = None) list[ScoringRequestOutput][source]#

Generate similarity scores for all pairs <text,text_pair>.

The inputs can be 1 -> 1, 1 -> N or N -> N. In the 1 - N case the data_1 input will be replicated N times to pair with the data_2 inputs.

Parameters:
  • data_1 – Can be a single prompt or a list of prompts. When a list, it must have the same length as the data_2 list.

  • data_2 – The data to pair with the query to form the input to the LLM.

  • truncate_prompt_tokens – The number of tokens to truncate the prompt to.

  • pooling_params – The pooling parameters for pooling. If None, we use the default pooling parameters.

  • chat_template – The chat template to use for the scoring. If None, we use the model’s default chat template.

Returns:

A list of ScoringRequestOutput objects containing the generated scores in the same order as the input prompts.

shutdown()[source]#

Shutdown the LLM engine gracefully.

async stream_generate(prompt: str, sampling_params: SamplingParams = SamplingParams(n=1, best_of=1, repetition_penalty=1.0, temperature=1.0, top_p=1.0, top_k=-1, min_p=0.0, use_beam_search=False, length_penalty=1.0, early_stopping=False, max_tokens=16, min_tokens=0, skip_special_tokens=True, logprobs=None, prompt_logprobs=None, guided_decoding=None, output_kind=RequestOutputKind.CUMULATIVE, stop_token_ids=None, ignore_eos=False), prompt_token_ids: BatchEncoding | None = None, tokenizer_kwargs: Dict[str, Any] | None = None, is_demo: bool = False) AsyncGenerator[str, None][source]#

Generate texts from given prompt and sampling parameters.

Parameters:
  • prompt – The prompt to generate texts. Note that unlike generate, this API supports only a single prompt.

  • sampling_params – The sampling parameters for generating texts.

  • prompt_token_ids – Pre-tokenized prompt input as a BatchEncoding object. If not provided, the prompt will be tokenized internally using the tokenizer.

  • tokenizer_kwargs – Additional keyword arguments passed to the tokenizer’s encode method, such as {“use_special_tokens”: True}.

Returns:

A stream of generated output tokens.

Key Methods#

generate()#

Generate text completions for the given prompts using sampling parameters. This is the primary method for text generation tasks.

See the chat example for usage.

embed()#

Generate embedding vectors for the given prompts. This method is only applicable to embedding models.

Parameters:

  • prompts (PromptType | Sequence[PromptType]): The prompts to encode. Can be a single prompt or a sequence for batch processing.

  • pooling_params (PoolingParams | Sequence[PoolingParams] | None): The pooling parameters. If None, default parameters are used.

Returns:

  • List[EmbeddingRequestOutput]: A list of embedding outputs containing the embedding vectors in the same order as the input prompts.

Example:

from furiosa_llm import LLM, PoolingParams

llm = LLM(artifact_path="path/to/embedding/model")

# Single embedding
outputs = llm.embed("Hello, world!")
embedding = outputs[0].outputs.embedding

# Batch embedding with normalization disabled
params = PoolingParams(normalize=False)
outputs = llm.embed(["First text", "Second text"], pooling_params=params)

See the embedding example for more details.

score()#

Generate similarity scores for text pairs. This method is only supported for binary classification models, including Qwen3-Reranker models or models converted using as_binary_seq_cls_model.

Parameters:

  • data_1 (PromptType | Sequence[PromptType]): The first input text(s). Can be a single prompt or a list.

  • data_2 (PromptType | Sequence[PromptType]): The second input text(s) to pair with the first.

  • truncate_prompt_tokens (int | None): Maximum number of tokens to truncate the prompt to. If None, no truncation is applied.

  • pooling_params (PoolingParams | None): The pooling parameters. If None, default parameters are used.

  • chat_template (str | None): Custom chat template for scoring. If None, the model’s default template is used.

Input Patterns:

  • 1-to-1: Single text paired with single text

  • 1-to-N: Single text paired with multiple texts (data_1 is replicated N times)

  • N-to-N: Multiple texts paired element-wise (both lists must have the same length)

Returns:

  • List[ScoringRequestOutput]: A list of scoring outputs containing similarity scores in the same order as the input pairs.

Example:

from furiosa_llm import LLM, PoolingParams

llm = LLM(artifact_path="path/to/reranker/model")

# 1-to-N scoring: one query against multiple documents
query = "What is machine learning?"
documents = [
    "Machine learning is a subset of AI",
    "Python is a programming language",
    "Deep learning uses neural networks"
]

outputs = llm.score(query, documents)
for i, output in enumerate(outputs):
    print(f"Document {i}: score = {output.outputs.score}")

See the score example for more details.