LLM class#
- class furiosa_llm.LLM(model_id_or_path: str | PathLike, *, revision: str | None = None, devices: str | Sequence[Device] | None = None, data_parallel_size: int | None = None, pipeline_parallel_size: int | None = None, num_blocks_per_pp_stage: Sequence[int] | None = None, max_io_memory_mb: int = 2048, max_model_len: int | None = None, scheduler_config: SchedulerConfig | None = None, guided_decoding_backend: Literal['auto', 'guidance', 'xgrammar'] = 'auto', tokenizer: str | PreTrainedTokenizer | PreTrainedTokenizerFast | None = None, tokenizer_mode: Literal['auto', 'slow'] = 'auto', seed: int | None = None, cache_dir: PathLike = PosixPath('/root/.cache/furiosa/llm'), skip_engine: bool = False, enable_jit_wiring: bool = False, **kwargs)[source]#
Bases:
objectAn LLM for generating texts from given prompts and sampling parameters.
- chat(messages: List[ChatCompletionDeveloperMessageParam | ChatCompletionSystemMessageParam | ChatCompletionUserMessageParam | ChatCompletionAssistantMessageParam | ChatCompletionToolMessageParam | ChatCompletionFunctionMessageParam] | List[List[ChatCompletionDeveloperMessageParam | ChatCompletionSystemMessageParam | ChatCompletionUserMessageParam | ChatCompletionAssistantMessageParam | ChatCompletionToolMessageParam | ChatCompletionFunctionMessageParam]], sampling_params: SamplingParams = SamplingParams(n=1, best_of=1, repetition_penalty=1.0, temperature=1.0, top_p=1.0, top_k=-1, min_p=0.0, use_beam_search=False, length_penalty=1.0, early_stopping=False, max_tokens=16, min_tokens=0, skip_special_tokens=True, logprobs=None, prompt_logprobs=None, guided_decoding=None, output_kind=RequestOutputKind.CUMULATIVE, stop_token_ids=None, ignore_eos=False), chat_template: str | None = None, chat_template_content_format: Literal['string'] = 'string', add_generation_prompt: bool = True, continue_final_message: bool = False, tools: List[Dict[str, Any]] | None = None, chat_template_kwargs: Dict[str, Any] | None = None) List[RequestOutput][source]#
Generate responses for a chat conversation.
The chat conversation is converted into a text prompt using the tokenizer and calls the
generate()method to generate the responses.- Parameters:
messages –
A list of conversations or a single conversation.
Each conversation is represented as a list of messages.
Each message is a dictionary with ‘role’ and ‘content’ keys.
sampling_params – The sampling parameters for text generation.
chat_template – The template to use for structuring the chat. If not provided, the model’s default chat template will be used.
chat_template_content_format – The format to render message content. Currently only “string” is supported.
add_generation_prompt – If True, adds a generation template to each message.
continue_final_message – If True, continues the final message in the conversation instead of starting a new one. Cannot be
Trueifadd_generation_promptis alsoTrue.tools – Optional list of tools to use in the chat.
chat_template_kwargs – Additional keyword arguments to pass to the chat template rendering function.
- Returns:
A list of
RequestOutputobjects containing the generated responses in the same order as the input messages.
- embed(prompts: str | TextPrompt | TokensPrompt | Sequence[str | TextPrompt | TokensPrompt], pooling_params: PoolingParams | Sequence[PoolingParams] | None = None) List[EmbeddingRequestOutput][source]#
Generate an embedding vector for each prompt. Only applicable to embedding models.
- Parameters:
prompts – The prompts to the LLM. You may pass a sequence of prompts for batch embedding.
pooling_params – The pooling parameters for pooling.
- Returns:
A list of EmbeddingRequestOutput objects containing the embedding vectors in the same order as the input prompts.
- encode(prompts: str | TextPrompt | TokensPrompt | Sequence[str | TextPrompt | TokensPrompt], pooling_params: PoolingParams | Sequence[PoolingParams] | None = None, *, pooling_task: Literal['embed', 'classify', 'score', 'token_embed', 'token_classify', 'plugin'] | None = None) List[PoolingRequestOutput][source]#
Apply pooling to the hidden states corresponding to the input prompts.
- Parameters:
prompts – The prompts to the LLM. You may pass a sequence of prompts for batch inference.
pooling_params – The pooling parameters for pooling.
pooling_task – Override the pooling task to use.
- Returns:
A list of PoolingRequestOutput objects containing the pooled hidden states in the same order as the input prompts.
- generate(prompts: str | List[str], sampling_params: SamplingParams = SamplingParams(n=1, best_of=1, repetition_penalty=1.0, temperature=1.0, top_p=1.0, top_k=-1, min_p=0.0, use_beam_search=False, length_penalty=1.0, early_stopping=False, max_tokens=16, min_tokens=0, skip_special_tokens=True, logprobs=None, prompt_logprobs=None, guided_decoding=None, output_kind=RequestOutputKind.CUMULATIVE, stop_token_ids=None, ignore_eos=False), prompt_token_ids: BatchEncoding | None = None, tokenizer_kwargs: Dict[str, Any] | None = None) RequestOutput | List[RequestOutput][source]#
Generate texts from given prompts and sampling parameters.
- Parameters:
prompts – The prompts to generate texts.
sampling_params – The sampling parameters for generating texts.
prompt_token_ids – Pre-tokenized prompt input as a BatchEncoding object. If not provided, the prompt will be tokenized internally using the tokenizer.
tokenizer_kwargs – Additional keyword arguments passed to the tokenizer’s encode method, such as {“use_special_tokens”: True}.
- Returns:
A list of RequestOutput objects containing the generated completions in the same order as the input prompts.
- classmethod load_artifact(model_id_or_path: str | PathLike, **kwargs) LLM[source]#
Deprecated: Use LLM() constructor directly.
This method is kept for backward compatibility and will be removed in a future release.
- score(data_1: str | TextPrompt | TokensPrompt | Sequence[str | TextPrompt | TokensPrompt], data_2: str | TextPrompt | TokensPrompt | Sequence[str | TextPrompt | TokensPrompt], /, *, truncate_prompt_tokens: int | None = None, pooling_params: PoolingParams | None = None, chat_template: str | None = None) list[ScoringRequestOutput][source]#
Generate similarity scores for all pairs <text,text_pair>.
The inputs can be 1 -> 1, 1 -> N or N -> N. In the 1 - N case the data_1 input will be replicated N times to pair with the data_2 inputs.
- Parameters:
data_1 – Can be a single prompt or a list of prompts. When a list, it must have the same length as the data_2 list.
data_2 – The data to pair with the query to form the input to the LLM.
truncate_prompt_tokens – The number of tokens to truncate the prompt to.
pooling_params – The pooling parameters for pooling. If None, we use the default pooling parameters.
chat_template – The chat template to use for the scoring. If None, we use the model’s default chat template.
- Returns:
A list of ScoringRequestOutput objects containing the generated scores in the same order as the input prompts.
- async stream_generate(prompt: str, sampling_params: SamplingParams = SamplingParams(n=1, best_of=1, repetition_penalty=1.0, temperature=1.0, top_p=1.0, top_k=-1, min_p=0.0, use_beam_search=False, length_penalty=1.0, early_stopping=False, max_tokens=16, min_tokens=0, skip_special_tokens=True, logprobs=None, prompt_logprobs=None, guided_decoding=None, output_kind=RequestOutputKind.CUMULATIVE, stop_token_ids=None, ignore_eos=False), prompt_token_ids: BatchEncoding | None = None, tokenizer_kwargs: Dict[str, Any] | None = None, is_demo: bool = False) AsyncGenerator[str, None][source]#
Generate texts from given prompt and sampling parameters.
- Parameters:
prompt – The prompt to generate texts. Note that unlike generate, this API supports only a single prompt.
sampling_params – The sampling parameters for generating texts.
prompt_token_ids – Pre-tokenized prompt input as a BatchEncoding object. If not provided, the prompt will be tokenized internally using the tokenizer.
tokenizer_kwargs – Additional keyword arguments passed to the tokenizer’s encode method, such as {“use_special_tokens”: True}.
- Returns:
A stream of generated output tokens.
Key Methods#
generate()#
Generate text completions for the given prompts using sampling parameters. This is the primary method for text generation tasks.
See the chat example for usage.
embed()#
Generate embedding vectors for the given prompts. This method is only applicable to embedding models.
Parameters:
prompts(PromptType | Sequence[PromptType]): The prompts to encode. Can be a single prompt or a sequence for batch processing.pooling_params(PoolingParams | Sequence[PoolingParams] | None): The pooling parameters. If None, default parameters are used.
Returns:
List[EmbeddingRequestOutput]: A list of embedding outputs containing the embedding vectors in the same order as the input prompts.
Example:
from furiosa_llm import LLM, PoolingParams
llm = LLM(artifact_path="path/to/embedding/model")
# Single embedding
outputs = llm.embed("Hello, world!")
embedding = outputs[0].outputs.embedding
# Batch embedding with normalization disabled
params = PoolingParams(normalize=False)
outputs = llm.embed(["First text", "Second text"], pooling_params=params)
See the embedding example for more details.
score()#
Generate similarity scores for text pairs. This method is only supported for binary classification models,
including Qwen3-Reranker models or models converted using as_binary_seq_cls_model.
Parameters:
data_1(PromptType | Sequence[PromptType]): The first input text(s). Can be a single prompt or a list.data_2(PromptType | Sequence[PromptType]): The second input text(s) to pair with the first.truncate_prompt_tokens(int | None): Maximum number of tokens to truncate the prompt to. If None, no truncation is applied.pooling_params(PoolingParams | None): The pooling parameters. If None, default parameters are used.chat_template(str | None): Custom chat template for scoring. If None, the model’s default template is used.
Input Patterns:
1-to-1: Single text paired with single text
1-to-N: Single text paired with multiple texts (data_1 is replicated N times)
N-to-N: Multiple texts paired element-wise (both lists must have the same length)
Returns:
List[ScoringRequestOutput]: A list of scoring outputs containing similarity scores in the same order as the input pairs.
Example:
from furiosa_llm import LLM, PoolingParams
llm = LLM(artifact_path="path/to/reranker/model")
# 1-to-N scoring: one query against multiple documents
query = "What is machine learning?"
documents = [
"Machine learning is a subset of AI",
"Python is a programming language",
"Deep learning uses neural networks"
]
outputs = llm.score(query, documents)
for i, output in enumerate(outputs):
print(f"Document {i}: score = {output.outputs.score}")
See the score example for more details.