LLM class

LLM class#

class furiosa_llm.LLM(model: str, task_type: str | None = None, llm_config: LLMConfig | None = None, auto_bfloat16_cast: bool | None = None, qformat_path: PathLike | None = None, qparam_path: PathLike | None = None, quant_ckpt_file_path: PathLike | None = None, hf_overrides: Dict[str, Any] = {}, bucket_config: BucketConfig | None = None, speculative_model: str | LLM | None = None, speculative_model_llm_config: LLMConfig | None = None, speculative_model_qformat_path: PathLike | None = None, speculative_model_qparam_path: PathLike | None = None, speculative_model_quant_ckpt_file_path: PathLike | None = None, speculative_model_config: Dict[str, Any] = {}, speculative_model_bucket_config: BucketConfig | None = None, speculative_draft_tensor_parallel_size: int | None = None, speculative_draft_pipeline_parallel_size: int | None = None, speculative_draft_data_parallel_size: int | None = None, speculative_draft_num_blocks_per_pp_stage: Sequence[int] | None = None, num_speculative_tokens: int | None = None, max_seq_len_to_capture: int = 2048, max_prompt_len: int | None = None, tensor_parallel_size: int = 4, pipeline_parallel_size: int = 1, data_parallel_size: int | None = None, device_mesh: Sequence[Sequence[Sequence[str]]] | None = None, tokenizer: PreTrainedTokenizer | PreTrainedTokenizerFast | None = None, tokenizer_mode: Literal['auto', 'slow'] = 'auto', trust_remote_code: bool = False, seed: int | None = None, devices: str | Sequence[Device] | None = None, param_file_path: PathLike | None = None, param_saved_format: Literal['safetensors', 'pt'] = 'safetensors', param_file_max_shard_size: str | int | None = '5GB', do_decompositions_for_model_rewrite: bool = False, comp_supertask_kind: Literal['edf', 'dfg', 'fx'] | None = None, cache_dir: PathLike | None = PosixPath('/root/.cache/furiosa/llm'), backend: LLMBackend | None = None, use_blockwise_compile: bool = True, num_blocks_per_supertask: int | Callable[[Bucket], int] = 1, num_blocks_per_pp_stage: Sequence[int] | None = None, embed_all_constants_into_graph: bool = False, paged_attention_block_size: int = 1, kv_cache_sharing_across_beams_config: KvCacheSharingAcrossBeamsConfig | None = None, prefill_chunk_size: int | None = None, scheduler_config: SchedulerConfig = SchedulerConfig(scheduler_kind=None, npu_queue_limit=2, max_processing_samples=65536, spare_blocks_ratio=0.0, is_offline=False, prefill_chunk_size=None, estimation_time_limit_ms=None, enable_prefix_caching=False), packing_type: Literal['IDENTITY'] = 'IDENTITY', compiler_config_overrides: Mapping | None = None, use_random_weight: bool = False, num_pipeline_builder_workers: int = 1, num_compile_workers: int = 1, skip_engine: bool = False, *, _cleanup: bool = True, _pipelines_with_metadata: Sequence[PipelineWithMetadata] | None = None, _custom_buckets: Sequence[Bucket | BucketWithOutputLogitsSize] = [], _optimize_logit_shape: bool = True, _model_metadata: ModelMetadata | None = None, _unpadded_vocab_size: int | None = None, _embedding_layer_as_single_block: bool = False, _artifact_id: str = 'NO_ARTIFACT_ID', _use_pipelines_as_is: bool = False, _enable_bf16_partial_sum_for_split: bool = True, _use_2d_attention_masks: bool = False, _merge_kv_cache_indices: bool = False, **kwargs)[source]#

Bases: object

An LLM for generating texts from given prompts and sampling parameters.

Parameters:

model – The name of the pretrained model. This corresponds to pretrained_model_name_or_path in HuggingFace Transformers.
task_type – The type of the task. This corresponds to task in HuggingFace Transformers. See https://huggingface.co/docs/transformers/main/en/quicktour#pipeline for more details.
llm_config – The configuration for the LLM. This includes quantization and optimization configurations.
auto_bfloat16_cast – Whether to cast the model to bfloat16 automatically. This option is required when neither the model is trained with bfloat16 nor quantized.
qformat_path – The path to the quantization format file.
qparam_path – The path to the quantization parameter file.
quant_ckpt_file_path – The path to the quantized parameters checkpoint file.
hf_overrides – Additional HuggingFace Transformers model configuration. This is a dictionary that includes the configuration for the model.
bucket_config – Config for bucket generating policy. If not given, the model will use single one batch, max_seq_len_to_capture attention size bucket per each phase.
speculative_model – Speculative model for speculative decoding.
speculative_model_llm_config – The configuration for the speculative model. This includes quantization and optimization configurations.
speculative_model_qformat_path – The path to the quantization format file for the speculative model.
speculative_model_qparam_path – The path to the quantization parameter file for the speculative model.
speculative_model_quant_ckpt_file_path – The path to the quantized parameters checkpoint file for the speculative model.
speculative_model_config – Additional HuggingFace Transformers model configuration for the speculative model. This is a dictionary that includes the configuration for the model.
speculative_model_bucket_config – Config for bucket generating policy. If not given, the model will use single one batch, max_seq_len_to_capture attention size bucket per each phase.
speculative_draft_tensor_parallel_size – The number of PEs for each tensor parallelism group in speculative model. If not given, it will follow the value of the target model. This value will be ignored if speculative_model is given as LLM instance.
speculative_draft_data_parallel_size – The size of the data parallelism for running speculative model. If not given, it will follow the value of the target model. This value will be ignored if speculative_model is given as LLM instance.
speculative_draft_pipeline_parallel_size – The size of the pipeline parallelism for running speculative model. The argument is valid only for artifacts that use blockwise compilation. If not given, it will follow the value of the target model. This value will be ignored if speculative_model is given as LLM instance.
speculative_draft_num_blocks_per_pp_stage – The number of transformer blocks per each pipeline parallelism stage for running speculative model. The argument is valid only for artifacts that use blockwise compilation. If not given, it will follow the value of the target model. In anyway if only speculative_draft_pipeline_parallel_size is provided, transformer blocks of speculative model will be distributed equally. This value will be ignored if speculative_model is given as LLM instance.
num_speculative_tokens – The number of tokens that specualtive model will generate speculatively during each iteration of the decoding process
max_seq_len_to_capture – Maximum sequence length covered by LLM engine. Sequence with larger context than this will not be covered. The default is 2048.
max_prompt_len – Maximum prompt sequence length covered by LLM engine. Prompt larger than this cannot be handled. If not given, will be obtained from bucket and other configs.
tensor_parallel_size – The number of PEs for each tensor parallelism group. The default is 4.
pipeline_parallel_size – The number of pipeline stages for pipeline parallelism. The default is 1, which means no pipeline parallelism.
data_parallel_size – The size of the data parallelism group. If not given, it will be inferred from total available PEs and other parallelism degrees.
trust_remote_code – Trust remote code when downloading the model and tokenizer from HuggingFace.
tokenizer – The name or path of a HuggingFace Transformers tokenizer.
tokenizer_mode – The tokenizer mode. “auto” will use the fast tokenizer if available, and “slow” will always use the slow tokenizer.
seed – The seed to initialize the random number generator for sampling.
devices – The devices to run the model. It can be a single device or a list of devices. Each device can be either “npu:X” or “npu:X:*” where X is a specific device index. If not given, available devices will be used.
param_file_path – The path to the parameter file to use for pipeline generation. If not specified, the parameters will be saved in a temporary file which will be deleted when LLM is destroyed.
param_saved_format – The format of the parameter file. Only possible value is “safetensors” now. The default is “safetensors”.
param_file_max_shard_size – The maximum size of a parameter file shard referenced by pipeline. The default is “5GB”.
do_decompositions_for_model_rewrite – Whether to decompose some ops to describe various parallelism strategies with mppp config. When the value is True, mppp config that matches with the decomposed FX graph should be given.
comp_supertask_kind – The format that pipeline’s supertasks will be represented as. Possible values are “fx”,”dfg”, and “edf”, and the default is “edf”.
cache_dir – The cache directory for all generated files for this LLM instance. When its value is None, caching is disabled. The default is “$HOME/.cache/furiosa/llm”.
backend – The backend implementation to run forward() of a model for the LLM. If not specified, the backend will be chosen based on the device kind.
use_blockwise_compile – If True, each task will be compiled in the unit of transformer block, and compilation result for transformer block is generated once and reused. The default is True.
num_blocks_per_supertask – The number of transformer blocks that will be merged into one supertask. This option is valid only when use_blockwise_compile=True. The default is 1.
num_blocks_per_pp_stage – The number of transformers blocks per each pipeline parallelism stage. If not given, transformer blocks will be distributed equally.
embed_all_constants_into_graph – Whether to embed constant tensors into graph or make them as input of the graph and save them as separate files. The default is False.
paged_attention_block_size – The maximum number of tokens that can be stored in a single paged attention block. This argument must be given if model uses paged attention.
kv_cache_sharing_across_beams_config – Configuration for sharing kv cache across beams. This argument must be given if and only if the model is optimized to share kv cache across beams. If this argument is given, decode phase buckets with batch size of batch_size * kv_cache_sharing_across_beams_config.beam_width will be created.
prefill_chunk_size – Chunk size used for chunked prefill. If the value is None, chunked prefill is not used.
scheduler_config – Configuration for the scheduler, allowing to maximum number of tasks which can be queued to HW, maximum number of samples that can be processed by the scheduler, and ratio of spare blocks that are reserved by scheduler.
packing_type – Packing algorithm. Possible values are “IDENTITY” only for now
compiler_config_overrides – Overrides for the compiler config. This is a dictionary that includes the configuration for the compiler.
use_random_weight – If True, the model will be initialized with random weights.
num_pipeline_builder_workers – number of workers used for building pipelines (except for compilation). The default is 1 (no parallelism). Setting this value larger than 1 reduces pipeline building time, especially for large models, but requires much more memory.
num_compile_workers – number of workers used for compilation. The default is 1 (no parallelism).
skip_engine – If True, the native runtime engine will not be initialized. This is useful when you need the pipelines for other purposes than running them with the engine.

chat(messages: List[ChatCompletionDeveloperMessageParam | ChatCompletionSystemMessageParam | ChatCompletionUserMessageParam | ChatCompletionAssistantMessageParam | ChatCompletionToolMessageParam | ChatCompletionFunctionMessageParam] | List[List[ChatCompletionDeveloperMessageParam | ChatCompletionSystemMessageParam | ChatCompletionUserMessageParam | ChatCompletionAssistantMessageParam | ChatCompletionToolMessageParam | ChatCompletionFunctionMessageParam]], sampling_params: SamplingParams = SamplingParams(n=1, best_of=1, repetition_penalty=1.0, temperature=1.0, top_p=1.0, top_k=-1, min_p=0.0, use_beam_search=False, length_penalty=1.0, early_stopping=False, max_tokens=16, min_tokens=0, logprobs=None, output_kind=RequestOutputKind.CUMULATIVE), chat_template: str | None = None, chat_template_content_format: Literal['string'] = 'string', add_generation_prompt: bool = True, continue_final_message: bool = False, tools: List[Dict[str, Any]] | None = None) → List[RequestOutput][source]#

Generate responses for a chat conversation.

The chat conversation is converted into a text prompt using the tokenizer and calls the generate() method to generate the responses.

Parameters:

messages –
A list of conversations or a single conversation.
- Each conversation is represented as a list of messages.
- Each message is a dictionary with ‘role’ and ‘content’ keys.
sampling_params – The sampling parameters for text generation.
chat_template – The template to use for structuring the chat. If not provided, the model’s default chat template will be used.
chat_template_content_format – The format to render message content. Currently only “string” is supported.
add_generation_prompt – If True, adds a generation template to each message.
continue_final_message – If True, continues the final message in the conversation instead of starting a new one. Cannot be True if add_generation_prompt is also True.

Returns:

A list of RequestOutput objects containing the generated responses in the same order as the input messages.

generate(prompts: str | List[str], sampling_params: SamplingParams = SamplingParams(n=1, best_of=1, repetition_penalty=1.0, temperature=1.0, top_p=1.0, top_k=-1, min_p=0.0, use_beam_search=False, length_penalty=1.0, early_stopping=False, max_tokens=16, min_tokens=0, logprobs=None, output_kind=RequestOutputKind.CUMULATIVE), prompt_token_ids: BatchEncoding | None = None, tokenizer_kwargs: Dict[str, Any] | None = None) → RequestOutput | List[RequestOutput][source]#

Generate texts from given prompts and sampling parameters.

Parameters:

prompts – The prompts to generate texts.
sampling_params – The sampling parameters for generating texts.
prompt_token_ids – Pre-tokenized prompt input as a BatchEncoding object. If not provided, the prompt will be tokenized internally using the tokenizer.
tokenizer_kwargs – Additional keyword arguments passed to the tokenizer’s encode method, such as {“use_special_tokens”: True}.

Returns:

A list of RequestOutput objects containing the generated completions in the same order as the input prompts.

classmethod load_artifact(model_id_or_path: str | PathLike, *, revision: str | None = None, devices: str | Sequence[Device] | Sequence[Sequence[Sequence[str]]] | None = None, data_parallel_size: int | None = None, pipeline_parallel_size: int | None = None, num_blocks_per_pp_stage: Sequence[int] | None = None, device_mesh: Sequence[Sequence[Sequence[str]]] | None = None, prefill_buckets: List[Tuple[int, int]] | None = None, decode_buckets: List[Tuple[int, int]] | None = None, max_prompt_len: int | None = None, max_model_len: int | None = None, max_batch_size: int | None = None, min_batch_size: int | None = None, scheduler_config: SchedulerConfig | None = None, speculative_model: str | PathLike | LLM | None = None, speculative_draft_data_parallel_size: int | None = None, speculative_draft_pipeline_parallel_size: int | None = None, speculative_draft_num_blocks_per_pp_stage: Sequence[int] | None = None, skip_speculative_model_load: bool = False, tokenizer: PreTrainedTokenizer | PreTrainedTokenizerFast | None = None, tokenizer_mode: Literal['auto', 'slow'] = 'auto', seed: int | None = None, cache_dir: PathLike = PosixPath('/root/.cache/furiosa/llm'), skip_engine: bool = False, **kwargs) → LLM[source]#

Instantiate LLM from saved artifacts without quantization and compilation.

Parameters:

model_id_or_path – A path to furiosa llm engine artifact or a HuggingFace model id.
revision – The revision of the model, if model_id_or_path is a HuggingFace model id.
devices – The devices to run the model. It can be a single device or a list of devices. Each device can be either “npu:X” or “npu:X:*” where X is a specific device index. If not given, all available devices will be used.
data_parallel_size – The size of the data parallelism group. If not given, it will be inferred from total available PEs and other parallelism degrees.
pipeline_parallel_size – The size of the pipeline parallelism. The argument is valid only for artifacts that use blockwise compilation.
num_blocks_per_pp_stage – The number of transformer blocks per each pipeline parallelism stage. The argument is valid only for artifacts that use blockwise compilation. If only pipeline_parallel_size is provided, transformer blocks will be distributed equally.
device_mesh – 3D Matrix of device ids that defines the model parallelism strategy. In this matrix, three dimensions determine the grouping of devices for data, pipeline, and tensor parallelism respectively. This is for advanced users who want to use parallelism strategy that cannot be represented with tensor_parallel_size, pipeline_parallel_size and data_parallel_size. So if this argument is provided, all other parallelism options should not be provided.
prefill_buckets – Prefill buckets to use. Specified buckets must exist in the compiled artifact. If not given, all prefill buckets in the artifact will be used.
decode_buckets – Decode buckets to use. Specified buckets must exist in the compiled artifact. If not given, all decode buckets in the artifact will be used.
max_prompt_len – The maximum prompt length to use. If given, prefill buckets with attention size larger than this value will be ignored.
max_model_len – The maximum context length to use. If given, decode buckets with attention size larger than this value will be ignored.
max_batch_size – The maximum number of batched samples to use.
min_batch_size – The minimum number of batched samples to use.
scheduler_config – Configuration for the scheduler, allowing to maximum number of tasks which can be queued to HW, maximum number of samples that can be processed by the scheduler, and ratio of spare blocks that are reserved by scheduler. If this is not given, scheduler config saved in the artifacts will be used.
speculative_model – Speculative model for speculative decoding. Should be provided either as an artifact path or as an LLM instance. Note that speculative decoding is an experimental feature and may lead to unstable behavior.
speculative_draft_data_parallel_size – The size of the data parallelism for running speculative model. If not given, it will follow the value of the target model. This value will be ignored if speculative_model is given as LLM instance. Note that speculative decoding is an experimental feature and may lead to unstable behavior.
speculative_draft_pipeline_parallel_size – The size of the pipeline parallelism for running speculative model. The argument is valid only for artifacts that use blockwise compilation. If not given, it will follow the value of the target model. This value will be ignored if speculative_model is given as LLM instance. Note that speculative decoding is an experimental feature and may lead to unstable behavior.
speculative_draft_num_blocks_per_pp_stage – The number of transformer blocks per each pipeline parallelism stage for running speculative model. The argument is valid only for artifacts that use blockwise compilation. If not given, it will follow the value of the target model. In anyway if only speculative_draft_pipeline_parallel_size is provided, transformer blocks of speculative model will be distributed equally. This value will be ignored if speculative_model is given as LLM instance. Note that speculative decoding is an experimental feature and may lead to unstable behavior.
skip_speculative_model_load – If True, artifact will be loaded without speculative decoding.
tokenizer – The name or path of a HuggingFace Transformers tokenizer.
tokenizer_mode – The tokenizer mode. “auto” will use the fast tokenizer if available, and “slow” will always use the slow tokenizer.
seed – The seed to initialize the random number generator for sampling.
cache_dir – The cache directory for all generated files for this LLM instance. When its value is None, caching is disabled. The default is “$HOME/.cache/furiosa/llm”.
skip_engine – If True, the native runtime engine will not be initialized. This is useful when you need the pipelines for other purposes than running them with the engine.

classmethod load_artifacts(path: str | PathLike, **kwargs) → LLM[source]#

Instantiate LLM from saved artifacts without quantization and compilation.

Please note that this method is being deprecated. Use load_artifact instead.

Parameters:

path – A path to artifacts to load.
devices – The devices to run the model. It can be a single device or a list of devices. Each device can be either “npu:X” or “npu:X:*” where X is a specific device index. If not given, all available devices will be used.
data_parallel_size – The size of the data parallelism group. If not given, it will be inferred from total available PEs and other parallelism degrees.
pipeline_parallel_size – The size of the pipeline parallelism. The argument is valid only for artifacts that use blockwise compilation.
num_blocks_per_pp_stage – The number of transformer blocks per each pipeline parallelism stage. The argument is valid only for artifacts that use blockwise compilation. If only pipeline_parallel_size is provided, transformer blocks will be distributed equally.
device_mesh – 3D Matrix of device ids that defines the model parallelism strategy. In this matrix, three dimensions determine the grouping of devices for data, pipeline, and tensor parallelism respectively. This is for advanced users who want to use parallelism strategy that cannot be represented with tensor_parallel_size, pipeline_parallel_size and data_parallel_size. So if this argument is provided, all other parallelism options should not be provided.
prefill_buckets – Prefill buckets to use. Specified buckets must exist in the compiled artifact. If not given, all prefill buckets in the artifact will be used.
decode_buckets – Decode buckets to use. Specified buckets must exist in the compiled artifact. If not given, all decode buckets in the artifact will be used.
max_prompt_len – The maximum prompt length to use. If given, prefill buckets with attention size larger than this value will be ignored.
max_model_len – The maximum context length to use. If given, decode buckets with attention size larger than this value will be ignored.
max_batch_size – The maximum number of batched samples to use.
min_batch_size – The minimum number of batched samples to use.
scheduler_config – Configuration for the scheduler, allowing to maximum number of tasks which can be queued to HW, maximum number of samples that can be processed by the scheduler, and ratio of spare blocks that are reserved by scheduler. If this is not given, scheduler config saved in the artifacts will be used.
speculative_model – Speculative model for speculative decoding. Should be provided either as an artifact path or as an LLM instance. Note that speculative decoding is an experimental feature and may lead to unstable behavior.
speculative_draft_data_parallel_size – The size of the data parallelism for running speculative model. If not given, it will follow the value of the target model. This value will be ignored if speculative_model is given as LLM instance. Note that speculative decoding is an experimental feature and may lead to unstable behavior.
speculative_draft_pipeline_parallel_size – The size of the pipeline parallelism for running speculative model. The argument is valid only for artifacts that use blockwise compilation. If not given, it will follow the value of the target model. This value will be ignored if speculative_model is given as LLM instance. Note that speculative decoding is an experimental feature and may lead to unstable behavior.
speculative_draft_num_blocks_per_pp_stage – The number of transformer blocks per each pipeline parallelism stage for running speculative model. The argument is valid only for artifacts that use blockwise compilation. If not given, it will follow the value of the target model. In anyway if only speculative_draft_pipeline_parallel_size is provided, transformer blocks of speculative model will be distributed equally. This value will be ignored if speculative_model is given as LLM instance. Note that speculative decoding is an experimental feature and may lead to unstable behavior.
skip_speculative_model_load – If True, artifact will be loaded without speculative decoding.
tokenizer – The name or path of a HuggingFace Transformers tokenizer.
tokenizer_mode – The tokenizer mode. “auto” will use the fast tokenizer if available, and “slow” will always use the slow tokenizer.
seed – The seed to initialize the random number generator for sampling.
cache_dir – The cache directory for all generated files for this LLM instance. When its value is None, caching is disabled. The default is “$HOME/.cache/furiosa/llm”.
skip_engine – If True, the native runtime engine will not be initialized. This is useful when you need the pipelines for other purposes than running them with the engine.

async stream_generate(prompt: str, sampling_params: SamplingParams = SamplingParams(n=1, best_of=1, repetition_penalty=1.0, temperature=1.0, top_p=1.0, top_k=-1, min_p=0.0, use_beam_search=False, length_penalty=1.0, early_stopping=False, max_tokens=16, min_tokens=0, logprobs=None, output_kind=RequestOutputKind.CUMULATIVE), prompt_token_ids: BatchEncoding | None = None, tokenizer_kwargs: Dict[str, Any] | None = None, is_demo: bool = False) → AsyncGenerator[str, None][source]#

Generate texts from given prompt and sampling parameters.

Parameters:

prompt – The prompt to generate texts. Note that unlike generate, this API supports only a single prompt.
sampling_params – The sampling parameters for generating texts.
prompt_token_ids – Pre-tokenized prompt input as a BatchEncoding object. If not provided, the prompt will be tokenized internally using the tokenizer.
tokenizer_kwargs – Additional keyword arguments passed to the tokenizer’s encode method, such as {“use_special_tokens”: True}.

Returns:

A stream of generated output tokens.