ArtifactBuilder#
- class furiosa_llm.artifact.ArtifactBuilder(model_id_or_path: str, name: str = '', *, tensor_parallel_size: int = 4, pipeline_parallel_size: int = 1, data_parallel_size: int | None = None, prefill_buckets: Sequence[Tuple[int, int]] = [], decode_buckets: Sequence[Tuple[int, int]] = [], max_seq_len_to_capture: int = 2048, max_prompt_len: int | None = None, num_hidden_layers: int | None = None, seed_for_random_weight: int | None = None, calculate_logit_only_for_last_token: bool | None = True, quantize_artifact_path: PathLike | None = None, compiler_config_overrides: Mapping | None = None, do_decompositions_for_model_rewrite: bool = False, use_blockwise_compile: bool = True, num_blocks_per_supertask: int = 1, num_blocks_per_pp_stage: Sequence[int] | None = None, embed_all_constants_into_graph: bool = False, optimize_logit_shape: bool = True, kv_cache_sharing_across_beams_config: KvCacheSharingAcrossBeamsConfig | None = None, paged_attention_block_size: int = 1, default_scheduler_config: SchedulerConfig = SchedulerConfig(npu_queue_limit=2, max_processing_samples=65536, spare_blocks_ratio=0.2, is_offline=False, prefill_chunk_size=None), trust_remote_code: bool | None = None, **kwargs)[source]#
Bases:
object
The artifact builder to use in the Furiosa LLM.
- Parameters:
model_id_or_path – The Huggingface model id or a local path. This corresponds to pretrained_model_name_or_path in HuggingFace Transformers.
name – The name of the artifact to build.
tensor_parallel_size – The number of PEs for each tensor parallelism group. The default is 4.
pipeline_parallel_size – The number of pipeline stages for pipeline parallelism. The default is 1. This param configures the default pipeline parallelism degree for the artifact. pipeline_parallel_size can be overridden when the artifact is loaded.
data_parallel_size – The size of the data parallelism group. If not given, it will be derived from total avaialble PEs and other parallelism degrees.
prefill_buckets – Specify the bucket size for prefill
decode_buckets – Specify the bucket size for decode
max_seq_len_to_capture – Maximum sequence length covered by LLM engine. Sequence with larger context than this will not be covered. If no bucket is explicitly specified, a single batch bucket with a context length of this value is created.
max_prompt_len – Maximum prompt sequence length covered by LLM engine. Prompt larger than this cannot be handled. If not given, will be obtained from bucket and other configs.
num_hidden_layers – Number of hidden layers in the Transformer encoder.
seed_for_random_weight – The seed to initialize the random number generator for creating random weight.
calculate_logit_only_for_last_token – Whether the model has last block slice optimization applied.
quantize_artifact_path – Specifies the path where quantization artifacts generated by the furiosa-model-compressor are saved.
compiler_config_overrides – Overrides for the compiler config. This is a dictionary that includes the configuration for the compiler.
do_decompositions_for_model_rewrite – Whether to decompose some ops to describe various parallelism strategies with mppp config. When the value is True, mppp config that matches with the decomposed FX graph should be given.
use_blockwise_compile – If True, each task will be compiled in the unit of transformer block, and compilation result for transformer block is generated once and reused. The default is
True
.num_blocks_per_supertask – The number of transformer blocks that will be merged into one supertask. This option is valid only when use_blockwise_compile=True. The default is 1.
num_blocks_per_pp_stage – The number of transformers blocks per each pipeline parallelism stage. If not given, transformer blocks will be distributed equally.
embed_all_constants_into_graph – Whether to embed constant tensors into graph or make them as input of the graph and save them as separate files. The default is False.
optimize_logit_shape – Add logit slice or removal operation in the graph for optimized performance.
kv_cache_sharing_across_beams_config – Configuration for sharing kv cache across beams. This argument must be given if and only if the model is optimized to share kv cache across beams. If this argument is given, decode phase buckets with batch size of
batch_size
*kv_cache_sharing_across_beams_config.beam_width
will be created.paged_attention_block_size – The maximum number of tokens that can be stored in a single paged attention block. This argument must be given if model uses paged attention.
default_scheduler_config – Default configuration for the scheduler, allowing to maximum number of tasks which can be queued to HW, maximum number of samples that can be processed by the scheduler, and ratio of spare blocks that are reserved by scheduler.
trust_remote_code – Trust remote code when downloading the model and tokenizer from HuggingFace.
- build(save_dir: str | PathLike, *, num_pipeline_builder_workers: int = 1, num_compile_workers: int = 1, cache_dir: PathLike | None = PosixPath('/root/.cache/furiosa/llm'), param_file_path: PathLike | None = None, param_saved_format: Literal['safetensors', 'pt'] = 'safetensors', _cleanup: bool = True, **kwargs)[source]#
Build the artifacts for given model configurations.
- Parameters:
save_dir – The path to save the artifacts. With artifacts, you can create
LLM
without quantizing or compiling the model again.num_pipeline_builder_workers – The number of workers used for building pipelines (except for compilation). The default is 1 (no parallelism). Setting this value larger than 1 reduces pipeline building time, especially for large models, but requires much more memory.
num_compile_workers – The number of workers used for compilation. The default is 1 (no parallelism).
cache_dir – The cache directory for all generated files for this LLM instance. When its value is
None
, caching is disabled. The default is “$HOME/.cache/furiosa/llm”.param_file_path – The path to the parameter file to use for pipeline generation. If not specified, the parameters will be saved in a temporary file which will be deleted when
LLM
is destroyed.param_saved_format – The format of the parameter file. Only possible value is “safetensors” now. The default is “safetensors”.
Artifact#
- class furiosa_llm.artifact.Artifact(*, metadata: ArtifactMetadata, devices: str, generator_config: GeneratorConfig, hf_config: Dict[str, Any], model_metadata: ModelMetadata, model_rewriting_config: ModelRewritingConfig, parallel_config: ParallelConfig, pipelines: List[Dict[str, Any]] = [], pipeline_metadata_list: List[PipelineMetadata] | None = None, max_prompt_len: int | None = None)[source]#
Bases:
BaseModel
- model_computed_fields: ClassVar[Dict[str, ComputedFieldInfo]] = {}#
A dictionary of computed field names and their corresponding ComputedFieldInfo objects.
- model_config: ClassVar[ConfigDict] = {}#
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- model_fields: ClassVar[Dict[str, FieldInfo]] = {'devices': FieldInfo(annotation=str, required=True), 'generator_config': FieldInfo(annotation=GeneratorConfig, required=True), 'hf_config': FieldInfo(annotation=Dict[str, Any], required=True), 'max_prompt_len': FieldInfo(annotation=Union[int, NoneType], required=False, default=None), 'metadata': FieldInfo(annotation=ArtifactMetadata, required=True), 'model_metadata': FieldInfo(annotation=ModelMetadata, required=True), 'model_rewriting_config': FieldInfo(annotation=ModelRewritingConfig, required=True), 'parallel_config': FieldInfo(annotation=ParallelConfig, required=True), 'pipeline_metadata_list': FieldInfo(annotation=Union[List[PipelineMetadata], NoneType], required=False, default=None), 'pipelines': FieldInfo(annotation=List[Dict[str, Any]], required=False, default=[])}#
Metadata about the fields defined on the model, mapping of field names to [FieldInfo][pydantic.fields.FieldInfo] objects.
This replaces Model.__fields__ from Pydantic V1.
ArtifactMetadata#
- class furiosa_llm.artifact.ArtifactMetadata(*, artifact_id: str, name: str, timestamp: int, version: ArtifactVersion)[source]#
Bases:
BaseModel
- model_computed_fields: ClassVar[Dict[str, ComputedFieldInfo]] = {}#
A dictionary of computed field names and their corresponding ComputedFieldInfo objects.
- model_config: ClassVar[ConfigDict] = {}#
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- model_fields: ClassVar[Dict[str, FieldInfo]] = {'artifact_id': FieldInfo(annotation=str, required=True), 'name': FieldInfo(annotation=str, required=True), 'timestamp': FieldInfo(annotation=int, required=True), 'version': FieldInfo(annotation=ArtifactVersion, required=True)}#
Metadata about the fields defined on the model, mapping of field names to [FieldInfo][pydantic.fields.FieldInfo] objects.
This replaces Model.__fields__ from Pydantic V1.
ArtifactVersion#
- class furiosa_llm.artifact.ArtifactVersion(*, furiosa_llm: str, furiosa_compiler: str)[source]#
Bases:
BaseModel
- model_computed_fields: ClassVar[Dict[str, ComputedFieldInfo]] = {}#
A dictionary of computed field names and their corresponding ComputedFieldInfo objects.
- model_config: ClassVar[ConfigDict] = {}#
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- model_fields: ClassVar[Dict[str, FieldInfo]] = {'furiosa_compiler': FieldInfo(annotation=str, required=True), 'furiosa_llm': FieldInfo(annotation=str, required=True)}#
Metadata about the fields defined on the model, mapping of field names to [FieldInfo][pydantic.fields.FieldInfo] objects.
This replaces Model.__fields__ from Pydantic V1.