SamplingParams class#

class furiosa_llm.SamplingParams(*, n: int = 1, best_of: int | None = None, repetition_penalty: float = 1.0, temperature: float = 1.0, top_p: float = 1.0, top_k: int = -1, min_p: float = 0.0, use_beam_search: bool = False, length_penalty: float = 1.0, early_stopping: bool | str = False, max_tokens: int | None = 16, min_tokens: int = 0, output_kind: RequestOutputKind = RequestOutputKind.CUMULATIVE)[source]#

Bases: object

Sampling parameters for text generation.

Parameters:
  • n – Number of output sequences to return for the given prompt.

  • best_of – Number of output sequences that are generated from the prompt. From these best_of sequences, the top n sequences are returned. best_of must be greater than or equal to n. This is treated as the beam width when use_beam_search is True. By default, best_of is set to n.

  • repetition_penalty – Float that penalizes new tokens based on whether they appear in the prompt and the generated text so far. Values > 1 encourage the model to use new tokens, while values < 1 encourage the model to repeat tokens.

  • temperature – Float that controls the randomness of the sampling. Lower values make the model more deterministic, while higher values make the model more random. Zero means greedy sampling.

  • top_p – Float that controls the cumulative probability of the top tokens to consider. Must be in (0, 1]. Set to 1 to consider all tokens.

  • top_k – Integer that controls the number of top tokens to consider. Set to -1 to consider all tokens.

  • min_p – Float that represents the minimum probability for a token to be considered, relative to the probability of the most likely token. Must be in [0, 1]. Set to 0 to disable this.

  • use_beam_search – Whether to use beam search instead of sampling.

  • length_penalty – Float that penalizes sequences based on their length. Used in beam search.

  • early_stopping – Controls the stopping condition for beam search. It accepts the following values: True, where the generation stops as soon as there are best_of complete candidates; False, where an heuristic is applied and the generation stops when is it very unlikely to find better candidates; “never”, where the beam search procedure only stops when there cannot be better candidates (canonical beam search algorithm).

  • max_tokens – Maximum number of tokens to generate per output sequence. If the value is None, it is capped to the maximum sequence length.

  • min_tokens – Minimum number of tokens to generate per output sequence before EOS or stop_token_ids can be generated

Examples#

This section provides examples of how to use the token generation methods available in the SDK.


2. Random Sampling with top_p / top_k Parameters#

SamplingParams(min_tokens=10, max_tokens=100, top_p=0.3, top_k=100)

This method uses random sampling techniques for token generation, allowing for diverse outputs.

  • Parameters:

    • min_tokens: Minimum number of tokens to generate.

    • max_tokens: Maximum number of tokens to generate.

    • top_p: Cumulative probability for nucleus sampling.

    • top_k: Number of highest probability tokens to consider.

  • Behavior:

    • Each generation may yield different results, even with the same input text and parameters, enhancing variability.

    • Generation may terminate before reaching max_tokens if an End Of Sequence (EOS) token is generated.

    • The EOS token will not be generated before reaching the specified min_tokens.


3. Beam Search with best_of Beams#

SamplingParams(min_tokens=10, max_tokens=100, use_beam_search=True, best_of=4)

Beam Search enhances the generation process by exploring multiple sequences simultaneously.

  • Parameters:

    • min_tokens: Minimum number of tokens to generate.

    • max_tokens: Maximum number of tokens to generate.

    • use_beam_search: Must be set to True to enable beam search.

    • best_of: Number of beams to consider for generating the best output.

  • Behavior:

    • The generation process explores multiple possible sequences to determine the best output.

    • Generation may terminate before reaching max_tokens if the number of End Of Sequence (EOS) tokens generated across all beams reaches the best_of count.

    • The EOS token will not be generated before reaching the specified min_tokens.