concepts.benchmark.common.vocab.gen_vocab#

gen_vocab(dataset, keys=None, extra_words=None, cls=None, single_word=False)[source]#

Generate a Vocabulary instance from a dataset.

By default, this function will retrieve the data using the get_metainfo function, or it will fall back to dataset[i] if the function does not exist.

The function should return a dictionary. Users can specify a list of keys that will be returned by the get_metainfo function. This function will split the string indexed by these keys and add tokens to the vocabulary. If the argument keys is not specified, this function assumes the return of get_metainfo to be a string.

By default, this function will add four additional tokens: EBD_PAD, EBD_BOS, EBD_EOS, and EBD_UNK. Users can specify additional extra tokens using the extra_words argument.

Parameters:
  • dataset (Sequence) – the dataset to generate the vocabulary from. It can be a list of strings or a dataset instance.

  • keys (Iterable[str] | None) – the keys to retrieve from the dataset items. If not specified, the dataset is assumed to be a list of strings.

  • extra_words (Iterable[str] | None) – additional words to add to the vocabulary.

  • cls (type) – the class of the Vocabulary instance to generate.

  • single_word (bool) – whether to treat the entries in the dataset as single words. Default to False. When set to False, the entries should either be a list of strings or a single string (in which case it will be split by spaces).