concepts.benchmark.common.vocab.Vocab#

class Vocab[source]#

Bases: object

A simple vocabulary class.

Methods

add(word)

Add a word to the vocabulary.

add_word(word)

Add a word to the vocabulary.

check_json_consistency(json_file)

Check whether the vocabulary is consistent with a json file.

dump_json(json_file)

Dump the vocabulary to a json file.

from_dataset(dataset, keys[, extra_words, ...])

Generate a vocabulary from a dataset.

from_json(json_file)

Load a vocabulary from a json file.

from_list(dataset[, extra_words, single_word])

Generate a vocabulary from a list of strings.

invmap_sequence(sequence[, proc_be])

Map a sequence of indices to a sequence of words.

map(word)

Map a word to its index.

map_fields(feed_dict, fields)

Map the content in a specified set of fields in a dictionary to indices.

map_sequence(sequence[, add_be])

Map a sequence of words to a sequence of indices.

words()

Attributes

idx2word

A dictionary mapping indices to words.

__init__(word2idx=None)[source]#

Initialize the vocabulary.

Parameters:

word2idx – a dictionary mapping words to indices. If not specified, the vocabulary will be empty.

__iter__()[source]#

Return an iterator over the words in the vocabulary.

Return type:

Iterable[str]

__len__()[source]#

Return the size of the vocabulary.

Return type:

int

__new__(**kwargs)#
add(word)[source]#

Add a word to the vocabulary. Alias of add_word().

Parameters:

word (str)

add_word(word)[source]#

Add a word to the vocabulary.

Parameters:

word (str)

check_json_consistency(json_file)[source]#

Check whether the vocabulary is consistent with a json file.

Parameters:

json_file (str)

Return type:

bool

dump_json(json_file)[source]#

Dump the vocabulary to a json file.

Parameters:

json_file (str)

classmethod from_dataset(dataset, keys, extra_words=None, single_word=False)[source]#

Generate a vocabulary from a dataset.

Parameters:
  • dataset – the dataset to generate the vocabulary from.

  • keys (Sequence[str]) – the keys to retrieve from the dataset items.

  • extra_words (Sequence[str] | None) – additional words to add to the vocabulary.

  • single_word (bool) – whether to treat the values of the keys as single words.

Return type:

Vocab

classmethod from_json(json_file)[source]#

Load a vocabulary from a json file.

Parameters:

json_file (str)

Return type:

Vocab

classmethod from_list(dataset, extra_words=None, single_word=False)[source]#

Generate a vocabulary from a list of strings.

Parameters:
  • dataset (list) – the list of strings to generate the vocabulary from.

  • extra_words (Sequence[str] | None) – additional words to add to the vocabulary.

  • single_word (bool) – whether to treat the values of the keys as single words.

Return type:

Vocab

invmap_sequence(sequence, proc_be=False)[source]#

Map a sequence of indices to a sequence of words. If the argument proc_be is True, the begin-of-sentence and end-of-sentence tokens will be removed from the sequence.

Parameters:
  • sequence (Sequence[int] | Tensor) – the sequence of indices to map.

  • proc_be (bool) – whether to remove the begin-of-sentence and end-of-sentence tokens from the sequence.

Returns:

a list of words.

Return type:

List[str]

map(word)[source]#

Map a word to its index. If the word is not in the vocabulary, return the index of the unknown token.

Parameters:

word (str)

Return type:

int

map_fields(feed_dict, fields)[source]#

Map the content in a specified set of fields in a dictionary to indices. The argument fields is a list of keys in the dictionary to map. This function will modify the dictionary in-place.

Parameters:
  • feed_dict (dict) – the dictionary of fields to map.

  • fields (Sequence[str]) – the list of keys to map.

Returns:

a dictionary of mapped fields.

Return type:

dict

map_sequence(sequence, add_be=False)[source]#

Map a sequence of words to a sequence of indices. If the argument add_be is True, the begin-of-sentence and end-of-sentence tokens will be added to the sequence.

Parameters:
  • sequence (Sequence[str]) – the sequence of words to map.

  • add_be (bool) – whether to add the begin-of-sentence and end-of-sentence tokens to the sequence.

Returns:

a list of indices.

Return type:

List[int]

words()[source]#
Return type:

Iterable[str]

property idx2word: dict#

A dictionary mapping indices to words. This is a lazy property. It will be automatically recomputed when the length of the vocabulary changes.