Go to the first, previous, next, last section, table of contents.


3.3 Statistics

LIBPA also provides data structures to efficiently accumulate the statistics of arbitrary symbols and strings: unigram_t and suffix_array_t, respectively.

The unigram_t module provides a space-efficient representation for the frequencies of symbols drawn from an arbitrary alphabet. Our unigram_t implementation uses dynamic counter resizing to minimize the amount of storage required to store the symbol frequencies. Thus, it may be used to efficiently store the state transition frequencies in a very large Markov model.

The suffix_array_t module provides a compact representation for the frequencies of all fixed-length substrings of a given string. Strings are represented using only sizeof(symbol_t) + sizeof(unsigned) bytes per symbol in the strings. The data structure may be constructed in O(n log n) time and subsequent lookups require O(log n) time using a binary search.


Go to the first, previous, next, last section, table of contents.