LIBPA also provides data structures to efficiently accumulate the
statistics of arbitrary symbols and strings: unigram_t
and
suffix_array_t
, respectively.
The unigram_t
module provides a space-efficient representation
for the frequencies of symbols drawn from an arbitrary alphabet. Our
unigram_t
implementation uses dynamic counter resizing to
minimize the amount of storage required to store the symbol frequencies.
Thus, it may be used to efficiently store the state transition
frequencies in a very large Markov model.
The suffix_array_t
module provides a compact representation for
the frequencies of all fixed-length substrings of a given string.
Strings are represented using only sizeof(symbol_t)
+
sizeof(unsigned)
bytes per symbol in the strings. The data
structure may be constructed in O(n log n) time and subsequent lookups
require O(log n) time using a binary search.
Go to the first, previous, next, last section, table of contents.