Copyright (C) 1998 by Eric Sven Ristad and Peter N. Yianilos. All Rights Reserved.
Permission to use, copy, and distribute this software and its documentation without fee for not-for-profit academic or research purposes is hereby granted, provided that the above copyright notice appears in all copies, and that both the copyright notice and this permission notice and warranty disclaimer appear in supporting documentation, and that use of this software for any purpose is explicitly acknowledged and cited in all relevant reports and publications via the following citation form:
Eric Sven Ristad and Peter N. Yianilos. Library of practical abstractions, release 1.3. http://www.mnemonic.com/software/libpa February 1998.
Permission to modify the software and documentation is hereby granted, but only to the extent necessary to cause the software to function in a hardware or software environment not explicitly supported as part of the distribution, and subject to the condition that the original authors retain all legal rights to any modifications or improvements of this software.
The authors disclaim all warranties with regard to this software, including all implied warranties of merchantability and fitness. In no event shall the authors or any entities or institutions to which he/she is related be liable for any special, indirect or consequential damages or any damages whatsoever resulting from loss of use, data or profits, whether in an action of contract, negligence or other tortious action, arising out of or in connection with the use or performance of this software.
The library of practical abstractions (LIBPA) provides efficient implementations of conceptually simple abstractions, in the C programming language. We believe that the best library code is conceptually simple so that it will be easily understood by the application programmer; parameterized by type so that it enjoys wide applicability; and at least as efficient as a straightforward special-purpose implementation. You will find that our software satisfies the highest standards of software design, implementation, testing, and benchmarking.
The current LIBPA release is a source code distribution only. It consists of modules for portable memory management, one dimensional arrays of arbitrary types, compact symbol tables, hash tables for arbitrary types, a trie module for length-delimited strings over arbitrary alphabets, single precision floating point numbers with extended exponents, and logarithmic representations of probability values using either fixed or floating point numbers. Additional modules are provided to efficiently store the frequencies of symbols and fixed-length substrings.
We have used LIBPA to implement a wide range of statistical models for both continuous and discrete domains. The time and space efficiency of LIBPA has allowed us to build larger statistical models than previously possible, and to investigate more computationally-intensive techniques. We have found LIBPA to be indispensable in our own research, and hope that you will find it useful in yours. If you find LIBPA useful, please let us know!
LIBPA is not guaranteed to be backward compatible. We strive to provide the best libraries for each release, and do not hesitate to completely redesign a module to improve clarity or performance. As a result, upgrading your LIBPA installation to the latest release may require you to rewrite portions of your code. We will not change the semantics of a function without also changing the prototype for that function. This policy should help you quickly upgrade your software using the latest LIBPA release.
Every module has its own license, written by the module's authors. Before you install or use a LIBPA module, you must agree to the terms of that module's license. Typically, authors will ask that you use LIBPA for not-for-profit research purposes only, and that you follow standard academic practice in acknowledging your use of LIBPA in any relevant reports or publications.
The source code may be perused online from http://www.mnemonic.com/software/libpa/src and the entire distribution may be obtained from http://www.mnemonic.com/software/libpa. Please use GNU tar -zxf to unpack the `.tar.gz' file.
LIBPA releases are numbered according to the following convention. Each release has a number x.y.z. A change in the first number x indicates a major redesign of LIBPA and its central memory management modules. A change in the second number y indicates a change in functionality, either a module revision or the addition of new modules. A change in the third number z indicates support for a new operating environment, changes to the installation routines, or minor bug fixes. If you are able to install LIBPA, then you need only download the latest version when the first or second numbers change.
Once you have ftped and unpacked the distribution, you must initialize the following environment variables.
LIBPA_GROUP
LIBPA_ARCH
SCC
SLD
UCC
ULD
RANLIB
MAKE
The LIBPA_GROUP
environment variable should be set to the name of
the group which will have read/write permission in the LIBPA
installation. We source the `libpa-1.3.0/etc/cshrc' or
`libpa-1.3.0/etc/bashrc' scripts on our system to set these
environment variables properly. If you want to source these scripts on
your system, you will need to edit them first. At the very least, you
will need to change the LIBPA_GROUP
value. To verify that the
LIBPA environment variables are set, you may execute the
libpa-1.3.0/bin/checkenv.sh shell script.
Unless it is the only compiler available, we do not recommend using
gcc to compile the modules because gcc has a proprietary
implementation of the assert()
macro. Once you compile a module
using gcc, all executables based on that module must link
`libgcc.a'. This is particularly annoying if you are trying to
track down a compiler bug.
After you have set the LIBPA environment variables, run make install from `libpa-1.3.0/src' to install the libraries, and then run make tests from `libpa-1.3.0/src' in order to verify our implementations in your operating environment. When you are satisfied in the outcome of these tests, run make clean in the same directory to delete extraneous object code files and archives in the source directories. You must run make clean if you want to install LIBPA on multiple architectures.
As explained below, we provide safe and unsafe versions of all object code. The safe versions contain unoptimized object code with symbol tables for debugging purposes. The unsafe versions contain optimized object code without symbol tables. We recommend using safe object code on a routine basis, and only using unsafe object code when performance matters. The `libpa.a' and `libpa_u.a' archives contain the safe and unsafe versions of all object code, respectively.
All modules are supported on the following operating environments:
Most modules will work in any UNIX operating environment, and should work in any environment with an ANSI-C compiler.
We encourage you to subscribe to libpa-announce@cs.princeton.edu, to learn about upgrades and new modules as they are published. Send mail to majordomo@cs.princeton.edu with the single line:
subscribe libpa-announce
If you are having trouble installing or using LIBPA, or have other questions about the library of practical abstractions, send mail to libpa-help@cs.princeton.edu. If you are a skilled user of LIBPA, then we encourage you to subscribe to this majordomo mailing list and to help field the questions that are posted.
We are unable to track down bugs in your code or in your operating environment (compiler, operating system). We have spent a significant amount of time and care writing the LIBPA modules, and have published the source code and test suites. So if you think you have found a bug in our code, please help us by sending a thoroughly tested bug fix along with your bug report. When you have finished testing your bug fix, we would be grateful if you would please send it to libpa-bugs@cs.princeton.edu along with a detailed explanation.
Eric Sven Ristad ristad@mnemonic.com is with Mnemonic Technology, Inc.
Peter N. Yianilos pny@research.nj.nec.com is with the NEC Research Institute, 4 Independence Way, Princeton, NJ 08540.
We are grateful to David Hanson drh@microsoft.com for his feedback,
and for providing us with so many well-taught students while he was with
us at Princeton. Robert G. Thomas rgt@cs.princeton.edu provided
invaluable input in the design and redesign and reredesign of the LIBPA
modules. Kirk Kanzelberger kirk@research.nj.nec.com assisted with
the initial implementation of the vector_t
module in Summer 1993.
Documentation was created in GNU texinfo format, from which info and postscript formats were derived directly. html format was derived using texi2html from Lionel Cons.
Currently we publish modules for basic memory management, tables of arbitrary types, and extremal numerical values:
memory_t
vector_t
table_t
trie_t
hash_table_t
symbol_table_t
string_table_t
unigram_t
balanced_t
L_t
logpr_t
pr_t
The memory_t
module is the foundation of our library. All memory
management is performed in a portable manner using the memory_t
module, which helps identify memory leaks without using expensive and
nonportable third party applications, such as purify. The
memory_create()
function replaces malloc()
, while the
memory_destroy()
function replaces free()
. Other
functions are provided to replace the standard C memory functions.
The vector_t
module is also widely used, and is equally
indispensable. A vector_t
is a length-delimited sequence of
objects of arbitrary type. The objects are parameterized by the size of
their representations, via the sizeof()
primitive. The
vector_t
module supports dynamic resizing, concatenation,
insertion, sorting, and portable input/output. Indeed, we often use
vector_t
where LISP programmers would use lists. Since the
vector_t
objects aren't boxed and are arranged linearly in
memory, vector_t
operations are at least twice as fast as lists
and require significantly less storage than lists (eg., up to 16 times
less storage for one byte objects on a machine with 8 byte pointers).
LIBPA provides three ways to store arbitrary information in tables:
table_t
, trie_t
, and hash_table_t
.
The table_t
module is a compact table for arbitrary
<symbol,datum> pairs. Lookups are performed in O(log n) using binary
search, while insertions and deletions are O(n) for a table containing n
entries. The table_t
module is typically used as a building
block for other modules, such as the trie_t
module, where memory
usage is the single most important performance desiderata.
The trie_t
module maps length-delimited strings over arbitrary
alphabets to consecutive unsigned integers. It also provides the
inverse map, from indices to length-delimited strings.
The hash_table_t
module supports the easy construction of compact
hash tables for arbitrary <key,datum> pairs with open addressing, double
hashing, and dynamic resizing. We used the hash_table_t
module
to implement two useful modules: the symbol_hash_t
for symbols of
arbitrary size and the string_hash_t
for NULL-terminated
character strings. Again, we strove to minimize the space requirements
of the implementation so that many hash_table_t
objects can
be used in even the most memory-intensive computations.
LIBPA also provides data structures to efficiently accumulate the
statistics of arbitrary symbols and strings: unigram_t
and
suffix_array_t
, respectively.
The unigram_t
module provides a space-efficient representation
for the frequencies of symbols drawn from an arbitrary alphabet. Our
unigram_t
implementation uses dynamic counter resizing to
minimize the amount of storage required to store the symbol frequencies.
Thus, it may be used to efficiently store the state transition
frequencies in a very large Markov model.
The suffix_array_t
module provides a compact representation for
the frequencies of all fixed-length substrings of a given string.
Strings are represented using only sizeof(symbol_t)
+
sizeof(unsigned)
bytes per symbol in the strings. The data
structure may be constructed in O(n log n) time and subsequent lookups
require O(log n) time using a binary search.
LIBPA provides three ways to represent extremely small numerical values,
such as are commonly encountered in statistical modeling applications:
balanced_t
, L_t
, and logpr_t
. The pr_t
module provides a generic macro interface to all three modules. The
three modules offer a wide range of performance/accuracy tradeoffs.
L_t
is the fastest and requires the least memory. logpr_t
is the most accurate but the slowest. balanced_t
is nearly as
accurate as logpr_t
and can be significantly faster on systems
with a slow `libm.a' math library, such as SunOS and HP-UX.
The balanced_t
type represents single precision floating point
numbers with extended 32-bit exponents. As a result, the
balanced_t
module supports floating point arithmetic for
extremely small and extremely large numerical values -- values which
would cause overflow or underflow in double precision floating point
arithmetic. Depending on your machine, balanced_t
arithmetic is
from 10 to 20 times slower than double precision floating point
arithmetic and is only slightly less accurate that single precision
floating point arithmetic. Unlike the other LIBPA numerical types, the
balanced_t
type can represent a large range of numbers, such as
very large positive numbers and negative numbers.
The L_t
type represents probability values using fixed point
numbers -- char
, short
, int
, and long
---
where larger types are capable of representing ever-smaller probability
values. Depending on your machine, L_t
arithmetic is from 1.2 to
1.8 times slower than double precision floating point arithmetic. It is
also considerably less accurate. balanced_t
arithmetic is more
accurate than L_t
arithmetic, but is more than five times slower
and typically requires at least twice as much memory to represent each
probability value.
Finally, we also provide a standard logpr_t
type for the
logarithmic representation of probability values using double-precision
floating point numbers. Although this is the most accurate
representation of probability values, it is also the slowest. Depending
on the math libraries provided by your operating system, logpr_t
arithmetic can be 13 to 60 times slower than double precision floating
point arithmetic.
The following table reports timing results of the `pr.bench' benchmark on four different machines. All numbers are user times in seconds on an unloaded machine, as reported by /usr/bin/time or /usr/bin/timex (lower is better).
double logpr_t balanced_t L_t ------ ------- ---------- ----- AlphaStation 500/500 121.8 1619.8 1292.7 207.2 Sun Ultra2 1200 301.2 10166.4 5037.2 541.9 SGI O2 180mhz 371.4 8570.8 4830.1 546.6 HP 9000/755 369.2 19643.6 7497.1 638.0 Dell GXM 5166 805.5 12010.6 5118.6 933.3
The library of practical abstractions is organized as follows:
This allows you to compile with -I`libpa-1.3.0/include' and link
with -L`libpa-1.3.0/lib.ARCH'. You may also want to put
`libpa-1.3.0/bin' and `libpa-1.3.0/bin/ARCH' on your
PATH
environment variable.
A LIBPA module named module is implemented in the following files:
The implementation may include other files, named according to a similar naming convention, such as:
Special module functions may be provided a separate interface. For
example, the memory_t
library includes a `memory.spartan.h'
header file for the special-purpose spartan memory_t
functions.
As explained below, the object code for module is placed in an
archive named libmodule.a in the `libpa-1.3.0/lib.ARCH'
directory.
Comprehensive test suites are provided for all code. A test suite is a certificate of correctness. It must convince an aggressive skeptic of the correctness of the implementation. The best way to establish correctness is to verify the behavior of the implementation against an independent implementation. The independent implementation should be so simple that it is plausibly correct. Even the weakest test suite must exercise all functions for a wide range of inputs.
A test suite should run to completion in a reasonable amount of time.
It should describe the tests being performed as they are executed, but
should not display too much information either, rarely more than a
page. It should end by announcing the successful completion of the test
or by dumping core via an abort()
call or an assert()
failure.
A benchmark is a comparative performance analysis. The goal of a benchmark is to reveal the performance tradeoffs made in your implementation. Accordingly, the ideal benchmark compares the actual time/space requirements of your implementation with a range of alternative implementations in a realistic application. In practice, a benchmark should compare your implementation to at least one alternative implementation in a realistic sequence of function calls.
The library of practical abstractions provides two versions of all
object code: safe and unsafe. We recommend the routine use of safe
object code. Safe object code is intended for everyday use. Safe
object code has been compiled with no optimizations, with all
assert()
macros, and with full symbol tables for debugging. Safe
libraries are named `libmodule.a'. The `libpa.a'
archive includes all safe object code in the current release.
Unsafe object code is intended for more demanding applications. Unsafe object code has been compiled with maximum optimizations, the NDEBUG macro defined, and without any symbol tables. We discourage the routine use of unsafe object code, which is why we gave it such a frightful name. Under no conditions should unsafe object code be used during debugging, because optimizations are a major source of compiler errors and erroneous debugger information as well. Unsafe libraries are named `libmodule_u.a'. The `libpa_u.a' archive includes all unsafe object code in the current release.
Our Makefiles include the targets "install", "uninstall", "tests", and "clean". Use `libpa-1.3.0/src/Makefile' to perform these operations on all modules.
The central goal of module implementation is correctness. Performance is important, but secondary to correctness. By correctness, we mean that all specified functions are implemented strictly according to the module specification, without any memory leaks or segmentation faults. Aside from trivial cases, there will be no proof of correctness, only degrees of belief in correctness. And so the astute reader has already realized that "correctness" really means "belief in correctness".
Our belief in the correctness of an implementation is principally influenced by clarity of the implementation. It is much easier to develop confidence in code that is easy to understand than in code that is difficult to understand. Clarity is determined by degree of conformance to a style guide, appropriate use of abstractions, appropriate symbol names (that obey reasonable naming conventions), useful but sparing comments (when essential to understanding), and consistent formatting.
As we hope you will see, our software places an unusual premium on clarity, both directly and indirectly. Our belief in the correctness of an implementation is also affected by the rigor of the programmer's test suite and the abundance of assertions. Accordingly, our software always includes a comprehensive test suite as well as aggressive use of assertions.
We evaluate our code based on the three criteria of correctness, clarity, and performance.
assert()
for correctness and for clarity
The following keywords are used to specify the semantics of a procedure. They typically appear as comments immediately below the procedure's prototype in the module's .h file. The absence of a keyword indicates that the procedure's behavior in that respect conforms to widely-held expectations.
Clarity is of the utmost importance in naming your identifiers. Your
code should read like a great novel. Replace generic names, such as
count
, with maximally specific names, such as vertex_count
or edge_count
. Refrain from using meaningless names, such as
temp
, tmp
, or variants of the dreaded x
, xx
,
xxx
series. Single character names are reserved for integer
iteration variables.
Vowels are obligatory! Even standard abbreviations (eg., defn
,
dir
) should not be used unless they improve clarity. If your
typing skills are inadequate to type long names quickly, then take a
typing class, but don't burden others with your sloth.
Constants created using #define
should be all upper case. Macros
created using #define
should be either initial letter upper case
or all lower case (to allow replacement by procedures) when they take
arguments. Global variables should be first character upper case.
Procedures and all local variables should be all lower case.
Compound names are formed with underscore separators, eg.,
CONSTANT
, nifty_macro()
, Global_Variable
,
local_variable
.
Don't reuse variable names. A single variable name should have a single meaning in the widest possible scope (minimally within the body of a procedure, preferably within an entire library). Iteration variables should always iterate over the same domain.
When it improves clarity, variables that are pointers to a single
object should be named with the _p
suffix. Similarly, a pointer to
a pointer to an object (also called a "handle") should be named with
the _pp
suffix when it improves clarity.
type_t *name_p, **name_pp;
This convention might be used, for example, to distinguish a one
dimensional array of pointers (void **name_p
or void
*name_p[]
) from a two dimensional array of objects (void **name
or void name[][]
), or when an object is passed directly (ie., by
value) and indirectly (ie., by reference) in the same piece of code.
When an object is always passed by reference, adding the _p
suffix does not improve clarity. Note that arrays are NOT conceptually
considered pointers, and therefore they should not be named according to
the _p
or _pp
suffix convention (unless the arrays are
arrays of pointers or arrays of pointers to pointers, respectively).
When possible, procedure names begin with their module name and their prototypes obey the following naming conventions (see module design).
typedef ... type_t;
type_t
.
type_tag
type_t *type_create(...);
type_t
and returns its address.
void type_destroy(type_t *object);
type_t
;
all subsequent operations on that object are undefined.
type_t *type_copy(const type_t *object);
type_t
and returns its address.
void type_mirror(const type_t *source, type_t *target);
void type_initialize(type_t *address, ...);
sizeof(type_t)
bytes,
initializes that memory location to contain a valid object of type
type_t
.
void type_finalize(type_t *object);
type_t
,
finalizes that object so that the subsequent freeing
of sizeof(type_t)
bytes at the given address
will completely destroy the given object.
boolean_t type_is_predicate(const type_t *object);
TRUE
if and only if the Boolean predicate
holds for a given object of type type_t
.
value type_attribute(const type_t *object);
value
of given attribute for given object of type
type_t
, or the address of that value.
void type_set_attribute(type_t *object, ...);
type_t
so that
the desired attribute now holds for that object.
unsigned type_bytes(const type_t *object);
type_destroy(object)
, as measured by
memory_total_bytes()
.
void *type_workspace_p(const type_t *object);
type_t
, which caller is allowed to modify.
typeB_t *typeA2typeB(const typeA_t *object);
typeA_t
,
creates an equivalent object of type typeB_t
and returns it.
boolean_t type_write(const type_t *object, FILE *stream);
type_t
to the binary
file stream in self-delimiting format and returns TRUE
if and only if the write succeeds.
type_t *type_read(FILE *stream);
type_t
from the binary
file stream in self-delimiting format and returns it.
boolean_t type_fprintf(const type_t *object, FILE *stream);
type_t
on ASCII file stream
in self-delimiting human-readable format.
type_t *type_fscanf(FILE *stream);
type_t
from ASCII file stream
in self-delimiting human-readable format and returns it.
boolean_t type_backup(const type_t *object, FILE *stream);
type_t
to the binary
file stream in self-delimiting format and returns TRUE if and only if
the write succeeds. This is used when
boolean_t type_restore(type_t *object, FILE *stream);
type_t
from the
binary file stream in self-delimiting format and returns TRUE if and
only if the read succeeds and the objects properties are updated from
the file stream.
The first argument to an function should always be an object of the type defined in the function's module, eg.,
void type_write(const type_t *, FILE *);
type_t *type_read(FILE *);
Avoid macros! Macros are a major source of bugs. Use procedures instead. Once your code is working, and you find yourself with extra time on your hands, then you can replace some procedures with macros for minor performance gains, although we still do not recommend wasting your time this way.
All source code should make abundant use of the assert()
macro
from the `<assert.h>' library for both clarity and correctness.
Correctness includes checking that procedures are passed the proper
arguments and that they return proper values. Clarity includes
asserting all known invariants and assumptions. As a rough guide,
around 50% of your code should be asserts, as high as 75% when
performing pointer arithmetic. No kidding. The source code must work
properly even if NDEBUG
is defined and all the assert()
macros vanish. That means no side-effects may occur in the expressions
passed to assert()
.
Replace compound asserts with primitive asserts in order to make debugging easier. For example, the compound assertion
assert(exp1 && exp2);
should be replaced with the two primitive assertions
assert(exp1); assert(exp2);
Do not mutate a variable inside a complex expression or when it appears
as an argument to a procedure call. If this restriction makes you whine
about typing effort, then it's time to take another typing class. Thus,
f(i++);
should be f(i); i++
.
All blocks must be explicitly delimited by curly brackets, especially if the blocks contain exactly one statement, eg.,
if (predicate) { statement; }
Liberal use of parentheses is encouraged, particularly within macro definitions. Programming is not a test to see whether you know all the most compact legal expressions of C. Rather, it is a test to see if you can write correct code of breathtaking clarity.
Cutting and pasting code is likely to introduce errors. It is also an early warning sign of a bad modularization. So don't do it! If you are unable to break your cut-and-paste habit, then you must review newly pasted code with extraordinary diligence.
Whenever possible, arguments passed by reference (ie., pointers) should
also be declared const
. Not only does this improve the clarity
and safety of your code, but it allows other programmers to use your
code. A procedure argument that is not declared const can never be
const in any code that uses that procedure.
Do not mix declarations and initializations. That is, do not combine
the type declaration of automatic variables with their initialization in
the body of a procedure. For example, replace int i=0;
with the
declaration int i;
in the preamble followed by the actual
initialization i=0;
in the body of the procedure.
Declare variables in the narrowest possible scope. If you only use a variable inside one of the branches of a conditional statement, then it should be declared only in that branch. This reduces the risk of improperly initialized variables and inadvertent name conflicts. It also makes the code easier to understand.
In general, typedefs should be to the object itself, rather than to a pointer to the object. This makes memory management easier in most cases, and improves clarity because the caller better understands the semantics of procedure arguments.
Objects requiring more space than two machine words should typically be passed by reference for the purposes of efficiency. However, objects that are used solely to return multiple values from a procedure call should typically be returned by value in order to simplify memory management.
If at all possible, modules should be parameterized by type. Type parameterization allows you to design efficient, reusable modules. In C, types are parameterized by the size of their objects (in bytes) along with a minimal set of procedures necessary to operate on those objects.
No operating system calls, such as memory allocation or file i/o, should occur in time-critical inner loops. This may require design changes.
Minimize the number of pointers in your data structures. Each pointer reference requires a potentially-nonlocal memory reference, resulting in a cache miss, and can take up to 8 bytes of storage on some machines.
Even a small memory leak is unacceptable.
All LIBPA code must use the memory_create()
and
memory_destroy()
procedures from the memory_t
library
instead of the malloc()
and free()
provided by the
standard stdlib
.
Every test suite must conclude with a memory leak check.
assert(memory_total_bytes() == 0);
assert(memory_total_blocks() == 0);
Use GNU emacs C mode. In most cases, curly braces should be on their own line, in order to more clearly delineate the scope of the block.
Jump to: a - b - c - h - k - l - m - p - s - t - u - v
balanced_t
hash_table_t
L_t
logpr_t
memory_t
pr_t
pr_t
benchmarks
string_hash_t
suffix_array_t
symbol_hash_t
table_t
trie_t
unigram_t
vector_t
This document was generated on 9 Febuary 1998 using the texi2html translator version 1.52.