Evaluating Source Code Analysis Tools
(CSE 6324)
Ken Garlington
15 Dec 2000
Keywords: software faults, static analysis, tool evaluations.
Abstract
Static analysis tools are capable of detecting potential faults in C source
code prior to execution-based testing. Based on preliminary results of
an evaluation of selected static analysis tools, at least one (LDRA Testbed)
is capable of finding a significant number of faults, although there are
concerns about the associated false alarm rate.
Introduction
"Static analysis" is defined as any analysis of a program carried out without
executing the program [BS
7925-1]. Several sources claim that static analysis provide significant
benefits to the programmer [Plessel]
[Mayrhauser]
[QuEST]
[Thomsen],
although it will probably not provide a complete prediction of program
behavior [Venema].
Static analysis can be done for any number of purposes, including:
-
Generation of software metrics [comp.software-eng]
for general software process and product improvement activities. There
are a number of tools that support this type of analysis, including TestWorks/Advisor
METRIC [Software
Research] and Cantata [IPL].
-
Capturing specific data to support use of software cost estimation models
[Mitretek]
-
Detecting ways to improve execution performance [Intel]
or estimate use of computing resources [Ueda]
-
Assisting in the reading and comprehension of a program, using tools such
as xVUE [Telcordia]
or CodeSurfer [GrammaTech]
-
Enforcing coding standards that relate to a specific program quality factor.
For example, tools such as CodeCheck [Abraxas]
can be used to detect the use of potentially non-portable constructs.
-
Test case generation (e.g., using C Test Bed [Testwell]
or VectorCAST [Vector]).
-
Software maintenance [Mayrhauser],
including porting between environments [Classen].
However, for the purposes of this study, one relatively narrow form of
analysis was considered: the detection of potential errors that might affect
program functionality. Scott and Lawrence, in "Testing Existing Software
for Safety-Related Applications" [Intrepid]
identify some of the specific classes of errors that can be detected:
-
Undeclared or improperly declared variables (e.g., variable typing discrepancies)
-
Reference anomalies (e.g., uninitialized or initialized but unused variables)
-
Complex or error-prone constructs
-
Expression faults (e.g., division by zero)
-
Argument checking on module invocations (number of arguments, mismatched
types, uninitialized inputs, etc.)
-
Inconsistent handling of global data
Approaches to implementing static code analysis can be categorized as follows:
Manual techniques.
Manual techniques such as code inspections and desk checking can be effective
in detecting faults [Intrepid].
One controlled experiment (Basili and Selby, 1987) found that code reading
detected more software faults and had a higher fault detection rate than
did functional or structural testing. However, such manual techniques can
also be labor intensive, and automated support is typically encouraged
[Intrepid].
Automated "lint"-class analysis techniques.
Tools of this type are primarily interested in detecting typical coding
standards violations. Although these tools may do some detection of unused
variables, etc., they rarely include extensive implementations of more
complex techniques such as data flow or boundary value analyses. (See the
National Institute of Standards and Technology (NIST) Special Publication
500-209 on Software Error Analysis [NIST]
for a description of these techniques.) Primarily, these tools perform
syntactic or "pattern-matching" operations on the code, looking for specific
lexical constructs that may indicate an error.
This paper refers to this category as "lint"-class tools, based on a
canonical example -- the "lint" utility in most UNIX implementations [OGI].
Lint was originally designed to supplement the weak type checking, etc.
of the C compiler, so that the compiler could run as fast as possible [Johnson].
In addition to lint, there are several other C/C++ tools that fall into
this class, such as:
There are also variations on lint for other languages, such as Fortranlint
for FORTRAN [Cleanscape],
CodeReview for Visual Basic [NuMega],
Jtest for Java [Parasoft]
and AdaSTAT for Ada [DCSIP].
One particular subset of this category is of special interest to developers
of safety-related software. The Motor Industry Software Reliability Association
(MISRA) has developed a standard, "Guidelines for the Use of the C Language
in Vehicle Based Software" [MISRA]
that focuses on C constructs that may be inappropriate for critical software.
Tools that check for violations of this standard include the MISRA C Checker,
part of the LDRA Testbed toolset [LDRA],
as well as several others from Hitex UK, Oakwood Computing, etc. [MISRA]
Automated "deep" analyses.
Compared to "lint"-class tools, tools that perform more complex analyses
such as control flow, data flow, and information flow analyses are rare.
This is because such tools tend to be more difficult to implement [Intrepid].
Tools in this category include the SPARK Examiner for Ada [Praxis]
and the PolySpace C Verifier [PolySpace].
The PolySpace tool is particularly interesting, given the environment
in which these tools are expected to be used, since the Ada version of
this tool detected
(after the fact, unfortunately) one of the most famous faults in the recent
history of aerospace software: the overflow condition in the Ariane
5 first flight, which caused the complete loss of the spacecraft. Similar
tools have been used to detect critical errors in NATO software [Cullyer].
Compilers.
C compilers are capable of finding certain syntactic and semantic errors
as part of the analysis required to generate executable code [NIST].
Any faults found by a compiler are found through static analysis [ISEB].
Regardless of which approach is used, static analysis has the potential
to be beneficial for early detection of errors. This is why the use of
static analysis tools and techniques is recommended by standards for safety-related
software, such as the British Ministry of Defence standard 00-55 [MoD].
To support the use of static code analysis by organizations producing critical
software, this study compared tools representing the different types of
automation described above.
Development Environment
A balanced error detection program will depend on many factors [NIST],
including:
-
the consequences of failure caused by an undetected error,
-
the complexity of the software system,
-
the types of errors likely to be committed in developing specific software,
-
the effort needed to apply a technique,
-
the automated support available,
-
and the experience of the development and assurance staff.
This study assumes the following for each factor:
Consequences of Failure
The users of the selected tool develop applications with very high consequences
of failures, including the potential loss of human life or millions of
dollars in damage. Failures in these systems can have other long term effects,
such as damaging the developing company's reputation and potentially causing
it to lose future work and close its doors. Therefore, the users are first
and foremost interested in detecting faults that will have an impact on
the system's safety and reliability. Other considerations, such as portability
and maintainability, are also important but are not the first priority.
System Complexity
The tool is expected to be used mostly for real-time embedded software
for high performance aircraft control. The application software size historically
has been between 50,000 and 150,000 source lines of code.
Types of Faults
Due to the nature of the application software, the types of faults
the tool may encounter include the following:
-
Computational faults. The mathematically-oriented nature of the
software invites faults such as undefined operations (e.g. divide by zero),
as well as arithmetic overflows and underflows. Invalid (e.g. type mismatch)
or missing references to variables used in the computations are also possible.
-
Logic faults. Particularly in areas such as diagnostics, extensive
logic combinations may be used. Faults such as unreachable code or a missing
response to a valid combination of conditions might be detected by a static
analysis tool.
-
Interface faults. Although the external interfaces for applications
of this type tend to be simple (basic numeric values with no direct user
involvement), there are a large number of interfaces to sensors, actuators,
and other automated systems. Typical faults in this area that might reasonably
be detected by a static analysis tool include invalid data typing, missing
interfaces, or extraneous (unused) interfaces.
-
Reference/bounds faults. Pointers and arrays are used extensively
in these systems to construct static databases of reference information
(e.g. look-up tables of aerodynamic data). Therefore, "out-of-bounds" or
other types of reference errors are possible.
Because of the nature of the application domain, the following types of
faults are not typically encountered:
-
Memory allocation faults. Other than the allocation of simple and
well-bounded data on the stack, very little dynamic allocation occurs during
program execution. Therefore, memory "leaks" and other faults of this class
are rare.
-
Concurrency/timing faults. There tend to be a small number of statically
defined tasks, with simple interactions, in systems of this type. As a
result, faults of this type do not usually occur.
Difficulty of Use
As with any other development effort, the efficiency of the processes and
associated tools is important. However, this study assumes that the development
team is willing to accept some inefficiencies, such as false alarms or
a non-intuitive tool interface, in exchange for uncovering potentially
serious faults.
Automated Support
Obviously, this study expects that some tool is available to support this
type of static analysis. However, as noted under Difficulty
of Use, it would be acceptable for some level of manual work to be
done in conjunction with the tool operation (e.g. tracing an error message
to the specific area of code containing the fault).
Staff Experience
The development team should have an average of five or more years programming
experience in this application domain, with a mix of new and experienced
engineers. However, their average experience in the target language (“C”)
is likely to be less than 2 years. Therefore, tools that detect improper
application of the C language, even for faults that are usually only made
by those new to the language, are expected to be beneficial to the team.
In addition, the users may not have any prior experience with static analysis
tools other than compilers.
Other Considerations
The users have identified additional factors of interest:
-
The team has to maintain legacy software written in the Ada language. Therefore,
they would like to use the same (or similar) tool for both Ada and C source
code.
-
Similarly, the users want a tool that could be applied to future systems
written in languages other than C, such as Java or C++.
-
Currently, the team uses a Windows NT-based development environment, and
so the tool should ideally execute on that host. (However, a UNIX or VAX/VMS-hosted
alternative could also be used.)
Tool Evaluation Criteria
Table 1 defines the criteria used for this study.
These criteria were selected based on the author's understanding of the
user environment. (Ideally, users would be directly involved in the specification
and potential modifications of these criteria.) Wherever possible, objective
and measurable values were defined.
In order to simplify the analysis of the results, each criteria was
specified in terms of four value ranges:
-
Values in the BLUE range indicate performance that is likely to exceed
user expectations -- to go beyond the minimal requirements.
-
Values in the GREEN range indicate generally acceptable behavior.
-
Values in the YELLOW range indicate behavior that does not disqualify the
use of the tool per se, but is likely to cause the user some concern. These
could be described as "annoyance"-type problems.
-
Values in the RED range are expected to be unacceptable to the user. A
significant risk might result from the use of a tool with one or more RED
scores.
Furthermore, the criteria were broken into two groups: primary
and secondary criteria. The expectation
was that a tool would only be selected if it received satisfactory scores
for all primary criteria. Secondary criteria would be used to (a) select
a single tool, where multiple tools satisfied the primary criteria and
(b) identify potential risks in a selected tool that would need to be communicated
and worked with the users.
Table 1: Tool Evaluation Criteria
| Evaluation Area |
Rating Criteria
|
|
Blue
|
Green
|
Yellow
|
Red
|
| Primary Criteria: |
| Faults Found (per KSLOC) |
> 10 |
> 5 |
> 0 |
0 |
| False Alarms |
< 10% |
< 50% |
<= 75% |
> 75% |
| Secondary Criteria: |
| Cost / year |
<= $5,000 |
<= $50,000 |
<= $100,000 |
> $100,000 |
| First use |
< 1 day |
< 1 week |
<= 2 weeks |
> 2 weeks |
| Stability (crashes / KSLOC) |
0 |
< 0.05 |
<= 0.1 |
> 0.1 |
| Turnaround (KSLOC / min) |
> 1 |
> 0.25 |
> 0.05 |
<= 0.05 |
| Languages |
Multiple |
C only |
Not applicable |
Not applicable |
Each criterion is defined as follows:
Primary Criteria
The study defined two key factors for evaluating each tool:
-
Faults Found (per KSLOC). This indicates the number of legitimate
faults detected during the evaluation, per thousand source lines of code
(KSLOC) processed. A "legitimate" fault is one that could have led to a
system failure under conditions that are feasible during operation. Since
the value of finding such faults is high, the threshold for an acceptable
detection rate was kept fairly low.
-
False Alarms. Error messages that do not represent legitimate faults
are also expected for these types of tools. For example, the programmers
of LCLint say that "they will give a prize to anyone who writes a real
program that receives no warning when using LCLint in [the most extensive
error checking] mode" [LinuxJournal].
Therefore, the density of false alarms was set at a value considered fairly
high -- as much as three out of four error messages could be false alarms
for a marginally acceptable (Yellow) rating.
Secondary Criteria
The following additional factors were also defined:
-
Cost / year. This cost reflects the average price per year of each
tool for a single seat license, taking into account the original purchase
price plus any yearly maintenance fees. This price would not include any
unique support, development, or training services provided by the vendor.
-
First use. This factor measures the time between when the software
was first installed on the system, and the point where the evaluator was
able to generate the first usable output. It is intended to help gauge
the difficulty of tool use.
-
Stability (crashes / KSLOC). This element identifies the number
of times the tool failed completely (i.e. "crashed") for every KSLOC processed.
It also provides a gross measure of potential difficulties in using the
tool. It also can provide an indirect measure of the expected user confidence
in the tool output.
-
Turnaround (KSLOC / min). "Turnaround" represents the average volume
of source code than can be processed for each minute of clock time. This
is also a typical reflection of the difficulty of tool use.
-
Languages. As noted above, the
users are interested in the ability to use the tool (or a similar tool
from the same vendor) to process languages other than C. This factor identifies
such cases found during the evaluation.
Tool Candidate Selection
As noted in the Introduction, there are a large
number of tools that could have been included in this study -- more than
could actually be considered given the available time and other resources.
Therefore, a manageable subset of the available tools had to be selected.
Three considerations were identified to guide this selection:
-
There must be a vendor willing to provide commercial support for the tool.
This does not exclude open source software, so long as maintenance can
be purchased (e.g. RedHat support for Linux),
-
There must be an existing commercial base for the toolset. This was used
as a way to roughly estimate the risk of the product -- if no one else
is using it, there may be a reason!
-
Finally, it was decided to include one candidate from each of the three
major types of automated static analysis described in the Introduction.
This permitted the study to not only evaluate tools, but different approaches
to automating static analysis.
The three tools selected as candidates were:
All of the current company projects in the domain of interest use the Texas
Instruments (TI)
CodeComposer for the TMS320C67 target. Therefore, it was chosen as the
example compiler for this study. Since this tool must be used to generate
code, it was treated as a "baseline" for the study. In other words, the
other tools evaluated in the study only processed code that had already
been successfully compiled by the TI toolset. Therefore, any faults reported
would be in addition to the faults found (and fixed) during compilation.
(Note that all evaluation factors listed for the TI compiler in the results
are given as "Not Applicable," since it is the baseline.)
As noted earlier, a web search can find a number
of tools that use "pattern matching" techniques to analyze C code. However,
when the desire to incorporate checks against the MISRA
C standard are included, only a handful of tools remain. The LDRA
Testbed MISRA code standards checker was selected, based on its large user
base.
This category was the easiest to select. Only one commercial tool in this
class was identified that would work with C code: the PolySpace
C Verifier.
Evaluation Methodology
Prior to this assignment, the author had done some preliminary work in
this area to identify potential static analysis tools for use in the organization.
This preliminary work is described above. Another
group in the same company has decided to perform an evaluation using these
tools, although their approach is somewhat less formal than described in
this paper. For example, they did not explicitly define evaluation criteria
as described in Table 1 above. However, their work
to date is incorporated into this paper. (To avoid confusion, this paper
does not further distinguish between this study and the other study.)
The study used three sources of data to evaluate the selected candidates:
Web Searches
Each vendor site was accessed to provide inputs to the study. Of course,
general productivity claims, etc. by the vendors or by trade magazines
[SDTimes] were not
taken at face value. However, they were of some use in determining support
for multiple languages. In addition, an attempt was made to locate prior
studies doing specific comparisons of two or more of these tools. While
at least one study comparing C++ tools was located [Meyers],
and several generic studies were found (examples include [Wisconsin]
and [DAEDALUS]),
no specific studies were found in this manner.
Direct Vendor Contacts
Each of the vendors was contacted to get information on costs. In addition,
they were asked to provide evaluation copies of their tools. LDRA provided
an evaluation copy, while PolySpace chose to take inputs from the evaluation
team and run them at the vendor's site. Copies of the TI compiler purchased
prior to the study were made available to the evaluation team.
Benchmarks
Most of the data to be gathered in this study was expected to come from
applying each tool to some set of C source code -- the study benchmark
code. Several options were available:
-
Use standard C source code benchmarks, downloaded from the web or accessed
from some other source,
-
Build a synthetic benchmark with faults "seeded" into the code, as was
done in the C++ tool study [Meyers]
-
Use code extracted from a C application developed by the user organization.
The third option was chosen. The specific application used was a prototype
embedded operating system called VMX (Vehicle Management eXecutive). This
code had several desirable attributes, such as:
-
It contained representative implementations of the same types of algorithms,
etc. expected to be found in other applications in this domain,
-
It was developed using the same development process expected to be applied
to future applications,
-
The VMX programmers met the expected profile with respect to experience,
etc.
-
Although the code had been compiled, minimal VMX testing had been performed.
As a result, it was reasonable to conclude that this code would have the
same types and density of faults as would
be found in other user applications from this domain. Note, however, that
only about 800 source lines of code were included in the study. Although
this limits the results, it was necessary based on the limited resources
available to the evaluation. This subset was selected to be representative
of the content of the total application.
Results to Date
Table 2 summarizes the results at this stage of the
study:
Table 2: Results to Date
| Evaluation Area |
Candidates
|
|
TI
|
LDRA
|
PolySpace
|
| Primary Criteria: |
| Faults Found (per KSLOC) |
Not applicable |
12.5 (10
/ 0.8 KSLOC) |
In work |
| False Alarms |
Not applicable |
90% (100
/ 110) |
In work |
| Secondary Criteria: |
| Cost / year |
Not applicable |
$15K |
$20K |
| First use |
Not applicable |
2 days |
In work |
| Stability (crashes / KSLOC) |
Not applicable |
0 |
In work |
| Turnaround (kSLOC / min) |
Not applicable |
0.4 |
In work |
| Languages |
Not applicable |
C only |
Ada and C |
As of this writing, the LDRA evaluation has been completed. Although
the fault detection rate is quite good, the false alarm rate is much higher
than originally expected. Discussions with the vendor are underway to understand
this effect further, and to see if there are ways to reduce the rate. Additional
discussions with the users may also be needed to see if the acceptance
threshold for this criterion needs to be adjusted.
No concrete data has been received to date from PolySpace. On November
7, they requested additional time to complete the evaluation, apparently
in part due to compiler-specific constructs in the code that need to be
addressed. Nothing has been received since that date. Since the PolySpace
tool is less mature than the LDRA tool, and since (as noted earlier)
the technology is more complex, this may or may not be a cause for concern.
Since the vendor has reported excellent results for this type of analysis
when applied to Ada applications such as the Ariane spacecraft, it is desirable
to keep this class of tool in the evaluation for as long as possible.
Conclusions and Additional Work
Clearly, the LDRA tool shows promise as a potential addition to the current
tool suite. Except for the high false alarm rate, the tool did very well
in all evaluation categories. Follow-on plans include trial usage on a
larger subset of VMX, as well as possibly deploying some copies to users
for further "hands-on" evaluations.
No specific conclusions can be drawn at this time for the PolySpace
tool. If no additional information is received by early next year, they
will probably be dropped from the evaluation. If they are dropped, they
may be re-evaluated later if there is sufficient interest.