Evaluating Source Code Analysis Tools

(CSE 6324)

Ken Garlington
15 Dec 2000


Keywords: software faults, static analysis, tool evaluations.

Abstract

Static analysis tools are capable of detecting potential faults in C source code prior to execution-based testing. Based on preliminary results of an evaluation of selected static analysis tools, at least one (LDRA Testbed) is capable of finding a significant number of faults, although there are concerns about the associated false alarm rate.

Introduction

"Static analysis" is defined as any analysis of a program carried out without executing the program [BS 7925-1]. Several sources claim that static analysis provide significant benefits to the programmer [Plessel] [Mayrhauser] [QuEST] [Thomsen], although it will probably not provide a complete prediction of program behavior [Venema].

Static analysis can be done for any number of purposes, including:

However, for the purposes of this study, one relatively narrow form of analysis was considered: the detection of potential errors that might affect program functionality. Scott and Lawrence, in "Testing Existing Software for Safety-Related Applications" [Intrepid] identify some of the specific classes of errors that can be detected: Approaches to implementing static code analysis can be categorized as follows:

Manual techniques.

Manual techniques such as code inspections and desk checking can be effective in detecting faults [Intrepid]. One controlled experiment (Basili and Selby, 1987) found that code reading detected more software faults and had a higher fault detection rate than did functional or structural testing. However, such manual techniques can also be labor intensive, and automated support is typically encouraged [Intrepid].

Automated "lint"-class analysis techniques.

Tools of this type are primarily interested in detecting typical coding standards violations. Although these tools may do some detection of unused variables, etc., they rarely include extensive implementations of more complex techniques such as data flow or boundary value analyses. (See the National Institute of Standards and Technology (NIST) Special Publication 500-209 on Software Error Analysis [NIST] for a description of these techniques.) Primarily, these tools perform syntactic or "pattern-matching" operations on the code, looking for specific lexical constructs that may indicate an error.

This paper refers to this category as "lint"-class tools, based on a canonical example -- the "lint" utility in most UNIX implementations [OGI]. Lint was originally designed to supplement the weak type checking, etc. of the C compiler, so that the compiler could run as fast as possible [Johnson]. In addition to lint, there are several other C/C++ tools that fall into this class, such as:

There are also variations on lint for other languages, such as Fortranlint for FORTRAN [Cleanscape], CodeReview for Visual Basic [NuMega], Jtest for Java [Parasoft] and AdaSTAT for Ada [DCSIP].

One particular subset of this category is of special interest to developers of safety-related software. The Motor Industry Software Reliability Association (MISRA) has developed a standard, "Guidelines for the Use of the C Language in Vehicle Based Software" [MISRA] that focuses on C constructs that may be inappropriate for critical software. Tools that check for violations of this standard include the MISRA C Checker, part of the LDRA Testbed toolset [LDRA], as well as several others from Hitex UK, Oakwood Computing, etc. [MISRA]

Automated "deep" analyses.

Compared to "lint"-class tools, tools that perform more complex analyses such as control flow, data flow, and information flow analyses are rare. This is because such tools tend to be more difficult to implement [Intrepid]. Tools in this category include the SPARK Examiner for Ada [Praxis] and the PolySpace C Verifier [PolySpace].  The PolySpace tool is particularly interesting, given the environment in which these tools are expected to be used, since the Ada version of this tool detected (after the fact, unfortunately) one of the most famous faults in the recent history of aerospace software: the overflow condition in the Ariane 5 first flight, which caused the complete loss of the spacecraft. Similar tools have been used to detect critical errors in NATO software [Cullyer].

Compilers.

C compilers are capable of finding certain syntactic and semantic errors as part of the analysis required to generate executable code [NIST]. Any faults found by a compiler are found through static analysis [ISEB].

Regardless of which approach is used, static analysis has the potential to be beneficial for early detection of errors. This is why the use of static analysis tools and techniques is recommended by standards for safety-related software, such as the British Ministry of Defence standard 00-55 [MoD]. To support the use of static code analysis by organizations producing critical software, this study compared tools representing the different types of automation described above.

Development Environment

A balanced error detection program will depend on many factors [NIST], including: This study assumes the following for each factor:

Consequences of Failure

The users of the selected tool develop applications with very high consequences of failures, including the potential loss of human life or millions of dollars in damage. Failures in these systems can have other long term effects, such as damaging the developing company's reputation and potentially causing it to lose future work and close its doors. Therefore, the users are first and foremost interested in detecting faults that will have an impact on the system's safety and reliability. Other considerations, such as portability and maintainability, are also important but are not the first priority.

System Complexity

The tool is expected to be used mostly for real-time embedded software for high performance aircraft control. The application software size historically has been between 50,000 and 150,000 source lines of code.

Types of Faults

Due to the nature of the application software, the  types of faults the tool may encounter include the following: Because of the nature of the application domain, the following types of faults are not typically encountered:

Difficulty of Use

As with any other development effort, the efficiency of the processes and associated tools is important. However, this study assumes that the development team is willing to accept some inefficiencies, such as false alarms or a non-intuitive tool interface, in exchange for uncovering potentially serious faults.

Automated Support

Obviously, this study expects that some tool is available to support this type of static analysis. However, as noted under Difficulty of Use, it would be acceptable for some level of manual work to be done in conjunction with the tool operation (e.g. tracing an error message to the specific area of code containing the fault).

Staff Experience

The development team should have an average of five or more years programming experience in this application domain, with a mix of new and experienced engineers. However, their average experience in the target language (“C”) is likely to be less than 2 years. Therefore, tools that detect improper application of the C language, even for faults that are usually only made by those new to the language, are expected to be beneficial to the team. In addition, the users may not have any prior experience with static analysis tools other than compilers.

Other Considerations

The users have identified additional factors of interest:

Tool Evaluation Criteria

Table 1 defines the criteria used for this study. These criteria were selected based on the author's understanding of the user environment. (Ideally, users would be directly involved in the specification and potential modifications of these criteria.) Wherever possible, objective and measurable values were defined.

In order to simplify the analysis of the results, each criteria was specified in terms of four value ranges:

Furthermore, the criteria were broken into two groups: primary and secondary criteria. The expectation was that a tool would only be selected if it received satisfactory scores for all primary criteria. Secondary criteria would be used to (a) select a single tool, where multiple tools satisfied the primary criteria and (b) identify potential risks in a selected tool that would need to be communicated and worked with the users.
 
Table 1: Tool Evaluation Criteria
 Evaluation Area
 Rating Criteria
 Blue
Green
Yellow
Red
 Primary Criteria:
 Faults Found (per KSLOC)  > 10  > 5  > 0  0
 False Alarms  < 10%  < 50%  <= 75%  > 75%
 Secondary Criteria:
 Cost / year  <= $5,000  <= $50,000  <= $100,000  > $100,000
 First use  < 1 day  < 1 week  <= 2 weeks  > 2 weeks
 Stability (crashes / KSLOC)  0  < 0.05  <= 0.1  > 0.1
 Turnaround (KSLOC / min)  > 1  > 0.25  > 0.05  <= 0.05
 Languages  Multiple  C only  Not applicable  Not applicable

Each criterion is defined as follows:

Primary Criteria

The study defined two key factors for evaluating each tool:

Secondary Criteria

The following additional factors were also defined:

Tool Candidate Selection

As noted in the Introduction, there are a large number of tools that could have been included in this study -- more than could actually be considered given the available time and other resources. Therefore, a manageable subset of the available tools had to be selected. Three considerations were identified to guide this selection: The three tools selected as candidates were:

Compiler toolset

All of the current company projects in the domain of interest use the Texas Instruments (TI) CodeComposer for the TMS320C67 target. Therefore, it was chosen as the example compiler for this study. Since this tool must be used to generate code, it was treated as a "baseline" for the study. In other words, the other tools evaluated in the study only processed code that had already been successfully compiled by the TI toolset. Therefore, any faults reported would be in addition to the faults found (and fixed) during compilation. (Note that all evaluation factors listed for the TI compiler in the results are given as "Not Applicable," since it is the baseline.)

"Lint-class" tool

As noted earlier, a web search can find a number of tools that use "pattern matching" techniques to analyze C code. However, when the desire to incorporate checks against the MISRA C standard are included, only a handful of tools remain. The LDRA Testbed MISRA code standards checker was selected, based on its large user base.

"Deep analysis" tool

This category was the easiest to select. Only one commercial tool in this class was identified that would work with C code: the PolySpace C Verifier.

Evaluation Methodology

Prior to this assignment, the author had done some preliminary work in this area to identify potential static analysis tools for use in the organization. This preliminary work is described above. Another group in the same company has decided to perform an evaluation using these tools, although their approach is somewhat less formal than described in this paper. For example, they did not explicitly define evaluation criteria as described in Table 1 above. However, their work to date is incorporated into this paper. (To avoid confusion, this paper does not further distinguish between this study and the other study.)

The study used three sources of data to evaluate the selected candidates:

Web Searches

Each vendor site was accessed to provide inputs to the study. Of course, general productivity claims, etc. by the vendors or by trade magazines [SDTimes] were not taken at face value. However, they were of some use in determining support for multiple languages. In addition, an attempt was made to locate prior studies doing specific comparisons of two or more of these tools. While at least one study comparing C++ tools was located [Meyers], and several generic studies were found (examples include [Wisconsin] and [DAEDALUS]), no specific studies were found in this manner.

Direct Vendor Contacts

Each of the vendors was contacted to get information on costs. In addition, they were asked to provide evaluation copies of their tools. LDRA provided an evaluation copy, while PolySpace chose to take inputs from the evaluation team and run them at the vendor's site. Copies of the TI compiler purchased prior to the study were made available to the evaluation team.

Benchmarks

Most of the data to be gathered in this study was expected to come from applying each tool to some set of C source code -- the study benchmark code. Several options were available: The third option was chosen. The specific application used was a prototype embedded operating system called VMX (Vehicle Management eXecutive). This code had several desirable attributes, such as: As a result, it was reasonable to conclude that this code would have the same types and density of faults as would be found in other user applications from this domain. Note, however, that only about 800 source lines of code were included in the study. Although this limits the results, it was necessary based on the limited resources available to the evaluation. This subset was selected to be representative of the content of the total application.

Results to Date

Table 2 summarizes the results at this stage of the study:
 
Table 2: Results to Date
 Evaluation Area
 Candidates
 TI
LDRA
PolySpace
 Primary Criteria:
 Faults Found (per KSLOC)  Not applicable  12.5 (10 / 0.8 KSLOC)  In work
 False Alarms  Not applicable  90% (100 / 110)  In work
 Secondary Criteria:
 Cost / year  Not applicable  $15K  $20K
 First use  Not applicable  2 days  In work
 Stability (crashes / KSLOC)  Not applicable  0  In work
 Turnaround (kSLOC / min)  Not applicable  0.4  In work
 Languages  Not applicable  C only  Ada and C

As of this writing, the LDRA evaluation has been completed. Although the fault detection rate is quite good, the false alarm rate is much higher than originally expected. Discussions with the vendor are underway to understand this effect further, and to see if there are ways to reduce the rate. Additional discussions with the users may also be needed to see if the acceptance threshold for this criterion needs to be adjusted.

No concrete data has been received to date from PolySpace. On November 7, they requested additional time to complete the evaluation, apparently in part due to compiler-specific constructs in the code that need to be addressed. Nothing has been received since that date. Since the PolySpace tool is less mature than the LDRA tool, and since (as noted earlier) the technology is more complex, this may or may not be a cause for concern. Since the vendor has reported excellent results for this type of analysis when applied to Ada applications such as the Ariane spacecraft, it is desirable to keep this class of tool in the evaluation for as long as possible.

Conclusions and Additional Work

Clearly, the LDRA tool shows promise as a potential addition to the current tool suite. Except for the high false alarm rate, the tool did very well in all evaluation categories. Follow-on plans include trial usage on a larger subset of VMX, as well as possibly deploying some copies to users for further "hands-on" evaluations.

No specific conclusions can be drawn at this time for the PolySpace tool. If no additional information is received by early next year, they will probably be dropped from the evaluation. If they are dropped, they may be re-evaluated later if there is sufficient interest.