Skip to content

Abstract

Published at: Empir Software Eng 29, 57 (2024). https://doi.org/10.1007/s10664-024-10440-0

As robotic systems such as autonomous cars and delivery drones assume greater roles and responsibilities within society, the likelihood and impact of catastrophic software failure within those systems is increased. To aid researchers in the development of new methods to measure and assure the safety and quality of robotics software, we systematically curated a dataset of 221 bugs across 7 popular and diverse software systems implemented via the Robot Operating System (ROS). We produce historically accurate recreations of each of the 221 defective software versions in the form of Docker images, and use a grounded theory approach to examine and categorize their corresponding faults, failures, and fixes. Finally, we reflect on the implications of our findings and outline future research directions for the community.

Methods

Subject Bugs and Data Gathering

To identify historical bugs in each subject system, we examined its issue tracker, pull requests, and commit history. For an initial screening, we prioritized issues labeled as a Bug (or similar) and commit messages including keywords such as fix. Issues clearly unrelated to bugs by their title or labels were discarded. All non-obvious issues and pull requests were inspected in consensus meetings to determine whether they describe additional bugs. In the meetings, we have asked whether the problem discussed is a result of a deliberate prior design decision, or whether it is a result of an omission, a mistake, a change in another system, etc. In any case, whenever developers used the term bug, error, or mistake in the discussion, we assumed a bug is being discussed. Bad smells and style issues were classified as not-bugs. In total, we identified 221 issues and pull requests that qualified as bugs across the subject systems.

Figure 2: Issue 331 in Kobuki,
https://github.com/yujinrobot/kobuki/issues/331

Figure 2 exemplifies the data available about the bugs. The issue creation date (1) determines the versions of ROS and other dependencies that might have been used by the reporter. The community status of the reporter (2) distinguishes between issues found internally and by the downstream users. The problem description (3) is the key source regarding whether an issue is a bug and what is its nature. The labels (4) provide a diversity of information. Here the issue has been labelled as software-related, which makes it a potential software bug report. The existence of commits (5, 8) referencing the issue show that it is either fixed or being worked on. Inspecting the commit (6, 8) we can understand the bug from the perspective of its fix. The referencing pull requests (7) provide similar context. If the bug is fixed, we note down the closing date (9) of the issue. This allows to estimate how long it took to resolve the problem (14 days here).

Figure 3: Commit 0e2ea0c4 of
https://github.com/mavlink/mavros

Inspecting commits and pull requests that do not reference any issue requires additional work, as they tend to describe what changes are introduced, rather than the problem addressed. Despite this, from the commit in Fig. 3, we can sill harvest four relevant items: The commit message (1) includes the keyword “fix”, implying there was an underlying issue. In this case, the commit fixes warnings from the cppcheck code analyzer. The first release (2) of the repository, which includes the commit. Together with the commit date (3) this determines the versions of the involved software. The parent commit (4) is the last version of the code that still contains the bug fixed.

Data Analysis

For each of the bugs in the data set we produced a forensic description by manually analyzing the available information. Each description follows a common schema. The initial list of attributes in the schema has been identified in a discussion of the authors based on their expertise in bug studies and in robotics software engineering. The list has remained stable for most of the data collection period, but several fields have been added in an exploratory fashion. These were usually derived either from the initial fields (using thematic coding) or by automatically querying GitHub repositories. Each description has been initially written by a team member familiar with the associated subject system, before being discussed extensively and cross-checked by multiple members of the research team. We include all these descriptions as YAML documents, along with the schema in ROBUST repository on GitHub.

Figure 4: An example report for a bug ( kobuki:e964bbb)

Figure 4 shows an example of a description for a bug in the Kobuki project. It opens with a unique identifier, a prefix of the hash of its fixing commit in a Git repository (e.g., e964bbb). The title summarizes the bug in general terms, and the description elaborates on the bug itself, the software components affected, and the context in which the bug occurred. We wrote the descriptions aiming to be as accessible as possible, without presupposing deep training in robotics. The keywords aid the search and retrieval of relevant bug reports. (Unlike the codes discussed below, the keywords are not derived systematically.) The system field records the name of the project in which the bug has been found.

We initially attempted to classify bugs using Common Weakness Enumeration, an established taxonomy of software weaknesses independent of us.Footnote1 However, as CWE is predominantly concerned with security, we were unable to adequately classify most of the dataset. Motivated by this inadequacy, rather than re-using an existing taxonomy (e.g., IEEE 1044–2009; Seaman et al. 2008; Thung et al. 2012; Garcia et al. 2020; Wang et al. 2021; Zampetti et al. 2022), we elected to use open coding and grounded theory building as an established mechanism for structuring qualitative data, when no prior taxonomy is pre-supposed. This allows us to better represent and fully describe the nature of software bugs in ROS without being constrained by an existing categorization. Moreover, by allowing the taxonomy to emerge from the data, our study provides a conceptual replication of prior work.

We systematically analyzed all bugs through a process of thematic coding by establishing codes in two groups: failure descriptions and fault descriptions. We define failure as inability of software to perform its function, with a special focus on the observable manifestation of this inability (sometimes also referred to as error). The fault is the cause, or the reason for the failure, within the software (so if the fault is repaired, the failure is eliminated). The results of this analysis are stored under failure-codes and fault-codes respectively.

The thematic coding has been split among the five coauthors randomly; two coders per each bug description. They performed the initial coding independently, introducing new codes as necessary. After the initial coding has been obtained, we held a consistency meeting which produced a unified codebook. Afterwards all bug descriptions have been recoded according to the codebook. Finally, for all code assignments where the two coders disagreed we held a series of consensus meeting with all five coders—an agreement was achieved by a joint discussion, analysis of the source material, and any necessary context information about ROS. Two of the coders involved had extensive robotics engineering experience, and three had extensive software quality engineering experience.

The rest of the record is broadly split into two sections: the bug description that elaborates on the fault and failure, and the fix description that collects information on how the bug has been fixed. The bug description specifies: the stage at which failure occurs (e.g., build, deployment, runtime); the relationship of the person that reported the bug to the affected system (e.g., guest user, contributor, maintainer, automatic, unreported); the URL of the associated GitHub issue, and at what time the issue was reported; the task of the robot that is directly affected by the bug (e.g., perception, localization, planning); a determination of how the bug was detected (e.g., build system, static analysis, assertions, runtime detection, test failure, developer); and whether the failure occurred in the application or in the ROS/ROSIn platform itself (architectural-location). The fix description provides: a list of commits that constitute the bug fix; the URL of the associated pull request, the date and time at which the bug was fixed; the files that were changed as part of the fix, and the language of those files. We only include the subset of files that were changed and which relate to the bug fix itself. We do not include coincidental changes (e.g., refactorings).

In Section 4, we give clickable links to bugs in the repository. These lists of links are not exhaustive, in the sense that they show a small number of examples, not all the examples from the dataset in the given category.

Descriptive Statistics of the Obtained Dataset

Table 2: Descriptive statistics for the subject systems
Subject # Bugs # Issues Category C++ C Python XML
Kobuki 57 325 application 23,555 18,073 4,207 2,325
TurtleBot 11 170 application 799 42 4,438 1,129
Care-O-bot 11 182 application 31,084 9,430 9,248 23,814
Universal Robot 25 158 driver 1,071 331 1,741 738
Motoman 22 78 driver 4,129 5,337 0 1,272
MavRos 40 623 middleware 12,807 1,611 1,013 330
geometry2 42 264 library 6,267 4,311 1,074 273
Total 208 1,800 79,712 39,135 21,721 29,881

Table 2 lists how many bugs we collected for each of the subject systems, and out of how many issues they have been selected (the remaining issues did not report bugs, so the statistics paint a valid picture of the bug population for this systems at the collection time).

Figure 5: The languages and file formats involved in bug fixes

Figure 5 breaks down the bugs by languages used in the fixed files, for the bugs that have been fixed. Over half of the bug fixes involve C++ (112 of 219). The remaining 107 fixes use a diversity of languages, many of which are domain-specific (e.g., Package XML, Launch XML, URScript, etc.), which typically lack associated analysis tools. Figure 6 shows the number of languages involved in each fix. We find that 200 fixes (91%) are limited to a single language. That is, while failures may span components written in different languages, fixes are usually restricted to a single language.

Figure 6: The number of languages involved in bug fixes

In the dataset, 118 failures occur at run-time and 38 at start-up time, so in total 156 bugs that have been fixed also have execution-time failures. Only 15 of these are accompanied by a test case. (We automatically identified bug fixes that add or modify tests by checking the paths of the changed files. We consider that a fix commit is accompanied by a test case if it adds or changes a file that contains the word test in its path, e.g., test/foo.pytest_foo.py. We manually inspected the remaining fixes to confirm that they did not include a test case.)

Figure 7: The number of files involved in a bug fix

Figure 7 shows an overview of the number of files that were fixed for each bug. The number is based on a manual removal of unrelated changes from bug-fixing commits. Almost two thirds of bug fixes affect only a single file. Specifically, 64% of bug fixes are confined to a single file (141 of 219), 19% span two files (41 of 219), and 17% (37 of 219) change three or more files.

Figure 8: The diff size of a bugfix; we truncated 11 bugs with size above 200 lines

In order to approximate the size of each bug fix, we measure the number of lines in the change differences across their fixing commits; see Fig. 8. As fixing commits may contain unrelated changes (e.g., opportunistic refactoring), the size of the change difference is greater than or equal to the size of the bug fix, and therefore represents a conservative upper bound. Four bug fixes consist solely of file renamings and have an associated diff size of zero. We find that more than 50% of bug fixes have diffs that consist of 12 lines or fewer, and 75% have a change difference that is 50 lines or smaller.

Limitations

The Credibility (Shenton 2004; Sikolia et al. 2013) of this study has been ensured by careful selection of the systems to be analyzed (Section 2), by depending on qualified researchers for bug selection and analysis, by the use of established methods for both archival and coding of bugs (Section 3), by employing peer scrutiny, and by grounding the study in existing work (Section 7). All authors were involved in bug gathering, selection, and analysis. One author is a domain expert on ROS with considerable experience in this FOSS community, the others have prior experience with ROS and with the kind of studies presented in this paper. All bugs were analyzed by at least two authors and the results were cross-referenced and discussed among all. Corrections were made by consensus. The coding was done according to established practices (Saldaña 2015; Linneberg and Korsgaard 2019) and by all authors in parallel, in multiple sessions. The final classification of bugs and the resulting codebook have been extensively discussed and checked for consistency by all authors. Finally, the preliminary results of the study were presented at ROSCon, the main conference of the ROS community (Timperley and Wasowski 2019). Received comments were taken into account while further refining the study.

We used purposive sampling, so some negative impact on the Transferability is expected. The subject repositories were selected based on the role the packages have in ROS-based products. Care was taken to select a qualitatively diverse set. We do not claim that this set of packages is representative of the entire ROS landscape, or of the wider robotics software. The quantitative results are not directly generalizable to these wider contexts—they have a been presented as descriptive statistics of the dataset, not as general conclusions. Even with these restrictions however, the set of bugs described in ROBUST is diverse enough to be representative of the types of systems that were analyzed. Conclusions can be made qualitatively about the presence of particular kinds of bugs in other ROS packages. Furthermore, while the selection of repositories was purposeful, the identification of bugs and fixes was not: bugs and fixes were always reported and contributed by either developers, maintainers or users of the systems under analysis, not by the authors. The developers were treated as historical oracles and their assessments whether something is a bug, a fix or neither were taken at face value.

To increase Dependability, we detail how we gathered the dataset (Sections 2 and 3), how the historical images were build (Section 6), and how bug reports are structured, analyzed (Section 3) and stored (Section 6). All bug reports link back to the source data for each bug: affected source code repositories, the original issues, code contributions fixing the bug, and the state of the involved repositories before and after merging the fix. Bug reports also include timestamps for these events, whenever they were present and identifiable, and the full analysis as performed by the authors. All of the source material is made available open-source, on-line. Such traceability increases dependability, facilitating evaluation of the research methods and results (Shenton 2004).

To warrant Confirmability, the dataset was built from issue reports written by developers, maintainers, and users. Some of these reports are unambiguous and leave no margin for researcher bias or interpretation, mostly because of the detail of the report and the language used by the reporter. For instance, bugs related to the build phase of the software can hardly be mistaken for runtime issues. Others contain little description of the fault or the manifested failures. To minimize bias, we involved all authors in the analysis of such cases, relying on the buggy source code and the fix, until a consensus was reached, often over several meetings. The commit history of the ROBUST repository, can be used to reconstruct how our bug reports changed over time, as a result of repeated analysis and discussion. Other aspects of our method can also be audited, as it is fully explained and all source code is available online. Despite this, we admit that a minimal set of bugs could be classified differently by another party for two reasons. First, we did not interview issue reporters to confirm whether our judgement matches theirs. Second, relying on the source code, especially on the changes introduced by a bug fix, might not tell a clear story because some commits might affect code that is not pertinent to fixing a particular bug. After completing our analysis, we estimate the accuracy of our labels by sampling 30 bugs from the dataset and recoding them according to our final taxonomy. Comparing the differences between the labels, we find that the original taxonomy had missing labels for three bugs and incorrect labels for a further three bugs, yielding a bug-level accuracy of 80% across our sample. Note that, only one label was incorrect or missing for each of those bugs (out of an average of 3.47 labels), giving us a label-level accuracy of 97%.

Conclusion

In this paper, we presented ROBUST, a study and accompanying dataset of 221 bugs in Robot Operating System software. We systematically collected, documented, and analyzed bugs across 7 popular ROS projects, and produced Docker images that allow researchers building QA tools and techniques for robotics to interact those bugs. We classified faults and failures within a taxonomy we constructed for this purpose, based on a qualitative analysis of our dataset, highlighted findings of particular interest to the software engineering research community, and discussed the ramifications of our results.

Access complete Publication

For an in-depth exploration of our findings and methodologies, download here

shape
Privacy Overview

This website uses cookies so that we can provide you with the best user experience possible. Cookie information is stored in your browser and performs functions such as recognising you when you return to our website and helping our team to understand which sections of the website you find most interesting and useful.