Behzad Soleimani Neysiani* and Seyed Morteza Babamir
Received: November 22, 2019; Published: December 17, 2019
*Corresponding author: Department of Software Engineering, Faculty of Computer & Electrical Engineering, University of Kashan, Kashan, Esfahan, Iran
Duplicate bug report detection (DBRD) is one of the significant problems of software triage systems, which receive end-user
bug reports. DBRD needs automation using artificial intelligence techniques like information retrieval, natural language processing,
text and data mining, and machine learning. There are two models of duplicate detection as follows: The first model uses machine
learning techniques to learn the features of duplication between pairs of bug reports.
The second model called IR-based that use a similarity metric like REP or BM25F to rank top-k bug reports that are similar to a target bug report. The IR-based approach has identical behavior like the k-nearest neighborhood algorithm of machine learning. This study reviews a decade of duplicate detection techniques and their pros and cons. Besides, the metrics of their validationperformance will be studied.
Keywords: Duplicate Detection Model; Machine Learning; Bug Report
Nowadays, software triage systems (STS) like Bugzilla are
an impartible tool for huge projects -especially open source- like
Open Office, Mozilla Firefox, Eclipse, Android, and so on. The main
task of STS is to help the development team for the maintenance
phase and get end-user requests like bug reports and suggestions
and deal with them. There exist many important tasks for software
triage systems like prioritizing bug reports, detecting duplicates,
assigning bug reports to developers, track the status of bug reports
until they can be fixed . Every bug report consists of various data
fields (DF) which can be categorized as follows:
I. Identical DFs like unique identity of bug report, identity of its master bug report which this bug report is duplicate and similar to that one, identity of developer which is responsible to deal with this bug report.
II. Categorical DFs such as company, product, component, and status of bug report which are grouping the bug report in specific categories.
III. Textual DFs contain the main end-user request which is described as a text message in short or long description, e.g., title or description.
IV. Temporal DFs show the Date Time of reporting, assigning, solving and other events about the bug report. Since there are about 30%-60% duplicate bug reports in a STS [2,3], automatic duplicate bug report detection (ADBRD) is one of major problems of STSs. ADBRD needs artificial intelligence techniques like information retrieval, natural language processing, machine learning, text, and data mining. This study focuses on methods of ADBRD and review its methodologies, compare them, and suggest their potential usage .
There are two major methodologies for automatic duplicate bug report detection (ADBRD):
The first methodology called the information retrieval-based approach, which its procedure is shown in Figure 1. In the first box, the raw dataset of bug reports exists which should be preprocessed in box 2 till deal with null values, unify the data type of some fields like version and priority and preferably change them to numerical, remove stop words from textual fields, stemming textual fields, correcting the typos in textual DFs [5,8], and make them ready for comparison as box 3. Then in box 4, every bug report can be selected as a target bug report, which its duplicates should be found. Usually, the target bug report is a new bug report which is created newly. Then target bug report of box 5 should be compared with other bug reports. Almost all data fields of bug reports cannot be simply compared with an equal operator, especially textual data fields, so, some feature extraction methods are required to calculate their similarity. There are many feature extraction methods based on various data fields like equality operator for nominal categorical DFs, difference or subtract operator for temporal and numerical categorical DFs, information retrieval-based operators like term frequency and inversed document frequency of each term for textual DFs, contextual features which show the similarity of bug report to a special context , and so on [1,4]. The feature extraction phase of box 6 returns a numerical vector consist of many similarity metrics as box 7. Now, the similarity of vectors should be calculated using cosine, Manhattan, Jaccard, BLEU, dice, and same formulas to find top similar bug reports as box 8. There are many heuristic similarity metrics and semi-heuristic learning algorithms in this phase, like REP , MSWGW , and Topic Sim . We know the real duplication status of bug reports based on the merge identity data field of bug reports, and now, we can check our prediction, which was true or not in box 9. Then the evaluation of the methodology can be done, and validation performance metrics can be measured in box 10, which its results show the validity performance metrics in box 13. Four modes can be held based on the real status of a bug report, and our prediction, which are shown in Table 1. The validation performance metrics are calculated based on these four modes. The true-positive rate (TPR), true-negative rate (TNR), false-positive rate (FPR), and false-negative rate (FNR) are calculated based on the four modes of Table 1 as (1). In addition, another famous validation metric is accuracy as (2) that show ration of true prediction based on total pairs of bug reports. The Precision metric as (3) is the ratio of true duplicate predicted on the total duplicate predicted. The recall ratio (4) is the fraction of true duplicate predicted based on total actual duplicates. The F1- measure as (5) is a harmonic average of Precision and recall.
The process of ML-based ADBRD is shown in Figure 2. The boxes 1 to 3 are similar to (Figure 1), but in this case, after building the ready dataset, some pairs of bug reports are selected in box 4, and the selected pairs in box 5 are used to extract various features in box 6. Every pair consist of numerical comparison features and a label with two modes: duplicate or non-duplicate in box 7. Now, the features vectors are divided into two separated sets called train and test. The train set is used to learn a machine learner like decision tree, neural network, deep learner, Naïve Bayes, linear regression, and so on in box 8. The built ML is a duplicate finder which is learned the features of duplication. Then the duplicate finder model of box 9 can be used to predict the test set in box 10, and the predicted status of box 11 can be used as (Table 1) to evaluate the validation performance metric as Eqs. (1) to (5). Finally, there are some hybrid approaches that used both IR-based and ML-based methods. The review of related works in state-of-the-art, which used these two types of ADBRD are tabulated in (Table 2).
This study reviews the methodologies of automatic duplicate bug report detection (ADBRD), including information retrieval (IR)-based approach and machine learning (ML)-based approach. The IR-based approach is mostly used for online ADBRD and MLbased in used for offline application, even though both of them can be used for online and offline applications. Also, IR-based approach behavior is similar to k-nearest neighbor (k-NN) algorithm of machine learning which makes this approach be a special case in ML-based approach, but, the most analysis of IR-based approach on the details of parameter K in k-NN make this approach famous and isolate, especially it seems many authors were not familiar with k-NN algorithm, so, they insist on implementing its detail their selves with custom modification in selecting bug reports for comparison or changing the similarity metric and introduce new heuristic similarity formulas. Some studies use a combination of both approaches. Because the parameters of experiments in stateof- the-art are different, it is difficult to judge which approach is more useful and accurate. Finally, future work is to find an accurate and fast ADBRD.
Bio chemistryUniversity of Texas Medical Branch, USA
Department of Criminal JusticeLiberty University, USA
Department of PsychiatryUniversity of Kentucky, USA
Department of MedicineGally International Biomedical Research & Consulting LLC, USA
Department of Urbanisation and AgriculturalMontreal university, USA
Oral & Maxillofacial PathologyNew York University, USA
Gastroenterology and HepatologyUniversity of Alabama, UK
Department of MedicineUniversities of Bradford, UK
OncologyCirculogene Theranostics, England
Radiation ChemistryNational University of Mexico, USA
Analytical ChemistryWentworth Institute of Technology, USA
Minimally Invasive SurgeryMercer University school of Medicine, USA
Pediatric DentistryUniversity of Athens , Greece
The annual scholar awards from Lupine Publishers honor a selected number Read More...