Duplicate Detection Models for Bug Reports of Software
Triage Systems: A Survey

Behzad Soleimani Neysiani; Seyed Morteza Babamir

Email Us: info@lupinepublishers.com phone

Call Us: +1 (914) 407-6109 57 West 57th Street, 3rd floor, New York - NY 10019, USA

Submit Manuscript

ISSN: 2643-6744

Current Trends in Computer Sciences & Applications

Review ArticleOpen Access

Duplicate Detection Models for Bug Reports of Software Triage Systems: A Survey Volume 1 - Issue 5

Behzad Soleimani Neysiani* and Seyed Morteza Babamir

Department of Software Engineering, University of Kashan, Iran

Received: November 22, 2019; Published: December 17, 2019

*Corresponding author: Department of Software Engineering, Faculty of Computer & Electrical Engineering, University of Kashan, Kashan, Esfahan, Iran

DOI: 10.32474/CTCSA.2019.01.000123

Abstract PDF

Abstract

Duplicate bug report detection (DBRD) is one of the significant problems of software triage systems, which receive end-user bug reports. DBRD needs automation using artificial intelligence techniques like information retrieval, natural language processing, text and data mining, and machine learning. There are two models of duplicate detection as follows: The first model uses machine learning techniques to learn the features of duplication between pairs of bug reports.
The second model called IR-based that use a similarity metric like REP or BM25F to rank top-k bug reports that are similar to a target bug report. The IR-based approach has identical behavior like the k-nearest neighborhood algorithm of machine learning. This study reviews a decade of duplicate detection techniques and their pros and cons. Besides, the metrics of their validationperformance will be studied.

Keywords: Duplicate Detection Model; Machine Learning; Bug Report

Introduction

Nowadays, software triage systems (STS) like Bugzilla are an impartible tool for huge projects -especially open source- like Open Office, Mozilla Firefox, Eclipse, Android, and so on. The main task of STS is to help the development team for the maintenance phase and get end-user requests like bug reports and suggestions and deal with them. There exist many important tasks for software triage systems like prioritizing bug reports, detecting duplicates, assigning bug reports to developers, track the status of bug reports until they can be fixed [1]. Every bug report consists of various data fields (DF) which can be categorized as follows:
I. Identical DFs like unique identity of bug report, identity of its master bug report which this bug report is duplicate and similar to that one, identity of developer which is responsible to deal with this bug report.
II. Categorical DFs such as company, product, component, and status of bug report which are grouping the bug report in specific categories.
III. Textual DFs contain the main end-user request which is described as a text message in short or long description, e.g., title or description.
IV. Temporal DFs show the Date Time of reporting, assigning, solving and other events about the bug report. Since there are about 30%-60% duplicate bug reports in a STS [2,3], automatic duplicate bug report detection (ADBRD) is one of major problems of STSs. ADBRD needs artificial intelligence techniques like information retrieval, natural language processing, machine learning, text, and data mining. This study focuses on methods of ADBRD and review its methodologies, compare them, and suggest their potential usage [4].

Methodologies of Automatic Duplicate Bug Report Detection

There are two major methodologies for automatic duplicate bug report detection (ADBRD):

Information retrieval (IR)-based methodology of automatic duplicate bug report detection (ADBRD)

The first methodology called the information retrieval-based approach, which its procedure is shown in Figure 1. In the first box, the raw dataset of bug reports exists which should be preprocessed in box 2 till deal with null values, unify the data type of some fields like version and priority and preferably change them to numerical, remove stop words from textual fields, stemming textual fields, correcting the typos in textual DFs [5,8], and make them ready for comparison as box 3. Then in box 4, every bug report can be selected as a target bug report, which its duplicates should be found. Usually, the target bug report is a new bug report which is created newly. Then target bug report of box 5 should be compared with other bug reports. Almost all data fields of bug reports cannot be simply compared with an equal operator, especially textual data fields, so, some feature extraction methods are required to calculate their similarity. There are many feature extraction methods based on various data fields like equality operator for nominal categorical DFs, difference or subtract operator for temporal and numerical categorical DFs, information retrieval-based operators like term frequency and inversed document frequency of each term for textual DFs, contextual features which show the similarity of bug report to a special context [9], and so on [1,4]. The feature extraction phase of box 6 returns a numerical vector consist of many similarity metrics as box 7. Now, the similarity of vectors should be calculated using cosine, Manhattan, Jaccard, BLEU, dice, and same formulas to find top similar bug reports as box 8. There are many heuristic similarity metrics and semi-heuristic learning algorithms in this phase, like REP [10], MSWGW [11], and Topic Sim [12]. We know the real duplication status of bug reports based on the merge identity data field of bug reports, and now, we can check our prediction, which was true or not in box 9. Then the evaluation of the methodology can be done, and validation performance metrics can be measured in box 10, which its results show the validity performance metrics in box 13. Four modes can be held based on the real status of a bug report, and our prediction, which are shown in Table 1. The validation performance metrics are calculated based on these four modes. The true-positive rate (TPR), true-negative rate (TNR), false-positive rate (FPR), and false-negative rate (FNR) are calculated based on the four modes of Table 1 as (1). In addition, another famous validation metric is accuracy as (2) that show ration of true prediction based on total pairs of bug reports. The Precision metric as (3) is the ratio of true duplicate predicted on the total duplicate predicted. The recall ratio (4) is the fraction of true duplicate predicted based on total actual duplicates. The F1- measure as (5) is a harmonic average of Precision and recall.

Table 1: Modes of the duplicate Detection.

Lupinepublishers-openaccess-computer-sciences-journal

Machine learning (ML)-based methodology of automatic duplicate bug report detection (ADBRD)

The process of ML-based ADBRD is shown in Figure 2. The boxes 1 to 3 are similar to (Figure 1), but in this case, after building the ready dataset, some pairs of bug reports are selected in box 4, and the selected pairs in box 5 are used to extract various features in box 6. Every pair consist of numerical comparison features and a label with two modes: duplicate or non-duplicate in box 7. Now, the features vectors are divided into two separated sets called train and test. The train set is used to learn a machine learner like decision tree, neural network, deep learner, Naïve Bayes, linear regression, and so on in box 8. The built ML is a duplicate finder which is learned the features of duplication. Then the duplicate finder model of box 9 can be used to predict the test set in box 10, and the predicted status of box 11 can be used as (Table 1) to evaluate the validation performance metric as Eqs. (1) to (5). Finally, there are some hybrid approaches that used both IR-based and ML-based methods. The review of related works in state-of-the-art, which used these two types of ADBRD are tabulated in (Table 2).

Figure 1: The methodology of duplicate bug report detection using information retrieval techniques.

Figure 2: The methodology of duplicate bug report detection using machine learning techniques.

Table 2: Review of duplicate detection models and their metrics.

Conclusion

This study reviews the methodologies of automatic duplicate bug report detection (ADBRD), including information retrieval (IR)-based approach and machine learning (ML)-based approach. The IR-based approach is mostly used for online ADBRD and MLbased in used for offline application, even though both of them can be used for online and offline applications. Also, IR-based approach behavior is similar to k-nearest neighbor (k-NN) algorithm of machine learning which makes this approach be a special case in ML-based approach, but, the most analysis of IR-based approach on the details of parameter K in k-NN make this approach famous and isolate, especially it seems many authors were not familiar with k-NN algorithm, so, they insist on implementing its detail their selves with custom modification in selecting bug reports for comparison or changing the similarity metric and introduce new heuristic similarity formulas. Some studies use a combination of both approaches. Because the parameters of experiments in stateof- the-art are different, it is difficult to judge which approach is more useful and accurate. Finally, future work is to find an accurate and fast ADBRD.

References

Editorial Manager:

Alison Opal

Email:

computerscience@lupinepublishers.com

computerscience@lupinepublisher.co

Track Your Article

Top Editors

Mark E Smith

Bio chemistry
University of Texas Medical Branch, USA
Lawrence A Presley

Department of Criminal Justice
Liberty University, USA
Thomas W Miller

Department of Psychiatry
University of Kentucky, USA
Gjumrakch Aliev

Department of Medicine
Gally International Biomedical Research & Consulting LLC, USA
Christopher Bryant

Department of Urbanisation and Agricultural
Montreal university, USA
Robert William Frare

Oral & Maxillofacial Pathology
New York University, USA
Rudolph Modesto Navari

Gastroenterology and Hepatology
University of Alabama, UK
Andrew Hague

Department of Medicine
Universities of Bradford, UK
George Gregory Buttigieg

Maltese College of Obstetrics and Gynaecology, Europe
Chen-Hsiung Yeh

Oncology
Circulogene Theranostics, England
Emilio Bucio-Carrillo

Radiation Chemistry
National University of Mexico, USA
Casey J Grenier

Analytical Chemistry
Wentworth Institute of Technology, USA
Hany Atalah

Minimally Invasive Surgery
Mercer University school of Medicine, USA
Abu-Hussein Muhamad

Pediatric Dentistry
University of Athens , Greece