Duplicate bug report detection (DBRD) is one of the significant problems of software triage systems, which receive end-user
bug reports. DBRD needs automation using artificial intelligence techniques like information retrieval, natural language processing,
text and data mining, and machine learning. There are two models of duplicate detection as follows: The first model uses machine
learning techniques to learn the features of duplication between pairs of bug reports.
The second model called IR-based that use a similarity metric like REP or BM25F to rank top-k bug reports that are similar to
a target bug report. The IR-based approach has identical behavior like the k-nearest neighborhood algorithm of machine learning.
This study reviews a decade of duplicate detection techniques and their pros and cons. Besides, the metrics of their validationperformance
will be studied.
Nowadays, software triage systems (STS) like Bugzilla are
an impartible tool for huge projects -especially open source- like
Open Office, Mozilla Firefox, Eclipse, Android, and so on. The main
task of STS is to help the development team for the maintenance
phase and get end-user requests like bug reports and suggestions
and deal with them. There exist many important tasks for software
triage systems like prioritizing bug reports, detecting duplicates,
assigning bug reports to developers, track the status of bug reports
until they can be fixed [1]. Every bug report consists of various data
fields (DF) which can be categorized as follows:
I. Identical DFs like unique identity of bug report, identity
of its master bug report which this bug report is duplicate and
similar to that one, identity of developer which is responsible
to deal with this bug report.
II. Categorical DFs such as company, product, component,
and status of bug report which are grouping the bug report in
specific categories.
III. Textual DFs contain the main end-user request which is
described as a text message in short or long description, e.g.,
title or description.
IV. Temporal DFs show the Date Time of reporting, assigning,
solving and other events about the bug report. Since there
are about 30%-60% duplicate bug reports in a STS [2,3],
automatic duplicate bug report detection (ADBRD) is one of
major problems of STSs. ADBRD needs artificial intelligence
techniques like information retrieval, natural language
processing, machine learning, text, and data mining. This study
focuses on methods of ADBRD and review its methodologies,
compare them, and suggest their potential usage [4].
There are two major methodologies for automatic duplicate
bug report detection (ADBRD):
Information retrieval (IR)-based methodology of
automatic duplicate bug report detection (ADBRD)
The first methodology called the information retrieval-based
approach, which its procedure is shown in Figure 1. In the first
box, the raw dataset of bug reports exists which should be preprocessed
in box 2 till deal with null values, unify the data type of
some fields like version and priority and preferably change them to
numerical, remove stop words from textual fields, stemming textual
fields, correcting the typos in textual DFs [5,8], and make them
ready for comparison as box 3. Then in box 4, every bug report can
be selected as a target bug report, which its duplicates should be
found. Usually, the target bug report is a new bug report which is
created newly. Then target bug report of box 5 should be compared
with other bug reports. Almost all data fields of bug reports cannot
be simply compared with an equal operator, especially textual data
fields, so, some feature extraction methods are required to calculate
their similarity. There are many feature extraction methods based
on various data fields like equality operator for nominal categorical
DFs, difference or subtract operator for temporal and numerical
categorical DFs, information retrieval-based operators like term
frequency and inversed document frequency of each term for
textual DFs, contextual features which show the similarity of
bug report to a special context [9], and so on [1,4]. The feature
extraction phase of box 6 returns a numerical vector consist of many
similarity metrics as box 7. Now, the similarity of vectors should be
calculated using cosine, Manhattan, Jaccard, BLEU, dice, and same
formulas to find top similar bug reports as box 8. There are many
heuristic similarity metrics and semi-heuristic learning algorithms
in this phase, like REP [10], MSWGW [11], and Topic Sim [12].
We know the real duplication status of bug reports based on the
merge identity data field of bug reports, and now, we can check our
prediction, which was true or not in box 9. Then the evaluation of
the methodology can be done, and validation performance metrics
can be measured in box 10, which its results show the validity
performance metrics in box 13. Four modes can be held based on
the real status of a bug report, and our prediction, which are shown
in Table 1. The validation performance metrics are calculated based
on these four modes. The true-positive rate (TPR), true-negative
rate (TNR), false-positive rate (FPR), and false-negative rate (FNR)
are calculated based on the four modes of Table 1 as (1). In addition,
another famous validation metric is accuracy as (2) that show
ration of true prediction based on total pairs of bug reports. The
Precision metric as (3) is the ratio of true duplicate predicted on
the total duplicate predicted. The recall ratio (4) is the fraction of
true duplicate predicted based on total actual duplicates. The F1-
measure as (5) is a harmonic average of Precision and recall.
The process of ML-based ADBRD is shown in Figure 2. The
boxes 1 to 3 are similar to (Figure 1), but in this case, after building
the ready dataset, some pairs of bug reports are selected in box 4,
and the selected pairs in box 5 are used to extract various features
in box 6. Every pair consist of numerical comparison features and a
label with two modes: duplicate or non-duplicate in box 7. Now, the
features vectors are divided into two separated sets called train and
test. The train set is used to learn a machine learner like decision
tree, neural network, deep learner, Naïve Bayes, linear regression,
and so on in box 8. The built ML is a duplicate finder which is learned
the features of duplication. Then the duplicate finder model of box
9 can be used to predict the test set in box 10, and the predicted
status of box 11 can be used as (Table 1) to evaluate the validation
performance metric as Eqs. (1) to (5). Finally, there are some hybrid
approaches that used both IR-based and ML-based methods. The
review of related works in state-of-the-art, which used these two
types of ADBRD are tabulated in (Table 2).
Figure 1: The methodology of duplicate bug report detection using information retrieval techniques.
Figure 2: The methodology of duplicate bug report detection using machine learning techniques.
Table 2: Review of duplicate detection models and their metrics.
This study reviews the methodologies of automatic duplicate
bug report detection (ADBRD), including information retrieval
(IR)-based approach and machine learning (ML)-based approach.
The IR-based approach is mostly used for online ADBRD and MLbased
in used for offline application, even though both of them can
be used for online and offline applications. Also, IR-based approach
behavior is similar to k-nearest neighbor (k-NN) algorithm of
machine learning which makes this approach be a special case in
ML-based approach, but, the most analysis of IR-based approach
on the details of parameter K in k-NN make this approach famous
and isolate, especially it seems many authors were not familiar
with k-NN algorithm, so, they insist on implementing its detail
their selves with custom modification in selecting bug reports for
comparison or changing the similarity metric and introduce new
heuristic similarity formulas. Some studies use a combination of
both approaches. Because the parameters of experiments in stateof-
the-art are different, it is difficult to judge which approach is
more useful and accurate. Finally, future work is to find an accurate
and fast ADBRD.
Liu K, Tan HBK, Chandramohan M (2012) Has this bug been reported? In Proceedings of the ACM SIGSOFT 20th International Symposium on the Foundations of Software Engineering pp. 28.
Alipour A (2013) A Contextual Approach Towards More Accurate Duplicate Bug Report Detection. Master of Science, Department of Computing Science, University of Alberta, Faculty of Graduate Studies and Research.
Soleimani Neysiani and S. M. Babamir (2019)"Effect of Typos Correction on the validation performance of Duplicate Bug Reports Detection," presented at the 10th International Conference on Information and Knowledge Technology (IKT), Tehran, Iran, 2020: 1-2.