ISSN: 2637-4676
Mehmet Metin Ozguven1*, Gungor Yilmaz2, Kemal Adem3 and Cemil Kozkurt4
Received: January 12, 2019; Published: January 22, 2019
Corresponding author: Mehmet Metin Özgüven, Tokat Gaziosmanpaşa University Department of Biosystems Engineering, Tokat, Turkey
DOI: 10.32474/CIACR.2019.06.000229
In order to make a contribution to the early generation selections in potato varieties through a classification, the MLPNN and SVM data mining methods were applied to the data set created by considering the selection criteria based on the macroscopic observations and measurements, performed to identify clones that are ineligible and to be eliminated through negative selection from the clones developed in line with the potato variety breeding program, initiated by hybrid combinations in this study. Data set used in the study consists of clones in a study conducted in 2016 as part of the project no. TUBITAK-TOVAG 113O928. A total of 703 potato clones from 12 hybrid combinations were used in the study. In order to identify the clones to be selected, two different models were created by using three attributes (tuber yield, number of tubers and average tuber weight) for each clone, and two different models were created by using two attributes (eyes depth and eyes pit depth) of each clone in order to identify the clones to be eliminated. Experiments were carried out by comparing the sensitivity, specificity and accuracy ratios for each model by using the generated dataset as input to the MLPNN and SVM classifiers. As a result of the experimental studies, the highest success was achieved in 2-class models and it was determined that MLPNN classifier is more successful in these models. With this study, it was put forth that data mining methods can be used for early generation selection in cultivar improvement studies.
Keywords: Potato; Clonal Selection; Breeding; Agricultural Information Technologies; Multilayer Perceptron Neural Network; Support Vector Machine
Today, intelligent machines and production systems that control machines have begun to take over traditional production methods. After the mechanization, automation, control and information technology use during the development period of agricultural production, now the data mining methods are an important research area for agricultural science and a promising tool for problems in the agricultural sector that need solutions and improvement [1-12]. Through the identification of plant, weed, disease and soil pattern classes using data mining methods in agriculture, numerous studies have been conducted such as fertility, biomass, chlorophyll, breeding environment and agricultural drought forecasts, various analyzes, modeling and simulations, mineral elements, toxic elements, organic and inorganic pollution and evaluation of biological markers. The summary data of some studies are presented in Table 1.
The potato is one of the plants that spread naturally in the Andes of Peru [13]. It has been brought through Russia and the Caucasus to Turkey, and it has been cultivated in the highland climate of eastern Black Sea region and Erzurum region for the first time in Turkey [14]. In today’s conditions, potatoes can be grown in almost every region of Turkey. Its yearly production is 376 million tons in 19 million hectares of land in the world [15] and 4.8 million tons in the 143 thousand hectares of land in Turkey [16]. Potato is a widely used plant in human nutrition and industry because of its carbohydrate, protein, mineral and vitamin content. Therefore, there are 100-400 commercially registered potato varieties in the countries where intensive potato farming is carried out. Although Turkey is among the most important potato producing countries, the number of native varieties is quite less. Therefore, Turkey is dependent on foreign resources in terms of basic seed varieties [17,18]. Various enhancement works are carried out for the solution of this problem. In potato, variety improvement studies usually start with crossing and continue with clonal selection [19,20].
The main purpose of potato improvement is to increase yield and quality. Besides, it is required to develop varieties suitable for market demands, resistant to temperature and drought as well as resistant to various diseases and pests and storage [21]. When we look at potato improvement programs, we see four fundamental stages. These include: crossing, early generation selection studies, late generation selection studies and advanced generation selection studies based on regional experiments. This period can take about 10-12 years [19,22]. This study aims to contribute to early generation selection by using data mining methods and taking into consideration the selection criteria based on macroscopic observation and measurements, from the clones in a variation set created through crossing studies for the development of native potato varieties. It was aimed to propose a new method, alternative to the classical early generation selection method, by developing a model with Support Vector Machine (SVM) and Artificial Neural Networks (ANN) from the data sets and performing a classification with this model in accordance with the selection criteria.
Part of the clones grown under the polycarbonic greenhouse conditions in the Gaziosmanpaşa University Faculty of Agriculture, Department of Field Crops, in 2016, within the scope of the TUBITAK-TOVAG 113O928 project, were used as the study material in this research, and a total of 703 potato clones from 12 hybrid combinations were used in the study. These clones were grown in 26 liter pots having a mixture of peat (2/3) and perlite (1/3) in a polycarbonate greenhouse environment. The ambient temperature was set 22°C during the day, and 15-16°C at night. Necessary maintenance procedures such as fertilization, irrigation and insecticide use were also performed without interruption.
After harvest, the characteristics such as tuber yield per plant (g/plant), average tuber weight (g), number of tubers per plant (number), and depths of eyes and eyes tips in harvested tubers were determined. At this stage of the selection process, positive selection of the clones having marketing and yield-related characteristics and the characters were performed and the clones that did not possess these criteria were eliminated by negative selection. The descriptive statistics of the potato clone’s data set are shown in Table 2.
In this research, experimental studies were carried out regarding the classification of favorable clones and the clones to be eliminated by negative selection from the resulting varieties obtained as a result of the crossing studies. In order to identify the selected clones, two different models were created by using three attributes, “tuber yield (TY), tuber count (TC) and average tuber weight (ATW)” for each clone. Model-1 consists of 2 classes, including selected (1) and other (0) classes. Model-2 consists of 4 classes, including average tuber weight (3), tuber count (2), tuber yield (1) and other (0) class. Two different models were created by using two attributes (eyes depth [ED] and eyes pit depth [EPD]) of each clone in order to identify the clones to be eliminated. Model-3 consists of 2 classes: needs to be eliminated class (1) and the other class (0). Model-4 consists of 4 classes, which are tuber shape (TS) (3), stem scar depth (2), eyes depth (1) classes, which needs to be eliminated, and the other (0) class. The sensitivity, specificity and accuracy ratios for each model were compared by using the generated dataset as input to the Multilayer Perceptron Neural Network (MLPNN) and SVM classifiers. In the MLPNN model, 10 neurons were used in the hidden layer, sigmoid function was used for activation, and back propagation algorithm was used for training. In this method, a randomly selected cross validation method is used that allows the model to be tested neutrally. In the study, the data sets were divided into 40% training, 20% validity and 40% test data, and the average success rate was determined. Gauss function was used as the kernel function in SVM classification.
Since classification is the most basic operation in the estimation part of the data mining, an important part of the problems is encountered in this step. One of the various algorithms used for accurate classification of data sets in the data mining is SVM. Proposed for the first time in 1995 [23], SVM is a supervised learning model in data mining used for classifying binary or multiple data sets of linear separable or linear non-separable type. Since the SVM approaches the classification problem as a quadratic optimization problem, it saves the number of operations in the training process and provides a considerable speed advantage compared to other algorithms [24]. Thus, SVM is successful in highvolume data sets, as well as in high-dimensional problems with few data [25]. While hyperplane based support vectors are constructed for linear-separable data, kernel functions and support vectors are formed for linearly non-separable data. These functions are usually polynomial and Gaussian kernels [26]. SVM is used in many areas of classification problem, such as image processing, financial estimation, biological species detection, medical examination [27- 31]. The hyperplane in Figure 1 is determined by the decision function estimated for the linear-separable data using the SVM method.
Mathematical equations for the support vectors in SVM are given in Equation 1 and Equation 2 to be used in a binary classification problem that can be differentiated linearly.
Where y is the class label, w is the weight vector, and b is the approximation value. The minimizing process of the w value required for increasing the optimum plane is given in Equation 3 [23].
Equation 3 gives the following:
The solution of Equation 4 with Lagrange equations gives Equation 5.
The decision function of the support vector machine for a twoclass problem is given in Equation 6 [32].
Another model with superior performance ratios used for classification estimation in data mining is known as MLPNN. The MLPNN model consists of an input layer, a hidden layer and an output layer. There are neurons in the hidden layer that contain nonlinear activation function. Figure 2 shows a sample neuron used in MLPNN. All the i1, i2, ..., in-1, in dimensions of the data are multiplied by weights w1, w2, ..., wn-1, wn respectively before reaching the neuron, and the result is primarily collected in the linear processing unit. The output of the linear processing unit is passed to the output layer through the activation function in the non-linear processing unit [33]. Operations in the hidden layer including linear and non-linear processing units are given in Equation 7 and Equation 8 [33].
The function f given in Equation 8 is the activation function of the non-linear processing unit. Sigmoid, hyperbolic tangent and step functions are frequently used activation functions [34].
Different types of backpropagation algorithm are used to determine the weights between the MLPNN neurons [35]. The backpropagation algorithm equations are given in equations 9, 10 and 11.
The error between the output value y’ and the actual value y is used for updating weights. In the updating process, usually the slope minimization method is used that provides a convergent approach to the goal [36]. Updated value of each weight (Δw) is found by distributing the calculated error energy, inversely proportional to the present w weights, to all the weights coming to the corresponding neuron [37]. The process is repeated for every point of the data and the average of the solutions found through the parameter η can be calculated.
Four different models were created for classification of selected clones and clones to be eliminated with the attribute data set obtained from potato clones. The confusion matrices obtained using MLPNN and SVM classifiers for Model-1, Model-2, Model-3, Model-4 are given in Tables 3 & 4. In the experiments for classification of selected potato clones performed for Model-1, 110 out of total 669 clones were classified as clone (1) class, and the remaining 559 were classified as the other (0) class. As seen in the confusion matrix obtained with MLPNN classifier in Table 3, 39 clones were misclassified. As seen in the confusion matrix obtained with SVM classifier in Table 4, 46 clones were misclassified. As a result of the experimental studies, MLPNN classifier was found to be more successful than SVM for Model-1. In the experiments for the classification of selected potato clones performed for Model-2, 72 out of 696 clones were selected in class 3 (mean tuber weight), 34 were as class 2 (number of tubers), 31 as class (tuber yield), and 559 were in the other class. As can be seen in the confusion matrix resulting from the MLPNN classifier in Table 3, 22 out of the 72 clones selected for the actual average tuber weight, 13 out of the 34 clones selected in terms of tuber count, 21 out of 34 clones selected in terms tuber yield, 39 out of 559 clones in the other class were misclassified (95 misclassified clones in total). As can be seen in the confusion matrix resulting from the SVM classifier in Table 4, 18 out of the 72 clones selected for the actual average tuber weight, 8 out of the 34 clones selected in terms of tuber count, 16 out of 31 clones selected in terms tuber yield, and 38 out of 559 clones in the other class were misclassified (80 misclassified clones in total).
As a result of the experimental studies, it was seen that SVM classifier is more successful than MLPNN for Model-2. In the experiments for the selection of potato clones to be eliminated for Model-3, 82 out of 703 clones were class (1) clones that must be eliminated through negative selection, and 621 were in the other (0) class, which is the positive selection class. As seen in the confusion matrix obtained with MLPNN classifier in Table 3, 22 clones were misclassified. As seen in the confusion matrix obtained with SVM classifier in Table 4, 25 clones were misclassified. As a result of the experimental studies, MLPNN classifier was found to be more successful than SVM for Model-3. For the selection of potato clones that should be eliminated by negative selection for model-4, 21 out of 703 clones were class 3 (tuber shape), 39 were class 2 (stem scar depth), 22 were class 1 (eyes depth) clones, which were among the clones to be eliminated, and 621 were in the other class, which is positive selection class. As shown in the confusion matrix resulting from the MLPNN classifier in Table 3, 12 out of 21 clones, which needs to be eliminated in terms of tuber shape, 29 out of 39 clones, which needs to be eliminated in terms of eyes pit depth, 12 out of 22 clones, which needs to be eliminated in terms of eyes depth, and 20 out of 621 clones, which were actually in the other class, were found to be misclassified (73 clones were misclassified in total). As seen in the confusion matrix resulting from the SVM classifier in Table 4, 8 out of 21 clones, which needs to be eliminated in terms of tuber shape, 23 out of 39 clones, which needs to be eliminated in terms of eyes pit depth, 7 out of 22 clones, which needs to be eliminated in terms of eyes depth, and 24 out of 621 clones, which were actually in the other class, were found to be misclassified (62 clones were misclassified in total). As a result of the experimental studies, it was observed that SVM classifier was more successful for Model-4. The sensitivity, specificity, and accuracy values obtained with the classifiers used for 4 different models are given in Table 5.
As shown in Table 5, it is seen that MLPNN was successful in Model 1 and 3, where there were only 2 classes, whereas SVM method was found to be more successful in Model 2 and Model 4, which have 4 classes. The correct classification ratios obtained in experiments on clones to be eliminated were higher than the success rates achieved in experiments on clone selection. This suggests that the selected attributes in the model regarding the clones to be eliminated affect the classification process better. The reason that sensitivity success rate was lower compared to specificity success rate in all models is believed to be due to the fact that the positive/ negative selection clone counts were not sufficient among all the clones. This leads to the necessity of model selection according to number of clones to be evaluated. The number of clones examined in this study was 703, and MLPNN method was found to be more successful in 2 class models, whereas SVM model was better in 4 class models. In variety improvement studies, selection is used as a classical method of improvement. In this method, selection of the genotypes in accordance with the criteria specified is called “positive selection”, and elimination of the inappropriate ones is called “negative selection”. In this research that was conducted with 703 potato clones, which were from an improvement program carried out with selection after crossing, clones were selected and eliminated in accordance with marketable tuber features, such as tuber shape, skin smoothness, eyes and eyes pit depth, in addition to features such as tuber count per plant and tuber weight.
In this study, a new early generation selection method is proposed using the models formed by applying SVM and ANN data mining methods to the selection criteria data sets that have been created as the result of crossing studies for the improvement of native potato varieties. It was determined that results obtained by classification of the data sets of the clones through the models (Model-1, Model-2, Model-3 and Model-4) created using data mining approach are very similar to the results obtained by applying classical methods on the same clones and data sets. The MLPNN method was found to be more successful in the 2-class models with positive and negative selections (Model-1 and Model-3), and the success rates were found as 94.2% and 96.9% respectively. In the 4-class models (Model-2 and Model-4), the SVM method was found to be more successful and the success rates were found to be 88.5% and 91.2%, respectively. Based on these results, it was concluded that data mining approaches can be used, in addition to classical methods, when deciding positive or negative selections of the clones in the early selection stages of potato improvement. And, it was decided to continue in future studies by taking other criteria into account with the selected clones in the next generation. Furthermore, it is anticipated that the success rates can be increased by considering different data mining methods, increased number of clones and different attributes in future studies.
In this research, some of the data from the TUBITAK-TOVAG 113O928 project were used.
Bio chemistry
University of Texas Medical Branch, USADepartment of Criminal Justice
Liberty University, USADepartment of Psychiatry
University of Kentucky, USADepartment of Medicine
Gally International Biomedical Research & Consulting LLC, USADepartment of Urbanisation and Agricultural
Montreal university, USAOral & Maxillofacial Pathology
New York University, USAGastroenterology and Hepatology
University of Alabama, UKDepartment of Medicine
Universities of Bradford, UKOncology
Circulogene Theranostics, EnglandRadiation Chemistry
National University of Mexico, USAAnalytical Chemistry
Wentworth Institute of Technology, USAMinimally Invasive Surgery
Mercer University school of Medicine, USAPediatric Dentistry
University of Athens , GreeceThe annual scholar awards from Lupine Publishers honor a selected number Read More...