INTRODUCTION
In the present days, multiple diseases originated from livestock were reported, and some of them are directly and indirectly connected to human food, which includes fatal zoonoses such as bovine spongiform encephalopathy and avian influenza. Also, a number of food-borne diseases have been reported due to poor sanitation and contamination of toxic microbiome in the process [
1]. Besides the food-related diseases, younger generation tends to lack trust on the market, and tries to be aware of what they eat and where the products came from [
1]. Therefore, consumers’ demand for the food safety has been increasing recently, accompanied by the suspicion on food quality. Today, consumers demand well-established distributions as well as reliable food sources. To meet the demand, the tracking system of livestock product has gained attention since the early 2000s [
1].
Traceability simply refers to the tracking system to identify animals from the manufacturer to the retailer, which provides a credibility of food origin [
2]. Even though tracking of the animal product has been improving for years (i.e. an animal tag), it still depends on the uncertainty of information provided by the manufacturers. Genetic traceability is one of the various types of traceability determined based on DNA-level data, such as microsatellite and single nucleotide polymorphism (SNP) [
1]. DNA molecules are constant and robust, and these traits make DNA remain unaffected regardless of the process from manufacture to distribution.
Based on the fact that genetic mutations occur and some of those mutations are fixed through generations, reflecting the evolutionary history, SNPs can be variable and represent from individual to population. For years, many SNP markers have been discovered and shared. Using the enormous DNA reference databases, it is possible to define genetic variants and to distinguish species and breeds. Furthermore, a set of SNPs called haplotype can be used to individualize samples and detect genetic diseases.
Selection is one of the driving forces of evolution, caused by extrinsic factors. In the meaning of evolution, different selection pressures can generate several different subsets of the parent population, called population differentiation. Artificial population differentiation within domesticated species such as pig is defined as breeding. As domesticated pig (
Sus scrofa) breeds are strongly inbred to maximize economic efficiency, thus within-breed variation shows relatively transient signal than between-breed variation does [
3–
6]; this makes genetic traceability challenging under the existence of a variety of breeds. Therefore, researches about effective applications of genetic traceability account for a big part of livestock science, along with decreasing price of SNP microarrays and next-generation sequencing.
In the swine industry, breeding farms use purebreds such as Landrace, Duroc, and Yorkshire. Since the purebreds are highly inbred to be selectively developed in economical traits, their genetic diversities within breeds are lower than that of wild boar [
4]. That is, the genome of each breed has distinct patterns of inbreeding signal, generally called a selective sweep [
6–
8]. It also implies that between-breed genetic differences in commercial pigs are relatively higher in some genomic regions related to the economic traits than conserved regions [
4,
5]. Considering the effect of between-breed variation, multi-breed prediction of the farm is harder than single-breed prediction. Nevertheless, the origin of pigs can be distinguished by SNP markers because the breeding farms use different selection indexes to orient their genomic selection, suggesting that various purebreds from one breeding farm can share farm-specific signals on their genomes [
9,
10]. It is necessary to build the prediction model based on SNPs which are shared by the farm, rather than the breed.
Feature selection is an important step of model construction to achieve the maximum parsimony [
11]. It reduces the number of attributes that are less relevant to the model, leading to a lower risk of overfitting and better prediction through the model, which may uncover the underlying insights of data [
11]. Least absolute shrinkage and selection operator (LASSO) is one of the regularization methods to optimize coefficients using a penalty [
12]. While LASSO builds and optimizes the model, it performs feature selection by setting irrelevant coefficients to zero [
12]. Since LASSO was reported to be effective in biomarker selection in the two-class problem and can be applied to logistic regression models, it has been extended to the multi-categorical variables such as farm tracing, which could be complicated in machine learning-based classification [
12,
13]. With that said, we performed LASSO to select powerful features and optimize the coefficients of a model for farm prediction.
In this study, multi-breed SNP chip data of 6,018 pigs over four breeding farms was used to demonstrate the farm tracing. Under the multi-breed conditions, we attempted to select features which offset the between-breed differences and distinguish their breeding farms. The main objective of the study is to maximize the performance of farm prediction using feature selection strategy. Therefore, we i) generated the additional set of breed-offsetting SNPs (boSNPs), ii) selected the best features through LASSO multinomial logistic regression, and iii) evaluated the highest-scored features using multiclass classifiers in the present study.
RESULTS AND DISCUSSION
Farm traceability refers to a process of predictive model construction to track the origin among various pigs. In this study, we attempted to achieve the farm tracing of breeding farm in 6,018 pigs of three breeds (Yorkshire, Landrace, and Duroc) using SNP chip data from four Korean local breeding farms: SJ, SCB, KB, and DB (
Table 1).
First, we filtered out genotyping errors from SNP chip data through pre-screening. After pre-screening, we pruned off SNPs with high collinearities using LD and attain 3,899 SNPs (null SNPs). Additionally, another set of 179 SNPs with reduced breed differences (boSNPs) were extracted to theoretically revise the breed variety of data (
Table 2). Next, we simulated multinomial regression models to predict the farm information, and we selected the most relevant features upon constructed model using LASSO penalty. During simulations, we compared the performances of two SNP sets and combined them into integrated SNPs. Finally, we used integrated SNPs to obtain the accuracies of models and feature scores of selected SNPs. The top 100 of highest-scored SNPs were selected as representative features, and two multiclass classifiers were employed to evaluate these features. The workflow is described as a diagram in
Figure 1.
The data used in this study was produced by each farm individually. Before genotype pruning, the merged data has genotyping rates of 0.83. After filtering process to minimize the missing genotypes, 14,670 SNPs remained. Since our data was composed of three purebreds, it is required to select features presenting less between-breed difference; genome of purebred pig contains variants with strong breed-specificity because of inbreeding [
3,
6]. We assumed that the effect of breeding might obscure model construction because it is relatively bigger than that of breeding farms [
5]. To solve the problem, we suggested extracting an additional set of SNPs with less between-breed gaps. PCA using 179 boSNPs showed reduced breed-specificity compared to the result using nullSNPs as shown in
Figure 2A, suggesting that the breed difference was neutralized in boSNPs. This observation was also concordant with ancestry population differentiation analyses with K value of 4 (
Supplementary Figure S1). Otherwise, farms were not distinguishable in both SNP sets, showing the necessity of further feature selection (
Figure 2B).
Therefore, we simulated feature selection using LASSO multinomial logistic regression with boSNPs and nullSNPs, respectively. LASSO is a statistics-based feature selection method using penalization of regression coefficients, to simplify the model. As an embedded feature selection method, LASSO simultaneously constructs a model and selects features used in the model [
12]. Because every single run of LASSO multinomial logistic regression possibly results in one distinctive set of features as a local optimum, we performed 100 simulations of 10-fold CV. The regularization parameter of LASSO, the lambda (λ), must be defined by the user from the list of iteratively generated lambdas. Hence, we used cv.glmnet, an internal function of glmnet which empirically suggests the best lambda value (lambda.1se) and the corresponding number of features, and the results for both SNP sets were depicted in
Figure 3. Using the forward stepwise method, the test showed that model of boSNPs reached lower convergent accuracy (96%) than nullSNPs (99%). Notably, boSNPs showed higher accuracy below the feature count of 53. The CVM of both SNP sets also showed a similar pattern of accuracy. This similarity suggests that boSNPs may contain powerful features to distinguish farm, but the small number of boSNPs limits accurate prediction.
The result implied that both nullSNPs and boSNPs contain farm-segregating features with smaller between-breed differences (
Figure 3). When there is a large number of relevant features that can be distinguished by the response variable, selecting the same number of features from subsets would not affect the convergent accuracies [
21]. In other words, feature selection in subsets including different numbers of features may result in different numbers of relevant features to fit the model. Considering the significant difference in the sizes of subsets (3,899 and 179), different convergent values of accuracy and CVM between boSNPs and nullSNPs indicates that boSNPs was not sufficient to complete the farm prediction, which is largely due to the lack of feature variety. In addition, the density of feature counts corresponding to optimal lambda values (lambda.1se) differed between two SNP sets for 90 and 110, which also supports the insufficiency problem of boSNPs. On the other hand, the superior of the accuracy of boSNPs at a smaller number of features represents the stronger relevance of individual SNPs; this supports the efficacy of boSNPs in farm classification.
In order to improve the model, we decided to integrate boSNPs into nullSNPs. We then simulated LASSO on 4,002 integrated SNP set using “forward” method, and we measured the accuracy of prediction along with the number of features involved (
Figure 4A). The model using integrated SNPs showed slightly better prediction compared to that of nullSNPs, and was similarly accurate where 100 features were included (98.5%). To understand the improvement in integrated SNP set, we divided the feature score results using integrated SNPs into three types of constituent: bo-exclusive SNP, null-exclusive SNP, and common SNP. Then we calculated the feature score on all models using 60 to 120 features and displayed them in a density plot (
Figure 4B). A total of 1,341 features has non-zero scores ranging from 0.01 to 557 in 100 simulations (
Supplementary Table S1), and the score distribution of features is positively skewed. The mean of features score is 37.60, and 75% of the features have feature scores less than 15. Scores in top 25% quartile (Q4) ranged from 15 to 557, showing the extreme bias of feature score. We also calculated densities of three types of the constituent for each quartile (
Figure 4C,
Table 3). The percent of bo-exclusive SNPs in Q4 quartile (29%) was larger than the expected ratio of 25% or that of null-exclusive SNPs (25%). Extreme bias of feature scores suggests the necessity of a second selection in terms of removal of low-impact SNPs and reaching the global optimum of LASSO regression. Considering the number of features corresponding to the optimal lambdas, a total of 100 highest-scored features were selected for further evaluation. Of the top 100 highest-scored SNPs, bo-exclusive SNPs showed higher scores than other two constituents (
Figure 4B), supporting that the boSNP set contains powerful individual features.
In order to assure the prediction performance, the highest-scored SNPs were evaluated by two machine-learning based multiclass classifiers, SVM and KNN. Since the recall, precision, and F1-measure had been considered useful in estimating performances of feature selection [
20], we used those measures to evaluate the result. First, a weighted sum of measures by farm were estimated between two classifiers. KNN showed better performance than SVM, but evaluations using all 100 features showed values greater than 0.99 for all three measures in both classifiers (
Figure 5A). Performance using all of 4,002 SNPs were measured to be compared; In the evaluation using SVM, values of measures using all SNPs are 0.78, 0.83, and 0.69 respectively for recall, precision, and F1-measure, whereas, KNN provides approximately 1 for all three measures. Considering that KNN with k = 1 could lead to overfitting, top 100 highest-scored SNPs showed substantial performance with less danger of overfitting. It is practical in the reduction of features, which simplifies the model, and by reducing time and cost in the perspective of industry and research. Performance measures were also characterized by farm using SVM and KNN (
Figure 5B, 5C). Approximately 75% of base accuracy (zero feature included) can be explained by the unequal sample size of each farm, especially because large portion of samples was from SJ. Despite the imbalance between sample sizes, farm prediction was successful when performed as more features were included, which is also supported by PCA plot of top 100 features (
Supplementary Figure S2).
As consumers’ demand for food safety increases, traceability of the origin of the animal product becomes more and more crucial. DNA markers actualized more reliable tracking concept called genetic traceability. To employ DNA markers for the farm prediction, variety of the population must be considered before selecting relevant features. In this study, we illustrated that LASSO regression greatly reduces the number of features down to 100 and demonstrates successful prediction under the existence of breed diversity. A successful application of LASSO in farm prediction can complement the existing food traceability, as it reduces the cost of traditional traceability system and delivers more accurate information to the consumers. In conclusion, our method presenting the maintenance of farm traceability with a smaller number of SNP markers provides valuable information to the swine industry and the researches with regard to the genetic traceability.