Lung Cancer Risk Prediction using Current Human
Databases: Strengths, Limitations and Future Directions

Ceren Canbey Goret; Nuri Emrah Goret; Omer Faruk Ozkan; Guray Kilic

Email Us: info@lupinepublishers.com phone

Call Us: +1 (914) 407-6109 57 West 57th Street, 3rd floor, New York - NY 10019, USA

Submit Manuscript

ISSN: 2641-1725

LOJ Medical Sciences

Research Article(ISSN: 2641-1725)

Lung Cancer Risk Prediction using Current Human Databases: Strengths, Limitations and Future Directions Volume 6 - Issue 5

Andrew Xing¹, Zhixin Tang² and Zhiguang Huo²*

¹Buchholz High School, Gainesville, FL 32606
²Department of Biostatistics, Colleges of Public Health & Health Professions and Medicine, University of Florida, Gainesville, FL 32610

Received: June 17, 2024; Published: June 27, 2024

*Corresponding author:Zhiguang Huo, Department of Biostatistics, Colleges of Public Health & Health Professions and Medicine, University of Florida, Gainesville, Florida 32610, United States.

DOI: 10.32474/LOJMS.2024.06.000249

Abstract PDF

Abstract

Lung cancer is the leading cause of cancer related deaths in the United States of America and the whole world. Lung cancer treatment not only has a rather limited success but also imposes tremendous financial burdens. Thus, alternative strategies are urgently needed to manage lung cancer in a more patient-friendly manner. It is extremely important for the general public to understand the chronic nature of lung cancer and to be aware of the numerous risk factors contributing to lung cancer so that preventive approaches may be implemented among the general population to complement current treatment paradigm for more effective lung cancer management. The better understanding of the different risk factors will also help identify the lung cancer highrisk individuals for early preventive interventions, which may be more effective and patient-friendly in addition to the lower cost. This manuscript discusses the various risk factors to lung cancer, including well-known risk factors, potential ones, and, importantly, emerging new risk factors that are likely to have a greater influence on the younger generation. This manuscript also discusses the complex nature of lung cancer risk factors, the application of various population-based databases for their identification and their limitations. Lastly it outlines potential future directions for lung cancer risk factor evaluation and the need for their integration in identifying individuals with higher risk of lung cancer.

Keywords: Lung cancer risk factor; Chronic nature; Tobacco smoke; Risk prediction; Human databases

Abbreviations: COPD: Chronic Obstructive Pulmonary Disease; TERT: Telomerase; CYP: Cytochrome P450 enzyme; UGT: Uridine 5’-diphosphoGlucuronosylTransferase; SNP: Single Nucleotide Polymorphism; PLCO: Prostate, Lung, Colorectal and Ovarian Cancer Screening Trial; NCI: National Cancer Institute; EAGLE: Environmental and Genetics in Lung cancer Etiology study; CPS-II: Cancer Prevention Study II; CNV: Copy Number Variation; MRI: Magnetic Resonance Imaging; CRP: C reactive protein; nAChR: nicotinic acetylcholine receptor

Introduction

Lung cancer is the most common cancer in men and the 2nd most common cancer in women worldwide with more than 2.2 million new cases in 2020 [1]. Besides its high prevalence, the fiveyear overall survival rate for patients with lung cancer is very low in comparison to the other major type of malignancies, barely reaching to 23% in the United States of America in 2023 [2] while such rates are even lower in many other countries and regions [3]. The rather poor clinical outcome of patients with lung cancer are largely due to the late diagnosis of lung cancers, the majority of which are at clinical stage upon diagnosis, the limited efficacies and significant toxicities of current therapeutic treatments, the rapid acquisition of drug resistance, and the associated high rate of disease recurrence and progression. Thus, lung cancer alone resulted in nearly 1.8 million deaths worldwide in 2020 [1], being the leading cause of the cancer-related deaths among all malignancies for decades. In addition, current lung cancer clinical management is associated with intimidating financial burdens. For instance, the cost of most targeted therapies for lung cancer are $100,000 or higher in the United States of America while the cost of the recently developed immunotherapies can reach above $400,000 [4]. Many patients with lung cancer and their families thus suffer dramatically from the associated financial toxicity with significant out-of-pocket expenses and poorer financial well-being in addition to the disease itself. Such intimidating financial burdens also contribute significantly and negatively to the compromised quality of life and reduced treatment adherence [5], all of which are associated with the rather poor outcome of lung cancer management. Therefore, besides continuous efforts to search for and to develop more effective therapeutic treatments for lung cancer, which has been the central theme for decades with significant investments and prohibitory financial burdens, paradigm-shift strategies need to be developed and implemented to improve lung cancer management, which needs to be more patient friendly-and cost-effective.

It is therefore very important to emphasize and disseminate the knowledge that lung cancer is a chronic disease; it typically takes several decades for lung cancer to evolve from initiation into the clinically detectable stage, during which minimal, if any, intervention is implemented in our current lung cancer management paradigm. Its chronic nature offers the great opportunity for early detection and early preventive intervention, which are going to be much more patient-friendly and in the long-run more cost effective, analogous to our successful managements of many other chronic diseases, such as diabetes and cardiovascular conditions. While several early diagnostic tools have been developed with certain levels of implementation in the clinic for lung cancer detection, they are mostly designed to detect lungs already with precancerous lesions or even early-stage cancers, which should have been classified as the late stage of lung cancer considering its decade-long evolving process. With the present practice and diagnostic tools, we thus still miss significant opportunities for real early detection of lung cancer and preventive intervention. To achieve more efficient and accurate early detection, comprehensive and quantitative understanding of the potential risk factors for lung cancer is required – what are the risk factors for lung cancer, what are the underlying mechanisms, what are the potential surrogate readouts, what are their potential interactions and relationships, and ideally what are their quantitative contributions to lung cancer development from a population and individual point of view. The foundational building blocks to the answers of these questions would require the prospective collection of the longitudinal data among a large number of participants given the chronic nature of lung cancer development, the intrinsic complexity of lung carcinogenesis, the high levels of heterogeneity among human individuals, and the intrinsic random nature of genetic mutations at least based on our current knowledge. Appropriate modeling, likely artificial intelligence-driven approaches due to its complexity, is expected to be essential to efficiently analyze and interpret these longitudinal data with the ultimate goal of risk factor integration for more accurate lung cancer risk prediction. Such a risk prediction model can be potentially embedded in our current annual check-up and integrated with our annual health information, which is expected to identify, or at least enrich, the lung cancer high-risk individuals early on followed by patient-friendly and more cost effective preventive early interventions.

To help achieve this goal, the current work will first review our current knowledge about different risk factors for human lung cancer. Their applications in lung cancer risk prediction via several human prospective cohorts will be summarized as well. We will then discuss the strengths and limitations of current approaches and outline future research directions with the ultimate goal to achieve lung cancer early detection and prevention, which is essential to improve lung cancer management.

Known Lung Cancer Risk Factors

Many risk factors have been proposed for lung cancer (Figure 1).

Figure 1: Different lung cancer risk factors and the need for their integration in lung cancer risk prediction.

Lupinepublishers-openaccess-biomedicalengineering-biosciences

Tobacco smoke is well-accepted as the major risk factor for lung cancer, which may have contributed to 80-90% of the lung cancer cases. With several decades of efforts, ample and compelling evidence have been accumulated, demonstrating a strong causal relationship between tobacco smoke exposure and lung cancer risk. First of all, before the start of the mass production of tobacco products in the late 19th century, lung cancer was a rare cancer [10]

Besides tobacco smoke, various genetic factors have been investigated for their potential contribution to lung cancer risks, including TERT [20], CYPs [21], and UGTs [22], SNPs [23], rare germline variants [24], germline homozygosity [25], and copy number variations (CNV) [26], to name a few. The results generally showed that these genetic factors alone can only account for a very small portion of cancer risk heritability [24], even the polygenic risk models that evaluate and integrate multiple genetic risk factors [27]. These outcomes further substantiate the complex nature of lung cancer as observed in the clinics that lung cancer cases are not driven by the same genetic defects. In addition, most of these genetic analyses, if not all, have not been rigorously validated in large population-based studies [27]. Similarly, many other risk factors have been evaluated and in general their individual contribution to lung cancer risks appears to be limited or nonsignificant, suggesting that the integration of multiple risk factors may be essential for more effective lung cancer risk prediction, which will be discussed later.

Human cohort database and their applications in lung cancer risk factor evaluation and risk prediction realizing the complexity of lung cancer risk prediction, which requires longitudinal and comprehensive data collected from a large number of participants, several population databases have been established. The data collected, particularly the longitudinal data, have the potential to help identify lung cancer risk factors, which can be used to enrich the high-risk individuals for lung cancer. In this section we will briefly describe a few human cohort databases, analyze their strengths, identify potential limitations, and summarize their applications in human lung cancer risk prediction (Figure 2).

Figure 2: Representative human databases that have been used to identify and evaluate lung cancer risk factors.

The Prostate, Lung, Colorectal and Ovarian (PLCO) Cancer Screening Trial was a large randomized controlled trial designed and sponsored by the National Cancer Institute (NCI). The goal was to determine the effects of different screening methods on cancer-related mortality and secondary endpoints in men and women aged between 55 and 74. This trial enrolled approximately 155,000 participants between November 1993 and July 2001. Participants were individually randomized into the control arm or intervention arm in equal proportions. Participants assigned to the control arm received usual care, whereas participants assigned to the intervention arm were invited to receive screening exams for prostate, lung, colorectal and ovarian cancers. Data were collected on cancer diagnoses through 2009 (median follow-up time 11.3 years) and mortality through 2018 (median-follow-up 19.2 years). All participants were asked to complete a baseline questionnaire containing information such as demographics and medical history. Intervention arm participants were also asked to complete the Dietary Questionnaire at baseline. A second dietary questionnaire was introduced in December 1998. Blood samples and buccal cell samples were also collected from certain participants for research. Around 110,000 PLCO participants were genotyped as well for genetic analyses. The cohort of PLCO participants, after following for more than two decades, resulted in a collection of lung cancer information with 1390 cases [20]. The collected information can be used for lung cancer risk factor investigation. For instance, the data have been used to develop protein-based risk biomarkers [28,29] and to evaluate the potential contributions of low-fat diet and supplements [30-33]. Orloff et al. has used the PLCO data to identify extended germline homozygosity with lung cancer risk [25]. These analyses, however, did not analyze the potential contributions to different lung cancer subtype, which is likely a major weakness of the results since different subtypes of lung cancer may have different risk factors or different contributions from the same risk factor. Indeed, Sivakumar et al. analyzed the mutation patterns between two different subtypes of lung cancer using the PLCO data and identified completely different mutation landscape [34], indicating the necessity to separate different subtypes of lung cancer in risk prediction and prevention. Environmental and genetics in lung cancer etiology study (EAGLE) is a population-based case-control study of lung cancer, including 2100 primary lung cancer cases and 2120 healthy controls enrolled in Italy between 2002 and 2005 [35] with the goal to explore the full spectrum of lung cancer etiology, from smoking addiction to lung cancer outcomes, through examination of epidemiological, molecular and clinical data. In addition to smoking data, a number of behavioral rating scales have been implemented including tobacco dependence, withdrawal, depression, anxiety, and alcohol dependence. These data have been explored for their potential to identify and evaluate different lung cancer risk factors, including gender, hormonal factors, certain gene copy numbers and microRNAs, family history, COPD, and even outdoor particulate matter [36-42]. One major limitation of this database is its nature of a single time point data collection for the enrolled participants and the rather small sample size. Thus, some results based on this database are not consistent with other studies and all of the results from this database remain to be validated in future studies.

The Cancer Prevention Study II (CPS-II), which began in 1982, is a prospective mortality study of approximately 1.2 million American men and women in all 50 states, the District of Columbia, and Puerto Rico. Each participant completed a four-page, confidential questionnaire. Baseline questions included personal identifiers, height, weight, demographic characteristics, personal and family history of cancer and other diseases, use of medicines and vitamins, menstrual and reproductive history (women), occupational exposures, dietary habits, alcohol and tobacco use, and various questions regarding exercise and behavior. Within this cohort, a CPS-II Nutrition Survey cohort was established to obtain detailed information on dietary exposures and to update with additional exposure information, and to conduct prospective cancer incidence follow-up in addition to mortality follow-up. Such new questionnaires were sent to the CPS-II Nutrition Survey cohort in 1997, 1999, 2001, 2003, 2005, 2007, 2009, 2011, 2013, and 2015. Ongoing cancer incidence follow-up for the CPS-II Nutrition Survey cohort is conducted by validating self-reported incidence cancers using medical records or linkage with state cancer registries. Nearly 30,000 incident cancers were reported in the interval 1992 to 2005, which should include over 3,000 lung cancer cases. These data could be powerful to examine the association of many surveyed factors (e.g., diet, lifestyle, and environment) with lung cancer incidence to help identify the risk factors given its longitudinal nature. Its application, however, has been limited based on the number of peer-reviewed publications; the potential reason remains unknown. Given the limitation of each individual database, particularly the limited number of lung cancer incidence, attempts to integrate data from multiple databases have been explored as well with the assumption that the major risk factors for lung cancer are similar, if not the same, among the different cohort. For instance, Landi et al. analyzed 14 databases including PLCO, EAGLE, and CPS-II, on lung cancer risk in association with different SNP [20] but failed to identify any promising candidates. Similarly, Li et al. analyzed CNV in EAGLE and PLCO on lung cancer risk without much success [26]. The negative results from these studies could be due to potential complication when integrating multiple databases. Specifically, it has been estimated that genetic factors only contribute to ~30% of lung cancer risk while environment is the major contributor; given the potential interactions between environments and genes, it may not be appropriate to combine data from different environments, including data collected from different countries and/or from the same country but during different periods of time, which are the typical variations among cohorts of populations in different database. Thus, the validity to integrate data from different database remains to be determined particularly for populations from different environments or during different period of times.

The UK biobank is a comprehensive database, collecting a wide range of data from a longitudinal cohort of general population in the United Kingdom of Britain. It contains the demographic information, biological samples (blood, saliva and urine), cognitive function, verbal interview, eye measurements, genotyped SNPs, brain MRI, cognitive function summary, mental health, work environment, local environment, diet and alcohol summary, early life experience, education and employment, genomics, geographical and location, heart MRI, linked health outcomes, mental health, physical measurement summary, self-reported medical conditions and many other factors that may be related to human health. Within each category, various parameters have been collected as well. Using early life experience as an example, the following status have been collected – birth weight, breastfeeding status, comparative body and height size at age 10, maternal smoking around birth, part of a multiple birth, and whether being adopted or not. This prospective cohort has enrolled ~400,000 participants with periodic followups to collect longitudinal data and disease outcomes. With the comprehensive data collection, numerous analyses have evaluated the predictive power for a wide range of potential lung cancer risk factors, including tobacco smoke, stressful life experience, inflammation, lung function, sleep, circadian rhythm, and many other risk factor candidates with some positive indications [43- 46]. For instance, CRP, an inflammation biomarker, has been demonstrated to increase lung cancer risk, including all lung cancer subtypes in the Biobank samples [47]. The level of bilirubin in the blood appeared to reduce lung cancer risk although the subtypes of lung cancer were not differentiated [48]. Polygenetic prediction has been employed for lung cancer risk prediction as well [27], even in the context of smoking [49]. Similarly, the genetic and smoking interaction in lung cancer risk prediction has been explored via unbiased approach [50]. Sleep [51, 52] and neurological functions/ stress [53, 54] as a lung cancer risk factor has been explored with interesting results. Specifically, psychological stress increased lung cancer risk among non-smokers, light smokers and heavy smokers by 43.0%, 46.8%, and 31.8% respectively [54] and a causal relationship was demonstrated as well in the same cohort [53]. This is consistent with human epidemiological data that individuals with mental health issues are at a higher risk of lung cancer based on a meta-analysis of 165 longitudinal studies [55]. Integrating multiple risk factors, including stress, smoking and genetic status, appear to result in better lung cancer risk predictions [54], however, systematic investigations have not been done with varied combinations of risk factor candidates, since some risk factors may be redundant or have interactions, such as stress and sleep. Additional risk factors evaluated using the Biobank data include walking [56], green tea [57], beta-blocker [58], diabetic status [59], asthma [60], polygenic risk factors [61], plasma protein markers [62], telomere length [63] and many others [64-67]. Recently Krishna et al utilized the Biobank and reported the association of HLA-II heterozygosity with reduced risk of lung cancer, implying that genetic variations in immune surveillance is a key feature of cancer susceptibility, together with environmental exposures [68].

Interestingly, depression and anxiety have been found to contribute to increased risk of lung cancer but no other cancers in this cohort [69]. Pettit et al also analyzed a range of heritable traits as lung cancer risk [70]. Once again, a wide range of factors have been evaluated for their potential in lung cancer risk prediction with many of them showing limited levels of predictive power, demonstrating the challenges and complexity of lung cancer risk prediction. The limited predictive power for each individual risk factor also suggests that multiple factors need to be integrated while factors associated with tobacco smoke may be of greater contributions, such as daily tobacco exposure level, genetic factors associated with tobacco use and tobacco toxicant metabolism, such as CYP2A6, and genetic factors to nAChRs. At the same time, different risk factors may have interactions, redundancy, causal relationship, and other more complicated associations. Thus, none of these factors alone were sufficient or powerful enough to predict lung cancer risk that their integrations are likely necessary. The data in Biobank offers the great opportunity to explore these potentials, particularly given its longitudinal nature that data will become more and more comprehensive with more lung cancer cases and hopefully more powerful to support such discoveries. Similar to the Biobank at UK, All of US is another prospective cohort of population in the USA. Its application in lung cancer risk exploration has been limited at this point, potentially because of the short period of longitudinal data collection to date since this cohort was established later than the UK Biobank. However, it will offer the additional opportunity similar as Biobank with more longitudinal data collected. Besides the targeted risk factor analyses in these database application, unbiased omics techniques have also been employed to identify potential risk factors in these studies with limited success. Although there are many unique strengths of the unbiased approach, the sensitivity of these methods remains to be determined. It is also possible that no single parameter, such as SNP, is powerful enough as an independent risk factor similar to many targeted risk candidates evaluated. In addition, some of the unbiased profiling may need to be interrogated in the context of specific environmental conditions, such as smoking status since some SNPs in nicotine addiction and tobacco toxicant metabolism may be a risk factor only in the context of tobacco smoke exposure that such analyses will only be valid among the participants who smoke, not the whole population in the database.

Future Directions

The current databases also have certain limitations-some risk factors are not well documented, such as radon exposure, tobacco smoke exposure information (no information about the tobacco products used by the individuals, the limitations of survey-based qualitative information of tobacco exposure, and the lack of biological quantification), second-hand smoking, environmental, occupational and domestic pollutions. There are also emerging new risk factors given the life-style changes, such as the increased prevalence of electronics, the reduction in physical activities, the changes in diets and sleeping patterns, and many other life style changes. There are also intrinsic risks for the integration of different database, because the causes for lung cancer are evolving and potentially different for different regions during different time periods. For example, the causes for lung cancer in the USA now could be substantially different from what they were in two or three decades ago, such as tobacco use, tobacco products, use of electronics, change in physical activities, and many other life style change. The causes for different subtypes of lung cancer can be different too: although tobacco smoke is the main cause of lung cancer, other factors may be involved for the different subtypes of lung cancer. Thus, if possible, different subtypes of lung cancer should be studied separately. The causes for lung cancer among different populations could be different as well although the difference could be subtle: there have been ample data suggesting lung cancer risk disparity with respect to race, gender, and other factors. Some of these may be driven by genetic factors and some may be driven by environmental factors.

In summary, with the continuous growth of these large prospective cohort databases, such as the UK Biobank and All of US, their longitudinal data collection, more comprehensive data on different risk factors, and the integration of multiple risk factors, these databases are expected to become more powerful in identify individual lung cancer risk factors, quantifying their potential contributions, and more importantly developing integrated risk index for more accurate lung cancer risk prediction. Given the complexity of lung cancer risks, artificial intelligence may be essential to help analyze the different risk factors, explore their potential interactions, and holistically integrate them for better risk prediction.

Acknowledgement

None.

Conflict of Interest

No conflict of interest.

References

Editorial Manager:

kimberly vera

Email:

medicalsciences@lupinepublishers.com

medicalsciences@lupinepublisher.co

Track Your Article

Top Editors

Mark E Smith

Bio chemistry
University of Texas Medical Branch, USA
Lawrence A Presley

Department of Criminal Justice
Liberty University, USA
Thomas W Miller

Department of Psychiatry
University of Kentucky, USA
Gjumrakch Aliev

Department of Medicine
Gally International Biomedical Research & Consulting LLC, USA
Christopher Bryant

Department of Urbanisation and Agricultural
Montreal university, USA
Robert William Frare

Oral & Maxillofacial Pathology
New York University, USA
Rudolph Modesto Navari

Gastroenterology and Hepatology
University of Alabama, UK
Andrew Hague

Department of Medicine
Universities of Bradford, UK
George Gregory Buttigieg

Maltese College of Obstetrics and Gynaecology, Europe
Chen-Hsiung Yeh

Oncology
Circulogene Theranostics, England
Emilio Bucio-Carrillo

Radiation Chemistry
National University of Mexico, USA
Casey J Grenier

Analytical Chemistry
Wentworth Institute of Technology, USA
Hany Atalah

Minimally Invasive Surgery
Mercer University school of Medicine, USA
Abu-Hussein Muhamad

Pediatric Dentistry
University of Athens , Greece