大规模人群队列生活行为方式相关的肺癌风险预测模型的构建
doi:
Construction of a Risk Prediction Model for Lung Cancer Based on Lifestyle Behaviors in the UK Biobank Large-Scale Population Cohort
-
摘要:
目的 发现影响肺癌发病的生活行为相关危险因素,并构建肺癌风险预测模型,识别人群中的高风险个体,帮助肺癌早期筛查。 方法 本研究数据来源于英国生物样本库(UK Biobank)2006年3月–2010年10月收集的502389名参与者。参考国内外肺癌筛查指南和高质量肺癌危险因素研究文献,确定本研究高危人群识别标准。采用单因素Cox回归分析及逐步回归筛选出肺癌的危险因素,通过Cox比例风险回归构建多因素肺癌风险预测模型,根据比较赤池信息准则以及Schoenfeld残差检验结果,最终选择等比例假设的最优拟合模型。多因素Cox比例风险回归考虑生存时间,将人群按7∶3的比例随机分为训练集和验证集,使用训练集建立模型,并用验证集对模型性能进行内部验证。受试者工作特征曲线(ROC)曲线的曲线下面积(AUC)被用于评估模型的效能。将人群按照发病概率的0%~<25%、25%~<75%、75%~100%分为低风险、中风险及高风险人群,分别计算其中的发病人数占比。 结果 本研究最终纳入453558人,在累计随访5505402人年期间,共诊断出2330例肺癌。Cox比例风险回归分析筛选出10个自变量建立模型:年龄、体质量指数(body mass index, BMI)、学历、收入、体力活动情况、吸烟状态、饮酒频率、新鲜水果摄入量、癌症家族史、烟草暴露。该模型通过内部验证结果显示8个自变量(除BMI和新鲜水果摄入量外)均是肺癌的影响因素( P<0.05)。该模型训练集预测肺癌发生的一年、五年、十年AUC分别为0.825、0.785、0.777;验证集预测肺癌发生的一年、五年、十年AUC分别为0.857、0.782、0.765。筛查高风险人群可发现68.38%的未来肺癌发病个体。 结论 本研究建立了大规模人群生活行为方式相关的肺癌风险预测模型,其在判别能力方面表现出良好的性能,为制定肺癌标准化筛查策略提供了工具。 Abstract:Objective To identify the risk factors related to lifestyle behaviors that affect the incidence of lung cancer, to build a lung cancer risk prediction model to identify, in the population, individuals who are at high risk, and to facilitate the early detection of lung cancer. Methods The data used in the study were obtained from the UK Biobank, a database that contains information collected from 502389 participants between March 2006 and October 2010. Based on domestic and international guidelines for lung cancer screening and high-quality research literature on lung cancer risk factors, high-risk population identification criteria were determined. Univariate Cox regression was performed to screen for risk factors of lung cancer and a multifactor lung cancer risk prediction model was constructed using Cox proportional hazards regression. Based on the comparison of Akaike information criterion and Schoenfeld residual test results, the optimal fitted model assuming proportional hazards was selected. The multiple factor Cox proportional hazards regression was performed to consider the survival time and the population was randomly divided into a training set and a validation set by a ratio of 7:3. The model was built using the training set and the performance of the model was internally validated using the validation set. The area under the receiver operating characteristic (ROC) curve (AUC) was used to evaluate the efficacy of the model. The population was categorized into low-risk, moderate-risk, and high-risk groups based on the probability of occurrence of 0% to <25%, 25% to <75%, and 75% to 100%. The respective proportions of affected individuals in each risk group were calculated. Results The study eventually covered 453558 individuals, and out of the cumulative follow-up of 5505402 person-years, a total of 2330 cases of lung cancer were diagnosed. Cox proportional hazards regression was performed to identify 10 independent variables as predictors of lung cancer, including age, body mass index (BMI), education, income, physical activity, smoking status, alcohol consumption frequency, fresh fruit intake, family history of cancer, and tobacco exposure, and a model was established accordingly. Internal validation results showed that 8 independent variables (all the 10 independent variables screened out except for BMI and fresh fruit intake) were significant influencing factors of lung cancer (P<0.05). The AUC of the training set for predicting lung cancer occurrence at one year, five years, and ten years were 0.825, 0.785, and 0.777, respectively. The AUC of the validation set for predicting lung cancer occurrence at one year, five years, and ten years were 0.857, 0.782, and 0.765, respectively. 68.38% of the individuals who might develop lung cancer in the future could be identified by screening the high-risk population. Conclusion We established, in this study, a model for predicting lung cancer risks associated with lifestyle behaviors of a large population. Showing good performance in discriminatory ability, the model can be used as a tool for developing standardized screening strategies for lung cancer. -
Key words:
- Lung cancer /
- koko体育app: Risk prediction /
- koko体育app: Prediction model /
- Risk factor
-
koko体育app
图 2 训练集(左)和验证集(右)ROC曲线分析结果
Figure 2. ROC curve aꦺnalysis results of the training set (left) and the t𝐆est set (right)
表 1 统计学特征及Cox回归分析
Table 1. Sꦆtatistical characteristics and Cox regression analysis
Variable Lung
cancer/case,
n=2330No lung
cancer/case,
n=451228Incidence/
‰Univariable Cox Multivariate Cox HR (95% CI) P HR (95% CI) P Age/yr. 40-49 91 111462 0.82 Reference Reference 50-64 1347 259088 5.17 6.55 (5.06-8.46) <0.001 5.43 (3.39-8.69) <0.001 ≥65 892 80678 10.94 14.46 (11.14-18.76) <0.001 11.36 (6.95-18.56) <0.001 Sex Female 1045 240686 4.32 Reference Male 1285 210542 6.07 1.43 (1.3-1.58) <0.001 Body mass index/(kg/m2) <18.5 52 4537 11.33 Reference Reference 18.5-24.9 724 145402 4.95 0.47 (0.33-0.66) <0.001 3.46 (0.48-24.77) 0.22 25.0-29.9 971 191225 5.05 0.48 (0.34-0.67) <0.001 3.11 (0.44-22.23) 0.26 >29.9 583 110064 5.27 0.48 (0.34-0.68) <0.001 2.61 (0.36-18.73) 0.34 Qualifications College or university degree 362 147542 2.45 Reference Reference A levels/AS levels or equivalent 181 50323 3.58 1.62 (1.31-1.99) <0.001 1.66 (1.23-2.23) <0.001 O levels/GCSEs or equivalent 424 94697 4.46 1.87 (1.58-2.21) <0.001 1.57 (1.22-2.03) <0.001 CSEs or equivalent 77 24866 3.09 1.28 (0.95-1.72) 0.105 1.11 (0.65-1.88) 0.71 NVQ, or HND, or HNC, or the equivalent 207 29622 6.94 2.95 (2.41-3.62) <0.001 1.71 (1.22-2.38) <0.001 Other professional qualifications 117 22960 5.07 2.06 (2.41-2.65) <0.001 1.55 (1.06-2.25) 0.02 Missing data 962 81218 11.71 Income/(£/year) <18000 826 85352 9.58 Reference Reference 18000-30999 564 96737 5.80 0.62 (0.54-0.70) <0.001 1.02 (0.79-1.32) 0.90 31000-51999 299 101496 2.94 0.32 (0.27-0.37) <0.001 0.87 (0.66-1.16) 0.35 52000-100000 146 80277 1.82 0.19 (0.16-0.24) <0.001 0.63 (0.44-0.89) <0.001 >100000 39 21455 1.81 0.17 (0.11-0.26) <0.001 0.83 (0.49-1.41) 0.50 Missing data 456 65911 6.87 Ethnicity White 2250 423934 5.28 Reference Not white 70 25677 2.72 0.46 (0.34-0.62) <0.001 Missing data 10 1617 6.15 International Physical Activity Questionnaires activity group Low 417 68539 6.05 Reference Reference Moderate 687 148113 4.62 0.78 (0.67-0.90) 0.001 0.75 (0.58-0.95) 0.02 High 639 147235 4.32 0.73 (0.63-0.85) <0.001 0.61 (0.47-0.79) <0.001 Missing data 587 87341 6.68 Smoking status Never 308 249020 1.24 Reference Reference Previous 1029 153457 6.66 5.70 (4.89-6.64) <0.001 3.74 (2.98-4.68) <0.001 Current 975 47071 20.29 16.93 (14.51-19.76) <0.001 3.23 (1.98-5.29) <0.001 Missing data 18 1680 10.60 Cooked vegetables intake/(tablespoon/d) <3 1177 230290 5.08 Reference 3 604 120535 4.99 0.98 (0.87-1.10) 0.691 >3 506 94436 5.33 1.08 (0.96-1.23) 0.195 Missing data 43 5967 7.15 Salad/Raw vegetables intake/(tablespoon/d) <3 1657 305853 5.39 Reference 3 270 62888 4.27 0.80 (0.69-0.93) 0.004 >3 345 76357 4.50 0.83 (0.72-0.95) 0.007 Missing data 58 6130 9.37 Fresh fruit intake/(pieces/d) <3 1651 286932 5.72 Reference Reference 3 352 89765 3.91 0.66 (0.57-0.76) <0.001 0.80 (0.61-1.03) 0.08 >3 298 72558 4.09 0.75 (0.65-0.87) <0.001 1.05 (0.82-1.36) 0.69 Missing data 29 1973 14.49 Meat intake/ (tablespoon/d) <1 725 178138 4.05 Reference 1 677 131150 5.14 1.21 (1.06-1.37) 0.003 >1 923 140841 6.51 1.59 (1.41-1.78) <0.001 Missing data 5 1099 4.53 Cheese intake/(tablespoon/d) <1 509 88452 5.72 Reference 1 543 94305 5.72 1.04 (0.90-1.20) 0.635 >1 1198 257268 4.64 0.85 (0.75-0.96) 0.011 Missing data 80 11203 7.09 Alcohol drinker status Never 78 20245 3.84 Reference Previous 186 15949 11.53 3.26 (2.34-4.54) <0.001 Current 2063 414475 4.95 1.45 (1.09-1.93) 0.010 Missing data 3 559 5.34 Alcohol drinking status/(day/week) >4 577 91413 6.27 Reference Reference 1-4 974 221206 4.38 0.69 (0.61-0.78) <0.001 0.79 (0.63-1.00) 0.04 <1 776 138186 5.58 0.88 (0.77-1.00) 0.052 1.06 (0.81-1.38) 0.67 Missing data 3 423 7.04 Worries/anxious feelings No 1023 191389 5.32 Reference Yes 1229 247351 4.94 0.89 (0.81-0.98) 0.021 Missing data 78 12488 6.21 Apolipoprotein A Low 9 663 13.39 Reference Moderate 1974 375222 5.23 0.38 (0.17-0.85) 0.019 High 42 12017 3.48 0.26 (0.11-0.62) 0.003 Missing data 305 63326 4.79 Apolipoprotein B Low 2 255 7.78 Reference Moderate 1697 325484 5.19 0.90 (0.13-6.41) 0.918 High 472 96954 4.84 0.83 (0.12-5.94) 0.857 Missing data 159 559 221.45 High-density lipoprotein Low 500 70339 7.06 Reference Moderate 971 184553 5.23 0.73 (0.65-0.84) <0.001 High 562 135067 4.14 0.60 (0.52-0.69) <0.001 Missing data 297 61269 4.82 Low-density lipoprotein Low 155 15239 10.07 Reference Moderate 710 117309 6.02 0.65 (0.52-0.80) <0.001 High 1313 291491 4.48 0.48 (0.39-0.59) <0.001 Missing data 152 27189 5.56 Total cholesterol Low 13 1181 10.89 Reference Moderate 916 145186 6.27 0.56 (0.29-1.09) 0.088 High 1253 278430 4.48 0.40 (0.21-0.78) 0.007 Missing data 148 26431 5.57 Triacylglycerol Low 21 6735 3.11 Reference Moderate 1123 248222 4.50 1.37 (0.83-2.24) 0.215 High 1036 169510 6.07 1.75 (1.07-2.88) 0.026 Missing data 150 26761 5.57 Family history of cancer No 1862 392298 4.72 Reference Reference Yes 452 56907 7.88 1.75 (1.55-1.97) <0.001 1.65 (1.30-2.09) <0.001 Missing data 16 2023 7.85 Tobacco exposure No 1033 322997 3.19 Reference Reference Yes 396 87133 4.52 1.47 (1.28-1.68) <0.001 1.33 (1.07-1.65) 0.01 Missing data 901 41098 21.45 A levels: Advanced Level qualifications; AS levels: Advanced subsidiary levels; O levels: General Certificate of Education Ordinary Level; GCSEs: General Certificate of Secondary Education; CSEs: Certificate of Secondary Education; NVQ: National Vocational Qualification; HND: higher National Diploma; HNC: Higher National certificate. 下载: 导出CSV
表 2 人群风险评估结果
Table 2. Results of population risk assessment
Risk situation Number of
casesProportion of
casesHigh-risk population 291 68.38% Moderate-risk population 123 28.58% Low-risk population 13 3.04% Total 427 100% 下载: 导出CSV
-
[1] SUNG H, FERLAY J, SIEGEL R L, et al. Global Cancer Statistics 2020: GLOBOCAN estimates of incidence and mortality worldwide for 36 cancers in 185 countries. CA Cancer J Clin,2021,71(3): 209–249. doi: [2] 郑荣寿, 张思维, 孙可欣, 等. 2016年中国恶性肿瘤流行情况分析. 中华肿瘤杂志,2023,45(3): 212–220. doi: [3] 赫捷, 陈万青, 李兆, 等. 中国肺癌筛查与早诊早治指南(2022, 北京). 中国肿瘤,2022,31(7): 488–527. doi: [4] VERNIERI C, NICHETTI F, RAIMONDI A, et al. Diet and supplements in cancer prevention and treatment: clinical evidences and future perspectives. Crit Rev Oncol Hematol,2018,123: 57–73. doi: [5] BACH P B, KATTAN M W, THORNQUIST M D, et al. Variations in lung cancer risk amongsmokers. J Natl CancerInst,2003,95(6): 470–478. doi: [6] SPITZ M R, HONG W K, AMOS C I, et al. A risk model for prediction of lung cancer. J Natl Cancer Inst,2007,99(9): 715–726. doi: [7] SPITZ M R, ETZEL C J, DONG Q, et al. An expanded risk prediction model for lung cancer. Cancer Prev Res (Phila),2008,1(4): 250–254. doi: [8] El-ZEIN R A, LOPEZ M S, D′AMELIO A M, et al. The cytokinesis-blocked micronucleus assay as a strong predictor of lung cancer: extension of a lung cancer risk prediction model. Cancer Epidemiol Biomarkers Prev,2014,23(11): 2462–2470. doi: [9] CASSIDY A, MYLES J P, Van TONGEREN M, et al. The LLP risk model: an individual risk prediction model for lung cancer. Br J Cancer,2008,98(2): 270–276. doi: [10] RAJI O Y, AGBAJE O F, DUFFY S W, et al. Incorporation of a genetic factor into an epidemiologic model for prediction of individual risk of lung cancer: the Liverpool Lung Project. Cancer Prev Res (Phila),2010,3(5): 664–669. doi: [11] MARCUS M W, CHEN Y, RAJI O Y, et al. Llpi: Liverpool lung project risk prediction model for lung cancer incidenc. Cancer Prev Res (Phila),2015,8(6): 570–575. doi: [12] MARCUS M W, RAJI O Y, DUFFY S W, et al. Incorporating epistasis interaction of genetic susceptibility single nucleotide polymorphisms in a lung cancer risk prediction model. Int J Oncol,2016,49(1): 361–370. doi: [13] ETZEL C J, KACHROO S, LIU M, et al. Development and validation of a lung cancer risk prediction model for African-Americans. Cancer Prev Res (Phila),2008,1(4): 255–265. doi: [14] SPITZ M R, AMOS C I, LAND S, et al. Role of selected genetic variants in lung cancer risk in African Americans. J Thorac Oncol,2013,8(4): 391–397. doi: [15] TAMMEMAGI C M, PINSKY P F, CAPORASO N E, et al. Lung cancer risk prediction: prostate, lung, colorectal and ovarian cancer screening trial models and validation. J Natl Cancer Inst,2011,103(13): 1058–1068. doi: [16] TAMMEMAGI M C, LAM S C, MCWILLIAMS A M, et al. Incremental value of pulmonary function and sputum DNA image cytometry in lung cancer risk prediction. Cancer Prev Res (Phila),2011,4(4): 552–561. doi: [17] TAMMEMAGI M C, KATKI H A, HOCKING W G, et al. Selection criteria for lung-cancer screening. N Engl J Med,2013,368(8): 728–736. doi: [18] HOGGART C, BRENNAN P, TJONNELAND A, et al. A risk model for lung cancer incidence. Cancer Prev Res (Phila),2012,5(6): 834–846. doi: [19] CHARVAT H, SASAZUKI S, SHIMAZU T, et al. Development of a risk prediction model for lung cancer: the Japan public health center-based prospective study. Cancer Sci,2018,109(3): 854–862. doi: [20] YOUNG R P, HOPKINS R J, HAY B A, et al. Lung cancer susceptibility model based on age, family history and genetic variants. PLoS One,2009,4(4): e5302. doi: [21] MAISONNEUVE P, BAGNARDI V, BELLOMI M, et al. Lung cancer risk prediction to select smokers for screening CT--a model based on the italian cosmos trial. Cancer Prev Res (Phila),2011,4(11): 1778–1789. doi: [22] LI H, YANG L, ZHAO X, et al. Prediction of lung cancer risk in a Chinese population using a multifactorial genetic model. BMC Med Genet,2012,13: 118. doi: [23] PARK S, NAM B H, YANG H R, et al. Individualized risk prediction model for lung cancer in Korean men. PLoS One,2013,8(2): e54823. doi: [24] WANG X, MA K, CUI J, et al. An individual risk prediction model for lung cancer based on a study in a Chinese population. Tumori,2015,101(1): 16–23. doi: [25] 朱猛, 程阳, 戴俊程, 等. 基于全基因组关联研究的中国人群肺癌风险预测模型. 中华流行病学杂志,2015,36(10): 1047–1052. doi: [26] WU X, WEN C P, YE Y, et al. Personalized risk assessment in never, light, and heavy smokers in a prospective cohort in Taiwan. Sci Rep,2016,6: 36482. doi: [27] WANG X, MA K, CHI L, et al. Combining telomerase reverse transcriptase genetic variant rs2736100 with epidemiologic factors in the prediction of lung cancer susceptibility. J Cancer,2016,7(7): 846–853. doi: [28] MULLER D C, JOHANSSON M, BRENNAN P. Lung cancer risk prediction model incorporating lung function: development and validation in the UK Biobank prospective cohort study. J Clin Oncol,2017,35(8): 861–869. doi: [29] WYNDER E L. Tobacco as a cause of lung cancer: some reflections. Am J Epidemiol,1997,146(9): 687–694. doi: [30] OLSSON A C, GUSTAVSSON P, KROMHOUT H, et al. Exposure to diesel motor exhaust and lung cancer risk in a pooled analysis from case-control studies in Europe and Canada. Am J Respir Crit Care Med,2011,183(7): 941–948. doi: [31] CHEN W Q, ZHENG R S, BAADE P D, et al. Cancer statistics in China, 2015. CA Cancer J Clin,2016,66(2): 115–132. doi: [32] 任冠华, 范亚光, 赵永成, 等. 低剂量螺旋CT肺癌筛查研究进展. 中国肺癌杂志,2013,16(10): 553–558. [33] CRUCITTI P, GALLO I F, SANTORO G, et al. Lung cancer screening with low dose CT: experience at Campus Bio-Medico of Rome on 1500 patients. Minerva Chir,2015,70(6): 393–399. -
开放性添加 本文遵循知识共享署名—非商业性使用4.0国际许可协议(CC BY-NC 4.0),允许第三方🌊对本刊发表的论文自由共享(即在任何媒介以任何形式复制、发行原文)、演绎(即修改、转换或以原文为基础进行创作),必须给出适当的署名,提供指向本文许可协议的链接,同时标明是否对原文作了修改;不得将本文用于商业目的。CC BY-NC 4.0许可协议详情请访问 //creativecommons.org/licenses/by-nc/4.0