International Journal of Preventive Medicine

: 2021  |  Volume : 12  |  Issue : 1  |  Page : 180-

Age at natural menopause; A data mining approach (Data from the National Health and Nutrition Examination Survey 2013-2014)

Tahereh Alinia1, Soheila Khodakarim2, Fahimeh Ramezani Tehrani3, Siamak Sabour4,  
1 Student Research Committee, School of Health and Safety, Shahid Beheshti University of Medical Sciences, Tehran, Iran
2 Department of Epidemiology, School of Allied Medical Sciences, School of Public Health and Safety, Shahid Beheshti University of Medical Sciences, Tehran, Iran
3 Reproductive Endocrinology Research Center, Research Institute for Endocrine Sciences, Tehran, Iran
4 Department of Clinical Epidemiology, School of Health and Safety, Shahid Beheshti University of Medical Sciences, Tehran, I.R. Iran

Correspondence Address:
Siamak Sabour
Department of Clinical Epidemiology, School of Health and Safety, Shahid Beheshti University of Medical Sciences, Tehran
I.R. Iran


Background: The timing of the age at which menopause occurs varies among female populations. This variation is attributed to genetic and environmental factors. This study aims to investigate the determinants of early and late-onset menopause. Methods: We used data from the National Health and Nutrition Examination Survey 2013-2014 for 762 naturally menopause women. Data on sociodemographic, lifestyle, examination, and laboratory characteristics were examined. We used random forest (RF), support vector machine (SVM), and logistic regression (LR) to identify important determinants of early and late-onset menopause. We compared the performance of models using sensitivity, specificity, Brier score, and area under the receiver operating characteristic (AUROC). The top determinants were assessed by using the best performing models, using the mean decease in Gini. Results: Random forest outperformed LR and SVM with overall AUROC 99% for identifying related factors of early and late-onset menopause (Brier score: 0.051 for early and 0.005 for late-onset menopause). Vitamin B12 and age at menarche were strongly related to early menopause. Also, methylmalonic acid (MMA), vitamin D, body mass index (BMI) were among the top highly ranked factors contributing to early menopause. Features such as age at menarche, MMA, sex hormone-binding globulin (SHBG), BMI, vitamin B12 were the most important covariate for late-onset menopause. Conclusions: Menarche age and BMI are among the important contributors of early and late-onset menopause. More research on the association between vitamin D, vitamin B12, SHBG, and menopause timing is required which will produce invaluable information for better prediction of menopause timing.

How to cite this article:
Alinia T, Khodakarim S, Tehrani FR, Sabour S. Age at natural menopause; A data mining approach (Data from the National Health and Nutrition Examination Survey 2013-2014).Int J Prev Med 2021;12:180-180

How to cite this URL:
Alinia T, Khodakarim S, Tehrani FR, Sabour S. Age at natural menopause; A data mining approach (Data from the National Health and Nutrition Examination Survey 2013-2014). Int J Prev Med [serial online] 2021 [cited 2022 Jan 27 ];12:180-180
Available from:

Full Text


Concurrent to chronological aging both the number and quality of the oocytes in the ovaries decrease, the ovaries stop producing estrogen and progesterone, and consequently, the menstrual periods stop permanently.[1] Twelve consecutive months of menstruation cessation, for which there is no other obvious pathological or physiological cause than the loss of ovarian follicular activity is defined as natural menopause.[2]

The mechanism underlying the age at natural menopause (ANM) has not been completely understood.[3] Different genetic, social, and environmental factors are likely to be associated with variability in ANM, huge controversies exist and no established risk factor is documented.[4]

Considerable long-term adverse health implication has been reported for early menopause. Early menopause links independently with increased odds of rheumatoid arthritis.[5] The risk of death among women with early menopause is higher.[6],[7] Late-onset menopause also carries health risks. It is a proxy of prolonged exposure to estrogen and a large number of ovulation, which consequently puts women at a higher risk of ovarian, breast, and endometrial cancer.[8]

Data mining utilizes statistical methods for data classification. These techniques have been frequently applied to epidemiologic data to classify determinants of health and have indicated a higher accuracy than classical methods.[9],[10],[11] Support vector machine (SVM), random forest (RF), logistic regression (LR) have been broadly utilized in this era. They are the most frequently used supervised learning methods for analyzing complex survey data.

Several studies have looked to identify significant risk factors of early and late-onset menopause. To the best of our knowledge, there is a lack of studies for ANM determinants using data mining methods. We hypothesized that the use of flexible, optimized data mining approaches on a large data set with many features would yield accurate classification and generate new invaluable hypotheses. In this study, we aim to identify significant determinants of early and late-onset menopause using data mining algorithms, and data from the national health and nutrition examination survey (NHANES) 2013-2014.


Data source

Using the NHANES 2013-2014, the health information on naturally menopause women was collected. NHANES is a cross-sectional survey conducted by the National Center for Health Statistics (NCHS) to assess the health and nutritional status of adults and children in the United States of America. It has a complex, four-stage sampling scheme that combines interviews, physical examinations, and laboratory tests of approximately 5000 non-institutionalized civilian resident population of the United States annually.

The menopause status was determined by asking women the question “Have you had regular periods in the past 12 months?” and if the answer was no, the next question was “What is the reason that you have not had regular periods in the past 12 months?”. Answer choices were pregnancy, breastfeeding, hysterectomy, menopause, and other. Women were referred as naturally menopause if the cause of the lack of menstruation was stated as menopause. 762 women were included in this study who were naturally menopause and reported their ANM. Women were categorized into three strata by their ANM (early, timely, and late-onset menopause). Early menopause was defined as amenorrhea before the age of 45.[12] Timely menopause was defined as women having menopause between ages 45 to 55, and late-onset menopause was defined as if it has not begun until 55.[13]

Input features

Several hundred variables are available in the NHANES data sets. Variables were selected based on existing research looking at the determinants of ANM. Even variables with a possible relation with ANM were considered. Variables were dropped from the dataset if it was only available on subsamples instead of the whole NHANES sample or it had a high level of missing value (more than 35%). A total of 38 variables were included in the models. In our application, considered variables cover socio-demographic (e.g. Education level, race, marital status, ratio of family income to poverty), lifestyle (e.g. drinking), reproductive (e.g. history of prior pregnancy), examination (e.g. anthropometrics), and laboratory (e.g. vitamin D level) characteristics [Table 1] and [Table 2] Family income-to-poverty ratio is an index of socioeconomic standing and represents family income by poverty level.{Table 1}{Table 2}

Data imputation

Tackling the missing values is a prerequisite for applying data mining algorithms. The proportion of missing data ranged from 0% to 15% for all features but, “the number of alcohol drinks over the past 12 months”, “history of Cocaine/heroin/methamphetamine use”, “age at first live birth”, “history of vaginal, anal, or oral sex”, and “age at first sex” which 25 to 35% of data were missing. Missing values in the covariates were imputed with multiple imputations using the package MICE in R with 5 imputations and 50 iterations.

Class unbalance

Data sets in this study were class-imbalanced. Since the data set shows a large number of timely menopausal women, the unbalanced distribution of the variable classes influences the model's performance. An approach to combat this challenge is oversampling. With oversampling, we duplicated samples from the minority class. We rebalanced the data by oversampling the minority class (early or late-onset menopause) and then proceed with learning the classification model on balanced data.

Once the data have been imputed and balanced, data mining algorithms were run on each of the balanced imputed dataset, and then the estimates from each dataset were combined to obtain the final result.

Data mining methods

We used three supervised learning, including random forest (RF), support vector machines (SVM), and logistic regression (LR) for the classification of naturally menopause women to early or late-onset menopause. An independent model was created for early vs. timely menopause and late-onset vs. timely menopause.

The RF is a supervised ensemble classification model that grows many classification trees built from a random subset of features and bootstrap samples. The RF ensemble the prediction from each tree through voting. The most important parameters for RF after parameter tuning in our study were ntree = 100 denotes the number of trees in the forest. The parameter mtry = 6 (square root of the total number of variables) denotes the number of features randomly selected as candidate features at each split.

SVM is a supervised data mining model that uses classification algorithms for two-group classification problems. The SVM is based on mapping data to a higher dimensional space through a linear kernel function and choosing the maximum-margin hyper-plane that separates data. Thus, the goal of the SVM is to improve accuracy by the optimization of space separation. The SVM model trains the characteristics associated with ANM groups. The regularization parameter was considered 10.

LR is a supervised data mining model that is used to model a binary dependent variable. It uses the logit function to predict certain class probabilities. The prediction from a logistic regression model can be interpreted as the probability that the label is 1. It's always best to predict class probabilities instead of predicting classes.

Performance evaluation

The dataset was divided into a training set (80% of total samples) used to develop the classification models and a validation set (20% of total samples) used to assess the classification accuracy of each model. Model on the testing dataset was evaluated using performance statistics in terms of sensitivity, specificity, Brier score, and Area Under Receiver Operating Characteristic (AUROC). A Brier score is a way to verify the accuracy of a classification model. It is calculated as the mean squared differences of actual results and forecast probability. It ranges between 0 and 1. The lower Brier score indicates higher accuracy. AUROC is known as a global measure of classifier performance that provides a comprehensive measure to summarize the false-positive rate, or 1–specificity versus sensitivity of the classification method. AUROC demonstrates how well the early, and late-onset menopause can be classified by the algorithm.

The best data mining model was selected based on performance metrics. Within the best performing model, variable importance measures rank the variables concerning their relevance for classification. It is assessed by the Gini impurity criterion index. Mean Decrease in Gini is the average of a variable's total decrease in node impurity, weighted by the proportion of samples reaching that node in each decision tree in the random forest. A low Gini (i.e., a greater decrease in mean Gini) means that a particular variable contributes a greater role in classification and have more relevance to menopause timing. The criterion for selecting the most important variable was the five variables with the largest decrease in the mean Gini among each model. All analyses were run using R release 4.0.3.


[Table 1] presents the socio-demographic characteristic of the study population stratified by their ANM category. Out of 762 naturally menopause women, early menopause reported among 132 (17.3%), timely menopause among 575 (75.4%), and late-onset menopause among 55 (7.2%). The average age of women at study time was 62.8 ± 10.6 for early menopause, 64.1 ± 9.1 for timely, and 67.4 ± 6.3 for late-onset menopause (p-value = 0.009). There was a decline across the years since the final menstrual period (FMP) from early to late-onset menopause. The average years since FMP at the time of study among early, timely, and late-onset menopause women were 23.1 ± 11.5, 14.1 ± 9.2, and 9.4 ± 5.9, respectively. The ratio of family income to poverty was the lowest among early menopause women (p-value = 0.019). Description of input variables into data mining models were presented in [Table 2].

[Table 3] describes the comparative performance of three data mining models (RF, SVM, and LR) for the classification of menopause timing. The full set of 38 variables were used for training the classification models (number of features = 38). Five classification performance estimates were produced (one estimate per each imputed dataset), separately for early and late-onset menopause, and then one combined estimate was provided for each ANM group.{Table 3}

Within the early vs. timely menopause women, we found that RFs outperformed LRs and SVMs. AUROC ranged from 98.1 to 98.4% among imputed datasets in the RF model; combined AUROC was 99%. The Brier score for combined RF was the lowest value among models (0.051 for RF versus 0.200 for LR and 0.211 for SVM). The worst performing model was SVM with a combined AUROC at 68.6%.

In the classification of late-onset vs. timely menopause, the RF was the best discriminating classifier with an AUROC score of 99% among all imputed, and combined estimates, followed by LR and SVM each reported combined AUROC of 84.1% and 78.9, respectively. The lowest Brier score belonged to RF balanced (Brier score at 0.005 for combined data) and the highest score to SVM balanced (Brier score at 0.200 for combined data).

[Figure 1] and [Figure 2] highlight the relative importance of variables by the RF model. As RF was the top-performing model, the mean decrease in Gini was used to compare the importance between the variables within the model. The criterion for selecting the most important variable was the five variables with the largest decrease in the mean Gini among each model.{Figure 1}{Figure 2}

Analysis of features in early menopause women showed vitamin B12 and age at menarche were the most important features which contribute substantially towards the classification of the RF model. Features including Methylmalonic acid (MMA), vitamin D (25OHD2 + 25OHD3), body mass index (BMI) were among top highly ranked variables contributing to the classification into early menopause.

The late-onset menopause data analysis suggests features such as age at menarche, MMA, sex hormone-binding globulin, BMI, vitamin B12 as the most important variables.


The main goal of the present study was to identify significant determinants of early and late-onset menopause using data mining algorithms. These models will be able to effectively screen women who carry a higher probability of early or late-onset menopause. This is significant because the unusual timing of menopause may indicate not only the loss of fertility but also an increased risk for various mid-life diseases and problems. Many of these diseases can be prevented by timely intervention, through lifestyle modification. The important contribution of the present work is that we searched the NHANES, a large population-based survey, for menopause timing determinant factors via data mining analytical approach. Menarche age and BMI are among the important contributors of early and late-onset menopause. Models trained using RF outperformed LR and SVM for ANM classification. Data mining has generated hypotheses that MMA, vitamin B12, SHBG, and vitamin D are possibly correlated to menopause timing.

The RF models developed in the study surpass LR and SVM. As suggested by a large body of literature RF outperforms SVM, however, the opposite has been reported too.[14] We are aware of no studies that have classified early or late-onset menopause using data mining approaches. Therefore, it is not possible to compare the current study with similar studies. The consistency of the performance metrics across imputed datasets suggests that imputation has been produced nearly similar data.

Our findings are consistent with previous studies that have established the association between age at menarche and ANM.[15] It is not completely clear whether early menarche cause early or late-onset menopause. The overall evidence is mixed. No correlation between the age of menarche and the age of menopause was reported in some studies.[16],[17],[18] A pooled analysis of nearly 50,000 postmenopausal women from nine observational studies in the UK, Scandinavia, Australia, and Japan, concluded that the risk of premature and early menopause increased by 80% for women with early menarche.[15]

In the present study, we found that MMA and vitamin B12 highly contribute to early and late-onset menopause classification. MMA is a carrier of vitamin B12, which is necessary for human metabolism and energy production, and its level is a biomarker for vitamin B12 deficiency.[19] Serum vitamin B12 concentrations are frequently low in the elderly.[20],[21] Previous studies reported that lack of estrogen (menopause) affects the requirements for the B vitamins, including B12, for maintaining low blood homocysteine concentrations,[22] however, no study has explored the effect of vitamin B12 on menopause timing. Therefore, there is no clear explanation for the observed relationship. Future research is essential to examine the possibility of an association between MMA, vitamin B12, and ANM.

SHBG was also found to be an important contributor to the unusual timing of menopause. SHBG binds to three sex hormones, including testosterone, dihydrotestosterone, and estradiol to regulate these hormone levels in the body. Evidence on the link between SHBG and ANM is lacking. The age-related trend of SHBG level among women is not clear and can be affected by many factors such as BMI and fasting insulin.[23] A meta-analysis of data retrieved from nine studies that investigated serum androgen profiles in women with premature ovarian failure found that these women did not seem to have a statistically significant difference compared to fertile women with regards to SHBG levels.[24]

The impact of vitamin D on female reproduction and the related disease has been thoroughly researched, however, the evidence on the link between vitamin D and ANM is scant. A prospective study reported that a higher level intake of vitamin D decreases the risk of early menopause.[25] Our work identified vitamin D as an important determinant of early menopause. This association may be biologically explainable. Active metabolites of vitamin D regulate genes involved in estrogen synthesis.[26] Also, follicle-stimulating hormone (FSH) was reported to be inversely associated with vitamin D, and FSH is a biomarker of ovarian reserve, which rises across the late reproductive lifespan.[27]

The present study found BMI as an important variable in the classification of menopause age on both early and late-onset menopause. This finding is in line with existing evidence. Several previous studies documented an association between BMI and ANM. The higher the BMI, the later the age at menopause. BMI is the major determinant of endogenous estrogen level, therefore women with lower BMI are at risk of early menopause,[28],[29],[30] and those with higher BMI are supposed to have higher levels of estradiol and estrone in the body, and consequently later ANM.[31]

It is important to be aware of the limitations of cross-sectional studies. Menopause is a condition, with extensive physiological and psychological changes, in combination with the advancement of age. One limitation of the cross-sectional study design is that because the exposure and outcome are simultaneously assessed, there is generally no evidence of a temporal relationship between explored variables and ANM. Women were menopause when they participated in the survey and some of their health information is related to their current situation. Some of the factors examined in that age range, including the serum level of biological markers may not reflect the status of these factors in whole life or the years before and around menopause. Future research should employ longitudinal designs to validate cross-sectional findings obtained in the present study to ascertain the temporal trends for predicting ANM.

This paper intended to find correlated factors of early and late-onset menopause using three popular data mining approaches. The RF models were consistently better classifiers than other models. Age at menarche and BMI have a contributing effect on menopause timing. Future research focusing on the effect of the level of vitamin D, vitamin B12, and SHBG on menopause timing is proposed and will produce invaluable information for better prediction of the age at which menopause starts.

Financial support and sponsorship


Conflicts of interest

There are no conflicts of interest.


1Amanvermez R, Tosun M. An update on ovarian aging and ovarian reserve tests. Int J Fertil Steril 2016;9:411-5.
2Research on the menopause in the 1990s. Report of a WHO Scientific Group. World Health Organization technical report series. 1996;866:1-107.
3O'Connor KA, Holman DJ, Wood JW. Menstrual cycle variability and the perimenopause. Am J Hum Biol 2001;13:465-78.
4Cagnacci A, Pansini FS, Bacchi-Modena A, Giulini N, Mollica G, De Aloysio D, et al. Season of birth influences the timing of menopause. Hum Reprod 2005;20:2190-3.
5Pikwer M, Bergstrom U, Nilsson JA, Jacobsson L, Turesson C. Early menopause is an independent predictor of rheumatoid arthritis. Ann Rheum Dis 2012;71:378-81.
6Hong JS, Yi SW, Kang HC, Jee SH, Kang HG, Bayasgalan G, et al. Age at menopause and cause-specific mortality in South Korean women: Kangwha Cohort Study. Maturitas 2007;56:411-9.
7Jacobsen BK, Heuch I, Kvåle G. Age at natural menopause and all-cause mortality: A 37-year follow-up of 19,731 Norwegian women. Am J Epidemiol 2003;157:923-9.
8Ruth KS, Murray A. Lessons from genome-wide association studies in reproductive medicine: Menopause. Semin Reprod Med 2016;34:215-23.
9Ahmed K, Jahan P, Nadia I, Ahmed F, Abdullah Al E. Assessment of menopausal symptoms among early and late menopausal midlife Bangladeshi women and their impact on the quality of life. J Menopausal Med 2016;22:39-46.
10Kelsey TW, Anderson RA, Wright P, Nelson SM, Wallace WH. Data-driven assessment of the human ovarian reserve. Mol Hum Reprod 2012;18:79-87.
11Malinowski J, Farber-Eger E, Crawford DC. Development of a data-mining algorithm to identify ages at reproductive milestones in electronic medical records. Pac Symp Biocomput 2014:376-87.
12Faubion SS, Kuhle CL, Shuster LT, Rocca WA. Long-term health consequences of premature or early menopause and considerations for management. Climacteric 2015;18:483-91.
13Canonico M, Plu-Bureau G, O'Sullivan MJ, Stefanick ML, Cochrane B, Scarabin PY, et al. Age at menopause, reproductive history and venous thromboembolism risk among postmenopausal women: The women's health initiative hormone therapy clinical trials. Menopause 2014;21:214-20.
14Statnikov A, Aliferis CF. Are random forests better than support vector machines for microarray-based cancer classification? AMIA Annu Symp Proc 2007;2007:686-90.
15Mishra GD, Pandeya N, Dobson AJ, Chung HF, Anderson D, Kuh D, et al. Early menarche, nulliparity and the risk for premature and early natural menopause. Hum Reprod 2017;32:679-86.
16Rizvanovic M, Balic D, Begic Z, Babovic A, Bogadanovic G, Kameric L. Parity and menarche as risk factors of time of menopause occurrence. Med Arch 2013;67:336-8.
17Zsakai A, Mascie-Taylor N, Bodzsar EB. Relationship between some indicators of reproductive history, body fatness and the menopausal transition in Hungarian women. J Physiol Anthropol 2015;34:35.
18Bjelland EK, Hofvind S, Byberg L, Eskild A. The relation of age at menarche with age at natural menopause: A population study of 336 788 women in Norway. Hum Reprod 2018;33:1149-57.
19Hannibal L, Lysne V, Bjorke-Monsen AL, Behringer S, Grunert SC, Spiekerkoetter U, et al. Biomarkers and algorithms for the diagnosis of vitamin B12 deficiency. Front Mol Biosci 2016;3:27.
20Allen LH. Vitamin B-12. Adv Nutr 2012;3:54-5.
21Carmel R, Howard JM, Green R, Jacobsen DW, Azen C. Hormone replacement therapy and cobalamin status in elderly women. Am J Clin Nutr 1996;64:856-9.
22Allen LH, Miller JW, de Groot L, Rosenberg IH, Smith AD, Refsum H, et al. Biomarkers of Nutrition for Development (BOND): Vitamin B-12 review. J Nutr 2018;148:1995S-2027S.
23Maggio M, Lauretani F, Basaria S, Ceda GP, Bandinelli S, Metter EJ, et al. Sex hormone binding globulin levels across the adult lifespan in women--The role of body mass index and fasting insulin. J Endocrinol Invest 2008;31:597-601.
24Soman M, Huang LC, Cai WH, Xu JB, Chen JY, He RK, et al. Serum androgen profiles in women with premature ovarian insufficiency: A systematic review and meta-analysis. Menopause 2019;26:78-93.
25Purdue-Smithe AC, Whitcomb BW, Szegda KL, Boutot ME, Manson JE, Hankinson SE, et al. Vitamin D and calcium intake and risk of early menopause. Am J Clin Nutr 2017;105:1493-501.
26Hong SH, Lee JE, An SM, Shin YY, Hwang DY, Yang SY, et al. Effect of vitamin D3 on biosynthesis of estrogen in porcine granulosa cells via modulation of steroidogenic enzymes. Toxicol Res 2017;33:49-54.
27Jukic AMZ, Steiner AZ, Baird DD. Association between serum 25-hydroxyvitamin D and ovarian reserve in premenopausal women. Menopause (New York, NY) 2015;22:312-6.
28Ahuja M. Age of menopause and determinants of menopause age: A PAN India survey by IMS. J Midlife Health 2016;7:126-31.
29Akahoshi M, Soda M, Nakashima E, Tominaga T, Ichimaru S, Seto S, et al. The effects of body mass index on age at menopause. Int J Obes Relat Metab Disord 2002;26:961-8.
30Tao X, Jiang A, Yin L, Li Y, Tao F, Hu H. Body mass index and age at natural menopause: A meta-analysis. Menopause 2015;22:469-74.
31McTiernan A, Wu L, Chen C, Chlebowski R, Mossavar-Rahmani Y, Modugno F, et al. Relation of BMI and physical activity to sex hormones in postmenopausal women. Obesity (Silver Spring) 2006;14:1662-77.