The temporal reliability of serum estrogens, progesterone, gonadotropins, SHBG and urinary estrogen and progesterone metabolites in premenopausal women
© Williams et al; licensee BioMed Central Ltd. 2002
Received: 23 August 2002
Accepted: 23 December 2002
Published: 23 December 2002
There is little existing research to guide researchers in estimating the minimum number of measurement occasions required to obtain reliable estimates of serum estrogens, progesterone, gonadotropins, sex hormone-binding globulin (SHBG), and urinary estrogen and progesterone metabolites in premenopausal women.
Using data from a longitudinal study of 34 women with a mean age of 42.3 years (SD = 2.6), we calculated the minimum number of measurement occasions required to obtain reliable estimates of 12 analytes (8 in blood, 4 in urine). Five samples were obtained over 1 year: at baseline, and after 1, 3, 6, and 12 months. We also calculated the percent of true variance accounted for by a single measurement and intraclass correlation coefficients (ICC) between measurement occasions.
Only 2 of the 12 analytes we examined, SHBG and estrone sulfate (E1S), could be adequately estimated by a single measurement using a minimum reliability standard of having the potential to account for 64% of true variance. Other analytes required from 2 to 12 occasions to account for 81% of the true variance, and 2 to 5 occasions to account for 64% of true variance. ICCs ranged from 0.33 for estradiol (E2) to 0.88 for SHBG. Percent of true variance accounted for by single measurements ranged from 29% for luteinizing hormone (LH) to 92% for SHBG.
Experimental designs that take the natural variability of these analytes into account by obtaining measurements on a sufficient number of occasions will be rewarded with increased power and accuracy.
Several active research programs are investigating the risk associated with serum estrogens, gonadotropins and urinary sex hormone metabolites for a variety of diseases including breast cancer , endometrial cancer , and osteoporosis . The results of the few published studies suggest that the natural temporal variability (true variation over time, not variation due to storage or other factors) of some serum estrogens, gonadotropins and urinary sex hormone metabolites is sufficiently great that a single measurement occasion may be inadequate to ensure a reliable estimate [4–6]. Published intraclass correlation coefficients (ICC) vary between 0.06 and 0.62 for estradiol (E2) and between 0.52 and 0.69 for estrone (E1) . Only the percent of free E2 and of SHBG-bound E2 have been found to be sufficiently reliable to account for as much as 50% of the variance in the true mean (ICC > 0.7).
The term reliability can refer either to the consistency of a measuring procedure or to the temporal stability of the target of measurement . The definition of temporal reliability used in this study includes both those dimensions, but emphasizes the latter. While researchers can control error due to insufficient repeated measures by increasing the number of measurement occasions, obtaining measurements is expensive. It is therefore useful to have evidence-based guidelines for estimating the minimum number of occasions required to obtain a given degree of reliability for a particular analyte.
The relation of a measurement to the object being measured can be represented as: σ O = σ T + σ E , where σ O = variance in the observed measurement of the target, σ T = variance in the true value of the target, and σ E = random variance, or error. If the true value of the target is invariant across measurements, i.e., if σ O = σ E , the observed variance will be purely a function of the unreliability of the measuring instrument. Conversely, if perfectly error-free measurement of the target could be assumed, i.e., if σ E = 0, then σ O = σ T and the observed variance would be purely a function of the temporal stability of the target. If σ E ≠ 0 and σ T ≠ 0, the observed variance will be a function of both the temporal stability of the target and of the unreliability of the measuring instrument.
Measurement error can result from a variety of factors, including true variance not captured by a particular measurement strategy, which may complicate the interpretation of temporal reliability estimates. These other factors include variance due to: fluctuations across cycle phases within each woman's menstrual cycle ; duration of sample storage prior to analysis ; limitations of the assay; multiple analysis batches ; multiple types of assays ; and multiple laboratories . Ideally, estimates of as many sources of error as possible should be included when considering the impact of temporal reliability on measurement strategy. The objective of this study was to determine the following for various serum estrogens, gonadotropins, and urinary sex hormone metabolites: the minimum number of repeated measurements required for reliable estimates; the ICCs; and the amount of true variance accounted for by single measurements.
The data for this study come from a randomized double-blind study investigating the effects of a 100 mg/day soy isoflavone regimen on estrogen levels in 34 premenopausal women. A detailed description of the study design and the results of the intervention were reported in Maskarinec et al., 2002).). The Committee on Human Studies at the University of Hawaii approved the study protocol. Written informed consent was obtained from each subject, prior to participation. The study group consisted of 17 premenopausal women per group. Four women left the study before the end of the year and another was able to give only four blood draws for health reasons. Eligibility criteria included: an age range of 35–46 years; an average intake of less than 7 servings of soy foods per week; no prior cancer diagnosis (except basal cell skin carcinoma); no use of oral contraceptives or hormone preparations within the past three months; no intention of becoming pregnant within the next year; an intact uterus and ovaries; self-defined regular menstrual periods; no serious medical condition. Subjects had a mean age of 42.3 years (SD = 2.6), and a mean weight of 65.6 kg (SD = 12.8). Subjects were ethnically diverse: 18 were Caucasian; 6 were Chinese; 5 were Japanese; 5 were Hawaiian.
Subjects were asked to donate 5 urine and blood samples, one at baseline and one after 1, 3, 6, and 12 months of participation. All samples were collected approximately 5 days after the ovulation (approximately day 19 in a 28 day cycle). Subjects used ovulation kits (Ovuquick test kits from Quidel, La Jolla, CA) to determine the time of ovulation. This kit detects the mid-cycle rise of LH using morning urine with a sensitivity of 35 mIU/mL of LH and its predictive validity with respect to ovulation has been estimated as 93% . Although the use of a minimum progesterone value to exclude data from anovulatory cycles from the analyses helped ensure acquisition of the mid-luteal phase samples, only 52% of samples were obtained on exactly the 5th day from ovulation. Ninety-one percent were obtained between the 4th and the 6th day from ovulation. Blood samples were drawn at a commercial laboratory, in the morning between 7 and 9 o'clock to control for circadian rhythm in hormone levels. Serum and urine samples were stored at -80°C after separation and aliquoting.
Coefficients of variation for all analytes
Batch 1 Mean of QC Value
Batch 2 Mean of QC Value
Urine samples were analyzed for estrone-3-glucuronide (E1-G), pregnanediol-3-glucuronide (PDG), 16α-hydroxyestrone (16α-OHE1) and 2-hydroxyestrone (2-OHE1). E1-G and PDG were measured directly in urine by enzyme immunoassay . Commercially available enzyme-linked immunosorbent assay kits (Estramet: Immuna Care Corporation, Bethlehem, PA) were used to determine levels of 16α-OHE1 and 2-OHE1 in urine . All results are relative to creatinine excretion.
The SAS statistical software package version 8.2 (SAS Institute Inc., Cary, NC, 1999–2001) was used to perform the statistical analyses. All statistics were computed using logged values when raw values were not normally distributed. To ensure that all measurements in the analysis were from the same time in the menstrual cycle, observations were only included if the concurrent progesterone values were at least 5 ng/mL, a minimum value after an ovulation has occurred. Because analyses for 8 of 12 analytes were conducted in two batches, we included consideration of error due to between batch variance in our analysis of the temporal stability of these analytes. Therefore, estimates of temporal stability for the 8 analytes were calculated for the total number of samples and for the first and second batches separately.
Two types of estimates of the number of measurement occasions (O) necessary to obtain an adequately reliable estimate were computed. The first, the relative type (O R ) includes the between-subject variance. O R was computed using the formula proposed by Nelson et al. : where r is the correlation between the observed and the true mean analyte values for an individual over a year, s W 2 is the within-subject variance, and s B 2 is the between-subject variance. Setting r to 0.9 results in a calculation of the number of measurement occasions required to obtain an estimate that would account for 0.92 or 81% of the true variance in the target. Ninety-five percent confidence intervals (95% CI) for O R were computed using a published method).).
The second estimate of the number of measurement occasions necessary to obtain an adequately reliable estimate, the absolute type (O A ), includes only within-subject variance. O A was calculated as , where σw is the within-subject variance . By adjusting the denominator, this method allows for the desired approximation to the true mean to be specified as a percentage. Setting the denominator to 0.2 results in a calculation of the number of occasions required to obtain an estimate that is within 20% of the true mean. A SAS macro using Proc Varcomp and Proc Means to produce estimates of O R , O A , and related statistics is available from the authors.
ICCs measure the proportion of variance attributable to targets of measurement as a ratio of within-subject variance to total variance  and are suitable to compare variables of the same measurement class . We computed two types of ICCs using the notation developed by Shrout and Fleiss : ICC(2,1) was computed for each analyte using all 5 measurement occasions to estimate the temporal reliability of the analyte; ICC(2,k) was computed between batches to estimate the contribution of between-batch variance to the temporal reliability estimate. ICC(2,1) was computed as ICC(2,1) = , where BMS is the between-subjects mean square, EMS is the error mean square, k is the number of observations, OMS is the observations mean square, and n is the number of subjects . ICC(2,k) was computed as ICC(2,k) = . We applied the formulas by Shrout and Fleiss  to obtain 95% CIs.
To estimate the percentage of true variance accounted for by a single measurement, we assumed that the best available estimate of the true variance was the total variance for all occasions.
Basic descriptive data for all measurement occasions of all analytes
Means for Each Measurement Occasion
2 (1 mo)
3 (3 mo)
4 (6 mo)
5 (1 yr)
Free Estradiol (pg/mL)
Minimum occasions required to obtain a reliable estimate, intraclass correlation coefficients, and percent of true variance accounted for by single measurements
Measurement Occasions Required
ICC(2,1) (95% CI)
% of True Variance Accounted for by a Single Occasion (Range)
To Account for 81% of True Variance
To be Within 20% of True Mean
Relative Method (95% CI)
In the case of SHBG (Figure 2), within-subject variance is small relative to between-subject variance. There is little variation within subjects relative to the variation between subjects, resulting in small O R and O A estimates (0.48 and 1.78 respectively). The PDG values (Figure 3) illustrate the case in which within subject variation is high and overlap one another considerably, resulting in relatively large O R and O A estimates (5.17 and 10.27 respectively). Finally, Figure 4 depicts the case in which within-subject variance is small, but so is the variance between subjects. In this case, the small within-subject variance results in a small O A estimate (0.34), but because the within-subject variance is not small relative to the between-subject variance, the O R is relatively large (8.26).
Intraclass correlation coefficients between batches for analytes analyzed in 2 batches
We have provided estimates to the minimum number of measurement occasions required to ensure adequate reliability for two types of experimental aims. Analyses in epidemiologic studies involve calculations in which between-subject as well as within-subject variance is important. Therefore, O R will usually be the appropriate index of the minimum number of occasions needed to obtain a reliable estimate. Estimates of O R based on our sample suggest that only SHBG and E1S had sufficient temporal stability to be adequately reliable with a single measurement when the desired amount of variance to account for was set as low as 64%. A single measurement of any of the other analytes would be unlikely to account for even 50% of the true variance. For cases in which the within-subject variance is the only variance of interest, e.g., when the measured value of an analyte will be compared with a fixed standard, O A will be the appropriate index. The omission of between-subject variance from the formula for calculating this statistic produces very different results from O R . Several of the analytes that were adequately reliable with a single measurement or very few measurements, when between-subject variance was a factor, required higher numbers of measures when only within-subject variance was involved and vice versa.
This study confirms previous findings that SHBG may be reliably measured in premenopausal women using a single occasion. It also indicates that E1S may be reliably measured using one sample only. More importantly, our results suggest that none of the other analytes examined meet minimal reliability requirements that would permit confidence in single measures. These results are in agreement with the wide range if ICCs reported in previous studies [4–6]. Our conclusions are limited to the collection of samples at midluteal phase, however, and may not generalize to other phases of the menstrual cycle.
The use of ICCs to estimate the agreement between analysis batches differs from their use as an index of temporal reliability. The appropriate type of ICC for this purpose uses a mean of several values rather than single values and is typically higher than that calculated using single values. Though the ICCs between batches were higher than those estimating temporal reliability, they were relatively low, demonstrating the importance of measuring all samples in one batch when possible. As was previously noted , error due to time in storage will affect estimates of temporal reliability. Analyzing in multiple batches is one means of decreasing this source of error, but runs the risk of increasing error due to multiple batches. Until better estimates of the impact of storage time on each of these analytes are available, however, it will be difficult to draw conclusions about whether error due to multiple analysis batches or error due to storage time has the more detrimental effect on temporal reliability.
Several sources of error are effectively beyond researchers' capacity to control. For example, the validity and reliability of the best assay available for measuring a given analyte cannot be increased through improving study design. Other sources of error, however, can be dramatically reduced through the use of appropriate designs. These strategies may include, increasing the sample size to reduce the impact of random error, analyzing all samples in one batch, and using a sufficient number of repeated measures to obtain an adequately reliable estimate. It is also possible, though not uncontroversial, to control error statistically by correcting for attenuation using validation data .
Several improvements, in addition to a larger sample and more repeated measures, would have increased confidence in the results of our study. First, if the effects of storage time on the analytes were known, we could have taken into account the contributions of this source of variance to our temporal reliability estimates and distinguished its impact from that due to assay reliability. Second, obtaining blood and urine samples on day 5 following ovulation was most appropriate for the measurement of progesterone and near-optimal for SHBG, but may not have been the best day to obtain estimates of the other analytes . Third, though our data were drawn from an intervention study in which no results approached significance, a more clearly homogeneous sample would have been preferable. Fourth, variation in menstrual cycle length and variance due to pulsatility of excretion were additional sources of error.
Finally, our estimates were based on targets that changed across measurements, and we could not assume error-free measurements. Consequently, we were not able to precisely distinguish between the contributions of assay reliability and the contributions of each analyte's natural variability to our estimates of temporal reliability. However, despite some limitations, this study provided significant new insights into the variability of sex hormones, gonadotropins, and urinary hormone metabolites in premenopausal women during a one-year period. Our estimates of temporal reliability represent the combined computation of the consistency of a measure across repeated measurements and the temporal fluctuations in the target of measurement.
Given the relatively large sample size for this analysis and the strictly controlled protocol to collect samples on the same day of the menstrual cycle, our results will be useful for designing future research projects exploring the role of sex hormones in the etiology of cancer and other diseases.
The authors gratefully acknowledge the valuable assistance, advice, and guidance provided by Lynne R. Wilkens, Dr PH, and Ian Pagano, MA, both of the Cancer Research Center of Hawaii. We are grateful to the women who donated their time and effort to participate in this study. The project was funded by a contract from the Pharmavite Corporation in San Fernando, California and by a Developmental Funds award from the Cancer Center Support grant to the Cancer Research Center of Hawaii (P30CA071789).
- Muti P, Bradlow HL, Micheli A, Krogh V, Freudenheim JL, Schunemann HJ, et al: Estrogen metabolism and risk of breast cancer: a prospective study of the 2:16alpha-hydroxyestrone ratio in premenopausal and postmenopausal women. Epidemiology. 2000, 11: 635-640. 10.1097/00001648-200011000-00004.View ArticlePubMedGoogle Scholar
- Parslov M, Lidegaard O, Klintorp S, Pedersen B, Jonsson L, Eriksen PS, et al: Risk factors among young women with endometrial cancer: a Danish case-control study. Am J Obstet Gynecol. 2000, 182: 23-29.View ArticlePubMedGoogle Scholar
- Moreira Kulak CA, Schussheim DH, McMahon DJ, Kurland E, Silverberg SJ, Siris ES, et al: Osteoporosis and low bone mass in premenopausal and perimenopausal women. Endocr Pract. 2000, 6: 296-304.View ArticlePubMedGoogle Scholar
- Michaud DS, Manson JE, Spiegelman D, Barbieri RL, Sepkovic DW, Bradlow HL, et al: Reproducibility of plasma and urinary sex hormone levels in premenopausal women over a one-year period. Cancer Epidemiol Biomarkers Prev. 1999, 8: 1059-1064.PubMedGoogle Scholar
- Muti P, Trevisan M, Micheli A, Krogh V, Bolelli G, Sciajno R, et al: Reliability of serum hormones in premenopausal and postmenopausal women over a one-year period. Cancer Epidemiol Biomarkers Prev. 1996, 5: 917-922.PubMedGoogle Scholar
- Toniolo P, Koenig KL, Pasternack BS, Banerjee S, Rosenberg C, Shore RE, et al: Reliability of measurements of total, protein-bound, and unbound estradiol in serum. Cancer Epidemiol Biomarkers Prev. 1994, 3: 47-50.PubMedGoogle Scholar
- Nunnally JC, Bernstein IH: Psychometric Theory. 1994, New York: McGraw-Hill, Inc, 3Google Scholar
- Greenland S: Basic methods for sensitivity analysis of biases. Int J Epidemiol. 1996, 25: 1107-1116.View ArticlePubMedGoogle Scholar
- Wong MY, Day NE, Wareham NJ: Measurement error in epidemiology: the design of validation studies II: bivariate situation. Stat Med. 1999, 18: 2831-2845. 10.1002/(SICI)1097-0258(19991115)18:21<2831::AID-SIM282>3.3.CO;2-V.View ArticlePubMedGoogle Scholar
- Gail MH, Fears TR, Hoover RN, Chandler DW, Donaldson JL, Hyer MB, et al: Reproducibility studies and interlaboratory concordance for assays of serum hormone levels: estrone, estradiol, estrone sulfate, and progesterone. Cancer Epidemiol Biomarkers Prev. 1996, 5: 835-844.PubMedGoogle Scholar
- Bolelli G, Muti P, Micheli A, Sciajno R, Franceschetti F, Krogh V, et al: Validity for epidemiological studies of long-term cryoconservation of steroid and protein hormones in serum and plasma. Cancer Epidemiol Biomarkers Prev. 1995, 4: 509-513.PubMedGoogle Scholar
- Falk RT, Gail MH, Fears TR, Rossi SC, Stanczyk F, Adlercreutz H, et al: Reproducibility and validity of radioimmunoassays for urinary hormones and metabolites in pre- and postmenopausal women. Cancer Epidemiol Biomarkers Prev. 1999, 8: 567-577.PubMedGoogle Scholar
- Maskarinec G, Williams A, Inouye J, Stanczyk F, Franke A: A Randomized isoflavone intervention among premenopausal women. Cancer Epidemiol Biomarkers Prev. 2002, 11: 195-201.PubMedGoogle Scholar
- Rudy EB, Estok P: Professional and lay interrater reliability of urinary luteinizing hormone surges measured by OvuQuick test. J Obstet Gynecol Neonatal Nurs. 1992, 21: 407-411.View ArticlePubMedGoogle Scholar
- Goebelsmann U, Bernstein GS, Gale JA, Kletzky OA, Nakamura RM, Coulson AH, et al: Serum gonadotropin testosterone estradiol and estrone levels prior to and following bilateral vasectomy. In Vasectomy: Immunologic and Pathophysiologic Effects In Animals And Man. Edited by: Lepow IH, Crozier R. 1979, New York: Academic Press, 165.Google Scholar
- Sodergard R, Backstrom T, Shanbag V, Carstensen H: Calculation of free and bound fractions of testosterone and estradiol-17α to human plasma proteins at body temperature. Steroid Biochem Mol Biol. 1982, 16: 801-810.View ArticleGoogle Scholar
- Munro CJ, Stabenfeldt GH, Cragun JR, Addlego LA, Overstreet JW, Lasley BL: Relationship of serum estradiol and progesterone concentrations to the excretion profiles of their major urinary metabolites as measured by enzyme immunoassay and radioimmunoassay. Clin Chem. 1991, 37: 638-644.Google Scholar
- Falk RT, Rossi SC, Fears TR, Sepkovic DW, Migella A, Adlercreutz H, et al: A new ELISA kit for measuring urinary 2-hydroxyestrone, 16alpha-hydroxyestrone, and their ratio: reproducibility, validity, and assay performance after freeze-thaw cycling and preservation by boric acid. Cancer Epidemiol Biomarkers Prev. 2000, 9: 81-87.PubMedGoogle Scholar
- Nelson M, Black AE, Morris JA, Cole TJ: Between- and within-subject variation in nutrient intake from infancy to old age: estimating the number of days required to rank dietary intakes with desired precision. American Journal of Clincal Nutrition. 1989, 50: 155-167.Google Scholar
- Wilkens LR, Le Marchand L, Harwood P, Cooney RV: Use of Breath Hydrogen and Methane as Markers of Colonic Fermentation In Epidemiological Studies: Variability in Exretion. Cancer Epidemiol Biomarkers Prev. 1994, 3: 149-153.PubMedGoogle Scholar
- Beaton GH, Milner J, Corey P, McGuire V, Cousins M, Stewart E, et al: Sources of variance in 24-hour dietary recall data: Implications for nutrition study desigh and interpretation. American Journal of Clinical Nutrition. 1979, 32: 2546-2549.PubMedGoogle Scholar
- Shrout PE, Fleiss JL: Intraclass Correlations: Uses in Assessing Rater Reliability. Psychological Bulletin. 1979, 86: 420-428. 10.1037//0033-2909.86.2.420.View ArticlePubMedGoogle Scholar
- McGraw KO, Wong SP: Forming Inferences About Some Intraclass Correlation Coefficients. Psychological Methods. 1996, 1: 30-46. 10.1037//1082-989X.1.1.30.View ArticleGoogle Scholar
- Wong MY, Day NE, Bashir SA, Duffy SW: Measurement error in epidemiology: the design of validation studies I: univariate situation. Stat Med. 1999, 18: 2815-2829. 10.1002/(SICI)1097-0258(19991115)18:21<2815::AID-SIM280>3.3.CO;2-R.View ArticlePubMedGoogle Scholar
- Ahmad N, Pollard M, Unwin N: The optimal timing of blood collection during the menstrual cycle for the assessment of endogenous sex hormones: can interindividual differences in levels over the whole cycle be assessed on a single day?. Cancer Epidemiol Biomarkers Prev. 2002, 11: 147-151.PubMedGoogle Scholar
- The pre-publication history for this paper can be accessed here:http://www.biomedcentral.com/1472-6874/2/13/prepub
This article is published under license to BioMed Central Ltd. This is an Open Access article: verbatim copying and redistribution of this article are permitted in all media for any purpose, provided this notice is preserved along with the article's original URL.