Identification of genes and pathways associated with menopausal status in breast cancer patients using two algorithms

Background Menopausal status has a known relationship with the levels of estrogen, progesterone, and other sex hormones, potentially influencing the activity of ER, PR, and many other signaling pathways involved in the initiation and progression of breast cancer. However, the differences between premenopausal and postmenopausal breast cancer patients at the molecular level are unclear. Methods We retrieved eight datasets from the Gene Expression Omnibus (GEO) database. Differentially expressed genes (DEGs) associated with menopausal status in breast cancer patients were identified using the MAMA and LIMMA methods. Based on these validated DEGs, we performed Gene Ontology (GO) functional enrichment and Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway enrichment analyses. Protein–protein interaction (PPI) networks were constructed. We used DrugBank data to investigate which of these validated DEGs are targetable. Survival analysis was performed to explore the influence of these genes on breast cancer patient prognosis. Results We identified 762 DEGs associated with menopausal status in breast cancer patients. PPI network analysis indicated that these genes are primarily involved in pathways such as the cell cycle, oocyte meiosis and progesterone-mediated oocyte maturation pathways. Notably, several genes played roles in multiple signaling pathways and were associated with patient survival. These genes were also observed to be targetable according to the DrugBank database. Conclusion We identified DEGs associated with menopausal status in breast cancer patients. The association of these genes with several key pathways may promote understanding of the complex characterizations of breast cancer. Our findings offer valuable insights for developing new therapeutic strategies tailored to the menopausal status of breast cancer patients. Supplementary Information The online version contains supplementary material available at 10.1186/s12905-023-02846-7.


Introduction
As the leading cancer diagnosis in women, breast cancer accounted for approximately 2,261,000 new cases and 684,000 fatalities in 2020 [1].This hormone-dependent malignancy primarily affects the mammary gland in females.Accurate identification of menopausal status is vital for effective prevention, detection, and treatment [2,3].A population-based study investigating the impact of premenopausal and postmenopausal breast cancer revealed that the mortality rate of patients with postmenopausal breast cancer in 2018 was 3.7 times greater than that in patients with premenopausal breast cancer [4].Given the unique molecular characteristics of these two conditions, personalized strategies are required to manage breast cancer based on menopausal status.For instance, endocrine therapy, which reduces estrogen or progesterone levels, is recommended for postmenopausal patients with estrogen receptor (ER) or progesterone receptor (PR) positivity but is unsuitable for premenopausal patients [5,6].It has been widely recognized that menopausal status is associated with estrogen, progesterone, and other sex hormone levels, potentially influencing the activity of ER, PR and many other signaling pathways participating in the initiation and progression of breast cancer.However, the intricate molecular distinctions between premenopausal and postmenopausal breast cancer remain opaque.This gap in understanding impedes the full realization of precision medicine tailored to menopausal status.Therefore, enhancing our understanding of the unique molecular mechanisms of breast cancer through gene expression profile analyses is essential to improve early detection, diagnosis, and treatment strategies.
With the significant advancements in high-throughput technologies for genome-wide profiling of methylation events and gene expression levels, including methods such as methylation microarrays, MeDip-seq, and RNAseq, and the availability of public datasets, we can now analyze data collected worldwide.Leveraging bioinformatic methods, we have the tools to identify potential biomarkers and pathways linked to menopausal status.However, numerous challenges arise in the integration and analysis of datasets from different sources.Fortunately, improvement in the differential expression analysis method enables us to perform cross-study analysis.In recent years, various differential expression analysis methods have been proposed, providing a variety of tools to ensure the robustness of our research findings.
To date, large-scale bioinformatic studies focusing on the differentially expressed genes (DEGs) associated with menopause in breast cancer patients have been scarce.The primary objective of our study is to illuminate the molecular distinctions between premenopausal and postmenopausal breast cancer patients.In our study, we attempted to collect more datasets to increase the sample size.In an integrated large cohort, we performed differential expression analyses to identify DEGs using two different algorithms.Additionally, Gene Ontology functional enrichment and Kyoto Encyclopedia of Genes and Genomes pathway enrichment analyses of the DEGs were performed.In addition, protein-protein interaction (PPI) networks were constructed to further elucidate the direct and indirect associations between the DEGs.In doing so, we hope to pinpoint key menopause-related biomarkers that could prove instrumental in future breast cancer research.Furthermore, understanding these biomarkers will undoubtedly shed light on the disease's pathogenesis, offering new avenues for clinical drug development and therapeutic interventions.

Microarray data for differentially expressed gene (DEG) analysis
We conducted an extensive search for breast cancer microarray datasets with a sample size of more than 20 in the Gene Expression Omnibus (GEO) database from the National Center for Biotechnology Information (NCBI) website (https:// www.ncbi.nlm.nih.gov/ geo/).From the 1058 breast cancer datasets identified, we specifically selected datasets based on the Affymetrix Human Genome U133 Plus 2.0 Array (platform GPL570, n of probes = 54,675) that included primary breast cancer tissues (excluding cell lines or animal tissues) and had information on menopausal status.Raw intensity files in CEL format from these 8 selected datasets were obtained from the GEO database.The R package "affy" was employed to convert the raw intensity files to gene expression profiles using the robust multiarray average (RMA) method [7].All data processing and statistical analyses were conducted in the R environment (https:// www.rproj ect.org).The study process is graphically represented in Fig. 1.

Differential expression analyses based on two algorithms
To identify DEGs, we utilized two different algorithms.First, we employed the R package "MAMA" to generate combined p values and combined effect sizes for the expression of each probe across all selected datasets [8].The cutoff criteria for this method were set as a combined p value of less than 0.01 and a combined z score of greater than 2 or less than − 2.Then, all the samples from the 8 selected datasets were integrated into a large cohort.The R package "limma" was then used to calculate the adjusted p value and fold change for the expression of each probe in this integrated cohort [9].The cutoff criteria for this method were an adjusted p value of less than 0.01 and a fold change (log 2) of greater than 2 or less than − 2. Probes demonstrating differential expression with consistent trends in both methods were selected for further analyses, and the genes mapped by these probes were identified as DEGs.

Visualization of the expression of DEGs
To present the top 50 differentially expressed probes, we utilized the R package "pheatmap" to generate heatmaps.Given that the sample sizes varied across datasets, we randomly selected a subset of 50 samples for plotting.The heatmap employed unsupervised hierarchical clustering using the Ward method with Manhattan distance to visualize the clustering patterns of either the samples or probes.At the top of the heatmaps, the category of each selected sample (premenopausal or postmenopausal) was marked.We have uploaded the pipeline of differential expression analyses and visualization to GitHub.It can be accessed at https:// github.com/ minzh angch eng/ BRCA_ menop ause.

Enrichment analysis of GO terms and KEGG pathways
We utilized the WEB-based Gene SeT AnaLysis Toolkit (WebGestalt) for the enrichment analysis of Gene Ontology (GO) terms and Kyoto Encyclopedia of Genes and Genomes (KEGG) pathways [10,11].The overrepresentation analysis (ORA) method was employed, with a significance threshold of p values less than 0.05, to identify the critical biological implications of the DEGs.
To further illustrate the direct and indirect associations among the DEGs, protein-protein interaction (PPI) networks were constructed and visualized using the Search Tool for the Retrieval of Interacting Genes/Proteins Fig. 1 Process of screening genes and pathways associated with menopausal status in breast cancer patients using two algorithms (STRING) database [12].The PPI networks were subsequently clustered using Cytoscope software [13] with the MCODE [14] plugin.Additionally, the setsApp [15] plugin was employed to color-code the network, highlighting gene sets associated with various clusters.

Exploration of targeted compounds for and patient survival related to DEGs
Drugs and their targets were downloaded from the Drug-Bank database (https:// go.drugb ank.com/) to investigate the potential targetability of these validated DEGs [16].Moreover, we utilized the TCGA breast cancer dataset to determine whether these genes influenced the overall survival (OS) of breast cancer patients.

DEGs associated with menopause in breast cancer patients
We used two distinct methods to screen for DEGs.In the first method, we treated the 8 datasets as separate cohorts.Within each dataset, we calculated p values and effect sizes and then generated combined p values and effect sizes.Applying a threshold of a combined p value less than 0.01 and a combined z score above 2, we identified 6286 differentially expressed probes (Table S1).In the second method, the samples from the 8 datasets were merged into a single large cohort.Using a cutoff of an adjusted p value less than 0.01 and a fold change (log2) greater than 2, 6620 probes were identified as differentially expressed probes (Table S1).The 1099 probes identified as differentially expressed by both methods were selected for further analyses and were found to map to 762 genes (Fig. 1, top 100 probes in Table 2, full list in Table S1).Additionally, heatmaps displaying the expression levels of the top 50 validated probes are presented in Fig. 2.

KEGG pathway enrichment and GO functional enrichment analyses
We conducted KEGG pathway enrichment analysis and gene ontology (GO) analysis using WebGestalt to ascertain the significant biological roles and molecular functions associated with the identified DEGs.As a result, we observed several enriched biological processes, including the p53 signaling pathway, extracellular matrix (ECM) structural constituents, cell cycle, and antifolate resistance (Fig. 3, full list in Table S2).Among the significantly enriched biological processes, the top overrepresented groups were related to the regulation of cell differentiation, proliferation, migration, and the cell cycle (Fig. 3, full list in Table S4).

Characterization of proteins encoded by the DEGs according to PPI network analysis
To gain further insight into the biological characteristics of the proteins encoded by the identified DEGs, we performed a protein-protein interaction (PPI) network analysis using STRING.The PPI network revealed a complex network of interactions among the DEGs (Fig. 4A).We simplified the network and identified highly interconnected regions by clustering the network using the MCODE algorithm.We present the top 10 subnetworks generated from this analysis in Fig. 4B and Table 3.We also performed further KEGG pathway enrichment analyses within these subnetworks (Table 4).These enrichment results showed that the first cluster was significantly associated with the cell cycle pathway.It is also involved in oocyte meiosis and progesterone-mediated oocyte maturation-all of which are menopause-related pathways.The second cluster was significantly associated with several tumor signaling pathways, such as the PI3K-Akt signaling pathway, EGFR tyrosine kinase inhibitor resistance pathway, Ras signaling pathway, and MAPK signaling pathway.

Targeted compounds and clinical significance of DEGs
Of the 762 validated genes, 89 genes were found to have targeted compounds (Table 4; full list for DEGs in Table

S5
).Among these genes, 15 were derived from Cluster 1, and 7 were derived from Cluster 2. Furthermore, 73 genes with targeted compounds were related to the OS of breast cancer patients (Table 4; full list for DEGs in Table S5).

Discussion
In this study, we investigated the differential gene expression between premenopausal and postmenopausal breast cancer patients by analyzing eight breast cancer datasets comprising 693 samples.We aimed to enhance the reliability of our analysis results by employing two different algorithms.As a result, we identified 762 DEGs that exhibited significant differences between the two groups.Among these, multiple genes have been well clarified to be associated with tumour initiation and progression.These include Matrix Metallopeptidase 7 (MMP7),   been found to be associated with breast cancer metastasis in our previous research [25].Among the top enriched pathways, the p53 signaling pathway and Hippo pathway are particularly remarkable, because they are involved in various intracellular regulations, including cellular senescence, energy metabolism regulating and blocking metastasis.The p53 signaling pathway, crucial in tumorigenesis [26], is frequently mutated in various human tumors, leading to a loss of its inhibitory effect on tumor growth.In this report, CDKN2A, gene within the p53 pathway, is involved in p53-dependent cellular senescence, proliferation, and apoptosis, while it may be a pioneering prognostic predictor for breast cancer [27,28].Furthermore, Cyclin D1 phosphorylates Rb by binding to cyclin-dependent kinase (CDK) 4/6, resulting in activation of E2F transcription and cell cycle transition from her G1 phase to S phase.The tumor tumor-suppressive role of SERPINB5 in breast cancer is also supported by experimental evidence [29].On the other hand, the Hippo pathway, originally discovered in Drosophila melanogaster as a crucial regulator of tissue development, is involved in tumorigenesis by regulating cell proliferation and apoptosis.For example, aberrations in the Hippo pathway and YAP/TAZ-TEAD activity are closely related to various human cancers, while targeting the Hippo pathway for treatment remains a compelling challenge [30].
Raltitrexed and pemetrexed selectively inhibit glycinamide ribonucleotide transformylase (GART) and thymidylate synthase (TYMS), which are crucial for the de novo biosynthesis of purine and thymidine nucleotides, respectively.These antifolates have been introduced for the treatment of malignant tumors.ATP-binding cassette sub-family C member 3 (ABCC3, also known as MRP3), a member of the ATP-driven multidrug resistance (MDR) transporters, mediates the efflux of folates and hydrophilic antifolates.Gamma-glutamyl hydrolase (GGH) catalyzes the removal of gamma-linked polyglutamates from (anti)folylpolygamma-glutamates.Additionally, a recent study has shown that the expression level of GGH is associated with poor prognosis and unfavorable clinical outcomes in invasive breast cancer [36].We believe that the association between these four genes and antifolates represents one of multiple pathways that could potentially act in both premenopausal and postmenopausal breast cancer.
Further KEGG pathway enrichment analysis based on the PPI subnetwork provided additional information.The first cluster was significantly associated with several important pathways, including the cell cycle, oocyte meiosis, and progesterone-mediated oocyte maturation pathways.The cell cycle is fundamental to the growth and development of all organisms and plays a significant role in cancer development and progression.For example, dysregulation of the cell cycle is a hallmark of cancer, and many chemotherapeutic drugs exert their effects by targeting the cell cycle machinery [37].We identified several DEGs involved in the cell cycle, including CDK1, CHEK1, CDC25C, BUB1, CDC20, and TTK, which not only are related to breast cancer patient survival but also have existing targeted drugs.However, none have been reported in association with menopause.How these genes affect premenopausal and postmenopausal breast cancer has not yet been fully Further study of these genes related to the cell cycle pathway will help us understand the mechanism of breast cancer for different menopausal statuses and strengthen the potential utility of these genes as therapeutic targets.In addition, CDK1 and CHEK1 are involved in the p53 signaling pathway, indicating the potential effect of menopausal status on the activity of p53 signaling.Consistent with the key role of menopause in our study, we observed that DEGs involved in oocyte meiosis and progesterone-mediated oocyte maturation, two pathways closely associated with reproductive aging and cessation, also emerged as significant in our analysis.It is widely accepted that women's hormonal milieu undergoes significant changes during menopause, with potential implications for breast cancer biology [38].Previous studies have reported the association of these pathways with breast cancer [39][40][41].In addition to CDC25C, BUB1, and CDK1 mentioned above, AURKA, which plays a role in both pathways, is linked to survival and has targeted drugs.Importantly, AURKA has been found to be associated with an increased risk of invasive breast cancer among postmenopausal women [42].
The second cluster of DEGs, including FGFR2, KDR2 and MET, indicates the importance of key cancer-related pathways, including the PI3K-Akt signaling pathway, EGFR tyrosine kinase inhibitor resistance pathway, Rap1 signaling pathway, Ras signaling pathway, and MAPK signaling pathway.A few studies have reported associations of these pathways with breast cancer.In addition, drugs targeting these signaling pathways are available.For the first time, our study reveals a connection between these signaling pathways and menopausal status, laying the groundwork for future clinical development of breast cancer treatment strategies that cater to women with different menopausal statuses.Among these DEGs, KDR and MET are linked to survival and have available targeted drugs.Therapies targeting these key genes may be effective in improving patient outcomes.Additionally, one GWAS presented solid evidence of a strong association between the FGFR2 locus and ER status in breast cancer patients [43].Another study found that menopause has a greater impact on ER-than ER+ breast cancer incidence [44].These findings, along with ours, hint Interestingly, the Cluster 6 genes involved in PPAR signaling and adipose metabolism showed different expression between premenopausal and postmenopausal breast cancer patients.It has been well established that after menopause, lower levels of estrogen can lead to the accumulation of fat around the waist instead of the hips and thighs.For postmenopausal women, abdominal fat makes up 15 to 20% of their total body weight, compared to 5 to 8% in premenopausal women [45].This also validates the reliability of our differential expression analysis results.Notably, adiposity is a risk factor for developing breast cancer in postmenopausal women, as breast fat has a major role in the genesis and progression of breast cancer.Rose et al. argued that obese postmenopausal women have an increased breast cancer risk, the principal mechanism for which is elevated estrogen production by adipose tissue [46].Our analysis showed that DEGs (CD36, FABP4, SLC27A6, PPARA) enriched in the PPAR signaling pathway were all strongly associated with patient survival.However, whether menopauseassociated obesity affects the initiation and progression of breast cancer remains an open question.
Additionally, many chemokines or cytokines, such as CCL20, CXCL5, and CXCL13 (Cluster 9), had significantly different expression levels between the two populations, which indicates differences in the tumor microenvironment.This difference could lead to a change in the infiltration of immune cells in tumor tissues and affect the efficacy of immune treatment.Locally produced and systemic cytokines are likely to affect breast cancer growth and behavior [47].
Compared with previous studies, our research benefits from a larger sample size and the use of two different algorithms to enhance the robustness of the results.In addition, the MAMA algorithm allows us to analyze data from different geographic regions.The studies included in our analysis encompass samples not just from the United States but also from Germany, France, and Belgium.This geographical diversity ensures a more global representation.However, this study has several limitations.First, some subsets lacked crucial clinical information, preventing us from analyzing the effect of clinical factors on gene expression across the entire cohort, even though we understand that some clinical factors, such as age and race, might menopausal status or gene expression.Second, despite using two algorithms to bolster the robustness of our results, it was challenging to determine whether we overlooked an essential gene due to algorithm differences.Third, it would be preferable to have an independent validation set.Therefore, we are attempting to collect our own clinical samples and pay more attention to these points mentioned above in our future studies.Other databases, such as TCGA, are also valuable resources for cancer research [48][49][50], but we did not use them in this study because they did not meet the requirements of the MAMA algorithm.
In conclusion, we utilized two differential expression analysis methods to identify several DEGs associated with menopausal status in a large integrated cohort.The interactions of the DEGs were depicted through PPI networks.Furthermore, we identified several key pathways.Most of our results related to menopausal status are reported for the first time; thus, these findings could provide a valuable reference for treating patients with premenopausal and postmenopausal breast cancer.Understanding the DEGs between premenopausal and postmenopausal breast cancer and elucidating their roles in the development and progression of the disease can offer valuable insights into its underlying mechanisms.Further studies are needed to comprehensively investigate this relationship and uncover the specific mechanisms involved.Continued research in this area will help improve our understanding of breast cancer and potentially lead to the development of more effective treatments tailored to the specific needs of premenopausal and postmenopausal patients.

Fig. 2
Fig.2The expression profiles are presented in the heatmap of the top 50 DEGs in the integrated cohort.The expression levels of the genes are represented by different colors.Red, upregulated; Blue, downregulated.Each row represents a probe, and each column represents a sample

Fig. 3 Fig. 4
Fig. 3 KEGG pathways and GO terms of DEGs.A KEGG pathways, B molecular function category, C GO biological process The color of each circle indicates the significance of the enrichment, with colors to representing smaller p-values.The size of each circle corresponds to the number of DEGs enriched in term, with larger circles indicating a higher number of DEGs

Table 1
Datasets involved in this studya These datasets contain normal, unannotated, and samples that are not primary breast cancer.After excluding these samples, 693 samples that met our criteria were used for subsequent analyses

Table 2
Top 100 different expression probes

Table 3
Subnetworks in PPI network

Table 4
KEGG enrichment of subnetworksCluster 3 and 8 are not involved in any KEGG pathways.Red genes are associated with OS; Blue genes have target compounds; Purple genes are associated with OS and have target compounds at the relationship between breast cancer, menopausal status, and ER status.