Genetic Association between HERV-H LTR Associating 2 (HHLA2) Protein and MAGE-B5 Variant in Viral Related Diseases
Genetic Association between the gamma retrovirus HERV-H Long Terminal Associating (HHLA1, HHLA2 and HHLA3) proteins and its association with melanoma associated antigen of the B class proteins (MAGEB5) has not been documented in various viral diseases affecting humans. MAGE proteins of class A, B and C are known to be differentially expressed in related cancer diseases likewise HHLA2 but the reason for the expression of the latter has not been clearly defined. This study focused on identifying the genetic association of HHLA2 and MAGEB5 in viral related oncogenesis through 4 objectives, which included: i) identifying viral related samples from Genome Expression Omnibus (GEO) profiles database expressing both HHLA2 and MAGEB5, ii) Identify the cluster patterns of the various identified samples, iii) Identify relative expression patterns of HHLA2 and MAGEB5 transcripts in various samples and iv) genomic interaction between the most differentially expressed proteins in the samples and their molecular functions. Ten viral related samples were obtained, and gene related data per sample was captured in a Table 1. Using R software (R.3.3.1), the samples showed 3 cluster patterns: 1 & 2 (vector related viruses, VRV); 3, 4, 5, 6, 7 & 8 (airway related viruses, ARV) and 9 & 10 (immune related viruses, IRV). Four main proteins showed more than two-fold expression across all samples and they included: HHLA1, HHLA2, HHLA3 and MAGEB5 meanwhile, only two HLA immune proteins [HLA-C (HLA class I) and HLA-DRB5 (HLA class II)] proteins showed a greater than one-fold differential expression pattern across samples. Five top Gene Ontology (GO) molecular functions associated with the most expressed genes using GeneMANIA software included: ER to Golgi transport vesicle membrane (7.75x10-22, GO:0012507), luminal side of membrane (1.58x10-21, GO:0098576), luminal side of ER membrane (1.58x10-21, GO:0098553), integral component of luminal side of ER membrane (1.58x10-21, GO:0071556) and ER to Golgi transport vesicle (3.24x10-21, GO:0030134), involving 3 main biological processes which were: MAGE Homology Domain (MHD_dom) (score = 51.9, IPR002190), Melanoma Associated Antigen N-terminal(MAGE_N) (score = 21.6, IPR021072), Immunoglobulin C1-set (Ig_C1-set) (score = 9.75, IPR003597). The identified biological processes suggested that the action of HHLA2 and related proteins inhibited HLA immune variant expressions hence favoring the expression of MAGEB protein variant in viral infections with possible oncogenic activity. This study suggests that HHLA2 and MAGEB2 proteins could be involved in biological processes favoring viral-oncogenic related diseases hence a potential therapeutic target.
Keywords:MAGE; HERV-H Long Terminal Associating (HHLA); Genome Expression Omnibus; HLA; Cancer; MAGE Homology Domain (MHD); Gene Ontology (GO) Pathways; Immune; Genemania
Pathogens including viruses are responsible for 20% of all cancers affecting humans worldwide [1]. The role of viruses in cancers has become so evident that it is also shaping medication design [2]. The International Agency for Research on Cancer (IARC) has placed several viruses as type 1carcinogenic agents and they include: Epstein-Barr virus, Kaposi sarcoma-associated herpesvirus (KSHV), human papillomaviruses (HPV),hepatitis B and C viruses (HBV, HCV), Merkel cell polyomavirus (MCPV) and Human T-cell Lymphotropic virus of type, 1 (HTLV1) (Moore and Chang, 2010) [3]. With increase in genetic identification technology the chances of identifying new viral species related to cancer is very likely. The carcinogenic process is common for oncogenic viruses producing little or no virions in chronic infections due to mechanisms favoring viral latency and/or persistency and this is because they are escaping cell death common with lytic infections hence shielding the pathogen from the immune system [4]. Changes in cell genetic landscape may be a biological accident and not a viral evolutionary advantage and that’s why it is a rare occurrence given that cancer leads to host death and manifestation of viral persistence through its oncogenes leading to end of the virus [5]. Viral oncogenes are viral encoded proteins that are responsible for carcinogenesis like E6 and E7 in HPV, E1A and E1B in Adenovirus and HPV16 and HPV18 are the principal causes of cervical cancer [6].
The interaction of viral oncogenes with the host cell results in a cascade of genetic events leading to activation of different genetic biological processes resulting in the activation of immune cells. Viruses enter the cell using different cell receptors like dendritic cells (DC-SIGN) for CD209 and macrophages. One common method for viral entry into the cell is the use of the immune escape mechanism mimicking the immune checkpoint mechanism. Programmed Death ligands (PDL1 and PDL2) are members of the B7 family and are known as immune check point inhibitors, expressed on antigen processing cells (APC) with purpose to bind their receptors, PD1 and PD2, respectively, on CD4 T-cells to inhibit and regulate Immune function [7]. Cancer cells like Osteosarcoma have also been shown to be targets for the immune check point therapy currently used to treat some cancer types like non-small cell lung cancer (NSCLC) [8]. HHLA2 protein variant happens to be the newly identified member of the B7 family and also shown to be a co-inhibitory and co-stimulatory molecule. HHLA2 suppresses the function of CD4 and CD8 T cells when T cell signaling is present and also reduces the production of cytokines by T cells [9]. HHLA2 has little or no expression in normal tissue but widely differentially expressed in human cancers [10]. Antibodies targeting the PD1 and PD2 receptors to block tumor suppressor genes are efficiently used as drugs to prevent cancer cell proliferation so it can be destroyed by immune cells [11].
The association between HHLA2 and MAGEB5 in viral cancers have not been documented and though MAGE expression has been shown in some viral cancers none has been established as tumor or cancer driver. Like HHLA2, MAGE protein variants are rarely expressed in normal tissues of adults but highly expressed in several cancer types which include breast cancer, lung adenocarcinoma, gastric cancer and prostate cancer to name a few [12]. The MAGE human family proteins are classified based on their chromosomal location and expression pattern. They all possess a MAGE homology domain (MHD) whose function in cancer related biological pathways has not been clearly identified [13]. The relevant cancer-testis antigens (CTA) are type-I belonging to MAGE class A, B and C and highly associated with cancers [14].
This study aimed at understanding the genetic associations involving MAGEB5 and HHLA2 with viral disease biological processes leading to the development of tumors and definitely cancer. No MAGE protein variant is currently a known target for drugs against cancer development given that focus has been mainly on the MAGE class a family known to be the most differentially expressed in cancers. Given that both HHLA2 and MAGEB5 are expressed only in cancers other than their normal tissue, there is a possibility of both genes coevolving in viral related cancers. Viral related cancer samples expressing MAGE protein variants were obtained and corresponding genes across samples were subjected through heatmap analysis, protein interaction pathway analysis and principal component analysis to understand genetic associations between the proteins of interest and how they can be used as potent targets for drug development and treatment of viral related cancers.
Data was obtained from GEO (RRID: SCR_005012), which has three divisions: a) GEO Database (www.ncbi.nlm.nih.gov/geo/), a public repository with necessary tools to search microarray datasets on functional genomics, download experiments and obtain curated gene expression profiles which have been documented; b) GEO Datasets (www.ncbi.nlm.nih.gov/gds), a site used to store datasets on molecular abundance and curated gene expression obtained from the GEO repository; and c) GEO Profiles (www.ncbi.nlm.nih.gov/geoprofiles/), a site designed to store molecular abundance of individual gene expression profiles obtained from the GEO repository. We based our study on GEO Profiles and using the terms “melanoma associated antigen (MAGE)” and “virus” to obtain data related to experiments done on MAGE and viruses, which was downloaded for further analysis.
Ten samples were identified based on selection criteria (Dataset 1) and downloaded data per sample had the following components: sample identity (Id), title of the study, relative standard deviation (RSTD), relative mean (RMEAN), gene name, Genebank accession number (ACCN), taxon and GEO dataset type (Dataset 2). In the dataset the samples were represented as BC_Sample (1-10) which referred to Biological collection (BC) of Samples (1-10). The data was further filtered for humans as species of interest and gene related information was used to extract per sample data from the dataset to be used for gene cluster and differential expression analysis. From our dataset, all MAGE gene variants were extracted into a data table based on RMEAN values only (Dataset 3), from which all HLA class I and II gene variants, 69 genes in total, were extracted as a separate dataset (MAGE-HLA) for heatmap analysis (Dataset 4). Given that viral immune escape has been associated with cell receptors [CD209 and LCK], both genes were added to the MAGE-HLA dataset for cluster pattern analysis [15,16]. The top 200 most differentially variable genes were extracted from Dataset 3 into another table (Dataset 5) and used for gene interaction analysis based on their Gene Ontology (GO) molecular function used as basis to identify functional molecular clusters across the sample data. The total count for each HLA class I & II immune variant across samples were extracted and plotted to observe their abundance in the data.
Dataset 1: MAGE-virus samples identified (n = 10) on GEO profile database:This table relates to each viral sample data retrieved from GEO profile database. 10 samples matched our search criteria (virus, cancer and MAGE). Column 1 = The web link to each dataset on the database, column 2 = Sample names used for analysis, column 3 = The sample title as represented on the database.
Dataset 2: MAGE-virus gene values per sample retrieved from GEO profile database:This table relates to the genetic information for each sample retrieved from GEO profile database. Genomic information was specific to humans. Column 1 = sample identity (Id), column 2 = title of the study, column 3 = relative standard deviation (RSTD), column 4 = relative mean (RMEAN), column 5 = gene name, column 6 = Genebank accession number (ACCN), column 7 = taxon and column 8 = GEO dataset type (gds).
Dataset 3: MAGE-virus normalized gene RMEAN values (n = 19688) across samples:This table represents the normalized RMEAN values per gene across all samples. This data table was used for further analysis in R software. BC_Sample (1 – 10) = Biological collection of samples 1 to 10.
Dataset 4: MAGE-virus immune-related genes (n = 69) across samples:This table represents the normalized RMEAN values per gene across all samples. This data table was an extract of genes specific for MAGE and Immune function and was further analyzed in R software. BC_Sample (1 – 10) = Biological collection of samples 1 to 10.
Dataset 5: MAGE-virus most variable genes (n = 200) across samples:This table represents the normalized RMEAN values per gene across all samples. This data table was an extract of the most variable genes most expressed in our data and was further analyzed in R software. BC_Sample (1 – 10) = Biological collection of samples 1 to 10.
Dataset 6: MAGE-virus Gene Ontology protein interaction biological process data:This table shows all the identified biological pathways for the combined protein interaction analysis for the most variable genes (n = 200) and MAGE-immune genes (n = 69). Column 1 = Gene Ontology Id, Colum 2 = identified pathway, column 3 = q-value for each pathway and column 4 = number of genes in our data that matched the total number of genes known for that pathway.
The top genes (n = 200) with the most variation between clusters observed in the data were uploaded into GeneMania a user-friendly and flexible software used for hypotheses generation on gene related functions, gene lists analysis and gene prioritization analysis for functional assays [17]. GeneMania is an integrated plugin in Cytoscape, version 3.6.1 a software platform designed to integrate interaction networks of biological and molecular importance having high throughput related expression data in combination with other molecular states [18]. The Gene Ontology molecular functions relative to our data were used for predicting the network and also perform genomic functional classification within the network.
Gene related RMEAN values per gene were obtained for 19,688 genes and normalized for across samples by summing and dividing by the number of samples, which were extracted into a Table 1 (Dataset 3) for further analysis. Genes with RMEAN > 1 for any sample [CD37 (Leukocyte antigen) and ZZZ3 (Zing Finger ZZ-Type Containing 3)] were removed from the dataset. The data table (Dataset 4) was analyzed using R version 3.3.1 (R Development Core Team, 2008; RRID, SCR_001905) and several statistical parameters were taken into consideration. The cluster patterns observed from analysis of mean from the data table were subjected to a One-way ANOVA statistical analysis taking into consideration F value and the P value at alpha = 0.01. Each value per gene (GV, gene value) was summed across samples (GS, gene sum) and GS = 0 per gene was exempted from the total data. For data normalization and transformation, a minimum of 0.001 was added to every gene value to eliminate all gene datasets with zeros. The final normalization step per gene was; Log2 [(GV + 0.001)/GS]. Hierarchical clustering (R Core Team, 2013; RRID: SCR_009154) on the sample data was performed using gplots version 2.1.0 and RColorBrewer version 1.1-2 and heatmap.2 was used for plotting the heatmap [19,20]. The samples on the heatmap were represented as EB_GEO_Sample (1-10) which represented “Extracted Biological” (EB) Gene Ontology (GO) sample (1-10). Principal component analysis (PCA) using ggbiplot library version 0.55 was performed on the data [21]. The biplot uses a covariance scaled to one, which is approximately the inner product between variables, while the distances between points is approximately the Mahalanobis distance [22].
The total data from GEO profile database was composed of n = 19,688 genes (Dataset 3) across 10 samples (Dataset 1). The Log2 normalized sample mean distribution across samples in the data showed a 3-cluster differentiation pattern with the following groups: Samples 1 & 2 [MAGE & Virus group 1 (MAV_GP1)], samples 3-8 [MAGE & Virus group 2 (MAV_GP2)] and samples 9 &10 [MAGE & Virus group 3 (MAV_GP3]) (Figure 1). The statistical analysis between clusters was significant hence validating the groups we observed. Data obtained showed the following: MAV_GP1 vs MAV_GP2; F = 228.53 and P = 2x10-16 and for MAV_GP1 vs MAV_GP2; F = 32.46 and P = 1.24x10-08 and for MAV_GP1 vs MAV_GP2; F = 1068 and P = 2x10-16. Relative mean values per sample were observed to be evenly distributed.
The total data from GEO profile database was composed of n = 19,688 genes (Dataset 3) across 10 samples (Dataset 1). The Log2 normalized sample mean distribution across samples in the data showed a 3-cluster differentiation pattern with the following groups: Samples 1 & 2 [MAGE & Virus group 1 (MAV_GP1)], samples 3-8 [MAGE & Virus group 2 (MAV_GP2)] and samples 9 &10 [MAGE & Virus group 3 (MAV_GP3]) (Figure 1). The statistical analysis between clusters was significant hence validating the groups we observed. Data obtained showed the following: MAV_GP1 vs MAV_GP2; F = 228.53 and P = 2x10-16 and for MAV_GP1 vs MAV_GP2; F = 32.46 and P = 1.24x10-08 and for MAV_GP1 vs MAV_GP2; F = 1068 and P = 2x10-16. Relative mean values per sample were observed to be evenly distributed.
The MAGE-HLA gene dataset was used to plot a hierarchical clustering heatmap to see if the immune genomic platform will have the same distribution pattern for the general dataset (Figure 2). There was a change in cluster patterns as samples in MAV_GP1 split into MAV_GP3 (EB_GEO_sample 1) while EB_GEO_samples 2 & 3 formed the new cluster. Four genes showed a strong expression in all samples and they were: HHLA1, HHLA2, HHLA3 and MAGEB5. The grey cluster (Samples 2 & 3) showed an expression for almost every gene for all the 3 identified color sidebars for related genes. Almost all genes showed an expression for the grey cluster and no gene expression for blue and pink cluster (light green side bar), low gene expression for blue and pink cluster relative to grey cluster (purple side bar) and gene expression for grey and pink cluster (yellow side bar). The majority of the HLA class I and II genes were expressed in the group 3 gene cluster alongside CD209 (DC-SIGN) gene which codes for the c-type lectin, a cell transmembrane receptor used by viruses to get into the cell.
The PCA analysis plot (Figure 3) was considered only for the MAGE-Immune group of genes (n = 69) could explain a 22% gene variation in the data as seen on the y-axis and a 43.1% variation in the samples as seen on the x-axis. Two main cluster patterns were observed as: samples 5,6,7 & 8 and samples 1,3,4,9,10 while sample 2 separated out on its own. MAGEB5, HHLA1, HHLA2 and HHLA3 were out of the circle of unit radius 1 alongside HLA-A, HLA-G, HLA-B and HLA-DPB1 as genes of interest in this data analysis.
The top 200 most variant genes and 69 MAGE-Immune related genes were analyzed for the most common biological processes involved in our data and a network of nodes (n = 299) and edges (n = 3170). A total of 92 Gene Ontology (GO) biological processes were identified (Dataset 6) and they were validated based on the q-value which is an adjusted false discovery rate (FDR) of the p-value. The first ten GO biological processes were of particular interest because of their involvement in processes like: vesicular formation at cell surfaces and movement to Golgi apparatus, MHC class II functions, antigen processing and presentation and interferon-gamma signaling. Majority of genomic interactions identified were from Integrative Protein Signature (InterPro) Database and these genes-associated processes were: MAGE Homology Domain (MHD_dom) (score = 51.9, IPR002190), Melanoma Associated Antigen N-terminal (MAGE_N) (score = 21.6, IPR021072), Immunoglobulin C1-set (Ig_C1-set) (score = 9.75, IPR003597).
The main focus of this study is understanding how viral infections and their cell targets affect the role of MAGE related protein variants especially MAGEB5 and HHLA related protein variants especially HHLA2. HHLA gene derives several variants and the functional role of some have not been clearly identified (HHLA1 and HHLA3). Data obtained from GEO profile database gives a gene by gene expression profile for the experimental data. The mean distribution per sample for all genes showed 3 clusters which could be categorized as: vector related viruses (VRV) from MAV_GP1, Airway-like related viruses (ARV) from MAV_GP2 and Immune-regulatory viruses (IRV) from MAV_GP3. The genetic variation was significant, suggesting an increased cellular response to genes in the VRV cluster relative to the two other clusters. This may also be a dual cellular response based on mechanical injury from the mosquito bite and response to a viral transmission to the human host [23]. The high F-value observed for the ANOVA analysis suggest that the gene values within group were quite different hence supporting the use of RMEAN values for these analyses. The heatmaps for the dataset followed a hierarchical distribution pattern similar to the mean cluster variation pattern observed in Figure 1. The VRV group showed a higher gene expression pattern which was relatively lower in other groups. The 200 top most genes were extracted and plotted as representative of the total dataset and the same cluster distribution pattern as of Figure 1 & Figure 2 were observed. This demonstrated that the top most gene variations in this data analysis were responsible for the distribution pattern observed. The role of most variable genes in gene expression and variation data have been associated with drivers of the particular genetic situation being observed. When evaluating the differential pattern of expressed genes in disease data, these most variable genes are considered disease drivers and further evaluated for therapeutic targets, which is a consideration for further molecular analysis [24]. The expectation of viral infected cells is for the immune response to contain the infection and restore health, which led us to investigate all MAGE, HHLA and immune related gene variants identified in this dataset. Two genes of interest LCK and CD209 were added to this dataset because of their role with immune activation and viral entry into the cell. LCK is a proto-oncogene of the Src family of Tyrosine Kinases and a key signaling molecule in the development and maturation of T-cells through its interaction with the T-cell receptor (TCR). The gene interaction of PD1 and TCR based on stimulation of the latter breaks the LCK-TCR functional pathway [25]. CD209 is a dendritic cell with MHC class II properties but most at times becomes a route to the entry of several viruses like HIV and Ebola [26,27]. The heatmap associated with this data subset was plotted and a different cluster pattern from Figure 2 & Figure 3, was observed which shaped the gene cluster patterns of interest. HHLA2 and MAGEB5 proteins were expressed across all sample types (Figure 5). MAGEB5 is a melanoma associated antigen (MAGE), a class of protein variants whose natural site of expression is only the testis and if identified in any other tissue then it is tumor-associated [12]. The genetic and functional relationship between MAGE and HHLA protein variants have not been documented which makes their co-expression for viral infection very important for further analysis. The differential expression pattern of LCK (Figure 5 group 1) and CD209 (Figure 5 group 3) proteins suggested the differential use of viral routes into the host. CD209 was expressed in samples associated with ARV cluster (pink) and VRV (grey). Specific HLA class II protein variants were also expressed in this cluster and no HLA-class I variants were observed. This expression pattern suggested that viral entry into its host will be preferably use dendritic cells (CD209) as point of entry for VRV and ARV but may use an immune escape mechanism where it inhibits T cell expression as seen with the lack of HLA class I antigens. This further supports the expression of HHLA2 as an immune suppressor and which could be involved in the biological process that favors immune escape [28]. It has been shown that, the well-developed pattern recognition receptors in dendritic cells can bind infective virions (HIV, Ebola, HCV), and the expression of these surface peptide loaded MHC I molecules are controlled by the CD3/TCR complexes on CD8+ T lymphocytes [29]. Interfering with peptide loading receptors and their trafficking process on CD4+ T lymphocytes on the cell surface is an effective strategy for immune evasion used by several viruses [28]. Viral-tumor genetic association could result in a down-regulation of MAGE and Immune protein variants (IRV) with exception of HHLA and MAGEB5. MAGE type I protein variants (A, B and C) are considered tumor associated antigens and are expressed in several cancer types but a succinct genetic profile of these MAGE type I antigens with viral disease and associated cancers has not been shown [13]. MAGEB5 and HHLA show an onset with viral disease and remains present in cases of tumorigenesis and carcinogenesis. The PCA analysis also showed that HHLA2 alongside other variants and only MAGEB5 variant were out of the circle. HLA class I variants (A, B & G) and HLA class II variant (HLA-DPB1) were also identified out of the circle and the latter has been associated with Beryllium lung disease caused by the metal Beryllium and those who come in contact with it in factories [30]. The ARV cluster related to those group of viruses that use airway and lung epithelial surfaces to penetrate their human host and there could be a possibility that, the biological process associated with Beryllium disease due to HLA-DPB1 mutations could also be used by these viruses. Genes located out of the circle are considered to have different functional patterns from those within the circle and should considered critical when identifying critical gene variants associated with a disease of interest [21]. Functional biological processes identified with the most variable genes in our data showed that, most of the genes were involved with vesicular formation and movement to Golgi apparatus which are methods used by viruses to enter the host cell. Viruses use endocytic movement at cell surfaces as vehicles of transport deep into the cell cytoplasm and this also helps to shield their glycoprotein membrane from host proteases and immune cells. Once released from these vesicles, these pathogens begin their function of cellular integration and replication [31]. Genes from this data identified with InterPro functional groups and domains like: MAGE Homology domain (MHD) common to all MAGE protein variants and their role has not been clearly characterized, melanoma associated antigen N-terminal (MAGE_N) wherein the terminal protein of about 82 to 96 amino acids found on all melanoma associated antigens is recognized by HLA-A1 for tumor rejection and Immunoglobulins (Ig_C1-set) whose C1 domains are mostly associated with T-cell function. Other functions of interest included: Asthma and viral myocarditis of the Consolidated-Pathways-2013 database and Leishmaniasis (KEGG pathway, H00359). Asthma is the condition wherein the airway becomes inflamed and it is also considered a polygenic disease and viral myocarditis is an inflammation of the heart muscle due to entry of virus in myocytes. Leishmaniasis is a vector-borne disease transmitted to humans by Phlebotomine sandflies. These three biological processes correlate with the role of viruses in our samples and their route through airways causing possible lung inflammations [32,33]. Genes associated with these biological processes correlated with MAGE (B5) and HHLA (2) functions.
This study was centered on the understanding the genetic interaction between viral related cells and the viral pathogen when it gets into the cell. Viruses are the cause of several cancer types like hepatitis and reproductive-organ related. The genetic biological processes involving MAGE (B5) and HHLA (2) and their role in viral pathogenesis and cancer has not been documented and this is the first study that puts all of these areas of study together. This study clearly shows that MAGEB5 and HHLA2 are involved in the onset viral disease and based on their interaction with the immune cells, the disease could gradually progress into a tumor and then to cancer. Genes involved with viral entry exploit the immune checkpoint pathway through HHLA2 favoring expression of MAGEB5 whose role may be facilitating viral evolution within the target cell. The genetic and functional pathway involving MAGEB5 and HHLA2 are of great interest for further analysis as biomarkers of viral disease onset and progression.