We request that any use of data obtained from the Global Biobank Engine be cited in publications using the following format:
We also ask that the developers of the engine be acknowledged as follows:
Summary statistics of the data presented in GBE can be downloaded from the Rivas Lab GitHub page within the summary stats section.
Additionally, the Neale Lab has made the summary statistics from their heritability analysis available for download here.
Data presented in GBE is from the UK Biobank dataset release version 2. To minimize the impact of cofounders and unreliable observations, we used a subset of individuals that satisfied all of the following criteria: (1) selfreported white British ancestry, (2) used to compute principal components, (3) not marked as outliers for heterozygosity and missing rates, (4) do not show putative sex chromosome aneuploidy, and (5) have at most 10 putative third-degree relatives. These criteria are reported by the UK Biobank in the file “ukb_sqc_v2.txt” in the following columns respectively: (1) “in_white_British_ancestry_subset,” (2) “used_in_pca_calculation,” (3) “het_missing_outliers,” (4) “putative_sex_chromosome_aneuploidy”, and (5) “excess_relatives.” We removed 151,169 individuals that did not meet these criteria. Similar criteria was applied to the exome sequencing data from UK Biobank.
We processed summary statistics from Biobank Japan.
We processed summary statistics from the United States' Million Veterans Program.
Genome-wide association analysis was performed with Firth-fallback using PLINK v2.00a (17 July 2017). We used the following covariates in our analysis: age, sex, array type, and the first four principal components, where array type is a binary variable that represents whether an individual was genotyped with UK Biobank Axiom Array or UK BiLEVE Axiom Array. For variants that were specific to one array, we did not use array as a covariate.
Current best practices for determining significance of associations with p-values in genetic association studies require that the significance threshold be adjusted to reflect the number of associations tested, a method known as the Bonferroni correction. For GWAS, 820,897 tests are performed, one for each variant on the array. For PheWAS, 1,766 tests, one for each phenotype tested for each variant. Thus the appropriate p-value cutoffs are 6.0x10-8 for GWAS and 2.8x10-5 for PheWAS.
The method used for aggregate analysis shown on GBE is described in detail in our manuscript, “Bayesian Model Comparison for rare variant association studies of multiple phenotypes”. Briefly, we run a model called MRP, which considers correlation, scale, and location of genetic effects across a group of genetic variants, phenotypes, and studies. By sharing information across rare variants and phenotypes, we improve our ability to identify rare variants associated with disease compared to considering a single rare variant and a single phenotype.
Variants were filtered using the variant filter table.tsv file available on GitHub (commit 6f9f726) to filter variants on the UK Biobank array for use with MRP. We first chose variants with minor allele frequency less than 1%. We then filtered out all variants with all filters less than one. This removes variants with missingness greater than 1% (calculated on an array-specific basis for array-specific variants) or Hardy-Weinberg equilibrium p < 10-7. This also removes some PTVs for which manual inspection revealed irregular cluster plots. We LD pruned the variants by only using variants with ld equal to one. We included missense variants and PTVs indicated by the following annotations: missense variant, stop gained, frameshift variant, splice acceptor variant, splice donor variant, splice region variant, start lost, stop lost.
The Bayes Factor (BF) is a scoring method used to convey confidence of one hypothesis over another, i.e. the alternative hypothesis over the null hypothesis. We present a log BF as a measure of support for results of the rare variant aggregate analyses. In practice, there is no threshold that indicates significance for Bayes Factors, unlike p-values. However, a log BF greater than 3 indicates moderate evidence for the alternative hypothesis. See Kass & Raftery (1995) for a thorough discussion on Bayes Factors.
The purpose of the Genetic Correlation App is to display genetic correlation estimates from the multivariate polygenic mixture model (MVPMM). Users can select phenotypes that are available in GBE from the search box at the bottom of the page.
The following is a description of each of the relevant variables within the application.
Users can filter by z-score, pi2, genetic correlation, and phenotype category.
For a video walkthrough of the application please see this youtube video.
We combined cancer diagnoses from the UK Cancer Register with self-reported diagnoses from the UK Biobank questionnaire to define cases and controls for cancer GWAS. Individual level ICD-10 codes from the UK Cancer Register, Data-Field 40006, and the National Health Service, Data-Field 41202, in the UK Biobank were mapped to the self-reported cancer codes, Data-Field 20001. The mapping was performed via manual curation of ICD-10 codes for each of the self-reported cancer codes. UKB field codes for self-reported cancer were created with a tree structure such that more specific cancer subtypes (e.g., “malignant melanoma”) are nested under more general categories (“skin cancer”). This tree structure was preserved in the field code to ICD-10 mapping. For example, the self-reported phenotype of “lip cancer” was mapped to its field code, 1010, and the ICD-10 codes for “malignant neoplasm of lip”, C00 and C000-C009. After this mapping, individuals with an affirmative entry in one or more of the phenotype collections (self-reported cancer, cancer registry, and the NHS) were included in the case cohort for the GWAS. No secondary neoplasms were included in the cancer phenotype mappings.
We combined disease diagnoses from the UK National Health Service Hospital Episode Statistics with self-reported diagnoses from the UK Biobank questionnaire to define cases and controls for noncancer phenotypes. We used the following procedure to define cases and controls for non-cancer phenotypes (referred to as “high confidence” phenotypes). ICD-10 codes (Data-Field 41202) were grouped with self-reported non-cancer illness codes (Data-Field 20002) that were closely related. This was done by first creating a computationally generated candidate list of closely related ICD-10 codes and selfreported non-cancer illness codes, then manually curating the matches. The computational mapping was performed by calculating the token set ratio between the ICD-10 code description and the self-reported illness code description using the FuzzyWuzzy python package. The high scoring ICD-10 matches for each selfreported illness were then manually curated to ensure high confidence mappings. Manual curation was required to validate the matches because fuzzy string matching may return words that are similar in spelling but not in meaning. For example, to create a hypertension cohort the code description from Data-Field 20002 (“Hypertension”) was mapped to all ICD-10 code descriptions and all closely related codes were returned (“I10: Essential (primary) hypertension” and “I95: Hypotension”). After manual curation code I10 would be kept and code I95 would be discarded.
We used data from Category 100034 (Family history–Touchscreen–UK Biobank Assessment Centre) to define “cases” and controls for family history phenotypes. This category contains data from the touchscreen questionnaire on questions related to family size, sibling order, family medical history (of parents and siblings), and age of parents (age of death if died). We focused on Data Coding 20107: Illness of father and 20110: Illness of mother.