14

Has anybody tested if paralogous genes are over-represented among the genes identified by genome-wide association studies (GWAS)?

For example, if a GWAS study finds 200 genes associated to the disease/trait, and a number X of those can be classified as belonging to Y different gene families, is there a test to see if X and Y are bigger than expected, given the total number of genes and gene paralogies in a genome? Here I am talking long established copies within a species, not CNVs in different individuals of the same species.

I am thinking there is an interesting question behind this: if a gene has duplicated during the evolution of a genome, and the different gene copies have taken specialized, yet related, roles, an unbiased analysis like GWAS should be able to find cases where different paralogous copies associate to different subdiseases/subtraits within the same global disease/trait.

WYSIWYG
  • 35,564
  • 9
  • 67
  • 154
yahoo301503
  • 744
  • 3
  • 10
  • Hello. Could you help by clarifying your question? Do you want to know whether paralogous genes/regions are over-represented in GWAS hits? It is an interesting hypothesis - have you any ideas/references why this may be the case? It's always best to give as full and clear a question as possible. Thanks – Luke Jul 03 '12 at 14:25
  • 1
    That is an interesting question (thanks for the expansion). If the paralogous genes were still functionally related there is likely to be redundancy between them, so a SNP in one of the copies may not manifest at all. A SNP that affects a sub-trait specific to one of the copies would need to have a massive effect (or the study would require a phenomonal sample size) to find it (unless it was a targeted study only on paralogous genes known to be disease related?). I don't know fo any studies, but will be interested in others answers! – Luke Jul 03 '12 at 15:19
  • 1
    Is a Fisher's exact test (a chi-squared test with more relaxed assumptions) something you've considered? – Atticus29 Feb 05 '13 at 08:41
  • I actually don't understand what the hypothesis is. First, there are tons of confounding factors to control for -- for instance, regulatory gene families may be more likely to expand than structural gene families, so you'd need to account for that (and a lot more). Your final question regarding subfunctionalization seems to have nothing to do with the original question of over-representation. – adam.r Dec 01 '13 at 19:41
  • Aren't paralogs supposed to have divergent function? I would expect them to be under-represented. – Superbest Apr 03 '14 at 02:43
  • You could test it also by counting paralogs in the hits only, then pick eg. 200 random genes (depending on how many hits your GWAS caught) and count in those. – Superbest Apr 03 '14 at 02:45
  • I think it depends on just how related the paralogs are. If they diverged a long time ago, then we can assume the SNPs that show up in your hypothetical GWAS are indeed independently associated with the disease. However, it's not strictly possible for the same SNP to occur in another gene, it's sort of by definition a different SNP. As GWAS methods generate correlations, not causations, if those same-but-different SNPs are associated with the disease, that's the end of the story. Deeper inspection is required to learn more. – Richard Rymer Jun 17 '14 at 21:42
  • This is one paper where they separated out-paralogs from orthologs and did conservation of protein-protein interaction studies across species.The methods which they used could be helpful in doing similar studies.http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3447968/ – Raghavakrishna Sep 25 '14 at 10:11

1 Answers1

1

There is no literature report saying such a thing. However, I did a cursory check for GWAS study on neuroblastoma.

  • Selected SNPs with p-value>0.05
  • Converted p-values to a score —
    -log10(p-value)
  • Mapped the SNPs to genes while calculating cumulative score for a gene

I just sorted the genes based on their names, assuming that many paralogs have similar names. I know it is not the right way to go about it. However, I found many similar named groups in the list.

Now the next step is find actual paralogs and to score the cumulative score for a paralog-group. This is a little task:

  • Get sequences of the genes
  • Run a BLAST search to find paralogs
  • Assign genes to groups and find scores

I can share the file with genes and scores. I would, however, continue only if someone is actually interested in pursuing this — this could be a research paper.

PS: If you want the file just comment your email id. Some IT admin idiot has blocked rapidshare/4shared etc

WYSIWYG
  • 35,564
  • 9
  • 67
  • 154