1

Where can I find some datasets of aligned nucleotide sequences? And what should I assume about the accuracy of the alignments there?

(I would like to use such datasets for training the alignment model I am working on. In particular, to help me get an estimate on some parameters such as the frequency of single nt INDEL at some locations.)

Anas Elghafari
  • 277
  • 1
  • 8
  • 1
    What do you want to do with this aligned sequences? Test algorithms? Actually your question is a bit unclear. – Chris Aug 06 '14 at 11:35
  • I would like to use such a dataset for training, i.e. to help me infer some parameters for the alignment tool I am working on. Thank you for your comment; I will edit my question to clarify. – Anas Elghafari Aug 06 '14 at 11:45
  • 1
    I agree with @Chris! do you mean if the alignment itself can be trusted based on the algorithm used or the sequences themselves can be trusted. Are you worried about the semi-conserved sequences being aligned in a variable manner depending on the algorithm used? For different alignment algorithms please see this post (http://biology.stackexchange.com/questions/20075/what-is-the-state-of-the-art-algorithm-for-multiple-sequence-alignment). I could be totally wrong but multiple alignment and semi-conserved sequences is mostly an issue for AA not nucleotides since they either match or they don't – Behzad Rowshanravan Aug 06 '14 at 11:46
  • Thank you for your comment. Amino Acid alignment wouldn't work for my purpose (I think), because I am trying to infer the probability of a single nt INDEL at a given location. – Anas Elghafari Aug 06 '14 at 11:51
  • So the alignment has to be trusted in the sense that the nucleotide INDELs predicted by the alignment are correct. – Anas Elghafari Aug 06 '14 at 11:57
  • 2
    Well.. Indels are read by your sequencer.. There are machine errors and sample prep errors. You have to set up controls in your machine and train your set.. I haven't really understood your question, however. – WYSIWYG Aug 06 '14 at 12:12
  • I haven't really understood your comment but that is probably because I'm relatively new to this world. What I'm trying to do: A pair of aligned nt sequences can have regions that are conserved, it can also have insertions/deletions. Those insertions/deletions can be of full codons, but there can also be insertions/deletions of single nucleotides (Am I right so far?). My purpose from a dataset of aligned nt sequences is to study those insertions/deletions (and infer parameters based on them for the alignment tool I'm working on). – Anas Elghafari Aug 06 '14 at 12:36
  • @AnasElghafari.. I would suggest that you use some diagram to make your question clear. – WYSIWYG Aug 06 '14 at 16:22
  • Okay, let's forget about the "goldstandard" and "absolutely correct" business. I edited the question so now I am only asking for the datasets of aligned nt sequences. – Anas Elghafari Aug 08 '14 at 20:34
  • Hey guys, I edited my question into something which -I hope- is clearer. Can you please remove the hold? – Anas Elghafari Aug 09 '14 at 13:04
  • what do you mean by aligned nucleotide sequences: pairwise alignment or MSA? – WYSIWYG Aug 13 '14 at 10:59
  • pairwise would be enough for my purpose. – Anas Elghafari Aug 14 '14 at 00:13
  • 1
    There are some manually curated HIV alignments at LANL. HIV is quite variable in length so you will find many indels in these alignments. – rmccloskey Sep 20 '14 at 01:51

1 Answers1

2

You can find 46-way multiz alignment from UCSC genome browser, it is down on comparative genomics part and labelled as "cons 46-way", which is a genome alignment of 46 vertebrate species. You can use data on their genome browser on the site, or get download information here.

If you are interested in pair-wise alignments, I don't know of any pair-wise alignment database, but in fact you don't need one. You can search for nucleotide sequences from NCBI nucleotide database and align them using BLAST on their website. BLAST is maybe the most common tool for pair-wise alignments and also for database alignment searches, where a single query sequence is searched for matches throughout a database of sequences. If you want to do a large number of alignments, you can download BLAST to your computer to do them faster.

Macond
  • 298
  • 1
  • 9