Viral protein sequence analysis and the origins of the WNV outbreak in the United States
Data Set 1: Viral protein sequence analysis and the origins of the WNV outbreak in the United States (Most suitable for students on the following programmes: Biochemistry, Biological and Medical Sciences, Genetics, Microbiology, and Molecular Biology).
Prepare a report on the origins of the New York City WNV outbreak in 1999. Your report should also contain:
1. An alignment of the five partial amino acid sequences, complete with legend.
2. A phylogram of the five partial amino acid sequences, complete with legend.
3. You should include a clear statement regarding the most likely country of origin of the virus that first appeared in New York City, and review the different mechanisms that may account for the original appearance of this virus in New York.
4. You should also include a statement clearly indicating what you consider to be the most likely mechanism of introduction and spread of the virus and the reasons for your choice.
Hypothesis: WNV found its way into the United States as a consequence of increased opportunities for intercontinental travel.
You will test this hypothesis by an analysis of the relatedness of viral strains found in the United States and elsewhere in the world. You are provided with partial amino acid sequence information for a viral envelope glycoprotein from isolates obtained from different geographical locations. You should use this information to deduce the most likely origin of the virus that first appeared in New York City in 1999. In order to do this, you will need to adopt a bioinformatics-based approach for your analysis. You should prepare an alignment of the protein sequences and a phylogram using the Clustal Omega multiple alignment software available here:
http://www.ebi.ac.uk/Tools/msa/clustalo/
The sequence information you need is available in the e-protein.txt file available in the LIFE223 module area on VITAL in the following location: Module Content > Communication and Study Skills > Scientific Report – Datasets. A copy of this information is also shown on the next page:
>gi|14132790|gb|AAK52303.1| envelope glycoprotein, partial PTTVESHGNYSTQVGATQAGRFSITPAAPSYTLKLGEYGEVTVDCEPRSGIDTNAYYVMTVGTKTFLVHREWFMDLNLP
>gi|7141350|gb|AAF37302.1| envelope glycoprotein, partial PTTVESHGNYSTQMGATQAGRFSITPAAPSYTLKLGEYGEVTVDCEPRSGIDTNAYYVMTVGTKTFLVHREWFMDLNLP
>gi|357529490|gb|AET80926.1| envelope glycoprotein, partial
PTTVESHGNYSTQVGATQAGRFSITPAAPSYTLKLGEYGEVTVDCEPRSGIDTNAYYVMTVGTKTFLVHREWFMDLNLP
>gi|4883993|gb|AAD31720.1|AF146082_1 envelope glycoprotein, partial
PTTVESHGNYSTQIGATQAGRFSITPAAPSYTLKLGEYGEVTVDCEPRSGIDTNAYYVMTVGTKT
FLVHREWFMDLNLP
>gi|7239391|gb|AAF43216.1|envelope glycoprotein, partial
PTTVESHGNYFTQIGATQAGRFSITPAAPSYTLKLGEYGEVTVDCEPRSGIDTNAYYVMTVGTKTFLVHR
EWFMDLNLP
Your first step should be to investigate the original records from which this file was prepared. In particular, you should record the geographical location from which the isolate was taken (clearly, one will be from New York City). In order to do this, you should search the ‘Protein Database’ at the NCBI website (http://www.ncbi.nlm.nih.gov/) using the search form at the top of the home web page and the accession numbers of the sample proteins i.e. AAK52303.1, AAF37302.1, AET80926.1, AAD31720.1 or AAF43216.1. Search for only one protein at a time.
Once you have identified the country of origin for each sequence, you should modify the header lines (i.e. the lines starting with the > character) for each of the sequences in the following way:
>country1 PTTVESHGNYSTQVGATQAGRFSITPAAPSYTLKLGEYGEVTVDCEPRSGIDTNAYYVMTVGTKTFLVHREWFMDLNLP
>country2 PTTVESHGNYSTQMGATQAGRFSITPAAPSYTLKLGEYGEVTVDCEPRSGIDTNAYYVMTVGTKTFLVHREWFMDLNLP
>country3
PTTVESHGNYSTQVGATQAGRFSITPAAPSYTLKLGEYGEVTVDCEPRSGIDTNAYYVMTVGTKTFLVHREWFMDLNLP
>country4
PTTVESHGNYSTQIGATQAGRFSITPAAPSYTLKLGEYGEVTVDCEPRSGIDTNAYYVMTVGTKT
FLVHREWFMDLNLP
>country5
PTTVESHGNYFTQIGATQAGRFSITPAAPSYTLKLGEYGEVTVDCEPRSGIDTNAYYVMTVGTKTFLVHR
EWFMDLNLP