Data transfer: Parsing Amino Acid data file

Problem

Consider this data file. It contains information about the amino-acids in a protein called 1A2T. Each amino-acid in the protein is labeled by a single letter. There are 20 amin acid molecules in nature, and each has a total surface area (in units of Angstroms squared) that is given by the following table,

'A': 129.0
'R': 274.0
'N': 195.0
'D': 193.0
'C': 167.0
'Q': 225.0
'E': 223.0
'G': 104.0
'H': 224.0
'I': 197.0
'L': 201.0
'K': 236.0
'M': 224.0
'F': 240.0
'P': 159.0
'S': 155.0
'T': 172.0
'W': 285.0
'Y': 263.0
'V': 174.0

However, when these amino acids sit next to each other to form a chain protein, they cover parts of each other, such that only parts of their surfaces are exposed, while the rest is hidden from the outside world by other neighboring amino acids. Therefore, one would expect an amino acid that is at the core of a spherical protein would have almost zero exposed surface area.

Now given the above information, write a program that takes in two command-line input arguments or a function in a language of your choice that takes two input string arguments:

a string containing the path to the above input file 1A2T_A.dssp which contains the partially exposed surface areas of amino acids in protein 1A2T for each of its amino acids and,
a second input string which is the path to the file that will contain the output of the code (e.g., it could be named ./readDSSP.out).

Then, the code does the following tasks,

reads the content of the input file and,
extracts the names of the amino acids in this protein from the data column inside the file which has the header AA (look at the line number 25 inside the input data file, below AA is the column containing the one-letter names of amino acids in this protein) and,
also extracts the partially exposed surface area information for each of these amino acids which appear in the column with the header ACC and,
then uses the above table of maximum surface area values to calculate the fractional exposed surface area of each amino acid in this protein (i.e., for each amino acid, fraction_of_exposed_surface = ACC / maximum_surface_area_from_table) and,
finally for each amino acid in this protein, it prints the one-letter name of the amino acid, the corresponding partially exposed surface area (ACC from the input file), and its corresponding fractional exposed surface area (name it RSA) to the output file given by the user on the command line.
On the first column of the output file, the code should also write the name of the protein (which is the name of the input file 1A2T_A) on each line of the output file. Note that your code should extract the protein name from the input filename (by removing the file extension and other unnecessary information from the input command line string). Here is an example output of the code.

Write your code in such a way that it checks for the existence of the output file. If it already exists, then it does not remove the content of the file, whereas, it appends new data to the existing file. Otherwise, if the file does not exist, then it creates a new output file as requested by the user.

Warning: Note that in some rows instead of a one-letter amino acid name, there is !. In such cases, your code should be able to detect the abnormality and skip that row, because that row does not contain amino acid information.

Write your Python script in such a way that your code takes the input and output file names as two input command-line arguments to your script at the time of calling the script. Your Python script should also be able to handle an error resulting from less or more than 2 input arguments to your code. For example, if the number of input command-line arguments is something other than two, then it should print the following message on the screen and stop.

$ ./readDSSP.py ./1A2T_A.dssp

Usage:
      ./readDSSP.py <input dssp file> <output summary file>

Program aborted.

or,

$ ./readDSSP.py ./1A2T_A.dssp ./readDSSP.out amir

Usage:
      ./readDSSP.py <input dssp file> <output summary file>

Program aborted.

To achieve the above goal, you will have to create a dictionary from the above table, with amino acid names as the keys, and the maximum surface areas as the corresponding values. Name your code readDSSP.py and submit it to your repository. To check for the existence of the output file, you will need to use os.path.isfile function from module os in Python.

Write your code in MATLAB as a function that takes two input string arguments corresponding to the input and output file names.

Tip: Unlike Python, Dictionaries are not first-class citizens in MATLAB. However, one can use structures in MATLAB to effectively mimic the behavior of Dictionaries in Python. In this case, the fields of a structure in MATLAB are equivalent to the keys of a Dictionary in Python, and corresponding field values are equivalent to the key values in a Python dictionary.

Data transfer: Parsing Amino Acid data file

Problem

Amir Shahmoradi

Comments