Problem

Consider this data file. It contains information about the amino-acids in a protein called 1A2T. Each amino-acid in the protein is labeled by a single letter. There are 20 amin acid molecules in nature, and each has a total surface area (in units of Angstroms squared) that is given by the following table,

'A': 129.0
'R': 274.0
'N': 195.0
'D': 193.0
'C': 167.0
'Q': 225.0
'E': 223.0
'G': 104.0
'H': 224.0
'I': 197.0
'L': 201.0
'K': 236.0
'M': 224.0
'F': 240.0
'P': 159.0
'S': 155.0
'T': 172.0
'W': 285.0
'Y': 263.0
'V': 174.0

However, when these amino acids sit next to each other to form a chain protein, they cover parts of each other, such that only parts of their surfaces are exposed, while the rest is hidden from the outside world by other neighboring amino acids. Therefore, one would expect an amino acid that is at the core of a spherical protein would have almost zero exposed surface area.

Now given the above information, write a program that takes in two command-line input arguments or a function in a language of your choice that takes two input string arguments:

  1. a string containing the path to the above input file 1A2T_A.dssp which contains the partially exposed surface areas of amino acids in protein 1A2T for each of its amino acids and,
  2. a second input string which is the path to the file that will contain the output of the code (e.g., it could be named ./readDSSP.out).

Then, the code does the following tasks,

  1. reads the content of the input file and,

  2. extracts the names of the amino acids in this protein from the data column inside the file which has the header AA (look at the line number 25 inside the input data file, below AA is the column containing the one-letter names of amino acids in this protein) and,

  3. also extracts the partially exposed surface area information for each of these amino acids which appear in the column with the header ACC and,

  4. then uses the above table of maximum surface area values to calculate the fractional exposed surface area of each amino acid in this protein (i.e., for each amino acid, fraction_of_exposed_surface = ACC / maximum_surface_area_from_table) and,

  5. finally for each amino acid in this protein, it prints the one-letter name of the amino acid, the corresponding partially exposed surface area (ACC from the input file), and its corresponding fractional exposed surface area (name it RSA) to the output file given by the user on the command line.

  6. On the first column of the output file, the code should also write the name of the protein (which is the name of the input file 1A2T_A) on each line of the output file. Note that your code should extract the protein name from the input filename (by removing the file extension and other unnecessary information from the input command line string). Here is an example output of the code.

Write your code in such a way that it checks for the existence of the output file. If it already exists, then it does not remove the content of the file, whereas, it appends new data to the existing file. Otherwise, if the file does not exist, then it creates a new output file as requested by the user.

Python

Write your Python script in such a way that your code takes the input and output file names as two input command-line arguments to your script at the time of calling the script. Your Python script should also be able to handle an error resulting from less or more than 2 input arguments to your code. For example, if the number of input command-line arguments is something other than two, then it should print the following message on the screen and stop.

$ ./readDSSP.py ./1A2T_A.dssp


Usage:
      ./readDSSP.py <input dssp file> <output summary file>

Program aborted.

or,

$ ./readDSSP.py ./1A2T_A.dssp ./readDSSP.out amir


Usage:
      ./readDSSP.py <input dssp file> <output summary file>

Program aborted.

To achieve the above goal, you will have to create a dictionary from the above table, with amino acid names as the keys, and the maximum surface areas as the corresponding values. Name your code readDSSP.py and submit it to your repository. To check for the existence of the output file, you will need to use os.path.isfile function from module os in Python.

MATLAB

Write your code in MATLAB as a function that takes two input string arguments corresponding to the input and output file names.

Solution

Python

An example implementation can be downloaded from here.

#!/usr/bin/env python
import sys, os

residue_max_acc = {'A': 129.0, 'R': 274.0, 'N': 195.0, 'D': 193.0, \
                   'C': 167.0, 'Q': 225.0, 'E': 223.0, 'G': 104.0, \
                   'H': 224.0, 'I': 197.0, 'L': 201.0, 'K': 236.0, \
                   'M': 224.0, 'F': 240.0, 'P': 159.0, 'S': 155.0, \
                   'T': 172.0, 'W': 285.0, 'Y': 263.0, 'V': 174.0}
    
def main():
    if len( sys.argv ) != 3:
        print('''
 
Usage:''')
        print("     ", sys.argv[0], "<input dssp file>", "<output summary file>", '\n')
        sys.exit('Program aborted.\n')
    else:
        dssp_in = sys.argv[1]  # path for the output dssp file to be read
        sum_out = sys.argv[2]   # summary file containing all residue DSSP-property data

    # Now check if the output summary file already exists. If so, then append data for the new pdb to the data already in the file.
    if os.path.isfile(sum_out):
       sum_out_file = open(sum_out,'a')
    else:
       sum_out_file = open(sum_out,'w')
       sum_out_file.write('pdb' + '\t' + 'name' + '\t' + 'ACC' + '\t' + 'RSA' + '\n')
    
    input = open(dssp_in, 'r')
    fileContents = input.readlines()   # This is a list containing each line of the input file as an element.
    pdb_name = dssp_in[-11:-5]
    
    resnam = []     # A list containing all Residue Names
    acc = []        # A list containing all normalized residue acc values in the pdb file
    rsa = []        # A list containing all normalized residue acc values in the pdb file
    counter = 0
    for record in fileContents[25:len(fileContents)]:
        AA = record[13]
        if AA != ('!' or '*'):
            counter += 1
            resnam.append(AA)
            acc.append(record[35:39])
            rsa.append(float(record[35:39])/residue_max_acc[AA])
            
    input.close()
    
    # Now write out (or append to) the ouput file
    for (i,j) in enumerate(rsa):
        sum_out_file.write(pdb_name + '\t' + resnam[i] + '\t' + acc[i] + '\t' + str(rsa[i]) + '\n')
    
    return

if __name__ == "__main__":
    main()
MATLAB

Here is an implementation of the function in MATLAB,

function readDSSP( inputFileName, outputFileName )

    residue_max_acc.A = 129.0;
    residue_max_acc.R = 274.0;
    residue_max_acc.N = 195.0;
    residue_max_acc.D = 193.0;
    residue_max_acc.C = 167.0;
    residue_max_acc.Q = 225.0;
    residue_max_acc.E = 223.0;
    residue_max_acc.G = 104.0;
    residue_max_acc.H = 224.0;
    residue_max_acc.I = 197.0;
    residue_max_acc.L = 201.0;
    residue_max_acc.K = 236.0;
    residue_max_acc.M = 224.0;
    residue_max_acc.F = 240.0;
    residue_max_acc.P = 159.0;
    residue_max_acc.S = 155.0;
    residue_max_acc.T = 172.0;
    residue_max_acc.W = 285.0;
    residue_max_acc.Y = 263.0;
    residue_max_acc.V = 174.0;

    File.Input.name = inputFileName; % '1A2T_A.dssp'
    File.Output.name = outputFileName; % 'readDSSP_output_MATLAB.txt'

    % get the PDB name from the input file's name
    pdb_name = File.Input.name(1:6);

    % read all of the input file contents as a single string variable
    File.Input.contents = fileread( File.Input.name );

    % split the single string variable to a cell array each element of which
    % corresponds to a line in the input file
    File.Input.Line = strsplit( File.Input.contents, '\n' );

    % first allocate all the arrays to the maximum possible size to save
    % computational time by avoiding repeated allocation / deallocation of the
    % same variable
    File.Output.Result.AA = strings(length(File.Input.Line),1); % array containing the Amino Acid names
    File.Output.Result.ACC = zeros(length(File.Input.Line),1); % array containing the ACCessible surface areas
    File.Output.Result.ACC = zeros(length(File.Input.Line),1); % array containing the ACCessible surface areas

    % iterate over the lines of the input file
    counter = 0;
    for rowCell = File.Input.Line(26:length(File.Input.Line))
        row = rowCell{1};
        if ~isempty(row)
            if ~strcmp(row(14),'!')
                counter = counter + 1;
                File.Output.Result.AA(counter) = row(14);
                File.Output.Result.ACC(counter) = str2double(row(36:39));
                % compute the fractional/relative solvent accessible area
                File.Output.Result.RSA(counter) = File.Output.Result.ACC(counter) / residue_max_acc.(File.Output.Result.AA{counter});
            end
        end
    end

    % Now resize all the arrays to remove empty array elements
    File.Output.Result.AA = File.Output.Result.AA(1:counter);
    File.Output.Result.ACC = File.Output.Result.ACC(1:counter);
    File.Output.Result.RSA = File.Output.Result.RSA(1:counter);

    % now open the output file and check if the file already exists
    if exist(File.Output.name,'file')
        % the file already exists, so open it to append new data
        File.Output.id = fopen( File.Output.name, 'a' );
    else
        % the file does not exist, so open it to write new data
        File.Output.id = fopen( File.Output.name, 'w' );
    end

    % write the output file as a CSV (comma-separated-variable) file
    % write the output file's header
    fprintf( File.Output.id, "pdb,AA,ACC,RSA\n"); % \n stands for the new line character
    for i = 1:length(File.Output.Result.AA)
        fprintf ( File.Output.id ...
                , "%s,%s,%.0f,%.4f\n" ...
                , pdb_name ...
                , File.Output.Result.AA(i) ...
                , File.Output.Result.ACC(i) ...
                , File.Output.Result.RSA(i) ...
                );
    end
    fclose(File.Output.id);

end

Calling this function readDSSP.m on the MATLAB command-line with the following input/output filenames,

readDSSP( '1A2T_A.dssp' , 'readDSSP_output_MATLAB.txt' )

will generate this output file readDSSP_output_MATLAB.txt

Comments