STRING PPI-Dumps nach Taxonomie filtern

Kürzlich musste ich einen STRING Protein-View-Datenbankdump (z.B. protein.links.full.v9.05.txt.gz) nach Taxonomie-ID filtern. Der ursprüngliche Datensatz war viel zu groß (er hatte mehr als 670 Millionen Datensätze).

Um mit konstantem Speicher zu filtern (der vollständige STRING-Dump ist schließlich 47 GB groß), habe ich dieses Skript erstellt, das das Filtern nach binären PPIs ermöglicht, die beide zum gegebenen Organismus passen (NCBI-Taxonomie-ID), aber auch das Filtern nach binären PPIs mit mindestens einem interagierenden Protein des gegebenen Organismus erlaubt. Normalerweise macht das für STRING keinen großen Unterschied.

Die Eingabe- und Ausgabedatei können entweder Klartext oder gzip-komprimiert sein (transparente Komprimierung / Dekomprimierung ist aktiviert).

Beispiel:

filter_protein_links_example.sh

#Nach beiden interagierenden Proteinen in Saccharomyces cerevisiae filtern
./filter-protein-links.py protein.links.full.v9.05.txt.gz output.txt.gz --filter-organism=4932 --match-both

#Nach beiden interagierenden Proteinen in Saccharomyces cerevisiae filtern
./filter-protein-links.py protein.links.full.v9.05.txt.gz output.txt.gz --filter-organism=4932 --match-both

Quellcode:

filter_protein_links.py

#!/usr/bin/env python3
# -*- coding: utf-8 -*-
"""
filter-protein-links.py

Filtert STRING-Protein-Link-Dumps nach Taxonomie
"""
from __future__ import with_statement
import gzip

__author__ = "Uli Koehler"
__license__ = "Apache v2.0"
__copyright__ = "Copyright 2013, Uli Koehler"

def filterSTRINGFile(infilename, outfilename, organismFilter, mustMatchBothOrganisms=True):
    inOpenFunc = gzip.open if infilename.endswith(".gz") else open
    outOpenFunc = gzip.open if outfilename.endswith(".gz") else open
    allRecordsCounter = 0
    passCounter = 0
    with inOpenFunc(infilename) as infile, outOpenFunc(outfilename, "w") as outfile:
        for line in infile:
            if line.startswith("protein1"):
                outfile.write(line)
                continue #Skip header line
            allRecordsCounter += 1
            if allRecordsCounter % 1000000 == 0:
                print ("Processed {} input records, {} records passed test...".format(
                        allRecordsCounter, passCounter))
            parts = line.split()
            #Check organism
            organismA = parts[0].partition(".")[0]
            organismB = parts[1].partition(".")[0]
            if mustMatchBothOrganisms and (organismA != organismFilter or organismB != organismFilter):
                continue
            if (not mustMatchBothOrganisms) and (organismA != organismFilter and organismB != organismFilter):
                continue
            #All tests passed, write line
            passCounter += 1
            outfile.write(line)
    #Print final statistics
    print ("Processed {} input records, {} records passed test".format(allRecordsCounter, passCounter))

if __name__ == "__main__":
    import argparse
    parser = argparse.ArgumentParser()
    parser.add_argument("--filter-organism", type=int, required=True, dest="filterOrganism", help="Both proteins must match this organism ID")
    parser.add_argument("--match-both", action="store_true", dest="matchBoth", help="Specify this flag")
    parser.add_argument("infile", help="The input file (.gz supported)")
    parser.add_argument("outfile", help="The output file (.gz supported")
    args = parser.parse_args()
    filterSTRINGFile(args.infile, args.outfile, str(args.filterOrganism), args.matchBoth)

#!/usr/bin/env python3
# -*- coding: utf-8 -*-
"""
filter-protein-links.py

Filtert STRING-Protein-Link-Dumps nach Taxonomie
"""
from __future__ import with_statement
import gzip

__author__ = "Uli Koehler"
__license__ = "Apache v2.0"
__copyright__ = "Copyright 2013, Uli Koehler"

def filterSTRINGFile(infilename, outfilename, organismFilter, mustMatchBothOrganisms=True):
    inOpenFunc = gzip.open if infilename.endswith(".gz") else open
    outOpenFunc = gzip.open if outfilename.endswith(".gz") else open
    allRecordsCounter = 0
    passCounter = 0
    with inOpenFunc(infilename) as infile, outOpenFunc(outfilename, "w") as outfile:
        for line in infile:
            if line.startswith("protein1"):
                outfile.write(line)
                continue #Skip header line
            allRecordsCounter += 1
            if allRecordsCounter % 1000000 == 0:
                print ("Processed {} input records, {} records passed test...".format(
                        allRecordsCounter, passCounter))
            parts = line.split()
            #Check organism
            organismA = parts[0].partition(".")[0]
            organismB = parts[1].partition(".")[0]
            if mustMatchBothOrganisms and (organismA != organismFilter or organismB != organismFilter):
                continue
            if (not mustMatchBothOrganisms) and (organismA != organismFilter and organismB != organismFilter):
                continue
            #All tests passed, write line
            passCounter += 1
            outfile.write(line)
    #Print final statistics
    print ("Processed {} input records, {} records passed test".format(allRecordsCounter, passCounter))

if __name__ == "__main__":
    import argparse
    parser = argparse.ArgumentParser()
    parser.add_argument("--filter-organism", type=int, required=True, dest="filterOrganism", help="Both proteins must match this organism ID")
    parser.add_argument("--match-both", action="store_true", dest="matchBoth", help="Specify this flag")
    parser.add_argument("infile", help="The input file (.gz supported)")
    parser.add_argument("outfile", help="The output file (.gz supported")
    args = parser.parse_args()
    filterSTRINGFile(args.infile, args.outfile, str(args.filterOrganism), args.matchBoth)

Check out similar posts by category: Allgemein

If this post helped you, please consider buying me a coffee or donating via PayPal to support research & publishing of new posts on TechOverflow

Buy me a coffee