awkまたはsedを使用して、テキストファイルの行間の行を削除します。

awkまたはsedを使用して、テキストファイルの行間の行を削除します。

列1の「Query_」ヘッダーの間にあるすべての行を削除するsedまたはawkコマンドがあるかどうか疑問に思います(各ヘッダー間の行数が5個未満の場合)。以下は、1Gb程度の大容量ファイルから抜粋した内容です。さまざまな方法を試しましたが、すべて失敗しました。

Query_10      26   KMGKWYPTEDAPAKKRKTQSWRQNKSKLRGGIVPGQVLIILAGKHKGKRVVYLTQLSTGE  205
XP_010718494  131  KMPRYYPTEDVPRKSHGKKPFSQHKRRLRASITPGTVLILLTGRHRGKRVVFLKQLGTGL  192
NP_001291831  111  KMPRYYPTEDVPRKSHGKKPFSQHVRKLRASITPGTILIILTGRHRGKRVVFLKQLSSGL  172
Query_10      206  IVVTGPHKFNRCPLKKLAQSFTMPTSTFVDI*GLNFDITEQHFVKEKP**SSEEAQFFAK  385
XP_010718494  193  LLVTGPLVVNRVPLRRAHQKFVIATSTKVDISGVKIHLTDAYFKKKKLRKPKQEGEIFDT  255
NP_001291831  173  LLVTGPLSLNRVPLRRTHQKFVIATSTKIDISSVKIHLTDAYFKKKKP--RHQEGEIFDT  235
XP_012359817  173  LLVTGPLVLNRVPLRRTHQKFVIATSTKIDISNVKIHLTDAYFKKKKP--RHQEGEIFDT  235
XP_009246541  173  LLVTGPLVLNRVPLRRTHQKFVIATSTKIDISNVKIHLTDAYFKKKKP--RHQEGEIFDT  235
XP_003225150  155  LLVTGPLAINRVPLRRAHQKFVIATSTKVDISSVKLHLNDVYFKKKKLRKPKQEGEIFDT  217
Query_13      31    MEEQKEKGLSNPEVV*KYRQCSEIVNQVLSTVVSSCVPGADVASICTNGDFLIEDGLRNI  210
XP_002947167  7     IQGEQEPNLSVPEVVTKYKAAADICNRALQAVIDGCKDGSKIVDLCRTGDNFITKECGNI  66
XP_004993505  1     MELDRQSKVVDADALSKYRAAAAIANDCVQQLVANCIAGADVYTLAVEADTYIEQKLKEL  60
XP_006961234  1     MSETKEYSLNNPDTLTKYKTAAQISEKVLAAVSDLCVPGAKIVDICQQGDKLIEEELAKV  62
XP_008089018  1     MSEETDYTLNNPDTLTKYKTAAQISEKVLAAVAELVVPGEKIVTICEKGDKLIEEELAKV  60
Query_13      211   EPDTNIEKGIAIPVCLNINNICSYYSPLPDASTTLQEGDLVKVDLGAHFDGYIVSAASSI  390
XP_004029906  65    YTKKKVEKGPAFPTCISINEICGHYSPLLSDSSLLKEGDVVKIDLGTHIDGFIALGAHTV  131
XP_004031065  64    FTKKKLQKGPAFPTCISVNEICGHYSPLISDSSLLKEGDVVKIDLGAQIDGFIALAAHTV  130
XP_003223249  65    KKEKDMKKGIAFPTSISVNNCVCHFSPLKDQDYILKEGDLVKIDLGVHVDGFISNVAHSF  125
XP_002947167  67    YKGKQIEKGVAFPTCVSVNSVVGHFSPNADDTSALKAGDVVKFDMGCHIDGFIATQATTV  126
XP_003880798  73    ENGKKMEKGIAFPTCISINEICGHFSPVEENAETLTEGDVVKIDMGCHIDGYISVVAYTV  135
XP_004348044  69    KANKKVKKGIAFPTCVSLNSTVCHQSPLSDAAITLQAGDVAKVDLGVHVDGLIAVVAHTI  129
XP_003284133  69    HSKKKIEKGIAFPTCISVNNCVGHYSPLKATSRSLVDGDIVKIDLGVHINGFIAVGAHTI  128
NP_001241588  65    YKNVKIERGVAFPTCLSINNVVCHFSPLASDEAVLEEGDILKIDMACHIDGFIAVVAHTH  126
XP_009039553  76    YQKKIIDKGVAFPTCVSVNECVCHNSPLESDTTSLSEGDLVKLDVGCYVDGYIAVAAHTM  141

必要な結果は次のとおりです。

Query_10      206  IVVTGPHKFNRCPLKKLAQSFTMPTSTFVDI*GLNFDITEQHFVKEKP**SSEEAQFFAK  385
XP_010718494  193  LLVTGPLVVNRVPLRRAHQKFVIATSTKVDISGVKIHLTDAYFKKKKLRKPKQEGEIFDT  255
NP_001291831  173  LLVTGPLSLNRVPLRRTHQKFVIATSTKIDISSVKIHLTDAYFKKKKP--RHQEGEIFDT  235
XP_012359817  173  LLVTGPLVLNRVPLRRTHQKFVIATSTKIDISNVKIHLTDAYFKKKKP--RHQEGEIFDT  235
XP_009246541  173  LLVTGPLVLNRVPLRRTHQKFVIATSTKIDISNVKIHLTDAYFKKKKP--RHQEGEIFDT  235
XP_003225150  155  LLVTGPLAINRVPLRRAHQKFVIATSTKVDISSVKLHLNDVYFKKKKLRKPKQEGEIFDT  217
Query_13      211   EPDTNIEKGIAIPVCLNINNICSYYSPLPDASTTLQEGDLVKVDLGAHFDGYIVSAASSI  390
XP_004029906  65    YTKKKVEKGPAFPTCISINEICGHYSPLLSDSSLLKEGDVVKIDLGTHIDGFIALGAHTV  131
XP_004031065  64    FTKKKLQKGPAFPTCISVNEICGHYSPLISDSSLLKEGDVVKIDLGAQIDGFIALAAHTV  130
XP_003223249  65    KKEKDMKKGIAFPTSISVNNCVCHFSPLKDQDYILKEGDLVKIDLGVHVDGFISNVAHSF  125
XP_002947167  67    YKGKQIEKGVAFPTCVSVNSVVGHFSPNADDTSALKAGDVVKFDMGCHIDGFIATQATTV  126
XP_003880798  73    ENGKKMEKGIAFPTCISINEICGHFSPVEENAETLTEGDVVKIDMGCHIDGYISVVAYTV  135
XP_004348044  69    KANKKVKKGIAFPTCVSLNSTVCHQSPLSDAAITLQAGDVAKVDLGVHVDGLIAVVAHTI  129
XP_003284133  69    HSKKKIEKGIAFPTCISVNNCVGHYSPLKATSRSLVDGDIVKIDLGVHINGFIAVGAHTI  128
NP_001241588  65    YKNVKIERGVAFPTCLSINNVVCHFSPLASDEAVLEEGDILKIDMACHIDGFIAVVAHTH  126
XP_009039553  76    YQKKIIDKGVAFPTCVSVNECVCHNSPLESDTTSLSEGDLVKLDVGCYVDGYIAVAAHTM  141

私が試したPythonスクリプト:

lines = [line.rstrip() for line in open('infile.txt')]
for line in lines: 
    data = line.split()
    sequence = data[2]
    if data[0].startswith("Query_"):
        hits = [i for i,c in enumerate(sequence) if c == <50]
        continue
    else:
        print(list(sequence[plus50] for plus50 in hits))

ベストアンサー1

そしてsed:

sed '
    /^Query_/{                #starts loop when meet patten
        :a
        $!{
            N
            /\nQuery_/!ba     #untill meet next pattern
        }
        /\(\n.*\)\{6,\}/{     #checks how many lines in block
            $b                #for end of file
            s/\nQuery_/\n&/   #marks lines to print
        }
    }
    /\n\n/P                   #prints marked lines
    D                         #remove 1st line in block, go to start
    '

その他のスクリプトフォームアッ:

awk '
    /^Query/{c=0;lines=$0;next}
    ++c<5{lines=lines "\n" $0;next}
    c==5{print lines}
    1                         #short for {print}
    '

おすすめ記事