Grepが機能しない - 変数付きファイルの検索

Question

上記のスクリプトはMKD_nsi_lib1_R1_001.fqファイルを48回grepします。ファイルサイズが小さくないと、スクリプトが非常に遅くなります。

また、Barcodes.txtに対してcat48回実行されますsed。速度は高速ではありませんが、.fqファイルを48回読み取るのと同じくらい高価ではありません。

同じ入力ファイルで複数回実行するのではなく、必要grepな操作を一度に実行するawkまたはperlスクリプトを作成します（ファイルが大きいほど優れていますbarcodes.txt）MKD_nsi_lib1_R1_001.fq。

このような：

#!/usr/bin/perl

use strict;
# %patterns is a hash where the keys are fixed-text
# strings, and the values are file-handles to 
# files opened for append.
my %patterns;

# First open the barcodes.txt file and read it into
# the %patterns hash
my $barcodes;
open($barcodes,'<','barcodes.txt') || 
  die "Couldn't open 'barcodes.txt' for read: $!\n";

while(<$barcodes>) {
  chomp; # strip the newline at the end of each line
  my $outfile = "GrepBarcode_$_.txt";
  open($patterns{$_}, ">>", $outfile) ||
    die "Couldn't open '$outfile' for append: $!\n";};
close($barcodes);

# Now process the .fq file(s) listed on the command line.
# also works with stdin.
while(<>) {
  # this assumes that the keyword is at the start
  # of the line and is followed by whitespace. This
  # is only a guess on my part, since you didn't describe
  # or provide a sample of your file.  If there's a different
  # delimiter in the input file, adjust the regex in the split
  # function.
  my ($p,undef) = split /\s+/, $_, 2;

  if (defined($patterns{$p})) {
    print { $patterns{$p} } $_;
  };
};

実行するには、ファイル（たとえばsplit-fq.pl）に保存し、chmod +x split-fq.pl次のように実行可能にします。

./split-fq.pl MKD_nsi_lib1_R1_001.fq

これは比較時に固定文字列を使用するように書かれていますMKD_nsi_lib1_R1_001.fq。各入力行から最初の「単語」を抽出し、それがハッシュのキーであることを確認します。%patternsその場合、現在の行は関連するファイルハンドルへの書き込みです。

ただし、正規表現を使用できますが、遅くなります。

#!/usr/bin/perl

use strict;

# %patterns is a hash where the keys are pre-compiled
# regular expressions anchored to the start of line ^,
# and the values are handles to files opened for append.
my %patterns;

my $barcodes;
open($barcodes,'<','barcodes.txt') ||
    die "Couldn't open 'barcodes.txt' for read: $!\n";

while(<$barcodes>) {
  chomp;
  my $outfile = "GrepBarcode_$_.txt";
  open($patterns{qr/^$_/}, ">>", $outfile) ||
    die "Couldn't open '$outfile' for append: $!\n";
};

close($barcodes);

while(<>) {
  MATCH: foreach my $re (keys %patterns) {
    if (m/$re/) {
      print { $patterns{$re} } $_;
      last MATCH; # no need to test any more patterns against current line
    };
  };
};

これは上記の固定テキストバージョンよりも遅いですが、grepシェルforループで48回実行するよりもはるかに高速です。 48回ではなく、.fqファイルを一度だけ読んでください。

注：これは同様の操作を実行する方法の例にすぎません。あなたのファイルに何があるのかわからないので、彼らがあなたのデータを正しく処理しているかどうかはわかりません。あなたはバーコード.txtまたは.fqファイルの例を提供していません。実際のデータに合わせてスクリプトを変更する必要がある場合は、ほぼ確実です。

また、fastqファイルを分割するためのより良いツールがすでに存在する可能性があります。実際に、Perlで書かれたバイオインフォマティクススクリプトとツールで構成された巨大なライブラリは次のとおりです。https://bioperl.org/

Pythonを好む場合は、次を参照してください。https://biopython.org/

もちろん、生物情報学に関する質問を扱うStack Exchangeサイトもあります。https://bioinformatics.stackexchange.com/

次のバージョンは、あなたが提供したサンプルデータで動作します。

最初の固定文字列バージョンと同様に機能しますが（速度もほぼ同じである必要があります）、コロン（）をフィールド区切り文字として使用して、.fqファイルの各入力行を2つのフィールド（変数$numと）に分割します。$data:

次に、Perlsubstr()関数を使用して最初の5文字を抽出し、$dataという別の変数に入れます$start。

$start値を持つキーが配列にある場合、%patterns現在の行（$_）は関連する出力ファイル（のファイルハンドル$patterns{$start}）に書き込まれます。

#!/usr/bin/perl

use strict;
my %patterns;
my $barcodes;

open($barcodes,'<','barcodes.txt') || 
  die "couldn't open 'barcodes.txt' for read: $!\n";
while(<$barcodes>) {
  chomp;
  my $outfile = "GrepBarcode_$_.txt";
  open($patterns{$_},">>","$outfile") ||
    die "couldn't open '$outfile' for append: $!\n";
};
close($barcodes);

while(<>) {
  my ($num,$data) = split /:/, $_, 2;
  my $start = substr($data,0,5);

  if (defined($patterns{$start})) {
    print { $patterns{$start} } $_;
  };
};

テスト用に実行したときに空の出力ファイルが作成されました。これは、コードに5文字のコードと一致するGrepBarcode_?????.txt行がないためです。examplefile.txtBarcodes.txt に追加すると、AAAGA次の内容のファイルが生成されます。GrepBarcode_AAAGA.txt

$ cat GrepBarcode_AAAGA.txt 
6:AAAGAGAAATGTAATTTATACATACAGTACATATATATATGGCAGCTGTCTCCCCAAATCCTGCTCTACTGCGTCATTGTTGTGGGAATTATTCCTGGGAGGGATGCGTGAAAAATGCAAGGATATGTGCCAAGAGTACTGCAGCACTA
10:AAAGACACTGCAGATAAACCCTGTGTAATAAATACATAAAATATGTTCCAACCATTTTTATAAATTTTCTGAGTAATCTGTGTTGGATTTTCAGAGTAAGCAAATGAGAAATTAGAGTATTTGATTCCCTGTTGCTTATCCAGGACTTT

Answer 1