固定されていない間隔の定義された長さに基づいて開始座標と終了座標を抽出する

Question

緩く言うと、問題はマージラインです。線の開始座標が前の線の終点座標と等しい場合、線は前の線と「マージ」することができます。

このラインはゲノムの特徴に対応し得る。目的は、ゲノム配列内の隣接する特徴をマージすることである。

awkこれを行うスクリプトは次のとおりです。

$2 == end {
    # This line merges with the previous line.
    # Update end and continue with next line.

    end = $3;
    next;
}

{
    # This is an unmergeable line (start doesn't correspond to end on
    # previous line).

    # If we've processed at least the header line, print the data collected.
    # The if statement avoids printing an empty output line at the 
    # start of the output.

    if (NR > 1) {
        print chr, start, end, score, len;
    }

    # Get data from this line.

    chr = $1;
    start = $2;
    end = $3;
    score = $4;
    len = $5;
}

END {
    # At the end of input, print the data as above to output last line.
    print chr, start, end, score, len;
}

スクリプトは、入力が整列しており、すべての開始座標が終了座標よりも厳密に小さいと仮定します（つまり、すべての機能が正のチェーンにある）。

テストしてみてください：

$ awk -f script.awk data
chr start end score length
chr1 237592 237912 176 320
chr1 521409 521729 150 320
chr1 714026 714346 83 320
chr1 805100 805440 323 340

Answer 1

緩く言うと、問題はマージラインです。線の開始座標が前の線の終点座標と等しい場合、線は前の線と「マージ」することができます。

このラインはゲノムの特徴に対応し得る。目的は、ゲノム配列内の隣接する特徴をマージすることである。

awkこれを行うスクリプトは次のとおりです。

$2 == end {
    # This line merges with the previous line.
    # Update end and continue with next line.

    end = $3;
    next;
}

{
    # This is an unmergeable line (start doesn't correspond to end on
    # previous line).

    # If we've processed at least the header line, print the data collected.
    # The if statement avoids printing an empty output line at the 
    # start of the output.

    if (NR > 1) {
        print chr, start, end, score, len;
    }

    # Get data from this line.

    chr = $1;
    start = $2;
    end = $3;
    score = $4;
    len = $5;
}

END {
    # At the end of input, print the data as above to output last line.
    print chr, start, end, score, len;
}

スクリプトは、入力が整列しており、すべての開始座標が終了座標よりも厳密に小さいと仮定します（つまり、すべての機能が正のチェーンにある）。

テストしてみてください：

$ awk -f script.awk data
chr start end score length
chr1 237592 237912 176 320
chr1 521409 521729 150 320
chr1 714026 714346 83 320
chr1 805100 805440 323 340

固定されていない間隔の定義された長さに基づいて開始座標と終了座標を抽出する

ベストアンサー1

おすすめ記事