Linuxで必要な列にファイルを結合する方法は？

Question

GNU awkを使用してください。このコマンドをbashスクリプトに入れました。より便利になります。

使用法: ./join_files.shまたは、きれいに印刷するには./join_files.sh | column -t：

#!/bin/bash

gawk '
NR == 1 {
    PROCINFO["sorted_in"] = "@ind_num_asc";
    header = $1;
}

FNR == 1 {
    file = gensub(/.*\/([^.]*)\..*/, "\\1", "g", FILENAME); 
    header = header OFS file;   
}

FNR > 1 {
    arr[$1] = arr[$1] OFS $5;
}

END {
    print header;

    for(i in arr) {
        print i arr[i];
    }
}' results/*.genes.results

出力（テストのために同じ内容で3つのファイルを作成しました。）

$ ./join_files.sh | column -t
gene_id          TB1      TB2      TB3
ENSG00000000003  1.00     1.00     1.00
ENSG00000000005  0.00     0.00     0.00
ENSG00000000419  1865.00  1865.00  1865.00
ENSG00000000457  1521.00  1521.00  1521.00
ENSG00000000460  1860.00  1860.00  1860.00
ENSG00000000938  6846.00  6846.00  6846.00
ENSG00000000971  0.00     0.00     0.00
ENSG00000001036  1358.00  1358.00  1358.00
ENSG00000001084  1178.00  1178.00  1178.00

説明する- 同じコードにコメントを追加します。また見てくださいman gawk。

gawk '
# NR - the total number of input records seen so far.
# If the total line number is equal 1

NR == 1 {
    # If the "sorted_in" element exists in PROCINFO, then its value controls 
    # the order in which array elements are traversed in the (for in) loop.
    # else the order is undefined.

    PROCINFO["sorted_in"] = "@ind_num_asc";

    # Each field in the input record may be referenced by its position: $1, $2, and so on.
    # $1 - is the first field or the first column. 
    # The first field in the first line is the "gene_id" word;
    # Assign it to the header variable.

    header = $1;
}

# FNR - the input record number in the current input file.
# NR is the total lines counter, FNR is the current file lines counter.
# FNR == 1 - if it is the first line of the current file.

FNR == 1 {
    # remove from the filename all unneeded parts by the "gensub" function
    # was - results/TB1.genes.results
    # become - TB1

    file = gensub(/.*\/([^.]*)\..*/, "\\1", "g", FILENAME); 

    # and add it to the header variable, concatenating it with the 
    # previous content of the header, using OFS as delimiter.
    # OFS - the output field separator, a space by default.

    header = header OFS file;   
}

# some trick is used here.
# $1 - the first column value - "gene_id"
# $5 - the fifth column value - "expected_count"
FNR > 1 {
    # create array with "gene_id" indexes: arr["ENSG00000000003"], arr["ENSG00000000419"], so on.
    # and add "expected_count" values to it, separated by OFS.
    # each time, when the $1 equals to the specific "gene_id", the $5 value will be
    # added into this array item.

    # Example:
    # arr["ENSG00000000003"] = 1.00
    # arr["ENSG00000000003"] = 1.00 2.00
    # arr["ENSG00000000003"] = 1.00 2.00 3.00

    arr[$1] = arr[$1] OFS $5;
}

END {
    print header;

    for(i in arr) {
        print i arr[i];
    }
}' results/*.genes.results

Answer 1

GNU awkを使用してください。このコマンドをbashスクリプトに入れました。より便利になります。

使用法: ./join_files.shまたは、きれいに印刷するには./join_files.sh | column -t：

#!/bin/bash

gawk '
NR == 1 {
    PROCINFO["sorted_in"] = "@ind_num_asc";
    header = $1;
}

FNR == 1 {
    file = gensub(/.*\/([^.]*)\..*/, "\\1", "g", FILENAME); 
    header = header OFS file;   
}

FNR > 1 {
    arr[$1] = arr[$1] OFS $5;
}

END {
    print header;

    for(i in arr) {
        print i arr[i];
    }
}' results/*.genes.results

出力（テストのために同じ内容で3つのファイルを作成しました。）

$ ./join_files.sh | column -t
gene_id          TB1      TB2      TB3
ENSG00000000003  1.00     1.00     1.00
ENSG00000000005  0.00     0.00     0.00
ENSG00000000419  1865.00  1865.00  1865.00
ENSG00000000457  1521.00  1521.00  1521.00
ENSG00000000460  1860.00  1860.00  1860.00
ENSG00000000938  6846.00  6846.00  6846.00
ENSG00000000971  0.00     0.00     0.00
ENSG00000001036  1358.00  1358.00  1358.00
ENSG00000001084  1178.00  1178.00  1178.00

説明する- 同じコードにコメントを追加します。また見てくださいman gawk。

gawk '
# NR - the total number of input records seen so far.
# If the total line number is equal 1

NR == 1 {
    # If the "sorted_in" element exists in PROCINFO, then its value controls 
    # the order in which array elements are traversed in the (for in) loop.
    # else the order is undefined.

    PROCINFO["sorted_in"] = "@ind_num_asc";

    # Each field in the input record may be referenced by its position: $1, $2, and so on.
    # $1 - is the first field or the first column. 
    # The first field in the first line is the "gene_id" word;
    # Assign it to the header variable.

    header = $1;
}

# FNR - the input record number in the current input file.
# NR is the total lines counter, FNR is the current file lines counter.
# FNR == 1 - if it is the first line of the current file.

FNR == 1 {
    # remove from the filename all unneeded parts by the "gensub" function
    # was - results/TB1.genes.results
    # become - TB1

    file = gensub(/.*\/([^.]*)\..*/, "\\1", "g", FILENAME); 

    # and add it to the header variable, concatenating it with the 
    # previous content of the header, using OFS as delimiter.
    # OFS - the output field separator, a space by default.

    header = header OFS file;   
}

# some trick is used here.
# $1 - the first column value - "gene_id"
# $5 - the fifth column value - "expected_count"
FNR > 1 {
    # create array with "gene_id" indexes: arr["ENSG00000000003"], arr["ENSG00000000419"], so on.
    # and add "expected_count" values to it, separated by OFS.
    # each time, when the $1 equals to the specific "gene_id", the $5 value will be
    # added into this array item.

    # Example:
    # arr["ENSG00000000003"] = 1.00
    # arr["ENSG00000000003"] = 1.00 2.00
    # arr["ENSG00000000003"] = 1.00 2.00 3.00

    arr[$1] = arr[$1] OFS $5;
}

END {
    print header;

    for(i in arr) {
        print i arr[i];
    }
}' results/*.genes.results

Linuxで必要な列にファイルを結合する方法は？

ベストアンサー1

おすすめ記事