特定の文字列の後の数字がしきい値より大きい行を解析する方法は?

特定の文字列の後の数字がしきい値より大きい行を解析する方法は?

次のファイルがあります(list_20.txt)。

[{"d_prime":"0.475425","variation1":"rs909776","r2":"0.057940","variation2":"rs16991816","population_name":"1000GENOMES:phase_3:KHV"}]
[{"r2":"0.057940","variation1":"rs909776","d_prime":"0.475425","population_name":"1000GENOMES:phase_3:KHV","variation2":"rs16991819"}]
[{"variation1":"rs909776","r2":"0.078476","d_prime":"0.546491","population_name":"1000GENOMES:phase_3:KHV","variation2":"rs8114269"}]
[{"population_name":"1000GENOMES:phase_3:KHV","variation2":"rs8114269","r2":"0.073418","variation1":"rs6130034","d_prime":"0.528588"}]
[{"population_name":"1000GENOMES:phase_3:KHV","variation2":"rs1201686","r2":"0.060239","variation1":"rs3746539","d_prime":"0.271891"}]
[{"variation2":"rs1201686","population_name":"1000GENOMES:phase_3:KHV","d_prime":"0.280262","r2":"0.058212","variation1":"rs2144011"}]
[{"population_name":"1000GENOMES:phase_3:KHV","variation2":"rs10485662","r2":"0.058826","variation1":"rs844808","d_prime":"0.423639"}]
[{"variation2":"rs6065565","population_name":"1000GENOMES:phase_3:KHV","d_prime":"0.638509","r2":"0.110749","variation1":"rs6139746"}]
[{"r2":"0.110749","variation1":"rs6139746","d_prime":"0.638509","population_name":"1000GENOMES:phase_3:KHV","variation2":"rs6072936"}]
[{"population_name":"1000GENOMES:phase_3:KHV","variation2":"rs6065562","variation1":"rs6139746","r2":"0.091021","d_prime":"0.606214"}]
[{"variation1":"rs6139746","r2":"0.910749","d_prime":"0.638509","population_name":"1000GENOMES:phase_3:KHV","variation2":"rs6072937"}]
...

"r2":" の後に値が 0.7 より大きく、1 以下の行だけを抽出したいと思います。

この例で予想される出力は次の行です。

[{"variation1":"rs6139746","r2":"0.910749","d_prime":"0.638509","population_name":"1000GENOMES:phase_3:KHV","variation2":"rs6072937"}]

私はこれを試しました:

awk '$NF >= 0.8 && $NF <1 {print $0}' list_20.txt  > 20.out

しかし、空のファイルを取得します。さらに、このコマンドは関心文字列 "r2"に限定されません。

ベストアンサー1

これはJSONと似ているため、コマンドラインJSONパーサーを使用してください。

$ jq '.[] | select((.r2|tonumber) > 0.7 and (.r2|tonumber) <= 1)' file
{
  "variation1": "rs6139746",
  "r2": "0.910749",
  "d_prime": "0.638509",
  "population_name": "1000GENOMES:phase_3:KHV",
  "variation2": "rs6072937"
}

r2キー値を文字列から正しい数値に変換するために使用する必要がありますtonumberが、それ以外は単純なフィルタですselect()

わずかに短縮したり、少なくとも各数値を変換することを避けることができます。二重

jq '.[] | (.r2|tonumber) as $r2 | select($r2 > 0.7 and $r2 <= 1)' file

結果が入力と同じ形式である場合は、次のようにします。

$ jq -c '.[] | (.r2|tonumber) as $r2 | select($r2 > 0.7 and $r2 <= 1) | [.]' file
[{"variation1":"rs6139746","r2":"0.910749","d_prime":"0.638509","population_name":"1000GENOMES:phase_3:KHV","variation2":"rs6072937"}]

つまり、「コンパクトな出力」を要求し、フィルタを介して抽出され-cた各結果の配列を生成するために使用します。select()[.]

おすすめ記事