PDF OCR技術を使用した変換後のテキストの欠陥の除去

PDF OCR技術を使用した変換後のテキストの欠陥の除去

私はPDFファイルを変換するためにOCR PDFリーダーを使用しています。元のテキストはPDFファイルのイメージであり、PDF FoxitはそれをOCRを使用してテキストに変換しました。変換後の問題は、テキストが正しく並べ替えられず、すべての単語と行が移動したように見えます。サンプルテキスト

  biochemistry can be divided in three fields; molecular genetics, protein science and metabolism. Over the last decades 
of the 20th century, biochem
istry has through these three disciplines becom
e successful at explaining living processes. Almost all areas o
f the life sciences are being uncovered and developed by biochemical methodology and research.[2] Biochemistry focuses on unde
rstanding how biolog
ical molecules give 
rise to the processes that occur within living cells and
 between cells,[3] which
 in turn relates greatly to the study and understanding of 
, organs, and organism structure and function[4]

Biochemistry is closely related to mol
ecular biology, the study of the molecular mechanisms by which geneti
c information encoded in DNA is able to result in the processes of life.[5]

Much of biochemistry deals with the structu
res, 
 an
d interactions of biological macromolecules, such as proteins, nucleic acids, carbohydrates and lipids, which provide the structure of cells and perform many of the functions associated with life.[6] The chemistry of the cell also depends on the 
 of smaller molecules and ions. Th
ese can be inorganic, for example water and metal ions, or organic, for example the amino acids, which are used to synthesi
ze proteins.[7]
 The mechanisms by which cells harness energy from their environment via chemical reactions are known as metabolism. The findings of biochemistry are applied primarily in medicine, nutrition, and agriculture. In medicine, b
iochemists investigate the causes and cures of diseases.[8] In nutrition, they study how to maintain health wellness and study the effects of nutritional deficiencies.[9] In agriculture, biochemists investigate soil and fertilizers, and try to discover ways to improve crop cultivation, crop storage and pest control.

問題は、一部の単語が半分に切り捨てられることです。テキストを読みやすくするためにできることはありますか?

ベストアンサー1

改善の余地があるかもしれませんが、これが始まりです。

perl -0777 -ne 's/([^ ])$\\n/\1/g; s/\\n/ /g; print' < input | fmt

改行を結合するためにPerlを使用してください。空白で終わる場合は行を続け、それ以外の場合は改行を完全に削除し、出力をパイプして長い行をfmt分割します。

おすすめ記事