一連のテキストエントリから共通/重要なフレーズを抽出する方法質問する

Question

最も一般的なフレーズだけでなく、最も興味深いフレーズも欲しいのではないでしょうか。コロケーションそうしないと、一般的な単語で構成されたフレーズが過剰に表示され、興味深く有益なフレーズが少なくなってしまう可能性があります。

これを実行するには、基本的にデータからnグラムを抽出し、最も高い点ごとの相互情報量(PMI) つまり、偶然に予想されるよりもはるかに多く共起する単語を見つけたいのです。

のNLTKコロケーションの使い方約 7 行のコードでこれを行う方法を説明しています。例:

import nltk
from nltk.collocations import *
bigram_measures = nltk.collocations.BigramAssocMeasures()
trigram_measures = nltk.collocations.TrigramAssocMeasures()

# change this to read in your data
finder = BigramCollocationFinder.from_words(
    nltk.corpus.genesis.words('english-web.txt'))

# only bigrams that appear 3+ times
finder.apply_freq_filter(3)

# return the 10 n-grams with the highest PMI
finder.nbest(bigram_measures.pmi, 10)

Answer 1

最も一般的なフレーズだけでなく、最も興味深いフレーズも欲しいのではないでしょうか。コロケーションそうしないと、一般的な単語で構成されたフレーズが過剰に表示され、興味深く有益なフレーズが少なくなってしまう可能性があります。

これを実行するには、基本的にデータからnグラムを抽出し、最も高い点ごとの相互情報量(PMI) つまり、偶然に予想されるよりもはるかに多く共起する単語を見つけたいのです。

のNLTKコロケーションの使い方約 7 行のコードでこれを行う方法を説明しています。例:

import nltk
from nltk.collocations import *
bigram_measures = nltk.collocations.BigramAssocMeasures()
trigram_measures = nltk.collocations.TrigramAssocMeasures()

# change this to read in your data
finder = BigramCollocationFinder.from_words(
    nltk.corpus.genesis.words('english-web.txt'))

# only bigrams that appear 3+ times
finder.apply_freq_filter(3)

# return the 10 n-grams with the highest PMI
finder.nbest(bigram_measures.pmi, 10)

一連のテキストエントリから共通/重要なフレーズを抽出する方法質問する

ベストアンサー1

おすすめ記事