引言
大家好,我叫 Austin。
今天我要介紹在自然語言處理中一種名為隨機同義詞替換的數據增強方法。
在人類對話或書寫時,我們常使用不同的單詞表達同一事物。
因此,這種方法是在模擬人類日常對話或寫作時,用不同的詞來表達同一事物。
在這個方法中,關鍵是用同義詞代替隨機選擇的詞來防止神經網路過擬合。
好!讓我們來寫程式吧。
步驟
此方法有 3 個步驟。
第一步,我們需要隨機選擇一個單詞,並設置一個相似度的閾值,以防止取得不相似的同義詞。
第二步,根據選定的單詞找出前 10 個相似的同義詞並利用閾值去除閾值以下的相似同義詞。
第三步,從上一步的結果中隨機選擇同義詞替換選定的單詞。
要求
請按照以下列表安裝套件。
pip install --upgrade gensim numpy
程式碼
import
#import
from gensim.models.word2vec import Word2Vec
import gensim.downloader as api
import numpy as np
import random
import string
class
# class
class RandomSynonymReplacement:
def __init__(self, corpus: str, similarity_threshold: float) -> None:
self.model = Word2Vec(api.load(corpus)) # create the model of Word2Vec
self.similarity_threshold = similarity_threshold # set the threshold
def __call__(self, text: str) -> str:
# Split the input text with spaces to get each word
# and check if the last character is a punctuation mark
if text[-1] in string.punctuation:
words = text[:-1].split(' ')
else:
words = text.split(' ')
# randomly select a word and replace it with a synonym
for word_index in random.sample(range(len(words)), len(words)):
word = words[word_index]
# turn the selected word to lower case
# and check it whether exist in the vocabulary of the Word2Vec model
if word.lower() in self.model.wv.key_to_index:
# get similarity word by the model of Word2Vec
# and put it to numpy array
similarity_word = np.array(
self.model.wv.most_similar(word.lower()))
# get the similarity from similarity_word
similarity = similarity_word[:, 1].astype(np.float)
# get the index with similarity above the threshold
similarity_index = np.where(
similarity >= self.similarity_threshold)[0]
# check the length of similarity_index
if len(similarity_index):
# randomly select the synonym
words[words.index(word)] = random.sample(
list(similarity_word[similarity_index, 0]), 1)[0]
# check if the last character is a punctuation mark
if text[-1] in string.punctuation:
return ' '.join(words)+text[-1]
else:
return ' '.join(words)
return text
call
if __name__ == '__main__':
# create a class of RandomSynonymReplacement
random_synonym_replacement = RandomSynonymReplacement(
corpus='text8', similarity_threshold=0.5)
# define a string
text = 'Hello, World!'
# check the result
print(text)
print(random_synonym_replacement(text=text))
result
Hello, World!
Hello, europe!
完整版本
完整版的程式碼在這裡: https://github.com/fastyangmh/toolkit/blob/main/Python/RandomSynonymReplacement.py
結論
如果您有任何問題,請隨時通過電子郵件與我聯繫。
參考連結
What is Gensim?
NumPy
Data Augmentation in Natural Language Processing
NLP Data Augmentation 常見方法