自然語言處理的數據增強: 隨機同義詞替換

1 minute read

引言

大家好,我叫Austin。

今天我要介紹在自然語言處理中一種名為隨機同義詞替換的數據增強方法。

在人類對話或書寫時,我們常使用不同的單詞表達同一事物。
同義詞範例

因此,這種方法是在模擬人類日常對話或寫作時,用不同的詞來表達同一事物。

在這個方法中,關鍵是用同義詞代替隨機選擇的詞來防止神經網路過擬合。

好!讓我們來寫程式吧。

步驟

此方法有 3 個步驟。

第一步,我們需要隨機選擇一個單詞,並設置一個相似度的閾值,以防止取得不相似的同義詞。

第二步,根據選定的單詞找出前10個相似的同義詞並利用閾值去除閾值以下的相似同義詞。

第三步,從上一步的結果中隨機選擇同義詞替換選定的單詞。

要求

請按照以下列表安裝套件。

pip install --upgrade gensim numpy

程式碼

import

#import
from gensim.models.word2vec import Word2Vec
import gensim.downloader as api
import numpy as np
import random
import string

class

# class


class RandomSynonymReplacement:
    def __init__(self, corpus: str, similarity_threshold: float) -> None:
        self.model = Word2Vec(api.load(corpus))  # create the model of Word2Vec
        self.similarity_threshold = similarity_threshold    # set the threshold

    def __call__(self, text: str) -> str:
        # Split the input text with spaces to get each word
        # and check if the last character is a punctuation mark
        if text[-1] in string.punctuation:
            words = text[:-1].split(' ')
        else:
            words = text.split(' ')

        # randomly select a word and replace it with a synonym
        for word_index in random.sample(range(len(words)), len(words)):
            word = words[word_index]
            # turn the selected word to lower case
            # and check it whether exist in the vocabulary of the Word2Vec model
            if word.lower() in self.model.wv.key_to_index:
                # get similarity word by the model of Word2Vec
                # and put it to numpy array
                similarity_word = np.array(
                    self.model.wv.most_similar(word.lower()))
                # get the similarity from similarity_word
                similarity = similarity_word[:, 1].astype(np.float)
                # get the index with similarity above the threshold
                similarity_index = np.where(
                    similarity >= self.similarity_threshold)[0]
                # check the length of similarity_index
                if len(similarity_index):
                    # randomly select the synonym
                    words[words.index(word)] = random.sample(
                        list(similarity_word[similarity_index, 0]), 1)[0]
                    # check if the last character is a punctuation mark
                    if text[-1] in string.punctuation:
                        return ' '.join(words)+text[-1]
                    else:
                        return ' '.join(words)
        return text

call

if __name__ == '__main__':
    # create a class of RandomSynonymReplacement
    random_synonym_replacement = RandomSynonymReplacement(
        corpus='text8', similarity_threshold=0.5)

    # define a string
    text = 'Hello, World!'

    # check the result
    print(text)
    print(random_synonym_replacement(text=text))

result

Hello, World!
Hello, europe!

完整版本

完整版的程式碼在這裡: https://github.com/fastyangmh/toolkit/blob/main/Python/RandomSynonymReplacement.py

結論

如果您有任何問題,請隨時通過電子郵件與我聯繫。

參考連結

What is Gensim?
NumPy
Data Augmentation in Natural Language Processing
NLP Data Augmentation 常見方法