Python機(jī)器學(xué)習(xí)NLP自然語(yǔ)言處理Word2vec電影影評(píng)建模
概述
從今天開(kāi)始我們將開(kāi)啟一段自然語(yǔ)言處理 (NLP) 的旅程. 自然語(yǔ)言處理可以讓來(lái)處理, 理解, 以及運(yùn)用人類的語(yǔ)言, 實(shí)現(xiàn)機(jī)器語(yǔ)言和人類語(yǔ)言之間的溝通橋梁.
詞向量
我們先來(lái)說(shuō)說(shuō)詞向量究竟是什么. 當(dāng)我們把文本交給算法來(lái)處理的時(shí)候, 計(jì)算機(jī)并不能理解我們輸入的文本, 詞向量就由此而生了. 簡(jiǎn)單的來(lái)說(shuō), 詞向量就是將詞語(yǔ)轉(zhuǎn)換成數(shù)字組成的向量.
當(dāng)我們描述一個(gè)人的時(shí)候, 我們會(huì)使用身高體重等種種指標(biāo), 這些指標(biāo)就可以當(dāng)做向量. 有了向量我們就可以使用不同方法來(lái)計(jì)算相似度.
那我們?nèi)绾蝸?lái)描述語(yǔ)言的特征呢? 我們把語(yǔ)言分割成一個(gè)個(gè)詞, 然后在詞的層面上構(gòu)建特征.
詞向量維度
詞向量的維度越高, 其所能提供的信息也就越多, 計(jì)算結(jié)果的可靠性就更值得信賴.
50 維的詞向量:
用熱度圖表示一下:
從上圖我們可以看出, 相似的詞在特征表達(dá)中比較相似. 由此也可以證明詞的特征是有意義的.
代碼實(shí)現(xiàn)
預(yù)處理
import numpy as np import pandas as pd import itertools import re from bs4 import BeautifulSoup from sklearn.feature_extraction.text import CountVectorizer from sklearn.model_selection import train_test_split from matplotlib import pyplot as plt import nltk # 停用詞 stop_words = pd.read_csv("data/stopwords.txt", index_col=False, quoting=3, sep="\n", names=["stop_words"]) stop_words = [word.strip() for word in stop_words["stop_words"].values] def load_train_data(): """讀取訓(xùn)練數(shù)據(jù)""" # 語(yǔ)料 data = pd.read_csv("data/labeledTrainData.tsv", sep="\t", escapechar="\\") print(data[:5]) print("訓(xùn)練評(píng)論數(shù)量:", len(data)) # 25,000 return data def load_test_data(): # 語(yǔ)料 data = pd.read_csv("data/unlabeledTrainData.tsv", sep="\t", escapechar="\\") print("測(cè)試評(píng)論數(shù)量:", len(data)) # 50,000 return data def pre_process(text): # 去除網(wǎng)頁(yè)鏈接 text = BeautifulSoup(text, "html.parser").get_text() # 去除標(biāo)點(diǎn) text = re.sub("[^a-zA-Z]", " ", text) # 分詞 words = text.lower().split() # 去除停用詞 words = [w for w in words if w not in stop_words] return " ".join(words) def split_train_data(): # 讀取文件 data = pd.read_csv("data/train.csv") print(data.head()) # 抽取bag of words特征 vec = CountVectorizer(max_features=5000) # 擬合 vec.fit(data["review"]) # 轉(zhuǎn)換 train_data_features = vec.transform(data["review"]).toarray() print(train_data_features.shape) # 詞袋 print(vec.get_feature_names()) # 分割數(shù)據(jù)集 X_train, X_test, y_train, y_test = train_test_split(train_data_features, data["sentiment"], test_size=0.2, random_state=0) return X_train, X_test, y_train, y_test def test(): # 讀取測(cè)試數(shù)據(jù) data = pd.read_csv("data/test.csv") print(data.head()) tokenizer = nltk.data.load("tokenizers/punkt/english.pickle") # 分詞 def split_sentences(review): raw_sentences = tokenizer.tokenize(review.strip()) return sentences sentences = sum(data["review"][:10].apply(split_sentences), []) def visualize(cm, classes, title="Confusion matrix", cmap=plt.cm.Blues): plt.imshow(cm, interpolation="nearest", cmap=cmap) plt.title(title) plt.colorbar() tick_marks = np.arange(len(classes)) plt.xticks(tick_marks, classes, rotation=0) plt.yticks(tick_marks, classes) thresh = cm.max() for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])): plt.text(j, i, cm[i, j], horizontalalignment="center", color="white" if cm[i, j] > thresh else "black") plt.tight_layout() plt.ylabel("True label") plt.xlabel("Predicted label") plt.show() if __name__ == '__main__': # # 處理訓(xùn)練數(shù)據(jù) # train_data = load_train_data() # train_data["review"] = train_data["review"].apply(pre_process) # print(train_data.head()) # # # 保存 # train_data.to_csv("data/train.csv") # # 處理訓(xùn)練數(shù)據(jù) # test_data = load_test_data() # test_data["review"] = test_data["review"].apply(pre_process) # print( test_data.head()) # # # 保存 # test_data.to_csv("data/test.csv") split_train_data()
主程序
import pandas as pd import nltk from gensim.models.word2vec import Word2Vec def pre_process(): """預(yù)處理""" # 讀取測(cè)試數(shù)據(jù) data = pd.read_csv("data/test.csv") print(data.head()) # 存放結(jié)果 result = [] # 分詞 for line in data["review"]: result.append(nltk.word_tokenize(line)) return result def main(): # 獲取分詞語(yǔ)料 word_list = pre_process() # 設(shè)定詞向量訓(xùn)練的參數(shù) num_features = 300 # Word vector dimensionality min_word_count = 40 # Minimum word count num_workers = 4 # Number of threads to run in parallel context = 10 # Context window size model_name = '{}features_{}minwords_{}context.model'.format(num_features, min_word_count, context) # 創(chuàng)建w2c模型 model = Word2Vec(sentences=word_list, workers=num_workers, vector_size=num_features, min_count=min_word_count, window=context) # 保存模型 model.save(model_name) def test(): # 加載模型 model = Word2Vec.load("300features_40minwords_10context.model") # 不匹配 match = model.wv.doesnt_match(['man','woman','child','kitchen']) print(match) # 最相似 print(model.wv.most_similar("boy")) print(model.wv.most_similar("bad")) if __name__ == '__main__': test()
輸出結(jié)果:
2021-09-16 20:36:40.791181: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0 Unnamed: 0id sentiment review 0 0 5814_8 1 stuff moment mj ve started listening music wat... 1 1 2381_9 1 classic war worlds timothy hines entertaining ... 2 2 7759_3 0 film starts manager nicholas bell investors ro... 3 3 3630_4 0 assumed praised film filmed opera didn read do... 4 4 9495_8 1 superbly trashy wondrously unpretentious explo... 73423 [[15958623 12368 4459622835 30152 2097 2408 35364 57143 892 2997766 42223967266 25276157108696 1631198 2576 9850 3745 27 52 3789 9503696526 52354862 474 38 2101 11027696 6456 22390969 5873 5376 4044 623 1401 2069718618 92 96138 1345714 96 18 123 1770518 3314354983 1888520 83 73983 2 28 28635 1044 2054401 1071 85 8565 8957 7226804 46 224447 2113 2691 5742 10 5 3217943 5045980373 28873438389 41 23 19 56122 9253 27176 2149 19 90 57144 53 4874696 6558136 2067 10682 48 518 1482 9 3668 1587 3786 2110 10506 25150 20744 340 33316 17 4824 3892978 14 10150 2596766 42223 5082 4784700198 6276 5254700198 2334696 20879 5 86 30 2583 2872 30601 30 86 28 83 73 32 96 18 2224708 30167 7 3791216 45513 2 2310513 1860 4536 1925414 1321578 7434851696 997 5354 57145162 30 2 91 1839] [ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1357684 28 3027 10371 5801 20987 21481 19800 1 3027 10371 21481 19800 1719204 49168250 7355 1547374401 5415 24 1719 24 49168 7355 1547 3610 21481 19800123204 49168 1102 1547656213 5432 5183 61 4 66166 20 36 56 7 5183 2025116 5031 11 45782] [ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2189 1586 2189 15 1855615400 5394 3797 23866 2892481 2892810 22020 17820 1741231 20746 2028 1040 6089816 5555 41772 1762 26811288 8796 45] [ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 85310 1734 78 1906 78 1906 1412 1985 78 7644 1412244 9287 7092 6374 2584 6183 3795 3080 1288 2217 3534 6005 4851 1543762 1797 26144699237 6745 7 1288 1415 9003 5623237 1669 17987874421234 1278347 9287 1609 7100 1065 75 9800 3344 76 5021 47380 3015 14366 6523 1396851 22330 3465 20861 7106 6374340 60 19035 3089 5081 3 7 1695 10735 3582 92 6374176 8348 60 1491 11540 28826 1847464 4099 22 3561 51 22 1538 1027 38926 2195 1966 3089 33 19894287142 6374184 37 4025 67325 37421549 21976 28 7744 2466 31533 27 2836 1339 6374 14805 1670 4666 60 33 12] [ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 27 52 4639 9 5774 1545 8575855 10463 2688 21019 1542 1701653 9765 9189706 2212 18342566437 2639 4311 4504 26110 307496893317 1 27 52587]] [[0. 1.] [0. 1.] [0. 1.] [1. 0.] [0. 1.]] 2021-09-16 20:36:46.488438: I tensorflow/compiler/jit/xla_cpu_device.cc:41] Not creating XLA devices, tf_xla_enable_xla_devices not set 2021-09-16 20:36:46.489070: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcuda.so.1'; dlerror: /usr/lib/x86_64-linux-gnu/libcuda.so.1: file too short; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64:/usr/local/cuda/lib64/:/usr/lib/x86_64-linux-gnu 2021-09-16 20:36:46.489097: W tensorflow/stream_executor/cuda/cuda_driver.cc:326] failed call to cuInit: UNKNOWN ERROR (303) 2021-09-16 20:36:46.489128: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (313c6f2d15e2): /proc/driver/nvidia/version does not exist 2021-09-16 20:36:46.489488: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX512F To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags. 2021-09-16 20:36:46.493241: I tensorflow/compiler/jit/xla_gpu_device.cc:99] Not creating XLA devices, tf_xla_enable_xla_devices not set Model: "sequential" _________________________________________________________________ Layer (type) Output Shape Param # ================================================================= embedding (Embedding) (None, None, 200)14684800 _________________________________________________________________ lstm (LSTM)(None, 200)320800 _________________________________________________________________ dropout (Dropout)(None, 200)0 _________________________________________________________________ dense (Dense) (None, 64) 12864 _________________________________________________________________ dense_1 (Dense) (None, 2) 130 ================================================================= Total params: 15,018,594 Trainable params: 15,018,594 Non-trainable params: 0 _________________________________________________________________ None 2021-09-16 20:36:46.792534: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:116] None of the MLIR optimization passes are enabled (registered 2) 2021-09-16 20:36:46.830442: I tensorflow/core/platform/profile_utils/cpu_utils.cc:112] CPU Frequency: 2300000000 Hz Epoch 1/2 313/313 [==============================] - 101s 315ms/step - loss: 0.5581 - accuracy: 0.7229 - val_loss: 0.3703 - val_accuracy: 0.8486 Epoch 2/2 313/313 [==============================] - 98s 312ms/step - loss: 0.2174 - accuracy: 0.9195 - val_loss: 0.3016 - val_accuracy: 0.8822
以上就是Python機(jī)器學(xué)習(xí)NLP自然語(yǔ)言處理Word2vec電影影評(píng)建模的詳細(xì)內(nèi)容,更多關(guān)于NLP自然語(yǔ)言處理的資料請(qǐng)關(guān)注本站其它相關(guān)文章!
版權(quán)聲明:本站文章來(lái)源標(biāo)注為YINGSOO的內(nèi)容版權(quán)均為本站所有,歡迎引用、轉(zhuǎn)載,請(qǐng)保持原文完整并注明來(lái)源及原文鏈接。禁止復(fù)制或仿造本網(wǎng)站,禁止在非www.sddonglingsh.com所屬的服務(wù)器上建立鏡像,否則將依法追究法律責(zé)任。本站部分內(nèi)容來(lái)源于網(wǎng)友推薦、互聯(lián)網(wǎng)收集整理而來(lái),僅供學(xué)習(xí)參考,不代表本站立場(chǎng),如有內(nèi)容涉嫌侵權(quán),請(qǐng)聯(lián)系alex-e#qq.com處理。