综合久久久久久久,香蕉视频app免费下载,国产福利精品一区二区无码

基于文檔圖結(jié)構(gòu)的惡意PDF文檔檢測(cè)方法

信息技術(shù)與網(wǎng)絡(luò)安全 11期

俞遠(yuǎn)哲，王金雙，鄒霞

(陸軍工程大學(xué) 指揮控制工程學(xué)院，江蘇南京210007)

摘要： 目前基于機(jī)器學(xué)習(xí)的惡意PDF文檔檢測(cè)方法依賴于專家經(jīng)驗(yàn)來遴選特征，無法全面反映文檔屬性。而且在面對(duì)對(duì)抗樣本時(shí)，檢測(cè)器性能下降明顯。針對(duì)上述問題，提出了一種基于文檔圖結(jié)構(gòu)和卷積神經(jīng)網(wǎng)絡(luò)的惡意PDF文檔檢測(cè)方法。該方法解析文檔結(jié)構(gòu)，根據(jù)文檔中各對(duì)象之間的引用關(guān)系構(gòu)建出有向圖。然后，通過TF-IDF算法計(jì)算各節(jié)點(diǎn)對(duì)分類的貢獻(xiàn)度來進(jìn)行圖結(jié)構(gòu)精簡(jiǎn)。最后，計(jì)算精簡(jiǎn)后圖的鄰接矩陣和度矩陣，并得到圖的拉普拉斯矩陣，以此作為特征送入CNN分類模型進(jìn)行訓(xùn)練。同時(shí)還加入了對(duì)抗樣本，對(duì)模型進(jìn)行對(duì)抗訓(xùn)練。實(shí)驗(yàn)評(píng)估表明，在給定訓(xùn)練和測(cè)試樣本比例9:1條件下，不斷調(diào)整神經(jīng)網(wǎng)絡(luò)結(jié)構(gòu)和參數(shù)，該方法的準(zhǔn)確率達(dá)到了99.71%，性能優(yōu)于KNN和SVM分類模型。在針對(duì)對(duì)抗樣本的檢測(cè)上，與知名在線檢測(cè)網(wǎng)站VirusTotal上的67款殺毒引擎相比，該方法取得了更高的檢測(cè)性能。

關(guān)鍵詞： 惡意PDF文檔文檔圖結(jié)構(gòu) 卷積神經(jīng)網(wǎng)絡(luò) 對(duì)抗樣本

中圖分類號(hào)： TP309
文獻(xiàn)標(biāo)識(shí)碼： A
DOI： 10.19358/j.issn.2096-5133.2021.11.003
引用格式：俞遠(yuǎn)哲，王金雙，鄒霞. 基于文檔圖結(jié)構(gòu)的惡意PDF文檔檢測(cè)方法[J].信息技術(shù)與網(wǎng)絡(luò)安全，2021，40(11)：16-23.

Malicious PDF detection method based on document graph structure

Yu Yuanzhe，Wang Jinshuang，Zou Xia

(Command & Control Engineering College，Army Engineering University of PLA，Nanjing 210007，China)

Abstract： Malicious PDF detection methods based on machine learning rely on the expert knowledge, which still cannot fully reflect the document attributes. Moreover, the performances of the detectors are easily affected by adversarial samples. To overcome these limitations, a malicious PDF detection method based on the PDF document graph structures and Convolutional Neural Network(CNN) was proposed. Firstly, a directed graph was constructed according to the document structure and the reference relationships between document objects. Secondly, the contribution of each node was calculated using TF-IDF algorithm, according to which the graph structures was simplified. Thirdly, the adjacency and degree matrices of the simplified graph were calculated, and the Laplacian matrix of the graph was obtained, which was used as a feature and sent to the CNN classification model for training. Adversarial samples were also added to train the model. It was evaluated that this method has an accuracy of 99.71% which is better than KNN and SVM classification models. Compared with the 67 antivirus engines on VirusTotal, it has achieved higher detection performance in the detection of adversarial samples.

Key words : malicious PDF document；document graph structure；CNN；adversarial sample

0 引言

PDF(Portable Document Format)文檔的使用非常廣泛。隨著版本的更新?lián)Q代，PDF文檔包含的功能也變得多種多樣，但其中一些鮮為人知的功能(如文件嵌入、JavaScript代碼執(zhí)行、動(dòng)態(tài)表單等)越來越多地被不法分子利用，來實(shí)施惡意網(wǎng)絡(luò)攻擊行為[1]。APT(Advanced Persistent Threat)攻擊[2]常常構(gòu)造巧妙偽裝的惡意PDF文檔，通過釣魚郵件攻擊等手段誘騙受害者下載，從而侵入或破壞計(jì)算機(jī)系統(tǒng)。相比傳統(tǒng)的惡意可執(zhí)行程序，惡意文檔具有更強(qiáng)的迷惑性。

基于機(jī)器學(xué)習(xí)的檢測(cè)方法被研究人員廣為使用，主要可以分為靜態(tài)檢測(cè)、動(dòng)態(tài)檢測(cè)和動(dòng)靜結(jié)合檢測(cè)方法[3]。而現(xiàn)有的惡意文檔特征選擇方法大多依賴于專家的知識(shí)驅(qū)動(dòng)，在惡意文檔的手動(dòng)分析期間進(jìn)行觀察來選擇特征集(如調(diào)用類對(duì)象的數(shù)量、文檔頁數(shù)或版本號(hào)等)，或是通過數(shù)學(xué)統(tǒng)計(jì)分析將特征細(xì)化(如某類對(duì)象在所有對(duì)象中的占比)。由于特征可選取的范圍很大，如果僅僅根據(jù)經(jīng)驗(yàn)選取了一部分作為特征集，就會(huì)喪失文檔的部分信息，無法全面地表達(dá)文檔特性。

由于PDF文檔格式的復(fù)雜性，其邏輯結(jié)構(gòu)包含了大量的文檔語義。文獻(xiàn)[4]認(rèn)為通過對(duì)結(jié)構(gòu)屬性的綜合分析能夠解釋惡意和良性PDF文檔之間的顯著結(jié)構(gòu)差異。因此本文設(shè)計(jì)通過綜合分析文檔的邏輯結(jié)構(gòu)，以文檔的結(jié)構(gòu)圖為特征進(jìn)行檢測(cè)，而不是獨(dú)立的結(jié)構(gòu)路徑。即使攻擊者知道哪些對(duì)象是成功檢測(cè)的關(guān)鍵，并可能針對(duì)性地修改某一特定路徑，但這樣就會(huì)破壞文檔的整體結(jié)構(gòu)，因此逃避檢測(cè)的成本很高。

本文詳細(xì)內(nèi)容請(qǐng)下載：http://theprogrammingfactory.com/resource/share/2000003843

作者信息：

俞遠(yuǎn)哲，王金雙，鄒霞

(陸軍工程大學(xué) 指揮控制工程學(xué)院，江蘇南京210007)

原創(chuàng)聲明：此內(nèi)容為AET網(wǎng)站原創(chuàng)，未經(jīng)授權(quán)禁止轉(zhuǎn)載。

相關(guān)內(nèi)容