《電子技術應用》
您所在的位置:首頁 > 通信與網(wǎng)絡 > 設計應用 > 基于文檔圖結構的惡意PDF文檔檢測方法
基于文檔圖結構的惡意PDF文檔檢測方法
信息技術與網(wǎng)絡安全 11期
俞遠哲,王金雙,鄒 霞
(陸軍工程大學 指揮控制工程學院,江蘇 南京210007)
摘要: 目前基于機器學習的惡意PDF文檔檢測方法依賴于專家經(jīng)驗來遴選特征,無法全面反映文檔屬性。而且在面對對抗樣本時,檢測器性能下降明顯。針對上述問題,提出了一種基于文檔圖結構和卷積神經(jīng)網(wǎng)絡的惡意PDF文檔檢測方法。該方法解析文檔結構,根據(jù)文檔中各對象之間的引用關系構建出有向圖。然后,通過TF-IDF算法計算各節(jié)點對分類的貢獻度來進行圖結構精簡。最后,計算精簡后圖的鄰接矩陣和度矩陣,并得到圖的拉普拉斯矩陣,以此作為特征送入CNN分類模型進行訓練。同時還加入了對抗樣本,對模型進行對抗訓練。實驗評估表明,在給定訓練和測試樣本比例9:1條件下,不斷調(diào)整神經(jīng)網(wǎng)絡結構和參數(shù),該方法的準確率達到了99.71%,性能優(yōu)于KNN和SVM分類模型。在針對對抗樣本的檢測上,與知名在線檢測網(wǎng)站VirusTotal上的67款殺毒引擎相比,該方法取得了更高的檢測性能。
中圖分類號: TP309
文獻標識碼: A
DOI: 10.19358/j.issn.2096-5133.2021.11.003
引用格式: 俞遠哲,王金雙,鄒霞. 基于文檔圖結構的惡意PDF文檔檢測方法[J].信息技術與網(wǎng)絡安全,2021,40(11):16-23.
Malicious PDF detection method based on document graph structure
Yu Yuanzhe,Wang Jinshuang,Zou Xia
(Command & Control Engineering College,Army Engineering University of PLA,Nanjing 210007,China)
Abstract: Malicious PDF detection methods based on machine learning rely on the expert knowledge, which still cannot fully reflect the document attributes. Moreover, the performances of the detectors are easily affected by adversarial samples. To overcome these limitations, a malicious PDF detection method based on the PDF document graph structures and Convolutional Neural Network(CNN) was proposed. Firstly, a directed graph was constructed according to the document structure and the reference relationships between document objects. Secondly, the contribution of each node was calculated using TF-IDF algorithm, according to which the graph structures was simplified. Thirdly, the adjacency and degree matrices of the simplified graph were calculated, and the Laplacian matrix of the graph was obtained, which was used as a feature and sent to the CNN classification model for training. Adversarial samples were also added to train the model. It was evaluated that this method has an accuracy of 99.71% which is better than KNN and SVM classification models. Compared with the 67 antivirus engines on VirusTotal, it has achieved higher detection performance in the detection of adversarial samples.
Key words : malicious PDF document;document graph structure;CNN;adversarial sample

0 引言

PDF(Portable Document Format)文檔的使用非常廣泛。隨著版本的更新?lián)Q代,PDF文檔包含的功能也變得多種多樣,但其中一些鮮為人知的功能(如文件嵌入、JavaScript代碼執(zhí)行、動態(tài)表單等)越來越多地被不法分子利用,來實施惡意網(wǎng)絡攻擊行為[1]。APT(Advanced Persistent Threat)攻擊[2]常常構造巧妙偽裝的惡意PDF文檔,通過釣魚郵件攻擊等手段誘騙受害者下載,從而侵入或破壞計算機系統(tǒng)。相比傳統(tǒng)的惡意可執(zhí)行程序,惡意文檔具有更強的迷惑性。

基于機器學習的檢測方法被研究人員廣為使用,主要可以分為靜態(tài)檢測、動態(tài)檢測和動靜結合檢測方法[3]。而現(xiàn)有的惡意文檔特征選擇方法大多依賴于專家的知識驅動,在惡意文檔的手動分析期間進行觀察來選擇特征集(如調(diào)用類對象的數(shù)量、文檔頁數(shù)或版本號等),或是通過數(shù)學統(tǒng)計分析將特征細化(如某類對象在所有對象中的占比)。由于特征可選取的范圍很大,如果僅僅根據(jù)經(jīng)驗選取了一部分作為特征集,就會喪失文檔的部分信息,無法全面地表達文檔特性。

由于PDF文檔格式的復雜性,其邏輯結構包含了大量的文檔語義。文獻[4]認為通過對結構屬性的綜合分析能夠解釋惡意和良性PDF文檔之間的顯著結構差異。因此本文設計通過綜合分析文檔的邏輯結構,以文檔的結構圖為特征進行檢測,而不是獨立的結構路徑。即使攻擊者知道哪些對象是成功檢測的關鍵,并可能針對性地修改某一特定路徑,但這樣就會破壞文檔的整體結構,因此逃避檢測的成本很高。




本文詳細內(nèi)容請下載:http://theprogrammingfactory.com/resource/share/2000003843




作者信息:

俞遠哲,王金雙,鄒  霞

(陸軍工程大學 指揮控制工程學院,江蘇 南京210007)


此內(nèi)容為AET網(wǎng)站原創(chuàng),未經(jīng)授權禁止轉載。