黄瓜视频18免费观看,日韩特级毛片久久国产电影,亚洲无码中文

基于單頁語義特征的垃圾網(wǎng)頁檢測

電子技術(shù)應(yīng)用

陳木生1，2，高斐1，吳俊華1

（1.江西理工大學(xué) 軟件工程學(xué)院，江西南昌 330013；2.南昌市虛擬數(shù)字工程與文化傳播重點(diǎn)實(shí)驗(yàn)室，江西南昌 330013）

摘要： 為解決垃圾網(wǎng)頁檢測中特征提取難度高、計(jì)算量大的問題，提出一種僅基于當(dāng)前網(wǎng)頁的HTML腳本提取語義特征的方法。首先使用深度優(yōu)先搜索和動態(tài)規(guī)劃相結(jié)合的記憶化搜索算法對域名進(jìn)行單詞切割，采用隱含狄利克雷分布提取主題詞，基于Word2Vec詞向量和詞移距離計(jì)算3個(gè)單頁語義相似度特征；然后將單頁語義相似度特征融合單頁統(tǒng)計(jì)特征，使用隨機(jī)森林等分類算法構(gòu)建分類模型進(jìn)行垃圾網(wǎng)頁檢測。實(shí)驗(yàn)結(jié)果表明，基于單頁內(nèi)容提取語義特征融合單頁統(tǒng)計(jì)特征進(jìn)行分類的AUC值達(dá)到88.0%，比對照方法提高4%左右。

關(guān)鍵詞： 垃圾網(wǎng)頁檢測特征提取記憶化搜索隱含狄利克雷分布詞向量

中圖分類號：TP391.6
文獻(xiàn)標(biāo)志碼：A
DOI: 10.16157/j.issn.0258-7998.223376
中文引用格式： 陳木生，高斐，吳俊華. 基于單頁語義特征的垃圾網(wǎng)頁檢測[J]. 電子技術(shù)應(yīng)用，2023，49(6)：24-29.
英文引用格式： Chen Musheng，Gao Fei，Wu Junhua. Web spam detection based on semantic features from current page[J]. Application of Electronic Technique，2023，49(6)：24-29.

Web spam detection based on semantic features from current page

Chen Musheng1，2，Gao Fei1，Wu Junhua1

(1.School of Software Engineering， Jiangxi University of Science and Technology， Nanchang 330013， China； 2.Nanchang Key Laboratory of Virtual Digital Engineering and Cultural Communication， Nanchang 330013， China)

Abstract： In order to solve the problem of high difficulty and large amount of computation in feature extraction for web spam detection, a method for extracting semantic features only based on the HTML script of the current page is proposed. Firstly, the domain name is segmented by a memorization search algorithm combining depth-first search and dynamic programming. Secondly, The latent Dirichlet distribution is used to extract subject words of the web page. Lastly, three single-page semantic similarity features are calculated based on Word2Vec and word mover distance. Combining the single-page semantic similarity features with single-page statistical features, classification algorithms such as random forest are used to build classification models for web spam detection. The experimental results show that the AUC value of single-page content extraction based on semantic and statistical features for classification reaches 88.0%, which is about 4% higher than that of the control method.

Key words : web spam detection；feature extraction；memory search；latent Dirichlet distribution；Word2Vec；word mover distance；random forest

0　引言

如今，隨著互聯(lián)網(wǎng)信息的快速增長，搜索引擎被認(rèn)為是訪問網(wǎng)站的關(guān)鍵工具，其用戶占到網(wǎng)絡(luò)用戶的80%以上[1]。但是有研究表明，大約60%的用戶只查看第一頁中最初的5個(gè)結(jié)果[2]?？梢钥闯?，在搜索結(jié)果中排名靠前的網(wǎng)頁會擁有更多的訪問者，由此帶來更多的收入。由于通過正常手段提高網(wǎng)頁排名非常困難，于是某些網(wǎng)站便通過非正常手段和技術(shù)欺騙搜索引擎提高網(wǎng)頁排名，這些網(wǎng)頁被稱為垃圾網(wǎng)頁[3]。垃圾網(wǎng)頁會降低搜索結(jié)果的質(zhì)量，浪費(fèi)用戶的時(shí)間，侵占搜索引擎公司和其他內(nèi)容網(wǎng)站的合法利益[4]。盡管搜索引擎公司已經(jīng)使用了各種方法來應(yīng)對垃圾網(wǎng)頁，但至今為止，垃圾網(wǎng)頁檢測依然是搜索引擎需要重點(diǎn)突破的難題，也是學(xué)術(shù)領(lǐng)域的一個(gè)前沿課題。因此，高效、準(zhǔn)確地檢測垃圾網(wǎng)頁具有重要意義。

本文詳細(xì)內(nèi)容請下載：http://theprogrammingfactory.com/resource/share/2000005343

作者信息：

陳木生1，2，高斐1，吳俊華1

（1.江西理工大學(xué) 軟件工程學(xué)院，江西南昌 330013；2.南昌市虛擬數(shù)字工程與文化傳播重點(diǎn)實(shí)驗(yàn)室，江西南昌 330013）

微信圖片_20210517164139.jpg

原創(chuàng)聲明：此內(nèi)容為AET網(wǎng)站原創(chuàng)，未經(jīng)授權(quán)禁止轉(zhuǎn)載。

相關(guān)內(nèi)容