过程:
1)先把一篇文章,何检按逗号分隔成一个一个短语
2)然后计算每个短语的文章字数
3)前两个>10个字的短语,我们拿出来在百度搜索下,原创计算百度搜索结果中,何检完整出现该短语的文章次数。
若一个文章被其他网站大量转载,原创那么随便提取该文章中一个短语,何检都能在百度搜索出完全重复的文章内容:
如果我们连续搜索两个短语,在百度搜索中,原创完全重复的何检结果很少,则可以一定程度代表该内容被其他站点大量转载的文章概率比较小,原创度较高
原创
以上3个步骤,原创编写一个脚本来执行:
左列是何检文章ID,右列是文章两个短语,在百度搜索结果中完整出现的原创次数。次数越大,重复度越高,具体数值达到多少,自己定义。比如本渣一般将>=30%定位重复度比较高的,即搜索2个短语,20个搜索结果中,完整出现该短语的结果有>=6个
#coding:utf-8 import requests,re,time,sys,json,datetime import multiprocessing import MySQLdb as mdb reload(sys) sys.setdefaultencoding('utf-8') current_date = time.strftime('%Y-%m-%d',time.localtime(time.time())) def search(req,html): text = re.search(req,html) if text: data = text.group(1) else: data = 'no' return data def date(timeStamp): timeArray = time.localtime(timeStamp) otherStyleTime = time.strftime("%Y-%m-%d %H:%M:%S", timeArray) return otherStyleTime def getHTml(url): host = search('^([^/]*?)/',re.sub(r'(https|http)://','',url)) headers = { "Accept":"text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8", "Accept-Encoding":"gzip, deflate, sdch", "Accept-Language":"zh-CN,zh;q=0.8,en;q=0.6", "Cache-Control":"no-cache", "Connection":"keep-alive", #"Cookie":"", "Host":host, "Pragma":"no-cache", "Upgrade-Insecure-Requests":"1", "User-Agent":"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.116 Safari/537.36", } # 代理服务器 proxyHost = "proxy.abuyun.com" proxyPort = "9010" # 代理隧道验证信息 proxyUser = "XXXX" proxyPass = "XXXX" proxyMeta = "http://%(user)s:%(pass)s@%(host)s:%(port)s" % { "host" : proxyHost, "port" : proxyPort, "user" : proxyUser, "pass" : proxyPass, } proxies = { "http" : proxyMeta, "https" : proxyMeta, } html = requests.get(url,headers=headers,timeout=30) # html = requests.get(url,headers=headers,timeout=30,proxies=proxies) code = html.encoding return html.content def getContent(word): pcurl = 'http://www.baidu.com/s?q=&tn=json&ct=2097152&si=&ie=utf-8&cl=3&wd=%s&rn=10' % word # print '@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ start crawl %s @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@' % pcurl html = getHTml(pcurl) a = 0 html_dict = json.loads(html) for tag in html_dict['feed']['entry']: if tag.has_key('title'): title = tag['title'] url = tag['url'] rank = tag['pn'] time = date(tag['time']) abs = tag['abs'] if word in abs: a += 1 return a con = mdb.connect('127.0.0.1','root','','wddis',charset='utf8',unix_socket='/tmp/mysql.sock') cur = con.cursor() with con: cur.execute("select aid,content from pre_portal_article_content limit 10") numrows = int(cur.rowcount) for i in range(numrows): row = cur.fetchone() aid = row[0] content = row[1] content_format = re.sub('<[^>]*?>','',content) a = 0 for z in [ x for x in content_format.split(',') if len(x)>10 ][:2]: a += getContent(z) print "%s --> %s" % (aid,a) # words = open(wordfile).readlines() # pool = multiprocessing.Pool(processes=10) # for word in words: # word = word.strip() # pool.apply_async(getContent, (word,client )) # pool.close() # pool.join() 版权声明
本文仅代表作者观点,不代表本站立场。
本文系作者授权发表,未经许可,不得转载。