Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Beautifulsoup - Problems for webcrawler

  1. How to correctly output all links on this news website? (in list form)

  2. After output in list form, how can I return the result randomly (3~5 links a time)

note: the code I need starts from line 739 (nearly it may change a bit cause it refresh everyday)

div class="abdominis rlby clearmen"

and I need every link inside this kind of thing

<a href="https://tw.news.appledaily.com/life/realtime/20180308/1310910/>

Thanks!! the code is below:

from bs4 import BeautifulSoup
from flask import Flask, request, abort
import requests
import re
import random
import types    
target_url = 'http://www.appledaily.com.tw/realtimenews/section/new/'
print('Start parsing appleNews....')
rs = requests.session()
res = rs.get(target_url, verify=False)
soup = BeautifulSoup(res.text, 'html.parser')

#can output all links but with useless information
contents = soup.select("div[class='abdominis rlby clearmen']")[0].find_all('a')
print(contents)

#can output single link but not in list form
#contents = soup.select("div[class='abdominis rlby clearmen']")[0].find('a').get('href')
#print(contents)
like image 829
Chevady Ju Avatar asked Nov 21 '25 20:11

Chevady Ju


2 Answers

Here is a solution which will append each link to a list if it is contained in the specified div..

from bs4 import BeautifulSoup
from flask import Flask, request, abort
import requests
import re
import random
import types    
target_url = 'http://www.appledaily.com.tw/realtimenews/section/new/'
print('Start parsing appleNews....')
rs = requests.session()
res = rs.get(target_url, verify=False)
soup = BeautifulSoup(res.text, 'html.parser')

list_links = [] # Create empty list

for a in soup.select("div[class='abdominis rlby clearmen']")[0].findAll(href=True): # find links based on div
    list_links.append(a['href']) #append to the list
    print(a['href']) #Check links

for l in list_links: # print list to screen (2nd check)
    print(l)

To create random links to be returned.

import random #import random module

random_list = [] #create random list if needed..
random.shuffle(list_links) #random shuffle the list

for i in range(5): # specify range (5 items in this instance)
    try:
        res = list_links.pop(random.randint(0, len(list_links))) # pop of each item randomly based on the size of the list
        print(res) #print to screen..
        random)list.append(res) # or append to random_list
    except IndexError:
        pass

One last edit as you asked for it to be returned..

Here it is as a function that returns a list of x amount of random links..

def return_random_link(list_, num):
    """ Takes in a list and returns a random amount of items """
    random.shuffle(list_)

    random_list = []

    for i in range(num):
        try: # try to append to the list
            r = list_.pop(random.randint(0, len(list_)))
            random_list.append(r)
        except IndexError: #except an IndexError (no items
            return random_list # Return the list of items

    return random_list

random_list = return_random_link(list_links, 5)

for i in random_list:
    print(i)  
like image 194
johnashu Avatar answered Nov 23 '25 09:11

johnashu


If you want the link tag without its descendents, you can clear them:

for elm in contents:
    elm.clear()

I image I'd be more interested in extracting just the links, though:

contents = [a['href'] for a in contents]

To get results in a random order, try using random.shuffle() and grabbing however many elements from the reshuffled list at a time you need.

like image 36
Martin Sand Christensen Avatar answered Nov 23 '25 10:11

Martin Sand Christensen