k1lib.selen module

Nice website analyzer based on Selenium. Whenever I used Selenium in the past, I’ve always have to find how to reach specific buttons, what paragraphs to capture, things like that, all manually. That really limits my usage of it. Recently, I’ve made the module kapi, capable of doing a bunch of DL stuff and feels like this is the right time to make something on top of Selenium that is able to pull out the main article, body, header, right/left bar automatically. After that, I can grab the main text to put it into an embedding db to use it further downstream. Example:

from k1lib.imports import *

browser = selen.getBrowser() # fires up the browser
browser.get("https://en.wikipedia.org/wiki/Cheese")

page = selen.Page.analyze(browser) # analyze everything. For a typical page, should take 8-14s total
page.draw()             # run this inside a notebook to view all stages
page.mainContents()     # return List[Element] of potential big pieces of the page, like header, footer, left/right bar, etc.

elem = page.mainContent # grabs what it think is the most likely meats of the page, ignoring everything else
elem.obj                # accesses the internal Selenium Element
elem.obj.text           # grabs element's text content
elem.obj.x              # grabs element's x location. Works with .y too
elem.obj.w              # grabs element's width. Works with .h too

k1lib.selen.getBrowser() → Browser[source]: Launches new browser and return object to manage

class k1lib.selen.Page(stages: List[List[Meta]])[source]

Bases: object

static analyze(browser, bounded=True)[source]

Analyze whatever is on the browser at this moment.

Parameters:: bounded – if True, adjusts all internal bounding boxes so that they don’t overflow. If False, then children can be bigger than its parent

draw()[source]: Quickly views each analysis stages, see bounding boxes, their paths and whatnot

mainContents()[source]: Returns some candidate Elements that seems to be the main content. You’d have to write some minimum code to determine what you’d like to use, but the bulk of the work by this point is done

property mainContent: Really tries to extract out the main content, so that this can be automated. It might not be good, but at least it’s automated.

bound() → Page[source]: Some children is detached from the parent outside of it, and can grow bigger. But that messes up with the ranking technique that I have right now. So purpose of this is to limit the children to the parent’s size

class k1lib.selen.Element(page, data)[source]

Bases: object

property parent: Grab this element’s parent element. If not found, then return None

property obj

property x

property y

property w

property h

property wh

property path

property children: list[Element]

property pArea: Percentage area of this element vs the entire page