k1lib.selen module
Nice website analyzer based on Selenium. Whenever I used Selenium
in the past, I’ve always have to find how to reach specific buttons,
what paragraphs to capture, things like that, all manually. That
really limits my usage of it. Recently, I’ve made the module kapi
,
capable of doing a bunch of DL stuff and feels like this is the right
time to make something on top of Selenium that is able to pull out the
main article, body, header, right/left bar automatically. After that, I
can grab the main text to put it into an embedding db to use it further
downstream. Example:
from k1lib.imports import *
browser = selen.getBrowser() # fires up the browser
browser.get("https://en.wikipedia.org/wiki/Cheese")
page = selen.Page.analyze(browser) # analyze everything. For a typical page, should take 8-14s total
page.draw() # run this inside a notebook to view all stages
page.mainContents() # return List[Element] of potential big pieces of the page, like header, footer, left/right bar, etc.
elem = page.mainContent # grabs what it think is the most likely meats of the page, ignoring everything else
elem.obj # accesses the internal Selenium Element
elem.obj.text # grabs element's text content
elem.obj.x # grabs element's x location. Works with .y too
elem.obj.w # grabs element's width. Works with .h too
- class k1lib.selen.Page(stages: List[List[Meta]])[source]
Bases:
object
- static analyze(browser, bounded=True)[source]
Analyze whatever is on the browser at this moment.
- Parameters
bounded – if True, adjusts all internal bounding boxes so that they don’t overflow. If False, then children can be bigger than its parent
- mainContents()[source]
Returns some candidate Elements that seems to be the main content. You’d have to write some minimum code to determine what you’d like to use, but the bulk of the work by this point is done
- property mainContent
Really tries to extract out the main content, so that this can be automated. It might not be good, but at least it’s automated.
- class k1lib.selen.Element(page, data)[source]
Bases:
object
- property parent
Grab this element’s parent element. If not found, then return None
- property obj
- property x
- property y
- property w
- property h
- property wh
- property path
- property children: list[k1lib.selen.Element]
- property pArea
Percentage area of this element vs the entire page