LTE: Extracting relevant text and features from HTML

If you have lots of HTML files that you collected for a project, chances are you can’t really use those files as is. Usually, you are looking to extract some information from these files, for example, an article (like in my project), a product description, or user reviews. Depending on the data you want to extract, the algorithm will be different, since different sections of the HTML file will be relevant.

In my project, I had HTML files with letters to the editor. Here is an example letter that I scraped and labeled. As you can see, the letter itself takes up less space than the rest of the content: there are links to social media, related articles, other sections on the website and advertisements. We can use BeautifulSoup to convert HTML to text:

from bs4 import BeautifulSoup
f = open(FILENAME, 'r')
html = f.read()
f.close()
soup = BeautifulSoup(html, "lxml")
text = soup.get_text()

This is the result of this conversion. It includes some spurious JavaScript, which I removed. As you can see, there is lots of text in the conversion that is not the text of the letter; this additional text is mostly links to other pages. The results are clearly unsatisfying, and it would be easiest to cut out unnecessary sections in the HTML file, where we have access to the structure. I explain the logic for my HTML-to-text conversion below. The code is available on GitHub.

First, parse out the html using the lxml package, remove lingering JavaScript and style tags, extract the title and the body:

tree = lxml.html.parse(file)
etree.strip_elements(tree, 'script') #Remove JavaScript
etree.strip_elements(tree, 'style') #Remove style tags
title = page.find(".//title").text
body_list = page.cssselect('body')

After that, find the div in the body that contains the text of the letter. Some sites wrap their letter content in <article></article> HTML tags. This is the easiest case, so just return the HTML content inside these tags.

Otherwise, look for the div that contains the most text, but do it in a more sophisticated way than just comparing the lengths of the textual content of all the divs. A div might contain many links that have text, thus having lots of textual content, but it’s not the div we want. The one we want has a large text-to-all-content ratio, where all content includes HTML tags, thus finding the div with lots of text with not many HTML tags:

div_tuples = []
for div in divs:
    ratio = len(pattern.sub("", div.text_content()))/len(b_pattern.sub(b"", etree.tostring(div)))
    div_tuples.append((div, ratio))
sorted_divs = sorted(div_tuples, key=lambda div:div[1], reverse=True) #Sort by ratio, highest ratio first

The simplest solution would now be to pick the top div in the sorted_divs list. However, for example, sometimes the article pages contain divs with comments, which are also divs with high text-to-all-content ratios. Thus, we take the top n divs from the sorted_divs list, and pick the one that contains the most words:

tokenizer = RegexpTokenizer(r'\w+')
max_words = 0
for div in candidate_divs:
    tokens = tokenizer.tokenize(div.text_content())
    num_tokens = len(tokenizer.tokenize(div.text_content()))
    if num_tokens > max_words:
        max_words = num_tokens
        winner = div

This will work in most cases: usually, comments are shorter than the letter itself. However, there are cases where the letter is a short one, and the comments are pretty long. In these cases, the longest comment will be selected as the text of the letter. There are very few such cases, so they won’t affect the results significantly.

At this point we could be done and just use the text of the winning div. It would be nice, though, to save the HTML markup information that could later on be input as a feature in a machine learning algorithm. I wrote a small recursive function to achieve this. In writing it, I made the assumption that the tags will change only on sentence level. For example, consider this letter. The general structure is as follows: title of the letter in an <h1> tag, each paragraph and the signature in its own <p> tag. Thus, the tags remain the same for each sentence. Of course, there are cases when a certain word inside a sentence is in bold or italics, and the function can easily be modified to save the same tag information on word level, also affecting performance. In my case, I didn’t need this information, since I only used sentence level tags in ML algorithms later on. Here is the function that saves the tag information as a list in a tuple with the sentence:

def tag_text (html, text, cur_tags, ti):
    if (isinstance (html, NavigableString)): #This is text
        ctags = cur_tags[:]
        text.append((str(html), ctags))
        return (text, cur_tags)
    else:
        children = html.contents
        for child in children:
            #Add current tag to list
            if ((not isinstance (child, NavigableString)) #This is an HTML tag
                and child.name in ti):
                tag = child.name
                cur_tags.append(tag)
            (text, cur_tags) = tag_text(child, text, cur_tags, ti)
        if (len(cur_tags) > 0):
            cur_tags.pop()
    return (text, cur_tags)

The next post will be extracting specific information from text using ML techniques. Stay tuned!

2 thoughts on “LTE: Extracting relevant text and features from HTML”

Leave a Comment

Your email address will not be published. Required fields are marked *