Stefano Tommesani

  • Increase font size
  • Default font size
  • Decrease font size
Home Web Web page scraping, the easy way

Web page scraping, the easy way

Hits

There are many ways to extract data elements from web pages, almost all of them prettier and cooler than the method proposed here, but as we are in an hurry, let's get that data quickly, ok? Suppose we have to extract the profile names from a well-known site:

Scrape1

The first step is identifying the element in the HTML page containing the profile name. In Chrome, select the name of the profile, right-click on it and select Analyze element in the popup menu, and we will get here:

Scrape2

We have all the bits of information we need, the name of the profile is hosted in an h1 element with a class containing the pv-top-card-section__name attribute. The troubling point is that the class contains more than an attribute, and we are not interested in specifying the other ones when searching for the right element in the page. Time to write some C# code. Assuming that the HTML code of the page was already digested by the HTMLAgilityPack package, the following function will search the correct element inside the page and return the inner text data:

private string GetItemText(HtmlDocument htmlDoc, string itemType, string classValue)
{
    if (htmlDoc.DocumentNode != null)
    {
        var findclasses = htmlDoc.DocumentNode
            .Descendants(itemType)
            .Where(d =>
                d.Attributes.Contains("class")
                &&
                d.Attributes["class"].Value.Contains(classValue)
            );
        var itemList = findclasses.ToList();
        if (itemList.Any())
        {
            return CleanUpItem(itemList.First().InnerText);
        }                             
    }
    return String.Empty;
}

The function will search for a specific element type (in this case, h1) that contains the given attribute (in this case, pv-top-card-section__name), so this invoke will return the name of the profile:

parsedProfile.Name = GetItemText(htmlDoc, "h1""pv-top-card-section__name");

What does the CleanUpItem call do, you say? Just some cleaning of the inner text of the element:

private string CleanUpItem(string item)
{
    string[] lines = item.Split(
        new[] { "\r\n""\r""\n""\\n" },
        StringSplitOptions.None
    );
    StringBuilder sb = new StringBuilder();
    foreach (var line in lines)
    {
        if (!String.IsNullOrWhiteSpace(line))
            sb.Append(line.Trim());
    }
 
    return sb.ToString();
}

So now, by a quick investigation with Chrome, and a C# code fragment, we can easily scrape the information from the web pages. Job done!

Quote this article on your site

To create link towards this article on your website,
copy and paste the text below in your page.




Preview :

Web page scraping, the easy way
Sunday, 07 January 2018

Powered by QuoteThis © 2008
Last Updated on Sunday, 07 January 2018 01:12  
View Stefano Tommesani's profile on LinkedIn

Latest Articles

A software to stand out 27 January 2018, 14.35 Web
A software to stand out
Standing out of the pack starts by being visible, and being noticed by the right group of professionals. No matter how good your profile is, it is lost in a sea of similar profiles, so you need to show up and start attracting
Web page scraping, the easy way 07 January 2018, 00.46 Web
Web page scraping, the easy way
There are many ways to extract data elements from web pages, almost all of them prettier and cooler than the method proposed here, but as we are in an hurry, let's get that data quickly, ok? Suppose we have to extract the
Scraping dynamic page content 06 January 2018, 23.57 Web
Scraping dynamic page content
One of the most common roadblocks when scraping the content of web sites is getting the full contents of the page, including JS-generated data elements (probably, the ones you are looking for). So, when using CEFSharp to scrape
Unit-testing file I/O 26 November 2017, 12.09 Testing
Unit-testing file I/O
Two good news: file I/O is unit-testable, and it is surprisingly easy to do. Let's see how it works! A software no-one asked for First, we need a piece of software that deals with files and that has to be unit-tested. The
Fixing Git pull errors in SourceTree 10 April 2017, 01.44 Software
Fixing Git pull errors in SourceTree
If you encounter the following error when pulling a repository in SourceTree: VirtualAlloc pointer is null, Win32 error 487 it is due to to the Cygwin system failing to allocate a 5 MB large chunk of memory for its heap at

Translate