Scraping dynamic content from websites has become an important task in data collection and analysis. With the advancements in web technologies, websites are now using dynamic content that cannot be easily scraped with traditional web scraping techniques. In this blog post, we will explore how to scrape dynamic content from web pages using C# and CEFSharp.
What is CEFSharp? Chromium Embedded Framework (CEF) is an open-source framework that provides a simple way to embed the Chromium browser into an application. CEFSharp is a .NET wrapper around the CEF framework that provides a browser control that can be used to render web pages inside a .NET application. With CEFSharp, you can easily automate browser actions and scrape dynamic content from web pages
One of the most common roadblocks when scraping the content of web sites is getting the full contents of the page, including JS-generated data elements. So, when using CEFSharp to scrape a site, reading the content of the page with:
string pageSource = await _browser.GetSourceAsync();
will not return the JS-generated parts of the page. But the following code fragment will:
var jsResponse = await _browser.EvaluateScriptAsync(@"document.getElementsByTagName ('html')[0].innerHTML"); if (jsResponse.Success) { string pageSource = jsResponse.Result.ToString();
Scraping dynamic content from web pages can be challenging, but with CEFSharp, it’s a lot easier. By using the EvaluateScriptAsync method of the ChromiumWebBrowser control, we can execute JavaScript code on the page and scrape dynamic content. With this knowledge, you can now scrape even the most complex web pages and extract the data you need.