Stefano Tommesani

  • Increase font size
  • Default font size
  • Decrease font size

Web page scraping, the easy way

E-mail Print PDF

There are many ways to extract data elements from web pages, almost all of them prettier and cooler than the method proposed here, but as we are in an hurry, let's get that data quickly, ok? Suppose we have to extract the profile names from a well-known site:

Scrape1

The first step is identifying the element in the HTML page containing the profile name. In Chrome, select the name of the profile, right-click on it and select Analyze element in the popup menu, and we will get here:

Last Updated on Sunday, 07 January 2018 01:12 Read more...
 

Scraping dynamic page content

E-mail Print PDF

One of the most common roadblocks when scraping the content of web sites is getting the full contents of the page, including JS-generated data elements (probably, the ones you are looking for). So, when using CEFSharp to scrape a site, reading the content of the page with:

string pageSource = await _browser.GetSourceAsync(); 

will not return the JS-generated parts of the page. But the following code fragment will:

var jsResponse = await _browser.EvaluateScriptAsync(@"document.getElementsByTagName ('html')[0].innerHTML");
if (jsResponse.Success)
{
    string pageSource = jsResponse.Result.ToString();  
Last Updated on Sunday, 07 January 2018 00:38
 

Unit-testing file I/O

E-mail Print PDF

Two good news: file I/O is unit-testable, and it is surprisingly easy to do. Let's see how it works!

A software no-one asked for

First, we need a piece of software that deals with files and that has to be unit-tested. The TestableIO project does the following:

  • given a directory, enumerate all the lossless audio files (e.g. FLAC ones) and all the lossy audio files (e.g. MP3 ones),
  • if an audio file the same name is present in both lossless and lossy format, delete the lossy file
  • repeat for the subfolders of the given directory
Last Updated on Sunday, 26 November 2017 14:34 Read more...
 

Fixing Git pull errors in SourceTree

E-mail Print PDF

If you encounter the following error when pulling a repository in SourceTree:

VirtualAlloc pointer is null, Win32 error 487

it is due to to the Cygwin system failing to allocate a 5 MB large chunk of memory for its heap at the fixed address 0x68570000, while only a hole 2.5 MB large was apparently available there (see the whole thread on SO). Unfortunately, attempts to rebase MSYS or other tricks are not working, requiring a full reboot of the PC to properly complete pulls.

To resolve this problem once and for all, upgrade the Git used by SourceTree from version 1.x to version 2: click on Tools | Options, select the Git tab and check the version, in this case it is "System Git version 1.8.4".

GitUpdate1

Last Updated on Friday, 28 April 2017 13:12 Read more...
 

Castle on the hill of crappy audio quality

E-mail Print PDF

As the yearly dynamic range day is close (March 31st), let's have a look at one of the biggest audio massacres of the year, Ed Sheeran's "Castle on the hill". First time I heard the song, I thought my headphones just got broken, it's really that bad. So let's measure the Dynamic Range (DR) of the track:

CastleHillDR

Here is how the DR value is computed:

In order to determine the official DR value, a song or entire album (16 bit, 44.1 kHz wave format) is scanned. A histogram (loudness distribution diagram) is created with a resolution of 0.01 dB. The RMS – an established loudness measurement standard – is determined by gathering approximately 10,000 pieces of loudness information within a time span of 3 seconds (dB/RMS). From this result, only the loudest 20% is used for determining the average loudness of the loud passages. At the same time, the loudest peak is determined. The DR Value is the difference between the peak and the top 20 average RMS measurements (top 20 RMS minus Peak = DR). (from TT Dynamic Range Meter documentation)

Read more...
 

Necessary evil: testing private methods

E-mail Print PDF

Some might say that testing private methods should be avoided because it means not testing the contract, that is the interface implemented by the class, but the internal implementation of the class itself. Still, not all classes were designed with testability in mind, so real life compromises sometimes demand such a trick.

When writing unit tests in C# with MSTest, the PrivateObject class lets you easily call private methods:

  1. [TestMethod]
  2. public void TestLPRead()
  3. {
  4.   var Logger = A.Fake<ILogger>();
  5.   var Telemetry = A.Fake<ITelemetry>();
  6.   DefaultDataModel DM = new DefaultDataModel(Logger, Telemetry);
  7.   PrivateObject obj = new PrivateObject(DM);
  8.   List<LPRead> ReadsList = (List<LPRead>)obj.Invoke("GetReads");

In the code above, a PrivateObject instance is created passing an instance of the class to be tested

  1. PrivateObject obj = new PrivateObject(DM);

then the invocation of the private method, that would be

  1. List<LPRead> ReadsList = DM.GetReads();

if the method were public, becomes

  1. List<LPRead> ReadsList = (List<LPRead>)obj.Invoke("GetReads");
Last Updated on Sunday, 29 January 2017 22:39
 
  • «
  •  Start 
  •  Prev 
  •  1 
  •  2 
  •  3 
  •  4 
  •  5 
  •  6 
  •  7 
  •  8 
  •  9 
  •  10 
  •  Next 
  •  End 
  • »


Page 1 of 10
View Stefano Tommesani's profile on LinkedIn

Latest Articles

A software to stand out 27 January 2018, 14.35 Web
A software to stand out
Standing out of the pack starts by being visible, and being noticed by the right group of professionals. No matter how good your profile is, it is lost in a sea of similar profiles, so you need to show up and start attracting
Web page scraping, the easy way 07 January 2018, 00.46 Web
Web page scraping, the easy way
There are many ways to extract data elements from web pages, almost all of them prettier and cooler than the method proposed here, but as we are in an hurry, let's get that data quickly, ok? Suppose we have to extract the
Scraping dynamic page content 06 January 2018, 23.57 Web
Scraping dynamic page content
One of the most common roadblocks when scraping the content of web sites is getting the full contents of the page, including JS-generated data elements (probably, the ones you are looking for). So, when using CEFSharp to scrape
Unit-testing file I/O 26 November 2017, 12.09 Testing
Unit-testing file I/O
Two good news: file I/O is unit-testable, and it is surprisingly easy to do. Let's see how it works! A software no-one asked for First, we need a piece of software that deals with files and that has to be unit-tested. The
Fixing Git pull errors in SourceTree 10 April 2017, 01.44 Software
Fixing Git pull errors in SourceTree
If you encounter the following error when pulling a repository in SourceTree: VirtualAlloc pointer is null, Win32 error 487 it is due to to the Cygwin system failing to allocate a 5 MB large chunk of memory for its heap at

Translate