# Debug article extraction

This article describes two methods to debug Boilerpipe execution, the process that retrieves HTML content of an article and Marfeelizes it.

To debug article extraction you can debug directly Gutenberg's extraction using IntelliJ, or use the GenerateTestFixtures test suite. Whereas using the first option you will be able to see the rendered version of the article, using the fixtures is faster when you just need to check the HTML output.

This guide can be used to:

  • Go through all methods executed during the Boilerpipe process and better understand if there are any issues.
  • Validate if a modification in BoilerPipe has the expected output.
  • Validate if a configuration flag has the expected behavior.
  • Make sure a change that the tenant should do will solve the issue. Eg. Add a class, remove a malformed element...

MarfeelPress

To debug MarfeelPress Extractor API, follow this dedicated guide.

# Debug Gutenberg execution

Check this video to learn how to debug Gutenberg using IntelliJ:

IntelliJ Remote debug parameters:

Name: remoteDebug Host: localhost Port: 49285 Command line arguments for remote JVM: -agentlib:jdwp=transport=dt_socket,server=y,suspend=n,address=49285

# Debug using test fixtures

The GenerateTestFixtures class executes Boilerpipe on the HTML of an article and outputs the resulting HTML.

TIP

A test Fixture (opens new window) is an environment used to consistently test some items.

Follow this guide to successfully debug the HTML processing on any locally modified HTML file using GenerateTestFixtures.java.

  1. Set up a local server to get the desired HTML served in a local URL. Do so by installing the Live Server (opens new window) extension in Visual Studio (opens new window).

  2. Create a new HTML file using Visual Studio editor and fill it with the tenant's target article source code. Modify the HTML according to your needs. Eg. Adding a new class to an image, removing an element that may be breaking the extraction...

  3. While having the new HTML file open, click on the Go Live button. This button appeared at the bottom right corner of Visual Studio after the Live Server plugin installation, restart Visual Studio if it didn't.

Go Live button

  1. This will open the HTML file in a browser, served from the local server set up by the extension. Keep it here, for now, you'll need the URL in a later step.

  2. Using IntelliJ, open the Gutenberg project. Then, open the GenerateTestFixtures.java file.

Shortcut

To find the file, use the cmd+ o (opens new window) shortcut to open the search console in IntelliJ. There, type the class name GenerateTestFixtures.

  1. Assign the URL generated on step 4 to the URL variable (opens new window) of the GenerateTestFixtures.java file.

  2. Set the tenant's extraction configuration flags in the getOptions() private method (opens new window).

  3. Run the GenerateTestFixtures class, in GenerateTestFixtures.java to obtain boiler's HTML output. Select the run option to launch the test, and the debug one if you want the execution to stop at the breakpoints.

Run the tests

  1. Find the output file in the MarfeelGutenberg/MarfeelBoilerpipe/src/test/resources/ folder. Open it with your preferred IDE to see its content.

TIP

When debugging the test execution, there are two main operations to validate.

  • Boilerpipe HTML process, which you can debug by adding a breakpoint to the HTMLDocumentProcessor(fetcher).process(args) function. Once the debugger is stopped, you can go inside the Boilerpipe process and isolate the part of it you are interested in.

  • Structured Data process, which you can debug by adding a breakpoint to the line where fStructuredData is created.

Test

Keep in mind this is a test.

Some parts might behave differently than production. Eg. The article URL is not the same.