Category: Blog, Development, Java

Scraping Web Pages with Retrofit – jspoon Library

Scrapping Web Pages with Retrofit – jspoon Library

Have you ever worked with JSON converters like GSON or Moshi? They are extremely useful when it comes to operating with internet data. I try to provide a similar mechanism for web scraping and take HTML parsing to the next level.

Story

Before I started working at Droids On Roids, I was a freelancer – often creating Android versions of popular Polish sites. As it was unofficial, I had no API, so everything was based on web scraping. I started with operating on raw HTML texts – I was looking for specific strings, adding shifts to get the desired value.

Then I’ve found jsoup, which made HTML parsing much more comfortable. Recently, I worked on commercial projects, with API in the JSON format. With Retrofit and GSON/Moshi converters, it was effortless to create POJO objects from Internet content. I thought “it would be great to have a similar mechanism for HTML” – so here it is:

jspoon with Retrofit converter!

jspoon is a library which uses annotations with CSS selectors to create Java POJO objects. It uses jsoup as a HTML parser and caches reflections for better performance. It is also Java 7 compatible, so it works on Android too.

It can be used when you don’t have access to the API – for example, if it isn’t ready yet. Another possible case is when the web page is yours, but you don’t have full access to the database (or you are just lazy), as you can omit the API and just scrap the page. Moreover, when you are dealing with third party web pages in your app and you need some data, like meta tags, this library is for you.

In this post, I’m going to parse a Droids On Roids /blog page using jspoon, Retrofit, and RxJava2.

Installation

We will need the following dependencies (using gradle):

Setting up

First of all, our page needs mapping from HTML to the POJO Java class. This is what jspoon does. We create the BlogPage class with the list of Posts:

Then, we need to configure Retrofit, following the Retrofit web page. We create an API interface:

We also write methods for building the BlogService instance for API calls. At this point, we add the jspoon converter and rxjava2 adapter:

Let’s scrap!

That’s all! Everything is set up and we are ready to start scraping:

Et voilà! We get the posts in our console:

You can check the full source of this example in Java and Kotlin here.

Conclusion

Scraping HTML will never beat professional JSON API, but I think that jspoon can make it much simpler and similar to modern JSON parsing. If you find any bugs or lack of functionality, feel free to contribute on GitHub.