Web Crawler Implementation - Java Code

Keet Malin Sugathadasa
Feb 10, 2017
12 min read

This blog contains information related to implementing a web crawler using a simple Java code. More on how to implement a web crawler using Python code, will be published soon.

What is a Web Crawler?

A web crawler is a software bot (internet bot) that will surf through the world wide web in a systematic manner and collects required information in a meaningful way. Typically it's purpose is to do web indexing (web spidering).

To know more about Web Crawlers, their architecture and their policies, read the following blog.

Implementation - JAVA Code

Before we begin things, there is always no hard and fast rules to implement a Web Crawler. You can always use your very own techniques and implement a simple piece of code that will browse the web. Similarly, you can read this blog, and get a rough idea on the crawling algorithm and the policies, and use that information to further enhance your own piece of code.

But, today we will be using a Java Library called jsoup, and will show you a step by step instruction on how to retrieve the information you need. Once the web crawling and information retrieval is done, you can surely twist the code lines a little bit and make it work according to your needs.

Go through the following steps to implement a basic web crawler

1) Setting Up the JAVA Project with Jsoup Library

1.1 Download & Install NetBeans

Go to the official NetBeans Downloads page.

Select any bundle that includes Java SE, and click on the download button

Install NetBeans. (Make sure that the Java Runtime Environment is being set on the computer)

1.2 Open Netbeans and create a new Java Project

File -> NewProject

Select "Java" Category and "Java Application" project.

Specify a project name, and project location and hit "Finish"

1.3 Download and add the JSoup Library to the project

Go to official JSoup download page.
Download the latest JSoup Core Library. (Not the javadoc or sources)
Copy the downloaded jar file into the project root directory.
Go to Netbeans, and right click on the "Libraries" folder (available on the projects pane) and click, "Add Jar/Folder".
Select you JSoup jar file from the project root folder (where you saved it before) and click open.
(If you are using a maven project, the details on how to add the dependency is given in the official jsoup site.)

2) Input HTML codes, URLs, Files into the project and convert to Document Objects

2.1 Get the HTML Code of a Web Page

In the main method of the Java Project, first specify a URL that you want to crawl. In this example let's take the wikipedia website.

//the URL of the webpage of interest String url = "http://en.wikipedia.org/";

Next, let's connect to the URL and retrieve its document. For this you need to handle the IOException being thrown. And need to make the following imports as well.

import java.io.IOException; import org.jsoup.Jsoup; import org.jsoup.nodes.Document;

//connect to url and get document Document document = Jsoup.connect(url).get();

Next to retrieve the entire HTML code, do the following

//get the outerhtml of the document String outerHtml = document.outerHtml();

To get the "head" element of the document, you need to do the following import of Element. This Element Class represents HTML elements in the html document. A HTML element consists of a tag name, attributes, and child nodes (including text nodes and other elements).

import org.jsoup.nodes.Element;

//get the head element of the document

Element headElement = document.head();

String head = headElement.toString();

To get the "body" element of the HTML document, you need to do the following import of Element. This Element Class represents HTML elements in the html document. A HTML element consists of a tag name, attributes, and child nodes (including text nodes and other elements).

import org.jsoup.nodes.Element;

//get the body element of the document

Element bodyElement = document.body();

String body= bodyElement.toString();

There are more manupulations that can be carried out to the Document object. For example, you can set the charset, retrieve the charset, get the title of the document, get the location of the document and so on. For more information on all the methods available for the Document class, visit this JSoup API Docs page.

The entire code for the above examples is in given below.

import java.io.IOException; import org.jsoup.Jsoup; import org.jsoup.nodes.Document; import org.jsoup.nodes.Element;

/** * * @author ASUS-PC */ public class GetHtmlCode {

public static void main(String[] args) throws IOException {

//the URL of the webpage of interest String url = "http://en.wikipedia.org/";

//connect to url and get document Document document = Jsoup.connect(url).get();

//get the outerhtml of the document String outerHtml = document.outerHtml();

//get the head element of the document Element headElement = document.head(); String head = headElement.toString();

//get the body element of the document Element bodyElement = document.body(); String body = bodyElement.toString(); } }

2.2 Get the Document from a given HTML Code

In section 2.1, we explained on how to get the HTML code of a website when the URL is given. What if you already have the HTML code at hand. There is no website, but you have the HTML code, may be in a text file or some other document. Then you can create a Document object easily and do all the manipulations shown above. You will have to use the following code snippet.

//the HTML of the webpage of interest String html = "<html><head><title>First parse</title></head>" + "<body><p>Parsed HTML into a doc.</p></body></html>";

//get document of given html Document document = Jsoup.parse(html);

The entire code segment is given below.

import java.io.IOException; import org.jsoup.Jsoup; import org.jsoup.nodes.Document; import org.jsoup.nodes.Element;

/** * * @author ASUS-PC */ public class GetDocumentFromHTML {

public static void main(String[] args) throws IOException {

//the HTML of the webpage of interest String html = "<html><head><title>First parse</title></head>" + "<body><p>Parsed HTML into a doc.</p></body></html>";

//get document of given html Document document = Jsoup.parse(html);

//get the outerhtml of the document String outerHtml = document.outerHtml();

//get the head element of the document Element headElement = document.head(); String head = headElement.toString();

//get the body element of the document Element bodyElement = document.body(); String body = bodyElement.toString(); }

}

2.3 Get the Document from a given HTML Body Fragment

Section 2.2 only works, if you have the entire HTML Code at hand. But what if you only have a small fragment of a HTML body? Then you can still use the Jsoup's Parse method, but in using a different method. Check out the following code snippet.

//the HTML of the webpage of interest String html = "<div><p>Lorem ipsum.</p>";

//get document of given html Document document = Jsoup.parseBodyFragment(html);

The entire code segment is given below.

import java.io.IOException; import org.jsoup.Jsoup; import org.jsoup.nodes.Document; import org.jsoup.nodes.Element;

/** * * @author ASUS-PC */ public class GetDocumentFromHTMLFragment {

public static void main(String[] args) throws IOException {

//the HTML Fragment of the webpage of interest String html = "<div><p>Lorem ipsum.</p>";

//get document of given html fragment Document document = Jsoup.parseBodyFragment(html);

//get the outerhtml of the document String outerHtml = document.outerHtml();

//get the body element of the document Element bodyElement = document.body(); String body = bodyElement.toString(); }

}

2.4 Get the Document from a given HTML File

If you have a raw file on your computer, or hosted somewhere else, you might need to know, how to access the HTML content in the file and then create a document for further processing. Use the following code snippet. You need to specify a charset name as an additional parameter in the JSoup's Parse Method.

//the HTML File of the webpage of interest File input = new File("/keet/index.html");

//get document of given html file Document document = Jsoup.parse(input , "UTF-8");

The entire code segment is given below.

import java.io.File; import java.io.IOException; import org.jsoup.Jsoup; import org.jsoup.nodes.Document; import org.jsoup.nodes.Element;

/** * * @author ASUS-PC */ public class GetDocumentFromHTMLFile {

public static void main(String[] args) throws IOException {

//the HTML File of the webpage of interest File input = new File("/keet/index.html");

//get document of given html file Document document = Jsoup.parse(input , "UTF-8");

//get the outerhtml of the document String outerHtml = document.outerHtml();

//get the head element of the document Element headElement = document.head(); String head = headElement.toString();

//get the body element of the document Element bodyElement = document.body(); String body = bodyElement.toString(); } }

Now we are aware of how to get the HTML code and the web Document from the following four options.

Webpage URL
Raw HTML Code
HTML Body Code Fragment
HTML File

Now let's look into, accessing elements inside a given HTML code or Document. For the ease of understanding, Let's use the Document object and then retrieve each of the elements, If you have any other source, check the above sections, and convert it to a Document object.

3) Extract Elements from HTML documents

Now we are aware of getting a Document object and the HTML code, from a given URL or a raw HTML text or a HTML file. Now let's try to understand on how to access each element from a given document. For example, if we want to access all the links given in a webpage, we need to refer to the HTML element called "a". Following code syntax is what we will be using to access elements in a web document. Make sure to convert all HTML files, Webpage URLs and HTML Files into a Document Object (Using any of the methods in Section 2). This section only explains you on extracting the element from a document. More on extracting information from elements is given in section 4.

3.1 Extract Elements using DOM Methods

The Jsoup library provides the capability to access HTML Elements using DOM methods on a Document. (DOM stands for Document Object Model). The following DOM methods and more, can be used on Documents to extract elements needed.

getElementById(String id)
getElementsByTag(String tag)
getElementsByClass(String className)
getElementsByAttribute(String key) etc

getElementById (String id)

In any HTML document, the "id" attribute is a unique value. Therefore whenever we try to retrieve an element using "getElementById (String id)", we will only get one element. We need to have a Document object ready as given below. Make sure to handle the IOException thrown by the connect method.

import java.io.IOException; import org.jsoup.Jsoup; import org.jsoup.nodes.Document;

//the URL of the webpage of interest

String url = "http://en.wikipedia.org/";

//connect to url and get document Document document = Jsoup.connect(url).get();

Next, we need to identify an "ID" of an element in the HTML document and extract it. If we want to extract the element with the "id" named "content", we can use the following code snippet.

import org.jsoup.nodes.Element;

//get elements by ID Element content = document.getElementById("content");

The entire code segment is given below.

import java.io.IOException; import org.jsoup.Jsoup; import org.jsoup.nodes.Document; import org.jsoup.nodes.Element;

/** * * @author ASUS-PC */ public class GetElementUsingId {

public static void main(String[] args) throws IOException {

//the URL of the webpage of interest String url = "http://en.wikipedia.org/";

//connect to url and get document Document document = Jsoup.connect(url).get();

//get elements by ID Element content = document.getElementById("content"); } }

getElementsByTag(String tag)

In HTML documents, there are many different tags available. Unlike "id", there are multiple elements with the same tag name. An example for a tag is <head>, <a> , <input> etc. Using the following method, you will getting an ArrayList called "Elements" because there are multiple elements with the same tag name. In the below code, it shows how to get each Element from an Elements Object. Processing these Element objects will be explained in section 4.

import org.jsoup.nodes.Element; import org.jsoup.select.Elements;

//get elements by Tag Elements tagList = document.getElementsByTag("a");

//This will iterate through the Elements arraylist for each Element for (Element element : tagList) {

//the element variable represents an Element

}

The entire code segment is given below.

import java.io.IOException; import org.jsoup.Jsoup; import org.jsoup.nodes.Document; import org.jsoup.nodes.Element; import org.jsoup.select.Elements;

/** * * @author ASUS-PC */ public class GetElementsUsingTag {

public static void main(String[] args) throws IOException {

//the URL of the webpage of interest String url = "http://en.wikipedia.org/";

//connect to url and get document Document document = Jsoup.connect(url).get();

//get elements by Tag Elements tagList = document.getElementsByTag("a");

//This will itarete through the Elements arraylist for each Element for (Element element : tagList) {

//the element variable represents an Element } } }

getElementsByClass(String className)

In HTML documents, there are many different classes available. Unlike "id", there are multiple elements with the same class name. Using the following method, you will getting an ArrayList called "Elements" because there are multiple elements with the same class name. In the below code, it shows how to get each Element from an Elements Object. Processing these Element objects will be explained in section 4.

import org.jsoup.nodes.Element; import org.jsoup.select.Elements;

//get elements by Class Elements classList = document.getElementsByClass("btn-default");

//This will iterate through the Elements arraylist for each Element for (Element element : classList) {

//the element variable represents an Element

}

The entire code segment is given below.

import java.io.IOException; import org.jsoup.Jsoup; import org.jsoup.nodes.Document; import org.jsoup.nodes.Element; import org.jsoup.select.Elements;

/** * * @author ASUS-PC */ public class GetElementsUsingClass {

public static void main(String[] args) throws IOException {

//the URL of the webpage of interest String url = "http://en.wikipedia.org/";

//connect to url and get document Document document = Jsoup.connect(url).get();

//get elements by Class Elements classList = document.getElementsByClass("btn-default");

//This will itarete through the Elements arraylist for each Element for (Element element : classList) {

//the element variable represents an Element } } }

getElementsByAttribute(String key)

In HTML documents, there are many different attributes available. Unlike "id", there are multiple elements with the same attribute name. Using the following method, you will getting an ArrayList called "Elements" because there are multiple elements with the same attribute name. In the below code, it shows how to get each Element from an Elements Object. Processing these Element objects will be explained in section 4.

import org.jsoup.nodes.Element; import org.jsoup.select.Elements;

//get elements by Attribute Elements attributeList = document.getElementsByAttribute("type");

//This will iterate through the Elements arraylist for each Element for (Element element : attributeList) {

//the element variable represents an Element

}

The entire code segment is given below.

import java.io.IOException; import org.jsoup.Jsoup; import org.jsoup.nodes.Document; import org.jsoup.nodes.Element; import org.jsoup.select.Elements;

/** * * @author ASUS-PC */ public class GetElementsUsingClass {

public static void main(String[] args) throws IOException {

//the URL of the webpage of interest String url = "http://en.wikipedia.org/";

//connect to url and get document Document document = Jsoup.connect(url).get();

//get elements by Attribute Elements attributeList = document.getElementsByAttribute("type");

//This will itarete through the Elements arraylist for each Element for (Element element : attributeList) {

//the element variable represents an Element } } }

3.2 Extract Methods using Select Method

The Select method allows you to manipulate or find elements using CSS or jquery like selector syntax. For this also, you will be needing a document object to use the "Select" method. To use this method, there is a specific set of syntax that you need to adhere to. For more on this syntax visit this page on Jsoup CookBook.

import org.jsoup.nodes.Element; import org.jsoup.select.Elements;

//get links using select Elements links = document.select("a[href]"); //This will itarete through the Elements arraylist for each Element for (Element element : links) {

//the element variable represents an Element } //get img with src, and only .png Elements pngs = document.select("img[src$=.png]"); //This will itarete through the Elements arraylist for each Element for (Element element : pngs) {

//the element variable represents an Element }

The entire code for this is given below.

import java.io.IOException; import org.jsoup.Jsoup; import org.jsoup.nodes.Document; import org.jsoup.nodes.Element; import org.jsoup.select.Elements;

/** * * @author ASUS-PC */ public class getElementsUsingSelect {

public static void main(String[] args) throws IOException {

//the URL of the webpage of interest String url = "http://en.wikipedia.org/";

//connect to url and get document Document document = Jsoup.connect(url).get();

//get links using select Elements links = document.select("a[href]"); //This will itarete through the Elements arraylist for each Element for (Element element : links) {

//the element variable represents an Element } } }

4) Extract Information from Elements

To go ahead with this section, we will be needing an Element Object. If you want to know, how to get Element Objects, please read section 3 on this blog.

4.1 Get Attribute Details from Elements

Using the above methods we have extracted the elements we need. Now let's have a look at the manipulations we can do on an element. To extract the HTML attributes of an element, we can use the "attr(String attributeKey)" method according to the following code snippet.

//get attribute "name" from the element content String name = element.attr("name");

To get all the attributes of an Element, we can use the "attributes()" method, and this will return an Iterable object called Attributes. Check the following code snippet. This code snippet will print the html code, key and value of all the attributes in this Attributes Object.

import org.jsoup.nodes.Attribute; import org.jsoup.nodes.Attributes;

//get all attributes of element content Attributes attributes = element.attributes(); //this will iterate the Attributes object for (Attribute attribute: attributes){ //print the HTML of each attribute System.out.println(attribute.html()); //print the Key of each attribute System.out.println(attribute.getKey()); //print the value of each attibute System.out.println(attribute.getValue());

}

The entire code is given below.

import java.io.IOException; import org.jsoup.Jsoup; import org.jsoup.nodes.Attribute; import org.jsoup.nodes.Attributes; import org.jsoup.nodes.Document; import org.jsoup.nodes.Element;

/** * * @author ASUS-PC */ public class GetIAttributesFromElement {

public static void main(String[] args) throws IOException {

//the URL of the webpage of interest String url = "http://en.wikipedia.org/";

//connect to url and get document Document document = Jsoup.connect(url).get();

//get elements by ID Element element = document.getElementById("content");

//get attribute "name" from the element String name = element.attr("name");

//get all attributes of element Attributes attributes = element.attributes();

//this will iterate the Attributes object for (Attribute attribute : attributes) {

//print the HTML of each attribute System.out.println(attribute.html());

//print the Key of each attribute System.out.println(attribute.getKey());

//print the value of each attibute System.out.println(attribute.getValue()); } } }

4.2 Get HTML from Element

The entire code for extracting the HTML is given below.

import java.io.IOException; import org.jsoup.Jsoup; import org.jsoup.nodes.Document; import org.jsoup.nodes.Element;

/** * * @author ASUS-PC */ public class GetIHTMLFromElement { public static void main(String[] args) throws IOException {

//the URL of the webpage of interest String url = "http://en.wikipedia.org/";

//connect to url and get document Document document = Jsoup.connect(url).get();

//get elements by ID Element element = document.getElementById("content");

//get innerHTML of the element String name = element.html();

//get outerHTML of the element String attributes = element.outerHtml(); } }

For more on how to manipulate Elements, please visit the official JSoup CookBook page.

The entire project explained above, is available on GitHub named as Web_Crawler.

References

[1] https://www.youtube.com/playlist?list=PL2npUq4XqJkyN3sr7MHuAiaZFB-74aqqB

[2] https://jsoup.org/apidocs/org/jsoup/nodes/Document.html

[3] https://jsoup.org/cookbook/

[4] https://jsoup.org/apidocs/org/jsoup/nodes/Element.html

#WhatisaWebCrawler #ImplementationJAVACode #Jsoup #jsouplibrary #java #selectmethod #htmlcode #crawling #webcrawling #parsing #DOM #getelementbyid #getelementsbytag #getelementsbyClass #ElementObject #DocumentObject

KEET MALIN SUGATHADASA

Web Crawler Implementation - Java Code

What is a Web Crawler?

Implementation - JAVA Code

1) Setting Up the JAVA Project with Jsoup Library

1.1 Download & Install NetBeans

1.2 Open Netbeans and create a new Java Project

1.3 Download and add the JSoup Library to the project

2) Input HTML codes, URLs, Files into the project and convert to Document Objects

2.1 Get the HTML Code of a Web Page

2.2 Get the Document from a given HTML Code

2.3 Get the Document from a given HTML Body Fragment

2.4 Get the Document from a given HTML File

3) Extract Elements from HTML documents

3.1 Extract Elements using DOM Methods

3.2 Extract Methods using Select Method

4) Extract Information from Elements

4.1 Get Attribute Details from Elements

4.2 Get HTML from Element

References

Recent Posts

Comments