For Complete Web Automation Testing Tutorials using Selenium Webdriver

Extract PDF text And Verify Text Present in PDF using WebDriver

Most of the applications has 'Print PDF' functionality. How to achieve this in Automation. we first need to decide is this really required to automate, if your answer is Yes then proceed further to see how we can achieve this. In Earlier tutorial we have seen validating if the file downloaded or not after clicking on download button. In this tutorial we will now see to validate Print PDF functionality by using below two ways.

There are multiple ways of doing this.

1. A very simple way without using any third party libraries.
2. Extract the text from PDF and then validate if the text you are looking is present in the PDF document or not. We should go for this ONLY when we want to validate something for sure.

Based on the requirement can decide on which one to use.

The very first way of doing this is below:

/**
	 * To verify pdf in the URL 
	 */
	@Test
	public void testVerifyPDFInURL() {
		WebDriver driver = new FirefoxDriver();
		driver.get("http://www.princexml.com/samples/");
		driver.findElement(By.linkText("PDF flyer")).click();
		String getURL = driver.getCurrentUrl();
		Assert.assertTrue(getURL.contains(".pdf"));
	}

The second way is using third party library. In this example we will how to use 'Apache PDFBox' library

To extract text from a PDF we can use Apache PDFBox library which is one of the main feature of PDFBox. I can extract the text from variety of PDF documents. The functionality of extracting text is encapsulated in 'org.apache.pdfbox.util.PDFTextStripper'

It also provides an option to limit the text that is extracted during the extraction process by specifying the range of pages that we want to extract. For example, if the PDF has 100 pages, we can give the range from first to second page to validate the text present.

Below code snippet to specify the range which will read first and second page of the PDF. If you want to verify the text some where in the middle of the PDF you can read that and validate.

PDFTextStripper stripper = new PDFTextStripper();
pdfStripper.setStartPage(1);
pdfStripper.setEndPage(2);

NOTE: The startPage and endPage properties of PDFTextStripper are 1 based and inclusive.

Below is the example Program for the both the above discussed ways.

import java.io.BufferedInputStream;
import java.io.IOException;
import java.net.MalformedURLException;
import java.net.URL;

import junit.framework.Assert;

import org.apache.pdfbox.cos.COSDocument;
import org.apache.pdfbox.pdfparser.PDFParser;
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.util.PDFTextStripper;
import org.openqa.selenium.By;
import org.openqa.selenium.WebDriver;
import org.openqa.selenium.firefox.FirefoxDriver;
import org.testng.annotations.AfterClass;
import org.testng.annotations.BeforeClass;
import org.testng.annotations.Test;

public class ReadPDF {
	
	WebDriver driver;
	
	@BeforeClass
	public void setUp() {
		driver = new FirefoxDriver();
	}
	
	/**
	 * To verify PDF content in the pdf document
	 */
	@Test
	public void testVerifyPDFTextInBrowser() {
		
		driver.get("http://www.princexml.com/samples/");
		driver.findElement(By.linkText("PDF flyer")).click();
		Assert.assertTrue(verifyPDFContent(driver.getCurrentUrl(), "Prince Cascading"));
	}

	/**
	 * To verify pdf in the URL 
	 */
	@Test
	public void testVerifyPDFInURL() {
		driver.get("http://www.princexml.com/samples/");
		driver.findElement(By.linkText("PDF flyer")).click();
		String getURL = driver.getCurrentUrl();
		Assert.assertTrue(getURL.contains(".pdf"));
	}

	
	public boolean verifyPDFContent(String strURL, String reqTextInPDF) {
		
		boolean flag = false;
		
		PDFTextStripper pdfStripper = null;
		PDDocument pdDoc = null;
		COSDocument cosDoc = null;
		String parsedText = null;

		try {
			URL url = new URL(strURL);
			BufferedInputStream file = new BufferedInputStream(url.openStream());
			PDFParser parser = new PDFParser(file);
			
			parser.parse();
			cosDoc = parser.getDocument();
			pdfStripper = new PDFTextStripper();
			pdfStripper.setStartPage(1);
			pdfStripper.setEndPage(1);
			
			pdDoc = new PDDocument(cosDoc);
			parsedText = pdfStripper.getText(pdDoc);
		} catch (MalformedURLException e2) {
			System.err.println("URL string could not be parsed "+e2.getMessage());
		} catch (IOException e) {
			System.err.println("Unable to open PDF Parser. " + e.getMessage());
			try {
				if (cosDoc != null)
					cosDoc.close();
				if (pdDoc != null)
					pdDoc.close();
			} catch (Exception e1) {
				e.printStackTrace();
			}
		}
		
		System.out.println("+++++++++++++++++");
		System.out.println(parsedText);
		System.out.println("+++++++++++++++++");

		if(parsedText.contains(reqTextInPDF)) {
			flag=true;
		}
		
		return flag;
	}
	
	@AfterClass
	public void tearDown() {
		driver.quit();
	}
}

The above case works fine when the PDF file is opened in a Browser after clicking on the Print button. There are few cases where once we click on Print, it will download the pdf file.

In these cases we should do in the below way: We need to change the below code

URL url = new URL(strURL);
BufferedInputStream file = new BufferedInputStream(url.openStream());
PDFParser parser = new PDFParser(file);
		
convert as below
		File file = new File("D:/Paynetsbicardbill.pdf");
		PDFParser parser = new PDFParser(new FileInputStream(file));

We should pass the path where the document is downloaded.

Selenium Tutorials:

Selenium Tutorials

Add new comment

Comments

good to reading

Permalink Submitted by Viewer on Mon, 04/20/2015 - 15:55

good to reading

reply

This code is not working for

Permalink Submitted by Neena on Tue, 07/21/2015 - 03:22

This code is not working for me.
FAILED: testVerifyPDFTextInBrowser
java.net.UnknownHostException: www.princexml.com

reply

Upload file using AutoIT harrydev
Keyword Driven Framework Example harrydev
Read data from Properties file using Java Selenium harrydev
Double Click on element using Webdriver harrydev
WebDriver Waits Examples harrydev

Extract PDF text And Verify Text Present in PDF using WebDriver

Comments

good to reading

This code is not working for

Add new comment

Plain text

Recent Post

Automation Framework

Home

Tutorial Menu

Selenium Tutorials

Interview Questions

Search form

Extract PDF text And Verify Text Present in PDF using WebDriver

Comments

good to reading

This code is not working for

Add new comment

Plain text

Selenium Tutorials

Recent Post

Automation Framework

Tags Cloud