selenium tutorial testing tools

Extract PDF text And Verify Text Present in PDF using WebDriver

Most of the applications has 'Print PDF' functionality. How to achieve this in Automation. we first need to decide is this really required to automate, if your answer is Yes then proceed further to see how we can achieve this. In Earlier tutorial we have seen validating if the file downloaded or not after clicking on download button. In this tutorial we will now see to validate Print PDF functionality by using below two ways.

There are multiple ways of doing this.

1. A very simple way without using any third party libraries.
2. Extract the text from PDF and then validate if the text you are looking is present in the PDF document or not. We should go for this ONLY when we want to validate something for sure.

Based on the requirement can decide on which one to use.

The very first way of doing this is below:

/**
	 * To verify pdf in the URL 
	 */
	@Test
	public void testVerifyPDFInURL() {
		WebDriver driver = new FirefoxDriver();
		driver.get("http://www.princexml.com/samples/");
		driver.findElement(By.linkText("PDF flyer")).click();
		String getURL = driver.getCurrentUrl();
		Assert.assertTrue(getURL.contains(".pdf"));
	}

The second way is using third party library. In this example we will how to use 'Apache PDFBox' library

To extract text from a PDF we can use Apache PDFBox library which is one of the main feature of PDFBox. I can extract the text from variety of PDF documents. The functionality of extracting text is encapsulated in 'org.apache.pdfbox.util.PDFTextStripper'

It also provides an option to limit the text that is extracted during the extraction process by specifying the range of pages that we want to extract. For example, if the PDF has 100 pages, we can give the range from first to second page to validate the text present.

Below code snippet to specify the range which will read first and second page of the PDF. If you want to verify the text some where in the middle of the PDF you can read that and validate.

PDFTextStripper stripper = new PDFTextStripper();
pdfStripper.setStartPage(1);
pdfStripper.setEndPage(2);

NOTE: The startPage and endPage properties of PDFTextStripper are 1 based and inclusive.

Below is the example Program for the both the above discussed ways.

import java.io.BufferedInputStream;
import java.io.IOException;
import java.net.MalformedURLException;
import java.net.URL;

import junit.framework.Assert;

import org.apache.pdfbox.cos.COSDocument;
import org.apache.pdfbox.pdfparser.PDFParser;
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.util.PDFTextStripper;
import org.openqa.selenium.By;
import org.openqa.selenium.WebDriver;
import org.openqa.selenium.firefox.FirefoxDriver;
import org.testng.annotations.AfterClass;
import org.testng.annotations.BeforeClass;
import org.testng.annotations.Test;

public class ReadPDF {
	
	WebDriver driver;
	
	@BeforeClass
	public void setUp() {
		driver = new FirefoxDriver();
	}
	
	/**
	 * To verify PDF content in the pdf document
	 */
	@Test
	public void testVerifyPDFTextInBrowser() {
		
		driver.get("http://www.princexml.com/samples/");
		driver.findElement(By.linkText("PDF flyer")).click();
		Assert.assertTrue(verifyPDFContent(driver.getCurrentUrl(), "Prince Cascading"));
	}

	/**
	 * To verify pdf in the URL 
	 */
	@Test
	public void testVerifyPDFInURL() {
		driver.get("http://www.princexml.com/samples/");
		driver.findElement(By.linkText("PDF flyer")).click();
		String getURL = driver.getCurrentUrl();
		Assert.assertTrue(getURL.contains(".pdf"));
	}

	
	public boolean verifyPDFContent(String strURL, String reqTextInPDF) {
		
		boolean flag = false;
		
		PDFTextStripper pdfStripper = null;
		PDDocument pdDoc = null;
		COSDocument cosDoc = null;
		String parsedText = null;

		try {
			URL url = new URL(strURL);
			BufferedInputStream file = new BufferedInputStream(url.openStream());
			PDFParser parser = new PDFParser(file);
			
			parser.parse();
			cosDoc = parser.getDocument();
			pdfStripper = new PDFTextStripper();
			pdfStripper.setStartPage(1);
			pdfStripper.setEndPage(1);
			
			pdDoc = new PDDocument(cosDoc);
			parsedText = pdfStripper.getText(pdDoc);
		} catch (MalformedURLException e2) {
			System.err.println("URL string could not be parsed "+e2.getMessage());
		} catch (IOException e) {
			System.err.println("Unable to open PDF Parser. " + e.getMessage());
			try {
				if (cosDoc != null)
					cosDoc.close();
				if (pdDoc != null)
					pdDoc.close();
			} catch (Exception e1) {
				e.printStackTrace();
			}
		}
		
		System.out.println("+++++++++++++++++");
		System.out.println(parsedText);
		System.out.println("+++++++++++++++++");

		if(parsedText.contains(reqTextInPDF)) {
			flag=true;
		}
		
		return flag;
	}
	
	@AfterClass
	public void tearDown() {
		driver.quit();
	}
}

The above case works fine when the PDF file is opened in a Browser after clicking on the Print button. There are few cases where once we click on Print, it will download the pdf file.

In these cases we should do in the below way: We need to change the below code

URL url = new URL(strURL);
BufferedInputStream file = new BufferedInputStream(url.openStream());
PDFParser parser = new PDFParser(file);
		
convert as below
		File file = new File("D:/Paynetsbicardbill.pdf");
		PDFParser parser = new PDFParser(new FileInputStream(file));

We should pass the path where the document is downloaded.

Selenium Tutorials: 

Comments

good to reading

This code is not working for me.
FAILED: testVerifyPDFTextInBrowser
java.net.UnknownHostException: www.princexml.com

Add new comment

Plain text

  • No HTML tags allowed.
  • Web page addresses and e-mail addresses turn into links automatically.
  • Lines and paragraphs break automatically.
CAPTCHA
This question is for testing whether or not you are a human visitor and to prevent automated spam submissions.
Image CAPTCHA
Enter the characters shown in the image.