Extract Data From Pdf As Csv

Hein Htet Zaw
3 min readOct 3, 2020

In this tutorial, we are going to build a spring boot application that can extract data from pdf (tubular and non tubular) to csv file for data analysis purpose or storing in database.To achieve this goal, we are going to modify some logic of TrapRange Method which only works properly for tubular format pdf. This application works well to retrieve data from billing statement of google, aws and ocbc.

Implementation Steps

Step 1 : Develop a Spring Boot Application

Step 2 : Modify TrapRange Method

Step 3 : Test Application

Step 1: Develop a Spring Boot Application

Go to the Spring Initializr website and download the project.

Dependencies we are going to use:

  1. Apache PDFBox

pdfbox can be used to create, render, print, split PDF files and so on.

2. Spring Boot Starter Web

spring-boot-starter-web, will import all packages and classes to handle the web project (as an API rest).

3. Org.Json

org.json can be used to handle JSON content-type like response in our rest API.

4. Google Guava

Guava is a suite of core and expanded libraries that include utility classes, google’s collections, io classes, and much much more.So we are going to use for managing text positions.

Finally… let’s start coding…

Let’s open our Spring Boot project that you downloaded using VS Code. In this tutorial we are going to use VS Code as IDE, so make sure that you have Spring Boot support in vs code or else please follow this Link.

1. Firstly we need to add above dependencies in our pom.xml file:

2. Now we will create a new controller class that will handle your REST requests:

In this class, we create a post route /api/pdf/convertCsv that will accept a pdf file with param name file. Then we pass the uploaded file as an input stream to a service, called PDFToCsvExtractor class that we will implement in next steps. Here we can also add the exceptPage and exceptLine methods to skip pages and lines.

Step 2: Modify TrapRange Method

Actually PDFToCsvExtractor class is the codes from TrapRange Methods but with some modifications.

In this method, we need to make some modification as follows.We extract start the height range of each line and put in pageIdNLineRangesMap variable with page number and loop each line to get char list of each line that put in pageIdNTextsMap. After that we have to loop the result list to get width of range words in each line and put to pageIdNColumnRangesMap.

In this method, we modify to remove comma ( , ) in each number like 1,234.56 to avoid separation of number by comma because our target is csv format which use comma as separate sign.

In this method, we modify to get in same column for a text with enter as in Fig 1.0. And we generate csv string for each line and append to string builder which is the result csv string.

Fig 1.0 Text with enter

By Default, spring boot allows you to upload 10MB file. If you want more you need to add some configuration in application.properties.

Step 3: Test the application

You can test the application with a POST call to the following endpoint uploading a PDF file:

http://localhost:8080/api/pdf/convertCsv

You can use Postman, SoapUI or the Rest client you prefer.

That’s all folks!

Full code on GitHub

You can find this project on my Git:

--

--