Liberating tabular data from the clutches of PDFs¶

By: Jayant Pahuja & Spongebob
Data Expert @OpenBudgetsIndia
Core Team Member @DataKindBlR

Budgets¶

Why are budgets important ?
What kind of data is published ?
How are they published ?

150+ Budget Source Websites¶

In [246]:

Image('govt_budget_portals.png')

Out[246]:

Problems with Budget Documents¶

Unstructured PDF documents
Inconsistent Formats
Budgets published in local languages

In [247]:

Image('whaat.jpg')

Out[247]:

Karnataka Budget¶

In [23]:

Image('kn_sample_image.png')

Out[23]:

West Bengal Budget¶

In [170]:

Image('west_bengal_sample.png')

Out[170]:

Gujarat Budget¶

In [171]:

Image('gj_sample_image.png')

Out[171]:

Categories of Budget Documents¶

Documents with Table Boundaries¶

Karnataka Budget Documents

Documents without Table Boundaries¶

West Bengal
Gujarat

Extracting Tables From Documents with Table Boundaries¶

Concept: Detect Rows and Columns based on the lines detected and parse the infromation to extract tables in csv format.

Detect Lines
- Vertical
- Horizontal
Detect Table Boundaries
Parse Tables To extract information in csv format.

Converting Pdf Page To Image - I¶

In [239]:

import cv2
import subprocess

# Converting a pdf page into an Image for processing
def get_page_image_from_pdf(page_num, image_file_name, pdf_file_path):
    '''
    Convert a pdf page into an image using imagemagick's convert utility 
    '''
    command = "convert -density 300 '%s'[%s] '%s'" % (pdf_file_path,
                                                      page_num,
                                                      image_file_name)
    subprocess.check_output(command, shell=True)
    return cv2.imread(image_file_name, 0)

Converting Pdf Page To Image - II¶

In [99]:

kn_page_image = get_page_image_from_pdf(11, 'kn_sample_image.png', kn_pdf_file_path)
plt.figure(figsize=(30,20))
plt.imshow(kn_page_image, cmap='gray')

Out[99]:

<matplotlib.image.AxesImage at 0x7f101ad531d0>

Detecting Lines¶

To detect lines we use openCV's Hough Transform Implementation

cv2.HoughLines(), cv2.HoughLinesP()

In [248]:

Image('lines.jpg')

Out[248]:

Extracting Edges¶

In [179]:

edges = cv2.Canny(kn_page_image, 120, 150, apertureSize=3)
plot_page(edges)

Detecting Lines From Edges - Hough Transform¶

In [187]:

lines = cv2.HoughLines(edges,1,np.pi/180,200)
lines[:5]

Out[187]:

array([[[  4.42000000e+02,   1.57079637e+00]],

       [[  2.26000000e+03,   1.57079637e+00]],

       [[  5.35000000e+02,   1.57079637e+00]],

       [[  1.76000000e+03,   1.57079637e+00]],

       [[  1.83500000e+03,   1.57079637e+00]]], dtype=float32)

In [184]:

HTML('<img src="http://opencv-python-tutroals.readthedocs.io/en/latest/_images/houghlinesdemo.gif">')

Out[184]:

Detecting Straight Lines¶

In [189]:

def get_straight_lines(img, aperture_size=3):                                                                                                         
    '''Extract long straight lines using Probabilistic Hough Transform
    '''
    edges = cv2.Canny(img, 120, 150, apertureSize=aperture_size)
    min_line_length = 100                                                                                                     
    max_line_gap = 100
    lines = cv2.HoughLinesP(edges, 1, np.pi/180, 80, min_line_length,
                            max_line_gap)
    return lines


lines = get_straight_lines(kn_page_image)
plot_lines_on_image(lines, kn_page_image)

Extending Lines To Table Boundaries¶

In [192]:

table_limits = get_table_limits(kn_page_image, lines, False)
plot_page(extend_lines_for_table(kn_page_image, lines, False, table_limits))

Parsing PDF¶

In [193]:

print("tabula --pages 4 --area top,left,bottom,right --columns c1,c2,c3,c4 path/to/pdf")
Image("spongebob.jpg")

tabula --pages 4 --area top,left,bottom,right --columns c1,c2,c3,c4 path/to/pdf

Out[193]:

Extracting Table Without Boundaries¶

In [218]:

Image('west_bengal_sample.png')

Out[218]:

Exercise: Sample 1¶

In [194]:

Image('subject2.jpg')

Out[194]:

Exercise: Sample 2¶

In [195]:

Image('subject1.jpg')

Out[195]:

Exercise: Sample 3¶

In [196]:

Image('subject3.jpg')

Out[196]:

Extracting Geometrical features¶

Extract Blocks from Page Image.
Extract geometrical features of each block.

Extract Blocks From Page Image¶

We use a modified version of the Run Length Smoothing Algorithm (RLSA)

RLSA convert a byte sequence `x` into `y` based on two rules : -

1. 0's in x are changed to 1 if their is 1 in the next C adjacent pixels.
2. 1's remain unchanged.

C here is a threshold

source: http://crblpocr.blogspot.in/2007/06/run-length-smoothing-algorithm-rlsa.html

RLSA Image¶

In [199]:

threshold = (25, 20)
wb_image_rlsa = rlsa_2d(wb_page_image, threshold)
plot_page(wb_image_rlsa)

Can we do better ?¶

In [201]:

def generate_blocks_dilation(img):
    kernel = np.ones((5,10),np.uint8)
    ret,thresh1 = cv2.threshold(img, 0, 1, cv2.THRESH_BINARY_INV)
    return cv2.dilate(thresh1,kernel,iterations = 5)

plot_page(generate_blocks_dilation(wb_page_image))

Extracting Features Using OpenCV¶

Use cv2.connectedComponentsWithStats()

In [225]:

import pandas as pd

n_comp, labels, stats, centroids = cv2.connectedComponentsWithStats(wb_image_rlsa)

block_stats = get_block_stats(stats, centroids)
block_stats['right'] = block_stats.left + block_stats.width
block_stats['bottom'] = block_stats.top + block_stats.height
block_stats.head(10)

Out[225]:

	left	top	width	height	area	centroid_x	centroid_y	right	bottom
1	0	0	25	20	500	12.000000	9.500000	25	20
2	130	0	2188	20	43760	1223.500000	9.500000	2318	20
3	0	105	25	54	1350	12.000000	131.500000	25	159
4	906	105	643	54	34128	1226.833304	131.129249	1549	159
5	0	169	25	49	1225	12.000000	193.000000	25	218
6	833	169	789	49	35613	1227.420240	191.262853	1622	218
7	0	242	25	22	550	12.000000	252.500000	25	264
8	143	242	2168	22	47696	1226.500000	252.500000	2311	264
9	0	282	25	54	1350	12.000000	308.500000	25	336
10	1579	282	133	54	5931	1644.686562	307.158827	1712	336

connected components with stats provide geometrical details about all the blocks that we generated using the modified rlsa

It provides :-

- number of components
- labels
- stats which include 
    - leftmost point
    - topmost point
    - height
    - width
    - area
- centroid for each block.

and for convinience we can generate bottom and right.

Layout Analysis¶

Identifying the layout of the table, where the elements of the table are :-

Header
Values
Empty Cells

Reading Text From PDF¶

We use poppler's version of `pdftotext`

In [226]:

def get_text(page_num, pdf_file_name, x, y, w, h):
    command_text = 'pdftotext -enc UTF-8  -f {0} -l {0} -x {1} -y {2} -W {3} -H {4} "{5}" -'
    command = command_text.format(page_num + 1, int(x), 
                                  int(y), int(w),
                                  int(h), pdf_file_name)
    return subprocess.check_output(command, shell=True)

For each block we have shape attributes which we can use to extract text in those particular blocks and use the text to generate more features like

- Is Number
- Is text
- Length of the characters etc

Extracting Text based Features¶

In [230]:

block_stats = block_stats.apply(get_text_data, axis=1, args=[33, wb_pdf_file_path, horizontal_ratio, vertical_ratio])
block_stats[block_stats.text_length > 0][text_columns].sample(5)

Out[230]:

	text	text_length	comma_separated_numbers_present	is_text	number
61	1,98,00,000	11	False	False	19800000.0
83	...	3	False	False	NaN
91	4,96,66,000	11	False	False	49666000.0
159	...	3	False	False	NaN
105	4,55,65,000	11	False	False	45565000.0

Rule Based Marking¶

Mark Numbers
Mark Headers
Mark Titles
Mark Groupings

In [233]:

block_stats[['text', 'label']].sample(10)

Out[233]:

	text	label
137	02-W.B.H.S. 2008	cell_grouping
59	Total - 2013-00-800-NP-001-13	NaN
103	Voted	NaN
24	Rs.	NaN
52	66,33,527	cell_values
68	65,000	cell_values
134	...	cell_values
82	13,08,000	cell_values
145	70-Deduct Recoveries	NaN
167	001-Sumptuary and Other Allowances [CL]	NaN

In [244]:

Image('rules.jpg')

Out[244]:

Moving from Rules to Machine Learning¶

Generating Meta-Data via Logging¶

Raw Images
Blocked Images
Features Generated
Output

In [236]:

block_stats.head()

Out[236]:

	area	bottom	centroid_x	centroid_y	header_index	height	is_text	label	left	number	pos	right	text	text_length	top	width
4	34128.0	159.0	1226.833304	131.129249	NaN	54.0	1.0	title	906.0	NaN	4.0	1549.0	REVENUE EXPENDITURE	19.0	105.0	643.0
6	35613.0	218.0	1227.420240	191.262853	NaN	49.0	1.0	title	833.0	NaN	6.0	1622.0	DETAILED ACCOUNT - MAJOR HEAD 2013	34.0	169.0	789.0
10	5931.0	336.0	1644.686562	307.158827	NaN	54.0	1.0	NaN	1579.0	NaN	10.0	1712.0	Budget	6.0	282.0	133.0
11	6180.0	328.0	1927.556472	305.665210	NaN	46.0	1.0	NaN	1856.0	NaN	11.0	2001.0	Revised	7.0	282.0	145.0
12	5931.0	336.0	2222.686562	307.158827	NaN	54.0	1.0	NaN	2157.0	NaN	12.0	2290.0	Budget	6.0	282.0	133.0

Opportunities¶

Bayesian Models to predict labels for each block.
Recurrent Neural Nets to mark blocks.

In [243]:

print('Source: https://twitter.com/fatiichi/status/466398802476548096')
Image('googly_eyes.jpg')

Source: https://twitter.com/fatiichi/status/466398802476548096

Out[243]:

Open By Default - Open Research¶

Reproducable
Understandable
Collaborative

Contribute¶

Help us to:¶

Generate more Open Budget Data
Improve our Algorithms & Codebase
Cover Budget Data in your Geography
Refine our Designs & much more!

We are open to new ideas, suggestions and feedback¶

Open For Questions¶

Thank You¶

Github: https://github.com/cbgaindia ¶

In [238]:

Image('static_qr_code_without_logo.jpg')

Out[238]: