Voters list of India: data extraction, standardization, transliteration

Apriori Data approached Nannostomus to collect comprehensive data on registered voters from the Indian electoral authorities across all administrative divisions and states. Additionally, they needed the data to be cleaned and normalized. The India voter database was to be crosschecked and enriched with data from India Post. Furthermore, the goal was to aggregate all the data into a single, easily accessible file and update it yearly.

At a glance.

Duration:6 months + annual updates

Team:5 developers

Number of processed records: 1 billion+

Languages: 22

Location: India

The task at hand was to:

1.

Gather data on all registered voters aged 18 and higher from 29 Indian states and 7 union territories.

2.

Clean and standardize data fields. Verify data and add new fields from India Post.

3.

Provide English transliteration to data available in 22 official languages.

4.

Conduct yearly updates.

The key challenges of the sex offenders project

Large data volume.

Processing 1 billion records required immense computational power and storage capacity. Additionally, managing the speed and efficiency of data retrieval, updating, and backup operations posed technical difficulties.

Validation.

We cross-referenced data from Indian electoral authorities with information in India Post. In particular, we had to verify the addresses and names to deliver an accurate dataset. That was hard to perform due to differences in data formats and structures.

Standardization.

The raw source data came in various formats, layouts, and languages. Most records were in PDFs, but there were also photos of hand-written voter forms in languages not supported by common OCR solutions. All this required extensive normalization and a creative approach toward data recognition.

Transliteration.

The data in the voters list in India was available in native languages. In total, we processed 22 languages. Pinjab was the trickiest one as none of the OCRs couldn't extract text from images. We preserved the original language and had to convert non-English data into Roman characters, which demanded advanced linguistic expertise and machine transliteration algorithms.

We'll gather it for you from websites of any complexity.

What we did to create a voters list India

Data collection

Created modules, which we run at Nannostomus to collect PDFs and images with voter data from Electoral Authorities across India.

Machine training

Hired linguistic experts to develop algorithms for teaching virtual machines to recognize 22 Indian languages.

Data extraction

Wrote a code to extract data from PDF and applied OCR to render images that contained voter forms.

Data standardization

Cleaned, normalized, and transliterated the extracted data to ensure consistency within a unified voter database India.

Data validation

Cross-referenced the standardized data with India Post addresses to verify accuracy and completeness.

Annual updates

Monitored data release schedules and updated the database annually to include the latest information from Electoral Authorities and India Post.

Technologies

Nannostomus
Tesseract OCR
C#
.NET
AWS

Deliverables

Our client received a centralized digital voter ID database India with over one billion records aggregated from 36 sources. Data is available in native languages and transliterated into English.

We scraped public sources to collect PDF files and images containing information about Indian voters. Here are data points you'll find in our India voters list (both have variations in the native language and English transliteration):

  • Voter name
  • Relation's name
  • EPIC number
  • Address
  • Age
  • Sex
  • Year of birth
  • Year of electoral roll revision
  • Polling station name

Nannostomus delivered a CSV file with a standardized and enhanced voters list online India. The file contained 63 data fields, totally verified and normalized.

  • Standardized and enhanced voter data with full history, including past addresses and name changes.
  • Data available in 22 languages with machine transliteration applied.
  • Accurate and validated data through cross-referencing with India Post.