OCR-Tesseract in Spring Boot 2024
1. Introduction to OCR
What is OCR?
- OCR stands for Optical Character Recognition.
- It is a technology used to convert different types of documents like scanned paper, PDFs, or images into editable and searchable data.
Common Use Cases:
- Converting scanned documents into editable text.
- Extracting text from images.
- Digitizing printed information (e.g., invoices, IDs).
2. Introduction to Tesseract and PDFBox
What is Tesseract?
- Tesseract is an open-source OCR engine developed by Google.
- It can recognize more than 100 languages.
Why Use Tesseract with Spring Boot?
- Integration with Spring Boot allows automation of text extraction from images for web applications.
- It is highly flexible and customizable for different projects.
What is PDFBox and why should we use it?
- PDFBox is a library used for working with PDF documents.
- It can convert PDF pages into images, which can then be processed by Tesseract.
3. OCR in Spring Boot
Step 1: Add Tesseract Dependency
- Add Tesseract OCR library to your Spring Boot project.
- Use the following dependency in pom.xml for Maven projects:
<dependency>
<groupId>net.sourceforge.tess4j</groupId>
<artifactId>tess4j</artifactId>
<version>4.5.5</version>
</dependency>
Step 2: Add PDFBox Dependency
- Add PDFBox library to your Spring Boot project.
- Use the following dependency in pom.xml for Maven projects
<dependency>
<groupId>org.apache.pdfbox</groupId>
<artifactId>pdfbox</artifactId>
<version>2.0.27</version>
</dependency>
Step 3: Install Tesseract
- Install Tesseract OCR software on your machine.
- Set up Tesseract in the system path so that it can be accessed from your Spring Boot application.
- Download language data files if working with non-English text.
- Link : CLICK HERE
Step 4: Implement OCR Service
- Create a service in Spring Boot to handle OCR processing.
- Example code for OCR service:
package com.example.HotelBooking.Service;
import net.sourceforge.tess4j.Tesseract;
import net.sourceforge.tess4j.TesseractException;
import org.springframework.stereotype.Service;
import org.springframework.web.multipart.MultipartFile;
import java.io.IOException;import java.nio.file.Files;
import java.nio.file.Path;
@Service
public class OCRService {
public String extractTextFromImage(MultipartFile file) throws IOException, TesseractException {
Tesseract tesseract = new Tesseract();
// Set the path to your tessdata directory
tesseract.setDatapath(“C:\\Program Files (x86)\\Tesseract-OCR\\tessdata”);
tesseract.setTessVariable(“user_defined_dpi”, “300”);
// Save the uploaded file to a temporary location
Path tempFile = Files.createTempFile(“tempImage”, file.getOriginalFilename());
file.transferTo(tempFile.toFile());
// Perform OCR on the image
String result = tesseract.doOCR(tempFile.toFile());
// Delete the temporary file
Files.delete(tempFile);
return result;
}
}
// OCR-Tesseract in Spring Boot 2024
Step 5: Implement PDF Box Service
● Create a PDFBox service in Spring Boot to handle OCR processing.
● Example code for OCR service:
package com.example.HotelBooking.Service;
import net.sourceforge.tess4j.TesseractException;
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.rendering.PDFRenderer;
import org.apache.pdfbox.text.PDFTextStripper;
import org.springframework.beans.factory.annotation.Autowired;
import org.springframework.stereotype.Service;
import org.springframework.web.multipart.MultipartFile;
import javax.imageio.ImageIO;
import java.awt.image.BufferedImage;
import java.io.*;
import java.nio.file.Files;
import java.util.ArrayList;
import java.util.List;
@Service
public class PDFService {
@Autowired
private OCRService ocrService;
public String extractTextFromPDF(MultipartFile file) throws IOException {
PDDocument document = PDDocument.load(file.getInputStream());
PDFTextStripper pdfStripper = new PDFTextStripper();
String text = pdfStripper.getText(document);
document.close();
return text;
}
public List extractTextFromPDFImages(MultipartFile file, OCRService ocrService) throws IOException, TesseractException {
PDDocument document = PDDocument.load(file.getInputStream());
PDFRenderer pdfRenderer = new PDFRenderer(document);
List ocrResults = new ArrayList<>();
for (int page = 0; page < document.getNumberOfPages(); page++) {
BufferedImage image = pdfRenderer.renderImageWithDPI(page,300); // 300 DPI for better quality
MultipartFile imageFile = createMultipartFileFromImage(image, “page-” + page + “.jpg”);
String ocrText = ocrService.extractTextFromImage(imageFile);
// Use OCR service to extract text from image
ocrResults.add(ocrText);
}
document.close();
return ocrResults;
}
public MultipartFile createMultipartFileFromImage(BufferedImage image, String fileName) throws IOException {
// Convert BufferedImage to byte array
ByteArrayOutputStream baos = new ByteArrayOutputStream();
ImageIO.write(image, “jpg”, baos);
byte[] imageBytes = baos.toByteArray();
// Create and return MultipartFile
return new MultipartFile() {
@Override
public String getName() {
return fileName;
}
@Override
public String getOriginalFilename() {
return fileName;
}
@Override
public String getContentType() {
return “image/jpeg”;
}
@Override
public boolean isEmpty() {
return imageBytes == null || imageBytes.length == 0;
}
@Override
public long getSize() {
return imageBytes.length;
}
@Override
public byte[] getBytes() throws IOException {
return imageBytes;
}
@Override
public InputStream getInputStream() throws IOException {
return new ByteArrayInputStream(imageBytes);
}
@Override
public void transferTo(File dest) throws IOException {
Files.write(dest.toPath(), imageBytes);
}
};
}
}
// OCR-Tesseract in Spring Boot 2024
// OCR-Tesseract in Spring Boot 2024
Step 6: Implement a Common Service to check if the document is Pdf or image ?
● Create a method in service to check if the document is Pdf or image.
● Example code:
public String getAadhaarCardData(Long id) throws IOException, TesseractException {
User user = userRepository.findById(id).orElseThrow(() -> new
ResourceNotFoundException(“User not found”));
String aadhaarData = null;
if (user.getAadhaarCardPhoto() != null) {
// Decode the Base64 encoded string to bytes
byte[] photoBytes =
Base64.getDecoder().decode(user.getAadhaarCardPhoto());
if (isPDF(photoBytes)) {
// Create a MultipartFile for the PDF
MultipartFile pdfFile = createMultipartFile(photoBytes, “aadhaar.pdf”);
aadhaarData = pdfService.extractTextFromPDF(pdfFile);
if (aadhaarData.trim().isEmpty()) {
List ocrResults = pdfService.extractTextFromPDFImages(pdfFile, ocrService);
aadhaarData = String.join(“\n”, ocrResults);
}
} else {
MultipartFile imageFile = createMultipartFile(photoBytes,”aadhaar.jpg”);
aadhaarData = ocrService.extractTextFromImage(imageFile);
}
}
return aadhaarData;
}
private boolean isPDF(byte[] fileBytes) {
// Simple check for PDF by its magic number
return fileBytes.length > 4 && fileBytes[0] == 0x25 && fileBytes[1] == 0x50 &&fileBytes[2] == 0x44 && fileBytes[3] == 0x46; // %PDF
}
private MultipartFile createMultipartFile(byte[] fileBytes, String fileName) {
return new MultipartFile() {
@Override
public String getName() {
return fileName;
}
@Override
public String getOriginalFilename() {
return fileName;
}
@Override
public String getContentType() {
return null;}
@Override
public boolean isEmpty() {
return fileBytes == null || fileBytes.length == 0; }
@Override
public long getSize() {
return fileBytes.length; }
@Override
public byte[] getBytes() throws IOException {
return fileBytes; }
@Override
public InputStream getInputStream() throws IOException {
return new ByteArrayInputStream(fileBytes); }
@Override
public void transferTo(File dest) throws IOException, IllegalStateException {
Files.write(dest.toPath(), fileBytes);
}
};
}
// OCR-Tesseract in Spring Boot 2024
// OCR-Tesseract in Spring Boot 2024
Step 4: Create Controller
- ● Set up a REST controller to handle image or pdf uploads and return extracted text.
- ● Example code for OCR controller:
@GetMapping(“/userAadhaarData/{id}”)
public ResponseEntity getAadhaarCardData(@PathVariable Long id)
{
try {
String aadhaarData = userService.getAadhaarCardData(id);
if (aadhaarData != null) {
return ResponseEntity.ok(“Extracted Aadhaar Data: ” + aadhaarData);
} else {
return ResponseEntity.status(404).body(“Aadhaar data could not be extracted or found.”); }
} catch (IOException | TesseractException e) {
return ResponseEntity.status(500).body(“Error while processing
Aadhaar data: ” + e.getMessage());
}
}
4. Handling Different Image Types and Formats
- Supported Formats: Tesseract supports various image formats like PNG, JPEG,
- and TIFF.
- Ensure the uploaded file is validated for type before processing.
- Handling PDFs: If the input is a PDF, you can use an external library (e.g., PDFBox) to convert the PDF pages into images and then apply OCR.
5. Challenges & Solutions
● Accuracy Issues:
- OCR accuracy depends on image quality, font size, and clarity.
- Solution: Preprocess the image (e.g., converting to grayscale, increasing contrast).
● Language Support:
- Solution: Use Tesseract language data packs (e.g., for languages like Hindi, Chinese).
● Image Size and Performance:
- Solution: Limit file size and optimize images before processing.
6. Testing and Output
● Testing OCR:
- Use Postman or a frontend to upload images and test the OCR process.
- Verify that the extracted text matches the content in the image.
7. Conclusion
● Summary:
- OCR is a powerful tool for extracting text from images and can be easily integrated into Spring Boot using Tesseract.
- This solution can automate text extraction and improve efficiency in many applications.
● Future Enhancements:
- Explore using deep learning models for more accurate OCR.
- Add more features like document classification, text correction, etc.
Also read : ELK In Spring Boot – Integrate ELK stack into Spring Boot application 2024