Using Tesseract 4 with C#

Recently I built a small tool to read the text of thousands of images.

Introduction

A common technique to extract text from images is know as OCR (Optical character recognition) and the best implementation, that I Know, is called Tesseract.

When a I started to build the tool, I used the most famous Tesseract’s wrapper for .NET.

Although the wrapper worked very well, I was curious about if there was a way to get better peformance results. With a little search I noticed that the .NET wrapper still use Tesseract 3, but there was a version 4 available with a lot of performance improvements:

If you are running Tesseract 4, you can use the “fast” integer models.

Tesseract 4 also uses up to four CPU threads while processing a page, so it will be faster than Tesseract 3

https://github.com/tesseract-ocr/tesseract/wiki/FAQ#can-i-increase-speed-of-ocr


So, I decided to try Tesseract 4 to see how could it impact in the performance of my tool. As at the time there was no .NET wrapper for it, I removed the old wrapper and called Tesseract 4 directly as a process.

The use of Tesseract 4 cut off the time to read the images in almost half

TesseractService

I ended up developing the class below to call the Tesseract 4 command-line (tesseract.exe) directly from the C# code.

using System;
using System.Diagnostics;
using System.IO;
using System.Linq;
using System.Text;
namespace Ocr
{
/// <summary>
/// Service to read texts from images through OCR Tesseract engine.
/// </summary>
public class TesseractService
{
private readonly string _tesseractExePath;
private readonly string _language;
/// <summary>
/// Initializes a new instance of the <see cref="TesseractService"/> class.
/// </summary>
/// <param name="tesseractDir">The path for the Tesseract4 installation folder (C:\Program Files\Tesseract-OCR).</param>
/// <param name="language">The language used to extract text from images (eng, por, etc)</param>
/// <param name="dataDir">The data with the trained models (tessdata). Download the models from https://github.com/tesseract-ocr/tessdata_fast</param>
public TesseractService(string tesseractDir, string language = "en", string dataDir = null)
{
// Tesseract configs.
_tesseractExePath = Path.Combine(tesseractDir, "tesseract.exe");
_language = language;
if (String.IsNullOrEmpty(dataDir))
dataDir = Path.Combine(tesseractDir, "tessdata");
Environment.SetEnvironmentVariable("TESSDATA_PREFIX", dataDir);
}
/// <summary>
/// Read text from the images streams.
/// </summary>
/// <param name="images">The images streams.</param>
/// <returns>The images text.</returns>
public string GetText(params Stream[] images)
{
var output = string.Empty;
if (images.Any())
{
var tempPath = Path.Combine(Path.GetTempPath(), Guid.NewGuid().ToString());
Directory.CreateDirectory(tempPath);
var tempInputFile = NewTempFileName(tempPath);
var tempOutputFile = NewTempFileName(tempPath);
try
{
WriteInputFiles(images, tempPath, tempInputFile);
var info = new ProcessStartInfo
{
FileName = _tesseractExePath,
Arguments = $"{tempInputFile} {tempOutputFile} -l {_language}",
RedirectStandardError = true,
RedirectStandardOutput = true,
CreateNoWindow = true,
UseShellExecute = false
};
using (var ps = Process.Start(info))
{
ps.WaitForExit();
var exitCode = ps.ExitCode;
if (exitCode == 0)
{
output = File.ReadAllText(tempOutputFile + ".txt");
}
else
{
var stderr = ps.StandardError.ReadToEnd();
throw new InvalidOperationException(stderr);
}
}
}
finally
{
Directory.Delete(tempPath, true);
}
}
return output;
}
private static void WriteInputFiles(Stream[] inputStreams, string tempPath, string tempInputFile)
{
// If there is more thant one image file, so build the list file using the images as input files.
if (inputStreams.Length > 1)
{
var imagesListFileContent = new StringBuilder();
foreach (var inputStream in inputStreams)
{
var imageFile = NewTempFileName(tempPath);
using (var tempStream = File.OpenWrite(imageFile))
{
CopyStream(inputStream, tempStream);
}
imagesListFileContent.AppendLine(imageFile);
}
File.WriteAllText(tempInputFile, imagesListFileContent.ToString());
}
else
{
// If is only one image file, than use the image file as input file.
using (var tempStream = File.OpenWrite(tempInputFile))
{
CopyStream(inputStreams.First(), tempStream);
}
}
}
private static void CopyStream(Stream input, Stream output)
{
if (input.CanSeek)
input.Seek(0, SeekOrigin.Begin);
input.CopyTo(output);
input.Close();
}
private static string NewTempFileName(string tempPath)
{
return Path.Combine(tempPath, Guid.NewGuid().ToString());
}
}
}

Setup

Usage

var service = new TesseractService(@"C:\Program Files\Tesseract-OCR", "eng", @"C:\Program Files\Tesseract-OCR\tessdata");
// var stream = File.OpenRead(string path);
// var stream = WebRequest.Create(string url).GetResponse().GetResponseStream();
// var stream = new MemoryStream(byte[] buffer);
var text = service.GetText(stream);
view raw usage-sample.cs hosted with ❤ by GitHub

If you try to read a image like this one:

post image


You will get this result after call the TesseractService.GetText method: The (quick) [brown] {fox} jumps! Over the $43,456.78 <lazy> #90 dog & duck/goose, as 12.5% of E-mail from aspammer@website.com is spam. Der ,.schnelle" braune Fuchs springt iiber den faulen Hund. Le renard brun «rapide» saute par-dessus le chien paresseux. La volpe marrone rapida salta sopra il cane pigro. El zorro marron rapido salta sobre el perro perezoso. A raposa marrom rapida salta sobre o céo preguicoso.

Further reading

Loading comments...
Tutorials

Articles

Labs