Using Tesseract 4 with C#

Recently I built a small tool to read the text of thousands of images.
Introduction
A common technique to extract text from images is know as OCR (Optical character recognition) and the best implementation, that I Know, is called Tesseract.
When a I started to build the tool, I used the most famous Tesseract’s wrapper for .NET.
Although the wrapper worked very well, I was curious about if there was a way to get better peformance results. With a little search I noticed that the .NET wrapper still use Tesseract 3, but there was a version 4 available with a lot of performance improvements:
If you are running Tesseract 4, you can use the “fast” integer models.
Tesseract 4 also uses up to four CPU threads while processing a page, so it will be faster than Tesseract 3
https://github.com/tesseract-ocr/tesseract/wiki/FAQ#can-i-increase-speed-of-ocr
So, I decided to try Tesseract 4 to see how could it impact in the performance of my tool. As at the time there was no .NET wrapper for it, I removed the old wrapper and called Tesseract 4 directly as a process.
The use of Tesseract 4 cut off the time to read the images in almost half
TesseractService
I ended up developing the class below to call the Tesseract 4
command-line (tesseract.exe)
directly from the C# code.
using System; | |
using System.Diagnostics; | |
using System.IO; | |
using System.Linq; | |
using System.Text; | |
namespace Ocr | |
{ | |
/// <summary> | |
/// Service to read texts from images through OCR Tesseract engine. | |
/// </summary> | |
public class TesseractService | |
{ | |
private readonly string _tesseractExePath; | |
private readonly string _language; | |
/// <summary> | |
/// Initializes a new instance of the <see cref="TesseractService"/> class. | |
/// </summary> | |
/// <param name="tesseractDir">The path for the Tesseract4 installation folder (C:\Program Files\Tesseract-OCR).</param> | |
/// <param name="language">The language used to extract text from images (eng, por, etc)</param> | |
/// <param name="dataDir">The data with the trained models (tessdata). Download the models from https://github.com/tesseract-ocr/tessdata_fast</param> | |
public TesseractService(string tesseractDir, string language = "en", string dataDir = null) | |
{ | |
// Tesseract configs. | |
_tesseractExePath = Path.Combine(tesseractDir, "tesseract.exe"); | |
_language = language; | |
if (String.IsNullOrEmpty(dataDir)) | |
dataDir = Path.Combine(tesseractDir, "tessdata"); | |
Environment.SetEnvironmentVariable("TESSDATA_PREFIX", dataDir); | |
} | |
/// <summary> | |
/// Read text from the images streams. | |
/// </summary> | |
/// <param name="images">The images streams.</param> | |
/// <returns>The images text.</returns> | |
public string GetText(params Stream[] images) | |
{ | |
var output = string.Empty; | |
if (images.Any()) | |
{ | |
var tempPath = Path.Combine(Path.GetTempPath(), Guid.NewGuid().ToString()); | |
Directory.CreateDirectory(tempPath); | |
var tempInputFile = NewTempFileName(tempPath); | |
var tempOutputFile = NewTempFileName(tempPath); | |
try | |
{ | |
WriteInputFiles(images, tempPath, tempInputFile); | |
var info = new ProcessStartInfo | |
{ | |
FileName = _tesseractExePath, | |
Arguments = $"{tempInputFile} {tempOutputFile} -l {_language}", | |
RedirectStandardError = true, | |
RedirectStandardOutput = true, | |
CreateNoWindow = true, | |
UseShellExecute = false | |
}; | |
using (var ps = Process.Start(info)) | |
{ | |
ps.WaitForExit(); | |
var exitCode = ps.ExitCode; | |
if (exitCode == 0) | |
{ | |
output = File.ReadAllText(tempOutputFile + ".txt"); | |
} | |
else | |
{ | |
var stderr = ps.StandardError.ReadToEnd(); | |
throw new InvalidOperationException(stderr); | |
} | |
} | |
} | |
finally | |
{ | |
Directory.Delete(tempPath, true); | |
} | |
} | |
return output; | |
} | |
private static void WriteInputFiles(Stream[] inputStreams, string tempPath, string tempInputFile) | |
{ | |
// If there is more thant one image file, so build the list file using the images as input files. | |
if (inputStreams.Length > 1) | |
{ | |
var imagesListFileContent = new StringBuilder(); | |
foreach (var inputStream in inputStreams) | |
{ | |
var imageFile = NewTempFileName(tempPath); | |
using (var tempStream = File.OpenWrite(imageFile)) | |
{ | |
CopyStream(inputStream, tempStream); | |
} | |
imagesListFileContent.AppendLine(imageFile); | |
} | |
File.WriteAllText(tempInputFile, imagesListFileContent.ToString()); | |
} | |
else | |
{ | |
// If is only one image file, than use the image file as input file. | |
using (var tempStream = File.OpenWrite(tempInputFile)) | |
{ | |
CopyStream(inputStreams.First(), tempStream); | |
} | |
} | |
} | |
private static void CopyStream(Stream input, Stream output) | |
{ | |
if (input.CanSeek) | |
input.Seek(0, SeekOrigin.Begin); | |
input.CopyTo(output); | |
input.Close(); | |
} | |
private static string NewTempFileName(string tempPath) | |
{ | |
return Path.Combine(tempPath, Guid.NewGuid().ToString()); | |
} | |
} | |
} |
Setup
- Just download the gist above and add it to your .NET project.
- Install Tesseract 4
- Linux and OSX: https://github.com/tesseract-ocr/tesseract/wiki
- Windows: https://github.com/UB-Mannheim/tesseract/wiki
- Download the trained data model for the language you need to read the images
- More accurate, but slower: https://github.com/tesseract-ocr/tessdata_best
- Faster, but less accurate: https://github.com/tesseract-ocr/tessdata_fast.
Usage
var service = new TesseractService(@"C:\Program Files\Tesseract-OCR", "eng", @"C:\Program Files\Tesseract-OCR\tessdata"); | |
// var stream = File.OpenRead(string path); | |
// var stream = WebRequest.Create(string url).GetResponse().GetResponseStream(); | |
// var stream = new MemoryStream(byte[] buffer); | |
var text = service.GetText(stream); |
If you try to read a image like this one:

You will get this result after call the TesseractService.GetText
method:
The (quick) [brown] {fox} jumps!
Over the $43,456.78 <lazy> #90 dog
& duck/goose, as 12.5% of E-mail
from aspammer@website.com is spam.
Der ,.schnelle" braune Fuchs springt
iiber den faulen Hund. Le renard brun
«rapide» saute par-dessus le chien
paresseux. La volpe marrone rapida
salta sopra il cane pigro. El zorro
marron rapido salta sobre el perro
perezoso. A raposa marrom rapida
salta sobre o céo preguicoso.