Tesseract - https://github.com/charlesw/tesseract is an open source OCR engine written for .Net. I have been using it for several months to extract text from screenshots I have taken from an app I use for my gig work. This has allowed me to easily create orders in my database and also add tips to the orders. Anyway, this blog is about how to use Tesseract, not how I've used it. So let's get started.
Firstly Tesseract isn't the only OCR Engine. I like it because it is open source and works fairly well. Probably the best known is Iron OCR - https://ironsoftware.com/csharp/ocr. IronOCR requires a https://ironsoftware.com/csharp/ocr/licensing/. For enterprise applications with a decent budget I would recomment Iron OCR, but for small projects where you don't have the budget you should be OK using Tesseract.
So, let's dive right in. I am going to demonstrate this using a .Net 9 Console App, in Visual Studio. Once you have you console app you need to include some Nuget packages.
Also you will need to download the language pack from
https://github.com/tesseract-ocr/tessdata/. The easiest way is to download it as a zip file and extract to it and copy the tessdata-main to a new directory called tessdata in your project directory. Make sure that it is included in your project and that the contents are set to copy always.
Once how have all that done we are ready to get started with extracting data. I have a sample image file that I created...
As you can see it has a few lines of text. So let's get started with extracting it.....
namespace TesseractTest
{
using Tesseract;
internal class Program
{
static void Main(string[] args)
{
var tessDataPath = Path.Combine(AppContext.BaseDirectory, "tessdata");
using (TesseractEngine engine = new TesseractEngine(tessDataPath, "eng", EngineMode.Default))
{
using (Pix pix = Pix.LoadFromFile(@"C:\temp\TesseractTest\mydata.PNG"))
{
using (Page page = engine.Process(pix))
{
var theText = page.GetText();
Console.WriteLine("Text: " + theText);
}
}
}
}
}
}
Let's take a look at the code... First we have to include Tesseract in the class, then we have to get an instance of the Tesseract OCR Engine which requires you to tell it where the tess data directory is and what language it should use. I've wrapped it in a using so that it will dispose of it once we are done. The first thing we do is get the page from the image. Again I wrap this in a using. The next line is the important one, it is responsible for extracting the text from the image.
As you can see, variable TheText now contains the contents from the image. Once you have the data from the image you can convert is into a string array and then iterate through it and use it whoever you see fit.
Hopefully this quick example gives you enough to use it for your own applications.
Comments
Post a Comment