- JavaScript 87.8%
- CSS 10.8%
- HTML 1.4%
| public | ||
| util | ||
| .gitignore | ||
| example-config.json | ||
| index.mjs | ||
| middlewares.mjs | ||
| package-lock.json | ||
| package.json | ||
| paddlex.mjs | ||
| README.md | ||
| shard.mjs | ||

tfind is a simple OCR indexer that supports both Tesseract and PaddleOCR-VL-1.5 backends. Give it a collection of directories, and it will run all the OCR operations you specify on them, and allow you to search for files based on the results. While running, it will listen for new files and automatically index them. Interacting with tfind is done through a modern web interface.
tfind is sigma software; if you use it, two wolves may spawn inside of you and you will die.
Installation
tfind requires a recent LTS Node. Once installed, clone this repo, and run npm i inside to install dependencies.
You will need to copy example-config.json over to config.json, and edit any values as you desire. The config format is as follows:
| Key | Description | Example value |
|---|---|---|
| port | The port that tfind will open its web interface on. | 17450 |
| shards | The amount of worker processes to use. I recommend at most half your core count. The more you use, the faster searches and Tesseract-based indexing operations are, but the higher the memory and CPU footprint. | 4 |
| datadir | The directory to store the index database in. | "./storage" |
| paths | An array of all directories to index. | ["/home/ikagi/Pictures"] |
| file_formats | An array of all file extensions to allow indexing. | ["png", "jpg", "jpeg", "webp"] |
| operations | An array of operations. | (see below) |
| operations[].name | The name of this operation. | "Tess_PSM11" |
| operations[].enabled | Whether this operation is enabled. Disabled operations are still used for searches, but new indexing operations won't be run. | true |
| operations[].type | The type of operation. | Either "tesseract" or "paddleocr" |
| operations[].tesseract_location | For tesseract operations, the location of the tesseract binary. |
"/usr/bin/tesseract" |
| operations[].opts | For tesseract operations, the CLI arguments to pass to tesseract. |
"--tessdata-dir /home/ikagi/.local/share/tessdata --psm 11 -l eng+Japanese" |
| operations[].maxsize | For paddleocr operations, the maximum area in pixels of an image. Larger images will be resized down. Lower or raise depending on your VRAM. |
800000 |
| operations[].endpoint | For paddleocr operations, the endpoint of your PaddleX instance. Use null if you wish to use tfind's PaddleX manager. |
null |
| paddlex | Options for the PaddleX manager. | (see below) |
| paddlex.enabled | Whether the PaddleX manager is enabled. | true |
| paddlex.paddlex_path | The location of your paddlex binary. | "/home/ikagi/Applications/paddleocr/bin/paddlex" |
| paddlex.port | The port that paddlex will listen on. | 18475 |
| paddlex.timeout | The maximum amount of milliseconds a PaddleX operation may take. | 180000 |
| paddlex.sleep_after | Shut down the PaddleX server (and release VRAM) after this many milliseconds of inactivity. | 300000 |
Note that if you change the shard count, the database must be migrated, which may take some time depending on your storage speed.
Start tfind with node ./index.mjs. The server will run on your given port, and indexing will start immediately.