- JavaScript 81.7%
- CSS 16.6%
- HTML 1.7%
| public | ||
| util | ||
| .gitignore | ||
| config-manager.mjs | ||
| db.mjs | ||
| example-config.json | ||
| index.mjs | ||
| middlewares.mjs | ||
| migrations.mjs | ||
| package-lock.json | ||
| package.json | ||
| paddlex.mjs | ||
| queue.mjs | ||
| README.md | ||
| watcher.mjs | ||

tfind is a simple OCR indexer that supports both Tesseract and PaddleOCR-VL-1.5 backends. Give it a collection of directories, and it will run all the OCR operations you specify on them, and allow you to search for files based on the results. While running, it will listen for new files and automatically index them. Interacting with tfind is done through a modern web interface, that allows you to monitor indexing status and search the index.
tfind is sigma software; if you use it, two wolves may spawn inside of you and you will die.
Installation
tfind requires a recent LTS Node. Once installed, clone this repo, and run npm i inside to install dependencies.
You will need to copy example-config.json over to config.json, and edit any values as you desire. The config format is as follows:
| Key | Description | Example value |
|---|---|---|
| port | The port that tfind will open its web interface on. | 17450 |
| datadir | The directory to store the index database in. | "./storage" |
| paths | An array of all directories to index. | ["/home/ikagi/Pictures"] |
| file_formats | An array of all file extensions to allow indexing. | ["png", "jpg", "jpeg", "webp"] |
| operations | An array of operations. | (see below) |
| operations[].name | The name of this operation. | "Tess_PSM11" |
| operations[].enabled | Whether this operation is enabled. Disabled operations are still used for searches, but new indexing operations won't be run. | true |
| operations[].type | The type of operation. | Either "tesseract" or "paddleocr" |
| operations[].tesseract_location | For tesseract operations, the location of the tesseract binary. |
"/usr/bin/tesseract" |
| operations[].opts | For tesseract operations, the CLI arguments to pass to tesseract. |
"--tessdata-dir /home/ikagi/.local/share/tessdata --psm 11 -l eng+Japanese" |
| operations[].maxsize | For paddleocr operations, the maximum area in pixels of an image. Larger images will be resized down. Lower or raise depending on your VRAM. |
800000 |
| operations[].endpoint | For paddleocr operations, the endpoint of your PaddleX instance. Use null if you wish to use tfind's PaddleX manager. |
null |
| operations[].threads | The amount of simultaneous operations allowed. | 4 |
| operations[].schedule | An array of schedule definitions, which allow changing the amount of threads on a schedule (ie. to only run some operations while you're asleep). | (see below) |
| operations[].schedule[].at | The time at which this schedule will run, in Crontab syntax. | * 20 * * * |
| operations[].schedule[].threads | The amount of simultaneous operations allowed from this time onward. 0 will pause the queue. | 0 |
| paddlex | Options for the PaddleX manager. | (see below) |
| paddlex.enabled | Whether the PaddleX manager is enabled. | true |
| paddlex.paddlex_path | The location of your paddlex binary. | "/home/ikagi/Applications/paddleocr/bin/paddlex" |
| paddlex.port | The port that paddlex will listen on. | 18475 |
| paddlex.timeout | The maximum amount of milliseconds a PaddleX operation may take. | 180000 |
| paddlex.sleep_after | Shut down the PaddleX server (and release VRAM) after this many milliseconds of inactivity. | 300000 |
Start tfind with node ./index.mjs. The server will run on your given port, and indexing will start immediately.