zettelkasten/OneNoteExport/Technik/Dokumentation/Anleitungen/48_PDF24 Creator .md
2023-08-17 19:32:37 +02:00

26 KiB

PDF24 Creator

Mittwoch, 28. September 2022

20:38

 

PDC OCR einrichten

 

 

Hallo,

packen Sie eine trainDataList.txt Datei für den lokalen Modus in den tesseract Ordner im Installationsverzeichnis. Im Vergleich zur online Liste enthält die lokale Liste keine URLs zum Download. Laden Sie sich dann die einzelnen Sprachdateien, die in der Liste angegeben sind, in den tessdata Order (INSTALL_DIR/tesseract/tessdata). Das war es dann schon. Achten Sie bitte drauf, dass Sie die Sprachdateien verwenden, die mit Tesseract 4.0 kompatibel sind. Die Sprachdateien kann man z.b. von GitHub aus dem tesseract Projekt herunterladen.

Die Liste für die lokale Variante kann man über die unten aufgeführte URL herunterladen. Diese Datei dann in den entsprechenden Order wie oben beschrieben kopieren. Beachten Sie bitte den Dateinamen von local-list.txt in trainDataList.txt umzubennen. Die Liste mit den downloadbaren Sprachdateien ist ebenfalls unten aufgeführt. Dort sind die direkten Links zu den Sprachdateien enthalten.

Die Liste können Sie kürzen. Sie können in der Liste nur die Sprachdateien angeben, die in der Software angezeigt werden  sollen.

Liste für die lokalen Sprachdateien:

https://creator.pdf24.org/tesseract/4.0/traindata/local-list.txt

Liste für die downloadbaren Sprachdateien:

https://creator.pdf24.org/tesseract/4.0/traindata/list.txt

Stefan Ziegler Posted new comment 2019-07-11

 

Aus <https://help.pdf24.org/de/fragen/frage/ocr-konfiguration-um-nur-installierte-sprachdateien-zu-verwenden/>

 

 

Dringend wird immer die Sprachdatei und die OSD (letzter Eintrag)

 

 

 

 

 

afr                                Afrikaans                                https://github.com/tesseract-ocr/tessdata_fast/raw/4.0.0/afr.traineddata
amh                                Amharic                                        https://github.com/tesseract-ocr/tessdata_fast/raw/4.0.0/amh.traineddata
ara                                Arabic                                        https://github.com/tesseract-ocr/tessdata_fast/raw/4.0.0/ara.traineddata
asm                                Assamese                                https://github.com/tesseract-ocr/tessdata_fast/raw/4.0.0/asm.traineddata
aze                                Azerbaijani                                https://github.com/tesseract-ocr/tessdata_fast/raw/4.0.0/aze.traineddata
aze_cyrl                Azerbaijani-Cyrilic                https://github.com/tesseract-ocr/tessdata_fast/raw/4.0.0/aze_cyrl.traineddata
bel                                Belarusian                                https://github.com/tesseract-ocr/tessdata_fast/raw/4.0.0/bel.traineddata
ben                                Bengali                                        https://github.com/tesseract-ocr/tessdata_fast/raw/4.0.0/ben.traineddata
bod                                Tibetan                                        https://github.com/tesseract-ocr/tessdata_fast/raw/4.0.0/bod.traineddata
bos                                Bosnian                                        https://github.com/tesseract-ocr/tessdata_fast/raw/4.0.0/bos.traineddata
bre                                Breton                                        https://github.com/tesseract-ocr/tessdata_fast/raw/4.0.0/bre.traineddata
bul                                Bulgarian                                https://github.com/tesseract-ocr/tessdata_fast/raw/4.0.0/bul.traineddata
cat                                Catalan                                        https://github.com/tesseract-ocr/tessdata_fast/raw/4.0.0/cat.traineddata
ceb                                Cebuano                                        https://github.com/tesseract-ocr/tessdata_fast/raw/4.0.0/ceb.traineddata
ces                                Czech                                        https://github.com/tesseract-ocr/tessdata_fast/raw/4.0.0/ces.traineddata
chi_sim                        Chinese-Simpl.                        https://github.com/tesseract-ocr/tessdata_fast/raw/4.0.0/chi_sim.traineddata
chi_tra                        Chinese-Trad.                        https://github.com/tesseract-ocr/tessdata_fast/raw/4.0.0/chi_tra.traineddata
chr                                Cherokee                                https://github.com/tesseract-ocr/tessdata_fast/raw/4.0.0/chr.traineddata
cym                                Welsh                                        https://github.com/tesseract-ocr/tessdata_fast/raw/4.0.0/cym.traineddata
dan                                Danish                                        https://github.com/tesseract-ocr/tessdata_fast/raw/4.0.0/dan.traineddata
deu                                German                                        https://github.com/tesseract-ocr/tessdata_fast/raw/4.0.0/deu.traineddata
dzo                                Dzongkha                                https://github.com/tesseract-ocr/tessdata_fast/raw/4.0.0/dzo.traineddata
ell                                Greek                                        https://github.com/tesseract-ocr/tessdata_fast/raw/4.0.0/ell.traineddata
eng                                English                                        https://github.com/tesseract-ocr/tessdata_fast/raw/4.0.0/eng.traineddata
epo                                Esperanto                                https://github.com/tesseract-ocr/tessdata_fast/raw/4.0.0/epo.traineddata
est                                Estonian                                https://github.com/tesseract-ocr/tessdata_fast/raw/4.0.0/est.traineddata
eus                                Basque                                        https://github.com/tesseract-ocr/tessdata_fast/raw/4.0.0/eus.traineddata
fas                                Persian                                        https://github.com/tesseract-ocr/tessdata_fast/raw/4.0.0/fas.traineddata
fin                                Finnish                                        https://github.com/tesseract-ocr/tessdata_fast/raw/4.0.0/fin.traineddata
fra                                French                                        https://github.com/tesseract-ocr/tessdata_fast/raw/4.0.0/fra.traineddata
frk                                Frankish                                https://github.com/tesseract-ocr/tessdata_fast/raw/4.0.0/frk.traineddata
frm                                French                                        https://github.com/tesseract-ocr/tessdata_fast/raw/4.0.0/frm.traineddata
gle                                Irish                                        https://github.com/tesseract-ocr/tessdata_fast/raw/4.0.0/gle.traineddata
glg                                Galician                                https://github.com/tesseract-ocr/tessdata_fast/raw/4.0.0/glg.traineddata
grc                                Greek                                        https://github.com/tesseract-ocr/tessdata_fast/raw/4.0.0/grc.traineddata
guj                                Gujarati                                https://github.com/tesseract-ocr/tessdata_fast/raw/4.0.0/guj.traineddata
hat                                Haitian                                        https://github.com/tesseract-ocr/tessdata_fast/raw/4.0.0/hat.traineddata
heb                                Hebrew                                        https://github.com/tesseract-ocr/tessdata_fast/raw/4.0.0/heb.traineddata
hin                                Hindi                                        https://github.com/tesseract-ocr/tessdata_fast/raw/4.0.0/hin.traineddata
hrv                                Croatian                                https://github.com/tesseract-ocr/tessdata_fast/raw/4.0.0/hrv.traineddata
hun                                Hungarian                                https://github.com/tesseract-ocr/tessdata_fast/raw/4.0.0/hun.traineddata
iku                                Inuktitut                                https://github.com/tesseract-ocr/tessdata_fast/raw/4.0.0/iku.traineddata
ind                                Indonesian                                https://github.com/tesseract-ocr/tessdata_fast/raw/4.0.0/ind.traineddata
isl                                Icelandic                                https://github.com/tesseract-ocr/tessdata_fast/raw/4.0.0/isl.traineddata
ita                                Italian                                        https://github.com/tesseract-ocr/tessdata_fast/raw/4.0.0/ita.traineddata
ita_old                        Italian-Old                                https://github.com/tesseract-ocr/tessdata_fast/raw/4.0.0/ita_old.traineddata
jav                                Javanese                                https://github.com/tesseract-ocr/tessdata_fast/raw/4.0.0/jav.traineddata
jpn                                Japanese                                https://github.com/tesseract-ocr/tessdata_fast/raw/4.0.0/jpn.traineddata
kan                                Kannada                                        https://github.com/tesseract-ocr/tessdata_fast/raw/4.0.0/kan.traineddata
kat                                Georgian                                https://github.com/tesseract-ocr/tessdata_fast/raw/4.0.0/kat.traineddata
kat_old                        Georgian-Old                        https://github.com/tesseract-ocr/tessdata_fast/raw/4.0.0/kat_old.traineddata
kaz                                Kazakh                                        https://github.com/tesseract-ocr/tessdata_fast/raw/4.0.0/kaz.traineddata
khm                                Khmer                                        https://github.com/tesseract-ocr/tessdata_fast/raw/4.0.0/khm.traineddata
kir                                Kirghiz                                        https://github.com/tesseract-ocr/tessdata_fast/raw/4.0.0/kir.traineddata
kor                                Korean                                        https://github.com/tesseract-ocr/tessdata_fast/raw/4.0.0/kor.traineddata
kor_vert                Korean-vertical                        https://github.com/tesseract-ocr/tessdata_fast/raw/4.0.0/kor_vert.traineddata
lao                                Lao                                                https://github.com/tesseract-ocr/tessdata_fast/raw/4.0.0/lao.traineddata
lat                                Latin                                        https://github.com/tesseract-ocr/tessdata_fast/raw/4.0.0/lat.traineddata
lav                                Latvian                                        https://github.com/tesseract-ocr/tessdata_fast/raw/4.0.0/lav.traineddata
lit                                Lithuanian                                https://github.com/tesseract-ocr/tessdata_fast/raw/4.0.0/lit.traineddata
ltz                                Luxembourgish                        https://github.com/tesseract-ocr/tessdata_fast/raw/4.0.0/ltz.traineddata
mal                                Malayalam                                https://github.com/tesseract-ocr/tessdata_fast/raw/4.0.0/mal.traineddata
mar                                Marathi                                        https://github.com/tesseract-ocr/tessdata_fast/raw/4.0.0/mar.traineddata
mkd                                Macedonian                                https://github.com/tesseract-ocr/tessdata_fast/raw/4.0.0/mkd.traineddata
mlt                                Maltese                                        https://github.com/tesseract-ocr/tessdata_fast/raw/4.0.0/mlt.traineddata
mon                                Mongolian                                https://github.com/tesseract-ocr/tessdata_fast/raw/4.0.0/mon.traineddata
mri                                Maori                                        https://github.com/tesseract-ocr/tessdata_fast/raw/4.0.0/mri.traineddata
msa                                Malay                                        https://github.com/tesseract-ocr/tessdata_fast/raw/4.0.0/msa.traineddata
mya                                Burmese                                        https://github.com/tesseract-ocr/tessdata_fast/raw/4.0.0/mya.traineddata
nep                                Nepali                                        https://github.com/tesseract-ocr/tessdata_fast/raw/4.0.0/nep.traineddata
nld                                Dutch                                        https://github.com/tesseract-ocr/tessdata_fast/raw/4.0.0/nld.traineddata
nor                                Norwegian                                https://github.com/tesseract-ocr/tessdata_fast/raw/4.0.0/nor.traineddata
oci                                Occitan                                        https://github.com/tesseract-ocr/tessdata_fast/raw/4.0.0/oci.traineddata
ori                                Oriya                                        https://github.com/tesseract-ocr/tessdata_fast/raw/4.0.0/ori.traineddata
pan                                Panjabi                                        https://github.com/tesseract-ocr/tessdata_fast/raw/4.0.0/pan.traineddata
pol                                Polish                                        https://github.com/tesseract-ocr/tessdata_fast/raw/4.0.0/pol.traineddata
por                                Portuguese                                https://github.com/tesseract-ocr/tessdata_fast/raw/4.0.0/por.traineddata
pus                                Pushto                                        https://github.com/tesseract-ocr/tessdata_fast/raw/4.0.0/pus.traineddata
que                                Quechua                                        https://github.com/tesseract-ocr/tessdata_fast/raw/4.0.0/que.traineddata
ron                                Romanian                                https://github.com/tesseract-ocr/tessdata_fast/raw/4.0.0/ron.traineddata
rus                                Russian                                        https://github.com/tesseract-ocr/tessdata_fast/raw/4.0.0/rus.traineddata
san                                Sanskrit                                https://github.com/tesseract-ocr/tessdata_fast/raw/4.0.0/san.traineddata
sin                                Sinhala                                        https://github.com/tesseract-ocr/tessdata_fast/raw/4.0.0/sin.traineddata
slk                                Slovak                                        https://github.com/tesseract-ocr/tessdata_fast/raw/4.0.0/slk.traineddata
slv                                Slovenian                                https://github.com/tesseract-ocr/tessdata_fast/raw/4.0.0/slv.traineddata
snd                                Sindhi                                        https://github.com/tesseract-ocr/tessdata_fast/raw/4.0.0/snd.traineddata
spa                                Spanish                                        https://github.com/tesseract-ocr/tessdata_fast/raw/4.0.0/spa.traineddata
spa_old                        Spanish-Old                                https://github.com/tesseract-ocr/tessdata_fast/raw/4.0.0/spa_old.traineddata
sqi                                Albanian                                https://github.com/tesseract-ocr/tessdata_fast/raw/4.0.0/sqi.traineddata
srp                                Serbian                                        https://github.com/tesseract-ocr/tessdata_fast/raw/4.0.0/srp.traineddata
srp_latn                Serbian-Latin                        https://github.com/tesseract-ocr/tessdata_fast/raw/4.0.0/srp_latn.traineddata
sun                                Sundanese                                https://github.com/tesseract-ocr/tessdata_fast/raw/4.0.0/sun.traineddata
swa                                Swahili                                        https://github.com/tesseract-ocr/tessdata_fast/raw/4.0.0/swa.traineddata
swe                                Swedish                                        https://github.com/tesseract-ocr/tessdata_fast/raw/4.0.0/swe.traineddata
syr                                Syriac                                        https://github.com/tesseract-ocr/tessdata_fast/raw/4.0.0/syr.traineddata
tam                                Tamil                                        https://github.com/tesseract-ocr/tessdata_fast/raw/4.0.0/tam.traineddata
tat                                Tatar                                        https://github.com/tesseract-ocr/tessdata_fast/raw/4.0.0/tat.traineddata
tel                                Telugu                                        https://github.com/tesseract-ocr/tessdata_fast/raw/4.0.0/tel.traineddata
tgk                                Tajik                                        https://github.com/tesseract-ocr/tessdata_fast/raw/4.0.0/tgk.traineddata
tha                                Thai                                        https://github.com/tesseract-ocr/tessdata_fast/raw/4.0.0/tha.traineddata
tir                                Tigrinya                                https://github.com/tesseract-ocr/tessdata_fast/raw/4.0.0/tir.traineddata
ton                                Tonga                                        https://github.com/tesseract-ocr/tessdata_fast/raw/4.0.0/ton.traineddata
tur                                Turkish                                        https://github.com/tesseract-ocr/tessdata_fast/raw/4.0.0/tur.traineddata
uig                                Uighur                                        https://github.com/tesseract-ocr/tessdata_fast/raw/4.0.0/uig.traineddata
ukr                                Ukrainian                                https://github.com/tesseract-ocr/tessdata_fast/raw/4.0.0/ukr.traineddata
urd                                Urdu                                        https://github.com/tesseract-ocr/tessdata_fast/raw/4.0.0/urd.traineddata
uzb                                Uzbek                                        https://github.com/tesseract-ocr/tessdata_fast/raw/4.0.0/uzb.traineddata
uzb_cyrl                Uzbek-Cyrilic                        https://github.com/tesseract-ocr/tessdata_fast/raw/4.0.0/uzb_cyrl.traineddata
vie                                Vietnamese                                https://github.com/tesseract-ocr/tessdata_fast/raw/4.0.0/vie.traineddata
yid                                Yiddish                                        https://github.com/tesseract-ocr/tessdata_fast/raw/4.0.0/yid.traineddata
yor                                Yoruba                                        https://github.com/tesseract-ocr/tessdata_fast/raw/4.0.0/yor.traineddata
osd                                OSD                                https://github.com/tesseract-ocr/tessdata_fast/raw/4.0.0/osd.traineddata

 

Aus <https://creator.pdf24.org/tesseract/4.0/traindata/list.txt>

//EventLog.CreateEventSource("PDF24-OCR-Service", "Application");