Large Oil and Gas industry text dataset from Norwegian , UK and Dutch public oil and gas documents
This is a large dataset of extracted text from public Oil and gas documents that was prepared in the run up to the FORCE 2023 Large Languagel model Hackathon in Stavanger, Norway The dataset is uninque since it contains the largest public collection of extracted text from Ocr'ed oil and gas documents currently available. It has been created with the aim to make more oil and gas documents knowledge better embedded in language models Additional the text has been classified in if the extracted pages are real text or mostly gibberish. Personal identifiable information has been removed as best as p... Mehr ...
Verfasser: | |
---|---|
Dokumenttyp: | other |
Erscheinungsdatum: | 2024 |
Verlag/Hrsg.: |
FORCE Hackathon 2023
|
Schlagwörter: | oil / gas / text dataset / embeddings model / diskos / ukoa / nlod / norway / force organisation |
Sprache: | Englisch Norwegian Niederländisch |
Permalink: | https://search.fid-benelux.de/Record/base-28639853 |
Datenquelle: | BASE; Originalkatalog |
Powered By: | BASE |
Link(s) : | https://doi.org/10.5281/zenodo.10775273 |
This is a large dataset of extracted text from public Oil and gas documents that was prepared in the run up to the FORCE 2023 Large Languagel model Hackathon in Stavanger, Norway The dataset is uninque since it contains the largest public collection of extracted text from Ocr'ed oil and gas documents currently available. It has been created with the aim to make more oil and gas documents knowledge better embedded in language models Additional the text has been classified in if the extracted pages are real text or mostly gibberish. Personal identifiable information has been removed as best as possible A file with 1500 hand classified pages is part of the upload to further train text classifiers.