Large Oil and Gas industry text dataset from Norwegian , UK and Dutch public oil and gas documents

This is a large dataset of extracted text from public Oil and gas documents that was prepared in the run up to the FORCE 2023 Large Languagel model Hackathon in Stavanger, Norway The dataset is uninque since it contains the largest public collection of extracted text from Ocr'ed oil and gas documents currently available. It has been created with the aim to make more oil and gas documents knowledge better embedded in language models Additional the text has been classified in if the extracted pages are real text or mostly gibberish. Personal identifiable information has been removed as best as p... Mehr ...

Verfasser: FORCE NETWORK Group
Dokumenttyp: other
Erscheinungsdatum: 2024
Verlag/Hrsg.: FORCE Hackathon 2023
Schlagwörter: oil / gas / text dataset / embeddings model / diskos / ukoa / nlod / norway / force organisation
Sprache: Englisch
Norwegian
Niederländisch
Permalink: https://search.fid-benelux.de/Record/base-28639853
Datenquelle: BASE; Originalkatalog
Powered By: BASE
Link(s) : https://doi.org/10.5281/zenodo.10775273

This is a large dataset of extracted text from public Oil and gas documents that was prepared in the run up to the FORCE 2023 Large Languagel model Hackathon in Stavanger, Norway The dataset is uninque since it contains the largest public collection of extracted text from Ocr'ed oil and gas documents currently available. It has been created with the aim to make more oil and gas documents knowledge better embedded in language models Additional the text has been classified in if the extracted pages are real text or mostly gibberish. Personal identifiable information has been removed as best as possible A file with 1500 hand classified pages is part of the upload to further train text classifiers.