AI-based iiRDS tagging of technical documents

Summary of the IUNTC meeting on November 14 by Gerald Adam Zwettler.

The IUNTC presentation by Gerald Zwettler explained the use and challenges of large language models such as ChatGPT and similar applications, particularly for the automated indexing of technical documents.

Large language models, such as ChatGPT, are used by many people for various tasks, including translation, summarizing texts, improving texts or writing short letters. Dr Zwettler introduced the Text-IT project as a case study of using AI for iiRDS tagging of technical documentation. Findings from this case indicate the challenges of using such tools in a professional, deterministic environment, because it is difficult to achieve consistent and reliable results with a stochastic system, as is required for the indexing of technical documentation.

Gerald Adam Zwettler - University of Applied Sciences Upper Austria

Gerald Adam Zwettler completed his diploma and master's degree at the University of Applied Sciences Upper Austria, Hagenberg Campus and received his doctorate from the University of Vienna with a thesis on feature-based generic classification. Since then, he has been teaching and researching at the Hagenberg Campus of the University of Applied Sciences Upper Austria in the fields of signal and image processing, computer vision, project development and machine learning. Since 2018, he has headed the AIST (advanced information systems and technology) research group, which conducts application-oriented research in the fields of data science, machine learning and computer vision.

November 2024 - written by Yvonne Cleary & Daniela Straub

Read full article

Training of large language models (LLMs)

Large language models (LLMs) are trained on a broad base of text data. Their unsupervised learning is based on recognizing patterns in the order of letters and words. Basically, they learn in an unsupervised process how letters follow each other in languages such as German or English. This approach makes it possible to identify larger linguistic patterns. However, there are also areas where large language models show weaknesses. Their performance can be improved through additional methods such as reinforcement learning and other learning approaches. One particularly effective approach is self-supervised learning. Thanks to the vast amount of text on the internet - from books to articles to websites - language models can be trained efficiently. The concept of cloze texts is often used here. Parts of a text are removed and the model is trained to insert the missing words or phrases correctly. This method is extremely flexible: texts from almost any field - whether law, history or fiction - can be used. The principle is simple: a text is partially emptied and the model learns to fill in the gaps with the best possible words. This enables an efficient learning process based on an almost unlimited amount of training data. This approach allows models to be continuously improved and further developed. This is the core of how large language models are trained and why they are so powerful.

Improvement of prompt engineering and prompt tuning

To achieve near-optimal or even deterministic results with LLMs, an important question is how to improve prompt engineering and prompt tuning in a standardized and professional way. The concept of so-called "mega-prompts"involves dividing each task that a large language model is to perform into clearly defined sections. These sections should be formulated as precisely as possible.

The first step is to assign a clear role to the language model. For example, we could say: "You are an expert in technical documentation. Please help us." The model should always know in which role it should act - be it as an expert, teacher or advisor. This is particularly important in educational contexts: students should learn to make specific requests, e.g: "I am a student, I am having problems with this task, and I need specific support." Clearly assigning roles in this way significantly improves results.

The task should be described precisely, including the individual work steps. Large language models benefit when complex tasks are broken down into smaller steps that can be processed iteratively. An example:

Analyze the text.
Think about the content.
Structure the text.
Create a summary in bullet points according to certain specifications.

By specifying such steps and instructions, the model can work more efficiently and in a more targeted manner.

A language model can only work with the information that is provided to it. Without context - such as geographical, temporal or situational background,the model will not be able to provide optimal answers. An example: "We are in Central Europe, it is winter. Should I wear a jacket today?" This type of context must be specified explicitly, as the model has no situational awareness of its own.

The desired output format should also be clearly defined, especially if there are specific requirements, such as compliance with certain standards or structural specifications. Structured formatted texts are much more useful than unstructured output.

An effective mega prompt could look like this:

Role: "You are a research expert."
Task: "Formulate a precise summary."
Steps: "1. analyze the source. 2. structure the content. 3. create a summary according to a list of points."
Secondary conditions: "Follow specific guidelines and provide the response in XML format."

Such detailed prompts lead to much better results than simply entering a text and hoping for useful results.

Language models can also be enriched with specific commands, e.g. for multiple-choice tasks, XML exports or other automated processes. In the education and e-learning sector, exams or content could be created efficiently in this way. By introducing parameters, models can even be controlled programmatically in order to meet specific requirements - such as legal issues or content-related topics.

A particularly interesting aspect of prompt optimization is that the language model itself often knows best how to handle its functions. If a prompt needs to be improved, AI itself can help. For example, if you are unsure how an optimal prompt should be designed, you can ask the language model directly: "You are now a prompt generator. Please define the best prompt for this task." This is a new and effective approach where the model itself suggests how it can best be used. In an earlier project, the language model was even able to act as a tokenizer: It suggested splitting texts into smaller units to accomplish the task more efficiently.

The following procedure is recommended to achieve good output results:

Divide tasks into smaller steps: Large language models deliver better results when tasks are broken down step by step and into manageable units. In this way, the model can work more efficiently and avoid errors.
Validate results: There is always a certain risk of so-called "hallucinations" (false or invented information). Repeat the same task several times and compare the results to ensure consistencySeveral runs can help to identify optimal solutions.
Provide examples: Show the model concrete examples of the expected results or solutions. The more examples provided, the better the model can learn to fulfill the task.
Iterative improvement: If the results are not satisfactory, work iteratively and adjust prompts or framework conditions.

AI and LLMs can be used for almost any task - be it for optimizing prompts or structuring complex processes - making stochastic language models more deterministic and reliable in their use.

Automatic indexing of documents

The objective of the "Text-IT" project was to implement automatic or semi-automatic tagging of documents, whereby the tags had to comply with the iiRDS standard. There were restrictions regarding the input format. Image data was excluded as it was not part of the project scope, although good progress is now being made in image analysis. Large language models were used to achieve the project goals.

Some instructions were extracted from the document text and made available to the model via prompts. This required not only importing the commands, but also the entire text, which slightly increased the workload. The model should respond with a standard-compliant list of texts, which was available in a textual form. A structured output, ideally in XML or JSON format, makes further processing much easier. A pure text output, on the other hand, would have been less useful from an IT perspective. An example of an expected output could look like this:

Language: German
Information topic: Generic collection
Conformity: Generic conformity

Of course, it also required post-processing to put everything in the right context.

Different variants of interaction with the language models were also examined. A distinction was made between tagging (assignment of predefined tags) and labeling (user-defined assignment).

Tagging uses predefined options (e.g. subject areas) from which the language model selects the appropriate tags. This approach is easier to handle and enables consistent results.

In the second type of interaction, labeling, user-defined tags were used. Training was somewhat more difficult here. While automated tagging quickly reaches its limits, especially when adapting to new subject areas, Labeling offers a little more flexibility, but is more complex.

Another aspect of keywording is the number of calls. A single call to complete the entire task and obtain a result would only work to a limited extent, as the language model would be overloaded. A better strategy was to split the document and the text and divide the task into several calls. With this so-called multicall strategy, the results were much more precise as smaller tasks could be targeted.

The evaluation compared the different strategies and their impact on outcomes. We specified context, purpose, topics, rules, location, aspects and boundaries, among other things, in order to formulate the prompts precisely. But the most important factor is how prompts are defined; strongly defined prompts are crucial to achieve the best results.

The model configuration was deliberately generic so that parameters such as the model, the input prompt and the temperature could be easily adapted via configuration files. This means that when new or cheaper models become available, a simple line change can be made without modifying the code.

The "temperature" in this context influences how creatively or neutrally the model responds. For technical documentation, strictly neutral responses were to be achieved, so the temperature was set correspondingly low. The model also needed to be provided with the iiRDs statistics and associated commands.

The prototype of the user interface was very similar to the final goal: the AI automatically assigns texts to tags, with the user having the option to make subsequent adjustments if errors or inconsistencies occur. When the text is extracted from a PDF, the automatic assignment takes place and the user can correct it as required.

When evaluating the accuracy of the emulation, it was found that the results for closed tagging were very good, especially for predefined texts. However, problems occurred with the number of tags, which is why an emulation metric was developed to correctly evaluate missing tags in the error statistics. In the free text analysis, it was more difficult to make an exact assignment, as the similarity of the text had to be compared..A similarity matrix was used to help identify and correct errors, although this is still a challenge.

The results were also compared with those of human experts in the field of technical documentation. The agreement was good, but there were discrepancies, especially in the interpretation of tags such as "caution", "warning", "danger" or "note". These terms lie in a gray area where different interpretations are possible, even if their definitions are relatively clear. This subjectivity was an important learning point in the project.

In terms of cost, the calculations for tokens and a good model were relatively inexpensive, minimizing the need for manual review by experts.

Findings from this case suggest that if the human effort required to check and correct the results is less than the entirely manual processing of the document, AI costs can become insignificant. This shows the potential of a business model in which AI provides support while manual control of the results is retained.

In summary, it can be said that large language models can certainly be used for technical documentation, but some critical aspects require fine-tuning.