Although English predominates in scientific communication, there is a vast production of knowledge in other languages that often remains out of global reach. The scarcity of tools to access scientific results in different languages limits the dissemination of knowledge and discourages publication in local languages. To address this challenge, the University of Zaragoza is participating in an international research project aimed at facilitating interoperable and multilingual access to scientific and technological data.
The Clasik project (Multilingual Access to Scientific Knowledge) emerges as an innovative response to eliminate the language barriers that fragment global research. Its goal is for anyone, from specialists to the general public, to be able to search, read, and interact with complex scientific documents in their own language.
For example, a researcher from Estonia interested in the impact of droughts in the Mediterranean often finds key information, such as technical reports from Aemet, only available in Spanish, making access difficult. Clasik will allow this researcher to conduct advanced searches in Estonian and receive accurate results in her language, making scientific knowledge more inclusive and enabling local research to be leveraged globally to tackle common challenges like climate change.
To achieve this, the team employs neurosymbolic artificial intelligence, which combines language models with knowledge graphs, creating bridges between languages and translating information accurately. This technology will initially be tested in the field of climatology, facilitating the exchange and understanding of critical data on extreme phenomena such as droughts, floods, or heatwaves, regardless of the original language of the reports.
Neurosymbolic AI integrates deep learning (neural networks) with symbolic reasoning (based on logic and structured knowledge). In Clasik, large language models (LLMs), like those powering ChatGPT or Gemini, are combined with knowledge graphs and ontologies validated by experts. The LLMs provide the ability to process and generate natural language, while symbolic systems ensure accuracy and veracity in the data.
This combination is key to overcoming limitations of current models, such as generating misinformation or lacking transparency. Through techniques like Graph RAG, the system produces responses grounded in scientific sources and allows tracking the origin of each piece of data, merging language intuition with logical rigor.
The multilingual knowledge graph is a structure that organizes information through nodes and relationships, connecting concepts in different languages. It uses web standards like RDF and OWL to ensure interoperability, preventing knowledge from being isolated in "monolingual islands." In Clasik, this graph links equivalent terms in different languages (such as drought, drought, or sécheresse) to the same concept, enabling semantic searches and access to scientific documents regardless of the original language. Additionally, it can connect with external resources like Wikidata or BabelNet to enrich the information.
Climatology, especially the study of extreme events, is an ideal field for Clasik, as much of the information is published in local languages by national agencies like Aemet. These languages provide unique perspectives and essential concepts for understanding natural phenomena, but they also hinder global access to knowledge. Integrating this regional knowledge is crucial for advancing climate research and developing effective adaptation strategies.
Jorge Gracia del Río coordinates the Clasik project from the I3A at the University of Zaragoza, in collaboration with the Scientific Culture Unit of the same university.
Source: heraldo.es