BSC: ALIA, Europe's first public, open and multilingual AI infrastructure

23 Jan 2025 | Network Updates | Update from Barcelona Supercomputing Center
These updates are republished press releases and communications from members of the Science|Business Network

ALIA-40B, the most advanced public multilingual foundational model in Europe, trained on the MareNostrum 5 supercomputer, emerges in this context

The initiative is 100% publicly funded to provide a service of public interest and democratise access to AI for citizens, public administrations, universities and companies

The President of the Spanish Government, Pedro Sánchez, has announced the launch of the ALIA project, the first European public, open and multilingual infrastructure which, thanks to the unique supercomputing capabilities of the Barcelona Supercomputing Center- Centro Nacional de Supercomputación (BSC-CNS), reinforces the technological sovereignty of Spain and Europe in the development of a transparent, responsible artificial intelligence at the service of people.

ALIA is a pioneering initiative in the European Union to provide a public infrastructure of AI resources and innovative technological services, such as open language models to encourage the promotion of Spanish and co-official languages -Catalan and Valencian, Basque and Galician- in the development and deployment of AI in the world. The project is coordinated by BSC, with the promotion and leadership of the Secretary of State for Digitalisation and Artificial Intelligence (SEDIA) and the support of the Generalitat de Catalunya. It is also part of the Spanish Government's Artificial Intelligence Strategy 2024.

Public supercomputing to advance AI

It is an open project distinguished by transparency and openness to drive innovation and technology adoption, ensuring technological reliability and social and economic inclusion. The ALIA family of models is verified by the Spanish Artificial Intelligence Supervisory Agency (AESIA) and is aligned with the transparency standards established by the AI Regulation.

This pioneering initiative is 100% publicly funded to serve the public interest and democratise access to AI for citizens, public administration, universities and companies.

‘The ALIA project represents an extraordinary effort to provide us with our own data, language models and resources in the competitive environment of artificial intelligence. At its core, ALIA works with texts in more than 35 European languages, ensuring a representation of 20% for the languages of Spain, which makes it the AI system that best reflects our linguistic and cultural reality’, said Mateo Valero, Director of BSC.

According to the Catalan Minister for Research and Universities, Núria Montserrat, ‘with ALIA, we are taking a decisive step towards Europe's technological sovereignty. This model not only reinforces Catalonia's and Spain's leadership in artificial intelligence, but also provides us with multilingual and specialised resources that will be fundamental for the development of AI in key sectors for the society and economy of the future’.

A large language model trained on MareNostrum 5

The training and deployment of generative AI requires enormous computational processing power. In the case of training the ALIA family of models, the processing of several billion words requires the use of thousands of hours of MareNostrum 5, one of the most powerful supercomputers in the world, located and managed by BSC.

In this context, the President of the Spanish Government also announced the release of ALIA-40B, the most advanced public multilingual foundational model in Europe with 40 billion parameters (40 billion in the American sense, meaning 40,000 million, equivalent to 40 x 10⁹), which has been trained for more than 8 months on MareNostrum 5 with 6.9 (European) billion tokens (words or fragments of words used in these systems) in 35 European languages. Its final version will be trained with up to 9.2 (European) billion tokens.

‘The ALIA-40B model, with 40 billion parameters, represents a qualitative leap compared to its predecessor and is the first sovereign and public model of this magnitude developed in Europe, capable of generating specialised resources in areas of social and economic interest,’ added Valero.

The model's training corpus occupies 33 terabytes of memory, which would be equivalent to 17 million books, or 4.5 million high-resolution photos, or 6.6 million songs. These figures represent an important qualitative leap compared to its predecessor model 7B, with 7 billion parameters, which was a milestone as the first model developed from scratch in Spain.

AINA and ILENIA, the precedents

The ALIA project began with the Language Technologies Plan in 2019. Projects such as AINA, promoted by the Government of Catalonia, and ILENIA, promoted by SEDIA, have laid the foundations for the construction of this public AI infrastructure. In the Spanish National Artificial Intelligence Strategy 2024, the implementation of the ALIA project is one of the key pillars for the creation of this public AI infrastructure in Spanish and co-official languages. Furthermore, ALIA is aligned with the European Union's Digital Decade programme, which guides Europe's digital transformation and its technological sovereignty.

This article was first published on 21 January by Barcelona Supercomputing Center.

Never miss an update from Science|Business:   Newsletter sign-up