BBVA, in collaboration with IBM Research, has created a stress test to evaluate bias in generative AI models, focusing on languages other than English. This initiative addresses a significant gap in AI development, where biases in non-English responses are often overlooked. The dataset, showcased at NeurIPS, the world's leading AI conference, has been made available to the open-source community to advance research in this field.
Generative AI models, such as GPT and Llama, are revolutionizing human-computer interactions. However, these models are prone to biases rooted in the data used for training. While efforts have been made to minimize discriminatory responses, much of the training has been English-centric, potentially leaving biases in other languages unaddressed. Recognizing this, BBVA adapted IBM's SocialStigmaQA dataset to Spanish, with IBM extending it to Japanese. This dataset evaluates bias across variables such as gender, race, age, and disability through hypothetical prompts designed to test the limits of AI responses.
The results revealed greater biases in non-English responses compared to English ones, underscoring the need for more inclusive AI development. Clara Higuera, a data scientist at BBVA’s GenAI Lab, emphasized the importance of such analyses in ensuring the safe and responsible use of AI. The research not only aids in detecting bias but also aligns with BBVA’s commitment to equitable AI practices.
The datasets are accessible on platforms like GitHub and HuggingFace, enabling global collaboration for improvement. BBVA also plans to develop domain-specific datasets for banking, highlighting the need for multidisciplinary approaches involving social scientists and technologists. By addressing sociotechnological challenges, BBVA aims to foster the creation of fairer and more culturally aware AI systems.