Submitted by coconautico t3_11c1hzc in MachineLearning
Hey Reddit,
tl;dr: To democratize the technology behind virtual assistants, we can play a Q&A game to build a collaborative dataset that will enable the creation of culturally and politically unbiased virtual assistants.
As AI becomes more ubiquitous in our lives, we need to democratize it, ensuring that the next generation of virtual assistants, such as chatGPT or BingChat, are not solely controlled by one company, group or country, as it would allow them to skew our reality more easily, by deploying politically and culturally biased assistants at large scale, as we have seen with OpenAI.
While one could argue that over time companies and startups will emerge and create their own alternatives, these could be few, as creating such virtual assistants is not only a matter of massive raw data and computation, but it requires the creation of very specific datasets (many of them created by experts from multiple fields) with the goal of "fine-tuning" Large Language Models (LLMs) into virtual assistants.
Because of this, there is an international collaborative effort to create a public, multilingual, and high-quality dataset through a Q&A game, that will enable the creation of other virtual assistants outside the control of these companies.
At this very moment, we already have more data than OpenAI had when it launched its first version of ChatGPT. However, the current dataset is strongly biased towards Spanish and English speakers, as they are the only ones who have contributed to it so far. Therefore, we need to encourage people from other countries and cultures to play this Q&A game in order to create a truly multilingual dataset with expert knowledge of all kinds, from all over the world. (This would allow the virtual assistant to even answer questions that have not been answered in their language).
For Spanish and English is already a reality. Let's make a reality for other languages too by writing a few of questions/answers in the OpenAssistant game!
visarga t1_ja2r2fe wrote
Wouldn't it be better if people could donate their interactions with chatGPT, BingChat and other models? Make a scraping extension, it should collect chat logs and anonymise them. Then you got a diverse distribution of real life tasks.
I suspect this is the reason OpenAI and Bing offered their models for free to the public - to find the real distribution of tasks people want to solve with AI bots.