To develop search engines, conversational agents or knowledge bases based on Artificial Intelligence, or to improve their performance, algorithms must be trained. Quality data sets are then necessary to train these algorithms.
He To date, there are no open question-answer datasets in French to train French-speaking AI applications.
Develop a huge A question-answer dataset, built from the start ("natively") in French, rather than from an automatic translation of English-speaking datasets, is the challenge of the PIAF project: Pour des intelligences artificielles francophones.
PIAF is partly based on a scientific question. The question is to identify if the availability of native French training data brings a real added value compared to the use of automatically translated data.
To answer this question, a scientific protocol has been developed, inspired by the "SQuAD" project conducted at Stanford University. Existing AI models will be trained on native French databases (PIAF) and translated into English to compare their performance.
As in many AI projects, a manual annotation phase is necessary to enable supervised learning.
"To constitute the native French dataset, excerpts from French Wikipedia articles will be "annotated" on a platform. The annotation consists here in formulating a question whose answer is in the displayed paragraph, and in locating the answer in the text".
An initial database of 20,000 questions and answers
The first annotation phase aims to build a database of 20,000 questions-answers that will validate or invalidate the scientific hypothesis " Do question-answer AIs perform better when trained on native French data? "Based on these results, a larger annotation phase will be opened to reach 100,000 question-answers and structure an open French database."
A contributory and learning approach
Inspired by participatory science initiatives and contributory projects such as Common Voice, PIAF relies on voluntary contributions.
The first annotation stage, aimed at producing the 20,000 evaluation questions and answers, is built through "annotathons" (annotation sprints) open to all.
A first test annotathon was held on October 18, 2019.. One will be held on November 19.second
PIAF: a contributory project to build "question and answer" datasets to train French-speaking AI applications
To develop search engines, conversational agents or knowledge bases based on Artificial Intelligence, or to improve their performance, algorithms must be trained. Quality data sets are then necessary to train these algorithms.
He To date, there are no open question-answer datasets in French to train French-speaking AI applications.
Develop a huge A question-answer dataset, built from the start ("natively") in French, rather than from an automatic translation of English-speaking datasets, is the challenge of the PIAF project: Pour des intelligences artificielles francophones.
PIAF is partly based on a scientific question. The question is to identify if the availability of native French training data brings a real added value compared to the use of automatically translated data.
To answer this question, a scientific protocol has been developed, inspired by the "SQuAD" project conducted at Stanford University. Existing AI models will be trained on native French databases (PIAF) and translated into English to compare their performance.
As in many AI projects, a manual annotation phase is necessary to enable supervised learning.
"To constitute the native French dataset, excerpts from French Wikipedia articles will be "annotated" on a platform. The annotation consists here in formulating a question whose answer is in the displayed paragraph, and in locating the answer in the text".
An initial database of 20,000 questions and answers
The first annotation phase aims to build a database of 20,000 questions-answers that will validate or invalidate the scientific hypothesis " Do question-answer AIs perform better when trained on native French data? "Based on these results, a larger annotation phase will be opened to reach 100,000 question-answers and structure an open French database."
A contributory and learning approach
Inspired by participatory science initiatives and contributory projects such as Common Voice, PIAF relies on voluntary contributions.
The first annotation stage, aimed at producing the 20,000 evaluation questions and answers, is built through "annotathons" (annotation sprints) open to all.
A first test annotathon was held on October 18, 2019.. One will be held on November 19.second