New project makes Wikipedia data more accessible to AI

On Wednesday, Wikimedia Deutschland announced a new database that will make Wikipedia’s wealth of knowledge more accessible to AI models.

Called the Wikidata Embedding Project, the system applies a vector-based semantic search — a technique that helps computers understand the meaning and relationships between words — to the existing data on Wikipedia and its sister platforms, consisting of nearly 120 million entries.

Combined with new support for the Model Context Protocol (MCP), a standard that helps AI systems communicate with data sources, the project makes the data more accessible to natural language queries from LLMs.

The project was undertaken by Wikimedia’s German branch in collaboration with the neural search company Jina.AI and DataStax, a real-time training-data company owned by IBM.

Wikidata has offered machine-readable data from Wikimedia properties for years, but the preexisting tools only allowed for keyword searches and SPARQL queries, a specialized query language. The new system will work better with retrieval-augmented generation (RAG) systems that allow AI models to pull in external information, giving developers a chance to ground their models in knowledge verified by Wikipedia editors.

New project makes Wikipedia data more accessible to AI | TechCrunch

Other newsrooms on this story

Related reading

Wikipedia signs major AI firms to new priority data access deals

Wikipedia volunteers spent years cataloging AI tells. Now there's a plugin to…

L'intelligenza artificiale sostituirà anche Wikipedia? Il caso NotebookLM e il…

Wikidata, Wikipedia, and Knowledge Graph entity engineering

Asse Wikipedia-Intelligenza artificiale. Così la Rete va sempre più a sinistra

Wikipédia suspend un test lié à l’intelligence artificielle après la colère de…