On Wednesday, Wikimedia Deutschland announced a new database that will make Wikipedia’s wealth of knowledge more accessible to AI models.
Called the Wikidata Embedding Project, the system applies a vector-based semantic search — a technique that helps computers understand the meaning and relationships between words — to the existing data on Wikipedia and its sister platforms, consisting of nearly 120 million entries.
Combined with new support for the Model Context Protocol (MCP), a standard that helps AI systems communicate with data sources, the project makes the data more accessible to natural language queries from LLMs.
The project was undertaken by Wikimedia’s German branch in collaboration with the neural search company Jina.AI and DataStax, a real-time training-data company owned by IBM.
Wikidata has offered machine-readable data from Wikimedia properties for years, but the preexisting tools only allowed for keyword searches and SPARQL queries, a specialized query language. The new system will work better with retrieval-augmented generation (RAG) systems that allow AI models to pull in external information, giving developers a chance to ground their models in knowledge verified by Wikipedia editors.







