DeepTech — from PDF file to ecosystem graph

Structure hundreds of application files into queryable entities and relationships to drive research-based innovation — with traceability back to the source document.

Context

Bpifrance's DeepTech department supports innovation resulting from research and must understand how startups emerge: original laboratories, transfers of intellectual property, financing, collaborations and sectoral trajectories. These signals feed support systems, activity reports and public policy decisions.

The corpus already exists — application files over several years, in natural language — but it remains largely locked in PDFs that are difficult to cross-reference, aggregate or re-search in a reproducible way.

Challenge

The links between startups, research organizations, patents, schools and funding did not exist in any structured form: it is impossible to continuously quantify the share of projects attached to a large national organization, or to measure the weight of IP transfers in the emergence pathways.

Each strategic question involved manual digging on a limited sample; Cohort analyzes took weeks, were not reproducible and did not provide an overall vision of the French deep tech ecosystem.

Approach

Co-construction of a business repository (typologies of nodes and relationships — laboratories, companies, people, patents, financing, collaborations, etc.) iterated with the DeepTech team, capitalized in an AI Knowledge Vault reusable on other corpora.
Extraction by generative AI on all files: identification of entities and links, quality arbitration between models, then relational storage (nodes / edges) usable in SQL, dashboards and interactive graphs.
Evidence Panel: each relationship or attribute displayed is linked to the source sentences of the original file — the response in committee is no longer an intuition, it is justifiable.
HITL loop: experts validate, correct or delete questionable nodes and relationships; the graph is refined over the course of reviews rather than remaining a black box.
Decision-oriented exploration: queries on cohorts, actor networks, serial entrepreneurs and sectoral correlations; generation of visualizations from natural language to accelerate restitutions.

Outcomes

Questions once inaccessible without weeks of manual reading become queryable across the entire corpus — with citation of source passages for each insight.
Emergence of quantitative indicators on the ecosystem (connections to large research organizations, distribution of IP links, weight of higher education in courses) where only qualitative material existed.
Industrializable cohort and network analyses: same method, same scope, reproducible from one reporting exercise to another.
Extensible base: repository and pipeline reusable on other documentary sets (national programs, greentech, industry) and aligned with the internal data roadmap (“company” mapping).

Public scope

Exact volumes of nodes and relationships, detailed repository typologies, precision metrics and carbon footprint are not published on this showcase page. Business control by sampling remains required to monitor drift at scale.

Case shared with client approval. Operational details, data and business parameters are not disclosed on this page.

Build your next ritual

Book a scoping session to align scope, data, and the proof your teams expect.

Request a demo