Search Results for “synthetic data” – Smals Research

Software testing

Jean-Bernard Demorcy — Tue, 09 Sep 2025 07:07:12 +0000

Test planning & management

The process of planning, estimating, monitoring, and controlling test activities, documented in a (risk‑based) test plan, strategy or policy, to achieve defined quality objectives within the project’s constraints of scope, time, and resources.

Maturity levels

Level	Name	Description	Technology	Example tools
0	Non-Existent	There is no AI assistance, automation, or data integration of any kind. Corporate guidelines or governance for AI‑enabled workflows and tool usage are absent. Test policy, strategy, plan creation, estimation, progress tracking and reporting are fully manual; no AI evaluates readability, maintainability, or explainability of artefacts.	None	None
1	One-Off Assist	Test managers/coordinators occasionally ask an LLM for draft strategy text, workload estimates, or risk heat‑maps and paste results into documents; nothing is version‑controlled, results vary widely between individuals and are difficult to reproduce or scale.	Natural-language draft generation with Off-the-shelf LLM & prompt engineering	LLM chatbots / answer engines M365 Copilot,chatGPT (openAI), Gemini google studio AI (google), Claude desktop (Anthropic), Perplexity (Perplexity AI), …
2	Integrated Assist	AI is embedded in QA tooling process, providing suggestions for test‑policy clauses, strategy sections, resource/timeline forecasts, and risk heat‑maps. Artefacts are version‑controlled with project deliverables. AI flags readability, maintainability, and explainability issues.	AI Agents / Autonomous Agents + LLMOps (prompt / template management, deployment, guardrails)	User customized AI Agents / Personalised AI Tools customGPTs on chatGPT, Gems on Google Gemini, Artifacts on Claude, Spaces on Perplexity, …
3	AI-Human Collaboration	AI agents act as junior test managers, digesting code, requirements, and trends to suggest strategies, scope, team updates. Every recommendation is traceable and explainable. and subject to human review.	Agentic frameworks advanced RAG orchestration LLMOps (prompt / template management, deployment, guardrails) Deep research	Agentic frameworks Langgraph as orchestrator, vector DBs as knowledge, prompts and a model for interaction Deep research Deep research on chatGPT/Gemini/Claude using a reasoning model, Perplexity labs, Manus, …
4	Full Autonomy	Autonomous agents/AI systems create and update test policy, strategies and plans from live data. Projects are managed dynamically; and AI handles scope, milestones, KPIs. Human involvement is confined to strategic governance, on demand, the autonomous AI must supply a transparent, traceable explanation of its actions, input data, and decision rationale.	Autonomous agents, causal-inference models, continual learning, LLMOps pipelines	End-to-end QA orchestrators (no knowledge of tools who operate at this level)

^{AI Maturity Level: Indicates the level the technology vendors claim to have reached in deploying AI solutions that actually work in real-world applications}

Test analyse & design

The process of analysing the test basis and transforming it into test conditions, test cases, and test data using appropriate test design techniques to achieve required coverage and mitigate quality risks.

Maturity levels

Level	Name	Description	Technology	Example tools
0	Non-Existent	All analysis and design tasks are fully manual. No AI, automation, or review for quality attributes like readability or explainability.	None	None
1	One-Off Assist	Test engineers use LLMs ad hoc to draft test cases or choose techniques. Prompts vary by user with no standards, reuse, or traceability. Results are inconsistent and unscalable.	Off-the-shelf LLM & prompt engineering	LLM chatbots / answer engines M365 Copilot, chatGPT (openAI), Gemini google studio AI (google), Claude desktop (Anthropic), Perplexity (Perplexity AI), …
2	Integrated Assist	AI assists with static oracle checks, structured case generation, and artefact review for quality. Work aligns with prompt standards and a feedback loop. Nearly full task support, enabling near end-to-end coverage with minimal manual effort. The AI system can review human‑created test artefacts for correctness, completeness, readability, maintainability, and explainability, flagging gaps or duplicates before peer review.	AI Agents / Autonomous Agents LLMOps (prompt / template management, deployment, guardrails)	Agentic frameworks Langgraph as orchestrator, vector DBs as knowledge, prompts and a model for interaction User customized AI Agents / Personalised AI Tools customGPTs on chatGPT, Gems on Google Gemini, Artifacts on Claude, Spaces on Perplexity, …
3	AI‑Human Collaboration	AI acts as a junior test analyst: analysing multimodal input and past defects to refine test oracles, recommend techniques, and generate test artefacts, including transparent explanation of its reasoning, while a human overseer guides and refines its output.	Agentic frameworks advanced RAG orchestration LLMOps (prompt / template management, deployment, guardrails) Agentic frameworks, code & UI embeddings	User customized AI Agents / Personalised AI Tools customGPTs on chatGPT, Gems on Google Gemini, Artifacts on Claude, Spaces on Perplexity, … Off-the-shelf-AI tools*** Testcraft, AskUI, Kusho.AI, …
4	Full Autonomy	Autonomous AI designs and maintains test suites, detects oracle issues, and regenerates impacted assets when input changes. Human involvement is confined to strategic governance, all actions are explainable	Autonomous agents, model‑based testing, continual learning pipelines	End‑to‑end test‑design orchestrators (no knowledge of tools who operate at his level)

*** Although many tools claim to operate at AI Maturity Level 3, these claims are often exaggerated, they typically require significant manual effort, lack true context awareness, and rely heavily on marketing buzzwords like “self-healing tests,” “autonomous agents,” “AI-driven quality,” “zero-touch automation,” “intelligent test orchestration,” and “continuous risk-based optimization.” In truth, most of these tools work more like Level 2, they help people but don’t really work alongside them. They still need detailed prompts, human guidance, and corrections to get good results. That said, some tools are starting to explore real Level 3 features. Early versions show potential. Progress is slow but steady, with better context awareness and more independence pushing things forward.

^{AI Maturity Level: Indicates the level the technology vendors claim to have reached in deploying AI solutions that actually work in real-world applications}

Test implementation, automation & test data generation

The phases of finalising testware by developing, maintaining, and automating executable test scripts, harnesses, and representative test data to enable efficient, repeatable, and scalable test execution.

Maturity levels

Level	Name	Description	Technology	Example tools
0	Non-Existent	All test scripts and data are created and maintained manually; no AI assistance is used.	None	None
1	One-Off Assist	Engineers prompt an LLM to generate a skeleton script, a SQL dataset, or a simple page‑object and then refine manually., there are no corporate guidelines, shared prompt libraries, or optimisation practices, so results vary widely between individuals and are difficult to reproduce or scale.	Off-the-shelf LLM & prompt engineering code completers in IDE’s	LLM chatbots / answer engines M365 Copilot, chatGPT(openAI), Gemini google studio AI (google), Claude desktop (Anthropic), Perplexity (Perplexity AI), … IDE’s with AI capabilities Cursor, junie, github copilot … CLI code assistants Codex (openAI), Gemini CLI (Google), Claude Code (Anthropic), …
2	Integrated Assist	AI in IDEs/frameworks generates maintainable code, test data, and converts test cases to scripts based on human input. Integrated prompt standards and feedback loops ensure consistent, scalable results.	LLM + code embeddings, s, test‑data synthesis libs MCP AI Agents / Autonomous Agents LLMOps (prompt / template management, deployment, guardrails)	User customized AI Agents / Personalised AI Tools chatGPT (openAI), Gemini google studio AI (Google), Claude, Perplexity, … IDE’s with AI capabilities Cursor, junie, github copilot, … CLI code assistants Codex (openAI), Gemini CLI (Google), Claude Code (Anthropic), … MCPs Playwright, … Off-the-shelf tools Testcraft.
3	AI‑Human Collaboration	AI agent(s)/system(s) acts as an entry-level test automator. The AI system is fully context‑aware of the project: it implements tests for new requirements, refactors and optimises the automation suite, synthesises sophisticated test data (synthetic or privacy‑masked), and flags redundant scripts, always with human experts supervising and validating its output.	Agentic frameworks advanced RAG orchestration LLMOps (prompt / template management, deployment, guardrails) MCP	Agentic frameworks Langgraph as orchestrator, vector DBs as knowledge, prompts and a model for interaction Microsoft copilot studio, OpenAI agent builder, Claude Skills (Anthropic), Claude cowork (Anthropic), … IDE’s with AI capabilities Cursor, windsurf, junie, github copilot, … CLI code assistants Codex (openAI), Gemini CLI (Google), Claude Code (Anthropic), … MCPs Playwright, … Off-the-shelf tools*** Cypress cy.prompt, coTestPilot for testers, coTestPilot for developers, TestZeus-Hercules, Magic Inspector, Wopee, Katalon, Applitools, UIPath, Testers.ai, Playwright agents (Microsoft)
4	Full Autonomy	Autonomous AI maintains scripts and data, migrates frameworks, manages test infrastructure, generates mocks/stubs, and runs test sets unsupervised. Human involvement is confined to strategic governance, on demand, the autonomous AI must supply a transparent, traceable explanation of its actions, input data, and decision rationale.	Autonomous agents, self‑healing AI, continual learning pipelines	End‑to‑end automation orchestrators (no knowledge of tools which operate at his level)

^{AI Maturity Level: Indicates the level the technology vendors claim to have reached in deploying AI solutions that actually work in real-world applications}

Test execution

The activity of running test suites (groups/folders/sets of test cases/scenario’s/scripts), comparing actual and expected outcomes, logging incidents, and collecting metrics in the designated environment.

Maturity levels

Level	Name	Description	Technology	Example tools
0	Non-Existent	All tests are executed manually/automated with no AI support or AI enhanced automation; results are logged by hand.	None	None
1	One-Off Assist	Testers occasionally use an LLM to auto‑generate a command‑line or interpret a log snippet to speed up manual execution, results vary widely between individuals and are difficult to reproduce or scale.	Off-the-shelf LLM & prompt engineering MCPs on off-the shelf LLMs	LLM chatbots / answer engines M365 Copilot, chatGPT(openAI), Gemini google studio AI (google), Claude desktop (Anthropic), Perplexity (Perplexity AI), … IDE’s with AI capabilities Cursor, junie, github copilot, …
2	Integrated Assist	AI is built into execution frameworks or processed to schedule suites, classify failures in real‑time dashboards and execute tests based on high level natural language descriptions.	LLM + code embeddings, s, test‑data synthesis libs MCP AI Agents / Autonomous Agents LLMOps (prompt / template management, deployment, guardrails)	User customized AI Agents / Personalised AI Tools chatGPT, Gemini google studio AI (google), Claude desktop (Anthropic), Perplexity (Perplexity AI), … IDE’s with AI capabilities Cursor, junie, github copilot, Antigravity (Google)… CLI code assistants Codex (openAI), Gemini CLI (Google), Claude Code (Anthropic), … MCPs Playwright MCP, Selenium MCP, Appium gestures MCP, …
3	AI‑Human Collaboration	AI agent(s)/system(s) acts as an entry level tester. who starts and monitors live runs, predicts remaining duration, suggests selective re‑runs, applies self‑healing and surfaces likely root causes for failed steps. While a human overseer guides and refines its output. It can also execute exploratory test flows based on high‑level natural‑language quality requests delivering summarised findings for human validation.	Agentic frameworks orchestration LLM + code embeddings, s, test‑data synthesis libs MCP AI Agents / Autonomous LLMOps (prompt / template management, deployment, guardrails)	Agentic frameworks Langgraph as orchestrator, vector DBs as knowledge, prompts and a model for interaction Microsoft copilot studio, OpenAI agent builder, Claude Skills (Anthropic), Claude cowork (Anthropic), … IDE’s with AI capabilities Cursor, junie, github copilot, Antigravity (Google)… CLI code assistants Codex (openAI), Gemini CLI (Google), Claude Code (Anthropic), … MCPs Playwright, … Off-the-shelf tools*** coTestPilot for testers, coTestPilot for developers, TestZeus-Hercules, Magic Inspector, Wopee, Katalon, Applitools, UIPath, Testim, testers.ai, Playwright agents (Microsoft) …
4	Full Autonomy	Autonomous execution agents provision environments, orchestrate parallelisation, self‑heal UI/API/… tests, run canary, chaos and other unsupervised experiments, continually optimising coverage, cost, and risk without hands‑on support. They also execute tests on the items under test, and can autonomously detect the need for and execute functional and non‑functional exploratory test flows from high‑level natural‑language quality objectives, Human interaction is limited to high‑level goal setting and periodic governance reviews, though the system remains available for on‑demand unsupervised natural‑language test runs. On demand, the autonomous AI must supply a transparent, traceable explanation of its actions, input data, and decision rationale.	Autonomous agents, reinforcement scheduling, chaos‑engineering AI	End‑to‑end execution orchestrators (no knowledge of tools which operate at his level)

^{AI Maturity Level: Indicates the level the technology vendors claim to have reached in deploying AI solutions that actually work in real-world applications}

Evaluating exit criteria & reporting

The activity of comparing actual test results and coverage to predefined exit criteria and producing concise, meaningful reports for stakeholders on product quality and residual risk.

Maturity levels

Level	Name	Description	Technology	Example tools
0	Non-Existent	Exit criteria are evaluated manually. Reports are crafted by hand with no AI support. No quality analysis assistance is present.	None	None
1	One-Off Assist	Test analyst prompt an LLM with natural‑language summary generation capabilities summarise results into prose or create a simple chart for a one‑time release note. Prompts are improvised, with no standards or reusability.	Off-the-shelf LLM & prompt engineering MCPs on off-the shelf LLMs code completers in IDE’s	LLM chatbots / answer engines M365 Copilot, chatGPT (openAI), Gemini google studio AI (google), Claude desktop (Anthropic), Perplexity (Perplexity AI), … IDE’s with AI capabilities Cursor, junie, github copilot, …
2	Integrated Assist	Standardised AI tasks are available and AI is integrated into tools to assist with KPI aggregation, script documentation, case-to-script mapping, and readiness scoring. Outputs are consistent and versioned.	LLM + code embeddings, s, test‑data synthesis libs MCP AI Agents / Autonomous Agents LLMOps (prompt / template management, deployment, guardrails) BI tooling, vector DB	User customized AI Agents / Personalised AI Tools ChatGPT, Gemini google studio AI (google), Claude desktop (Anthropic), Perplexity (Perplexity AI), … IDE’s with AI capabilities Cursor, junie, github copilot, … CLI code assistants Codex (openAI), Gemini CLI (Google), Claude Code (Anthropic), …
3	AI‑Human Collaboration	AI agent(s)/system(s) acts as an entry-level functional reviewer, performing context-aware end-to-end technical or functional reviews of an item under test while a human overseer guides and refines its output. It can also generate stakeholder‑specific narrative reports and offers interactive Q&A on QA	Agentic dashboards causal analytics Agentic frameworks orchestration LLM + code embeddings, s, test‑data synthesis libs MCP AI Agents / Autonomous LLMOps (prompt / template management, deployment, guardrails)	Agentic frameworks Langgraph as orchestrator, vector DBs as knowledge, prompts and a model for interaction, IDE’s with AI capabilities Cursor, junie, github copilot, … CLI code assistants Codex (openAI), Gemini CLI (Google), Claude Code (Anthropic), … Off-the-shelf tools*** Katalon, Applitools, UIPath, Testim, testers.ai, Playwright agents (Microsoft) …
4	Intelligent Automation	An autonomous quality‑governance agent continuously evaluates exit criteria and reports via live data, launching extra tests when needed. Human involvement is confined to strategic governance, on demand, the autonomous AI must supply a transparent, traceable explanation of its actions, input data, and decision rationale.	Autonomous agents, MLOps/CD integration, real‑time data streams	End‑to‑end quality governance platforms. (I have no knowledge of tools which operate at his level)

^{AI Maturity Level: Indicates the level the technology vendors claim to have reached in deploying AI solutions that actually work in real-world applications}

Test control

The ISTQB of comparing actual progress with planned progress, analysing variances, and taking corrective actions to meet test objectives.

Maturity levels

Level	Name	Description	Technology	Example tools
0	Non-Existent	All test control is manual with no AI or automation. Variances are tracked by hand, with no predictive insights or decision support.	None	None
1	One-Off Assist	Test coordinators sporadically prompt an LLM to estimate remaining effort and surface process gaps or test debt,	Off-the-shelf LLM & prompt engineering MCPs on off-the shelf LLMs code completers in IDE’s	LLM chatbots / answer engines M365 Copilot, chatGPT (openAI), Gemini google studio AI (Google), Claude desktop (Anthropic) , Perplexity Perplexity AI), …
2	Integrated Assist	Specifically for this subdomain an AI is embedded in the test process to predict schedule slippage, KPI drift, and recommend minor scope changes or scope re-balance. Tasks are standardised, outputs consistent, and embedded in the test process.	LLM + code embeddings, s, test‑data synthesis libs MCP AI Agents / Autonomous Agents LLMOps (prompt / template management, deployment, guardrails) vector DB’s, BI tooling	User customized AI Agents / Personalised AI Tools ChatGPT (openAI), Gemini google studio AI, Claude desktop (Anthropic), Spaces on Perplexity, … NotebookLM
3	AI‑Human Collaboration	AI agent(s)/system(s) acts as an entry-level test coordinator who performs continuous what‑if analysis, correlates business impact, and proposes corrective test actions with explanations. It answers complex test control queries. Humans oversee and validate.	Causal inference models simulation agentic planners LLM + code embeddings, s, test‑data synthesis libs LLMOps (prompt / template management, deployment, guardrails) MCP vector DB’s, BI tooling	Agentic frameworks Langgraph as orchestrator, vector DBs as knowledge, prompts and a model for interaction, … Microsoft copilot studio, OpenAI agent builder, Claude Skills (Anthropic), Claude cowork (Anthropic), … Off-the-shelf tools*** Playwright agents (Microsoft)
4	Full Autonomy	An autonomous agent simulates, forecasts, adjusts scope/staffing, and answers analytic queries. Human input is strategic only, with explainable outputs	Autonomous agents, reinforcement scheduling, closed‑loop governance	End‑to‑end adaptive QA orchestrators (I have no knowledge of tools which operate at his level)

^{AI Maturity Level: Indicates the level the technology vendors claim to have reached in deploying AI solutions that actually work in real-world applications}

Communication / Marketing

Jean-Bernard Demorcy — Tue, 09 Sep 2025 06:55:30 +0000

AI copywriting

AI Copywriting enables to generate written marketing and online content, such as news, alerts, blog posts, UX microcopy and social media updates.

Maturity levels

Level	Name	Description	Technology	Example tools
0	Absent	Entirely manual writing. The human is responsible for every word, from research to proofreading.	None	Word processors (Word, Pages, LibreOffice Writer)
1	Writing assistance	The tool offers basic suggestions: grammar and spelling correction and synonyms. It does not generate content but improves existing text.	Rule engines, basic language models	Antidote, Grammarly, LanguageTool
2	Short-form content generator	The AI generates paragraphs or short texts (UX microcopy, descriptions, slogans) from a prompt. The tone is generic and the content often requires human review.	Standard LLMs, simple prompt engineering	Basic versions of Jasper, Copy.ai, Rytr, Frontitude, ChatGPT, Anthropic Claude, Google Gemini, Microsoft Copilot, Mistral
3	Specialized writer	The AI can generate long-form content (blog posts, marketing emails, news) by adopting a specific tone and style (brand voice). It can integrate external information to improve relevance (RAG) and optimize content for a given channel.	Advanced LLMs (GPT-4, Claude 3), RAG, Fine-tuning, advanced templates	Jasper (Brand Voice mode), Copy.ai (workflows), TextCortex, Writesonic, Perplexity AI, ChatGPT, Anthropic Claude, Google Gemini, Microsoft Copilot, Mistral (*)
4	Content orchestration	Multiple specialized AI agents (e.g. strategy, writing, SEO) coordinate to execute complex content workflows end-to-end. They can plan an editorial calendar, identify relevant topics by analyzing trends, generate the content, optimize it (SEO), publish it. The system acts autonomously, requiring human oversight only at strategic checkpoints.	AI agent frameworks, multi-agent systems, real-time data integrations	NoimosAI, Copy.ai (Circuits), Jasper (optimization agents), Gumloop.

(*) no 100% accuracy guaranteed

^{AI Maturity Level: Indicates the level the technology vendors claim to have reached in deploying AI solutions that actually work in real-world applications}

AI translation assistants

AI Translation assistants help users to translate text and speech, offering features like real-time translation, contextual suggestions and quality checks.

Maturity levels

Level	Name	Description	Technology	Example Tools
0	Absent	Entirely human translation.	None	Paper dictionaries
1	Statistical translation	Word-for-word or phrase-based translation based on statistical models. Grammar and context are often incorrect.	SMT (Statistical Machine Translation)	Largely obsolete technology, the ancestor of Google Translate
2	Neural translation	The AI translates complete sentences, taking the immediate context into account. The translation is fluent and grammatically correct in most cases.	NMT (Neural Machine Translation)	DeepL (basic edition), Google Translate, Microsoft Translator, Reverso, Weglot
3	Contextual and specialized translation	The AI adapts the translation to the domain (legal, medical), tone (formal/informal) and style. It can integrate custom glossaries and preserve document formatting.	NMT with context, fine-tuning on specific domains, adaptive models	DeepL , ModernMT, Reverso Context, Linguee, eTranslation (provided by the European Commission), Tilde MT, specialised versions of Google/Microsoft/AWS Translate
4	Universal interpreter	The AI agent provides real-time, bidirectional voice translation with highly expressive audio. It seamlessly adapts to conversational flow, interruptions and emotional tone.	Multimodal native models (audio-to-audio processing without text bottleneck), cross-cultural reasoning	OpenAI Advanced Voice Mode, Gemini Live, Meta SeamlessM4T.

^{AI Maturity Level: Indicates the level the technology vendors claim to have reached in deploying AI solutions that actually work in real-world applications}

AI video generator

AI Video generator automatically creates videos from various inputs like text, images or audio.

Maturity levels

Level	Name	Description	Technology	Example tools
0	Absent	Entirely manual video production (filming, editing, special effects).	None	Adobe Premiere Pro, Final Cut Pro
1	Assisted editing	The tool automates simple tasks like creating animated slideshows from images, transcribing subtitles or basic template edits.	Pre-defined templates, Automatic speech recognition (ASR)	Canva, Animoto, CapCut AutoCut
2	Avatar or clip-based generation	The AI generates simple videos, often by animating an avatar that speaks a given text, or by assembling stock video clips based on a script. The quality is often synthetic.	Lip-sync, Text-to-speech (TTS), clip assembly	Synthesia, HeyGen, Pictory, Deepbrain AI, Colossyan, D-ID, InVideo AI, Lumen5
3	Cinematic generation (text-to-video)	The AI generates photorealistic scenes from a simple prompts or uploaded reference images. Models now handle complex physical motion, lighting and realistic human rendering.	Advanced video diffusion models, implicit 3D modeling	OpenAI Sora, Pika 2.0, Runway Gen-3, Luma AI, Google Veo 2, Lights Camera Action (LCA), Kling AI 1.6
4	Autonomous director	The AI handles entire multi-shot productions. It ensures character consistency across scenes, generates cohesive dialogue and soundtracks, and handles narrative pacing.	Unified multimodal generation (video, sound, dialogue), hierarchical AI agents	LTX Studio, Showrunner, Crreo AI (transitioning from R&D to early commercial access)

^{AI Maturity Level: Indicates the level the technology vendors claim to have reached in deploying AI solutions that actually work in real-world applications}

AI image generator

AI Image generator enables a software to create new images from various inputs, such as text descriptions, existing photos or other data.

Maturity levels

Level	Name	Description	Technology	Example tools
0	Absent	Manual creation of images (drawing, photography, graphic design).	None	Adobe Photoshop, Illustrator, Cameras
1	Filters and simple edits	The AI applies stylized filters (e.g., “Van Gogh style”) or performs simple edits (red-eye removal) on an existing image.	Convolutional neural networks (CNNs), style transfer models	Instagram filters, Prisma, basic “Magic eraser” tools
2	Simple generation (text-to-image)	The tool generates a complete image from a simple prompt. The image is coherent, but control over details (e.g., number of fingers) is weak.	Early diffusion models, Generative adversarial networks (GANs)	DALL-E 2, older Stable Diffusion versions
3	Advanced and controlled generation	The AI creates highly detailed, photorealistic images from complex prompts. It offers fine control over composition, accurate hand/anatomy rendering, and highly accurate typography generation directly within the image.	Advanced diffusion models, character consistency models, Conditional ControlNets	Midjourney, DALL-E 3, Stable Diffusion XL, Adobe Firefly, Ideogram 2.0, Flux.1, Leonardo.ai
4	Autonomous art-director	The AI agent designs a complete visual identity for a brand or campaign from a brief. It generates all necessary visuals (logos, banners, illustrations) in a consistent style and adapts them to different formats autonomously.	AI agent frameworks, real-time fine-tuning, reasoning on abstract concepts	Canva Magic Studio (Agentic features), custom enterprise R&D.

^{AI Maturity Level: Indicates the level the technology vendors claim to have reached in deploying AI solutions that actually work in real-world applications}

AI SEO assistants

AI SEO assistants help webmasters/copywriters to improve their search engine rankings by automating tasks, analyzing data and providing actionable insights for content and technical optimisation.

Maturity levels

Level	Name	Description	Technology	Example tools
0	Absent	Entirely manual SEO analysis and optimization (keyword research, competitor analysis, writing).	None	Google Search, spreadsheets (Excel)
1	Keyword analysis	The tool suggests relevant keywords based on a starting topic and provides simple metrics (search volume, difficulty).	Statistical analysis, keyword databases	Google Keyword Planner, Ubersuggest
2	Content optimizer	The AI analyses the content of top-ranking pages and provides a real-time optimisation score. It recommends keywords to include, a heading structure and text length, including basic LLM optimizations.	Natural Language Processing (NLP), TF-IDF analysis, SERP analysis	SurferSEO, MarketMuse, Clearscope
3	Integrated strategic assistant	The tool combines keyword research, competitive analysis, content brief generation, AI-assisted writing and on-page optimisation into a unified workflow, integrating advanced LLM optimization. It can suggest internal linking strategies.	LLMs, RAG on SERP data, Topic clustering	Ahrefs AI Content Helper & Brand radar, SurferSEO (with AI writing), GrowthBar, Frase.io, Semrush Copilot, ContentShake AI, SEOpital, Jasper (*)
4	Autonomous SEO strategist	A single intelligent system where agentic capabilities are embedded at every stage. It autonomously identifies gaps, researches topics, executes optimizations and monitors performance.	Multi-agent frameworks, deep CMS integration, automated monitoring	NoimosAI, Averi AI, custom agents built via Lindy

(*) no 100% accuracy guaranteed

^{AI Maturity Level: Indicates the level the technology vendors claim to have reached in deploying AI solutions that actually work in real-world applications}

PII Filtering – door *** uit *

Joachim Ganseman — Mon, 28 Oct 2024 15:37:54 +0000

Cet article est aussi disponible en français.

De populariteit van AI-toepassingen met chat-interface, doet een “oud zeer” opnieuw bovendrijven: hoe beschermen we persoonsgegevens die, vaak nietsvermoedend, via chat worden meegedeeld aan een geautomatiseerd systeem? Bij uitbreiding stelt zich deze vraag voor elke toepassing waar persoonsgebonden gegevens gedeeld moeten worden met derde partijen. De externe afhankelijkheden van een toepassing kunnen echter een ingewikkeld kluwen zijn. Het is ook niet altijd mogelijk (of economisch haalbaar) om de grote spelers op het vlak van cloud- en AI-diensten te ontwijken – toch niet als je mee wil zijn met de nieuwste mogelijkheden op een kostenefficiënte manier.

Een mogelijke oplossing staat bekend als PII Filtering. PII is daarbij het Engels acroniem voor Personal(ly) Identifiable/Identifying Information, i.e. de informatie waarmee iemand geïdentificeerd kan worden. Het idee is eenvoudig genoeg: we plaatsen een extra filter voor de applicatie, die de persoonlijke gegevens uit de input filtert, voordat die input aan de applicatie wordt doorgegeven. Als dat goed lukt, dan maakt het in principe niet meer uit wat de applicatie achter de schermen met die gegevens doet.

PII vs. Personal Data

Het is allereerst cruciaal om te begrijpen dat “PII” niet gelijkgesteld kan worden aan “Personal Data” zoals de GDPR en andere Europese wetgeving die definieert. PII is een concept dat geworteld is in Amerikaanse wetgeving. Het doelt meestal op een eindige set identificatiegegevens die kunnen worden gebruikt om de identiteit van een individu te onderscheiden of te achterhalen, zoals rijksregisternummers, adressen en telefoonnummers. Amerikaanse regelgeving is op dat vlak vaak prescriptief van aard: zo bevat de HIPAA (privacywetgeving m.b.t. gezondheidsgegevens) een lijst met 18 identifiers die als PII worden gedefinieerd. Dat heeft als groot voordeel dat het relatief gemakkelijk te implementeren is: wanneer het lijstje helemaal afgevinkt kan worden, is er ook juridisch weinig discussie meer.

Daarentegen hanteert de Europese GDPR (AVG) een principiële benadering: ze definieert een breder concept van Personal Data (persoonsgegevens). Dat omvat “alle informatie met betrekking tot een geïdentificeerde of identificeerbare natuurlijke persoon”. Dit betekent dat zelfs schijnbaar onschuldige informatie, zoals de kleur “rood”, beschouwd kan worden als persoonsgegeven, als deze bijvoorbeeld betrekking heeft tot iemands lievelingskleur. Deze contextafhankelijke definitie van persoonsgegevens, maakt het echter ook nagenoeg onmogelijk om generieke, algemeen inzetbare detectoren of filters ervoor te ontwikkelen. Wat beschouwd wordt als persoonsgegeven of niet, moet geval per geval beoordeeld worden. Niet alleen ontwikkelaars worden daardoor geconfronteerd met meer maatwerk dan hen lief is, ook juristen, DPO’s en Gegevensbeschermingsautoriteiten hebben met zulke beoordelingen in elk EU-land de handen vol.

Oplossingen voor PII-filtering die voldoen voor gebruik in de VS, lopen dus altijd een risico om in de EU slechts gedeeltelijk tegemoet te komen aan de vereisten. Omdat de term PII echter algemeen ingang gevonden lijkt te hebben in de globale markt, spreken we verder in dit artikel enkel nog over PII. Houd wel permanent in het achterhoofd dat Personal Data altijd het uitgangspunt moet zijn in EU-context.

PII Detectie en Filtering

Om tekstuele input te filteren gebruiken we doorgaans patroonherkenningstechnieken en Natural Language Processing (NLP)-modellen. Deze modellen scannen ongestructureerde gegevens, op zoek naar patronen zoals e-mailformaten of numerieke reeksen die lijken op rijksregister- of telefoonnummers, om deze nadien te kunnen redigeren of anonimiseren. Daarnaast worden aangepaste regex-patronen vaak toegevoegd om vormen van gevoelige informatie te herkennen die specifiek zijn voor de betrokken toepassing.

PII Filtering op basis van NER in het Nederlands. Bron: pii-filter library (c) “HabaneroCake”, MIT license

Effectieve PII-filtering steunt sterk op Named Entity Recognition (NER), een NLP-methode die entiteiten zoals namen, data en locaties in een tekst identificeert. We publiceerden daar eerder al over in meer detail – zie deze artikels over NLP en NER. De opkomst van generatieve AI heeft aan de opzet van NER-technieken nog niet veel veranderd. Ook vandaag gebruiken veel PII filtering tools achterliggend goed ontwikkelde NLP-toolkits zoals NLTK, SpaCy of Flair.

PII kan echter ook in afbeeldingen opduiken: scans van documenten, foto’s van gezichten of nummerplaten, … Om dat weg te filteren is een geavanceerdere aanpak vereist, omdat de gevoelige gegevens kunnen verschijnen in uiteenlopende vormen, van handgeschreven notities tot reflecties in foto’s. Optical Character Recognition (OCR) wordt gebruikt om tekst uit afbeeldingen te extraheren en deze om te zetten in een formaat dat op dezelfde manier kan worden geanalyseerd als tekstuele gegevens. Zodra de tekst is geëxtraheerd, ondergaat deze hetzelfde PII-filterproces met behulp van NLP-technieken. In gevallen waarin de afbeelding zelf gevoelige visuele elementen bevat (zoals gezichten of persoonlijke documenten), worden algoritmen voor beeldherkenning gebruikt om dergelijke inhoud te herkennen.

Eenmaal geïdentificeerd, moet je besluiten wat er dient te gebeuren met de gedetecteerde PII. Opties kunnen zijn:

Vervanging / substitutie door een andere waarde. Deze kan eventueel aangemaakt worden met een synthetic data tool, zodat het origineel vervangen wordt door een realistisch ogend alternatief.
Masking / obfuscation: vervang door een karakter of balkje. Dit kan eventueel gedeeltelijk, om nuttige algemenere info niet te verliezen: zo zien we nog dat +32********* een Belgisch telefoonnummer is.
Verwijdering
Hashing (best met salt ter preventie van brute-force attacks)
Encryptie, eventueel formaat-behoudend
…

Nederlandstalige tekst (links) geanonymiseerd door maskeren (midden) of substitutie (rechts) met behulp van de EU NLP Service. Merk op dat de geanonymiseerde tekst nog steeds unieke carrière-elementen vermeldt waaruit men de verborgen identiteit kan afleiden. Bron tekst: Stad Kortrijk, persbericht 23/07/2023

Voor afbeeldingen zijn andere functies mogelijk, waaronder:

Vervagen (blurring) of andere filters. Hierbij moet men er wel op letten dat sommige filters omkeerbaar zijn.
Bedekken of overschrijven, bijvoorbeeld met een zwarte rechthoek.
…

De vervanging door een alternatieve waarde van dezelfde soort kan echter soms ook voor vreemde effecten zorgen, omdat de entiteit niet altijd correct wordt ingeschat of omdat er te weinig of geen rekening gehouden kan worden met de context. Zo kan het zijn dat sommige tools geen acht slaan op het geslacht als een willekeurige naam moet worden gekozen om een echte naam te vervangen, terwijl dat wel nodig kan zijn om grammaticaal of inhoudelijk consistent te blijven. We zien soms ook plaatsnamen zoals Sint-Niklaas geanonymiseerd worden als pakweg Sint-Kevin, omdat Niklaas als naam werd aanzien. De taalmodellen gebruikt voor NER zijn dus zeker niet feilloos.

Het zou in theorie mogelijk moeten zijn om betere resultaten te halen door recente LLMs zoals GPT-4 in te schakelen met slim geconstrueerde prompts. Waarschijnlijk zullen er binnenkort wel stappen in die richting worden gezet, maar vandaag zijn de rekenkrachtvereisten, energieconsumptie en kostprijs daarvan nog te hoog, en de responstijd te traag, om dat ook schaalbaar te maken.

Dezelfde Nederlandstalige tekst (links) geanonymiseerd door ChatGPT 4o (rechts). In de prompt werd gevraagd “vervang alle PII en persoonsgegevens, ook de beroepen, werkgevers, steden, datums en leeftijden.” Merk op dat de resulterende tekst ook herschreven werd. Om dat tegen te gaan zou verdere uitbreiding en verfijning van de prompt nodig zijn.

Tools of the trade

Wie op zoek gaat naar grootschalige PII Filtering systemen, en volledige databases, netwerken of filesystems wil kunnen scannen, komt terecht bij Data Loss Prevention tools. Deze moeten verhinderen dat PII het bedrijf verlaat zonder de nodige toelatingen. Voor een marktoverzicht verwijzen we naar Gartner. Ook de internetgiganten bieden daarvoor oplossingen aan, zoals Amazon Macie, Google SDP, of IBM Guardium. De daarbij gebruikte technieken zijn enigszins verwant met diegene gebruikt bij forensisch onderzoek – de zogenaamde eDiscovery, waarover we ook al eerder schreven.

Applicatiebouwers zijn waarschijnlijk eerder geïnteresseerd in tools in de vorm van bibliotheken, SDK’s of API’s. Interessante projecten zijn:

Voor tekst:
- Microsoft Presidio (demo) (ook beschikbaar als Docker containers), of de PII detection dienst op Azure
- Amazon Comprehend (demo)
- De EU Language Services voor NLP (inloggen vereist): voor anonymisering van documenten in EU-talen, gebaseerd op het MAPA-EU project dat ook via Docker Compose gebruikt kan worden.
- PIICatcher (voor databases en filesystems)
Voor afbeeldingen:
- Google Magritte (voor gezichten)
- Meta Research EgoBlur (voor gezichten en nummerplaten)
- OctoPII (enkel detectie en geen redactie. Voor documenten en filesystems, met Tesseract als OCR engine)

Gezichts-anonymisering met Meta EgoBlur. Bron: Nikhil Raina et.al., “EgoBlur: Responsible Innovation in Aria”, met foto’s uit de publieke CCV2 dataset.

Ook in academia wordt er verder onderzoek gedaan. Zo is PII-Codex het resultaat van een universitair project, met een interessante feature: achterliggend maakt het gebruik van Presidio of Comprehend, maar het voegt ook een eigen risico-score toe, die moet kunnen aangeven in welke mate het niet-redigeren van de herkende PII een (privacy-)risico zou kunnen inhouden. Daarnaast laten de meeste tools ook toe om andere of eigen modellen in te pluggen. Deze kan je eventueel zelf gefinetuned hebben voor detectie van custom entiteiten, als je daarvoor de nodige trainingsdata hebt.

Als we vertrouwen op NER of beeldherkenning voor PII-detectie, dan kunnen we er zeker van zijn dat sommige PII niet gedetecteerd zal worden, en dat ook andersom niet-PII foutief als PII aangemerkt kan worden. Geen van deze technologieën garandeert immers 100% accuraatheid. Het succespercentage zal ook variëren afhankelijk van de taal en het entiteitstype dat men probeert te detecteren. Volledige vervanging of verwijdering van elke entiteit in een document kan nooit worden gegarandeerd. Daar waar dat cruciaal is, wordt het resultaat achteraf dus best nog gecontroleerd.

Conclusie

Oplossingen voor PII-filtering kunnen in Europese context zeker bijdragen aan de bescherming van persoonsgegevens. De techniek is eenvoudig te begrijpen en gemakkelijk inzetbaar. Er is echter nooit een garantie op volledige accurate detectie van alle persoonsgegevens, en dus zal hun gebruik in de meeste gevallen een onderdeel moeten zijn van een ruimere waaier aan maatregelen om compliance met GDPR en andere wetgeving te bevorderen.

De achterliggende technologie is “klassiek”, in de zin dat NER en beeldherkenning al lang bestaan en ondertussen goed ontwikkeld zijn. Vandaag profiteren ze mee van de aandacht voor artificiële intelligentie, en allerlei benchmarks laten toe om de state-of-the-art op te volgen. In de praktijk merken we wel dat de resulterende geanonymiseerde tekst soms wat bevreemdend kan overkomen, omdat enkele al even klassieke problemen waar NER typisch mee kampt, nog altijd niet helemaal van de baan zijn.

______________________

Dit is een ingezonden bijdrage van Joachim Ganseman, IT consultant bij Smals Research. Dit artikel werd geschreven in eigen naam en neemt geen standpunt in namens Smals.

Approaches Radar 2024

Koen Vanderkimpen — Fri, 26 Jan 2024 15:47:53 +0000

Methodology, Approaches & Architectural Styles

Légende – Legende

AI / Machine Learning	=	AI is the broader concept of machines acting in a way that we would consider “smart”. Machine Learning is a form of AI based on giving machines access to data and let them learn for themselves. Includes neural networks, deep learning, language processing. A possible application is fraud detection.
AI Augmented Development	=	Use of AI and NLP in the development environment: debugging, testing (mutation, fuzzing), generation of code/documentation, augmented coding, recommendations for refactoring, …
EDA	=	An Event Driven Architecture (EDA) can offer many advantages over more traditional approaches. Events and asynchronous communication can make a system much more responsive and efficient. Moreover, the event model often better resembles the actual business data coming in.
Generative AI	=	Generative AI is the technology to create new content by utilising existing text, audio files, or images. With generative AI, computers detect the underlying patterns related to the input and produce similar content.
Crypto-agility	+	Crypto-agility allows an information security system to switch to alternative cryptographic primitives and algorithms without making significant changes to the system’s infrastructure. Crypto-agility facilitates system upgrades and evolution.
AI for Security	+	Non-traditional methods for improving analysis methods in the security technology of systems and applications (e.g., user behaviour analytics, improved detection of potential attacked from system logs).
Causal AI	+	Causal AI techniques makes it possible to understand the causes of a prediction outcome, it encompasses methods like causal Bayesians networks, causal rules, combination of symbolic and neural AI, etc.
Verifiable Credentials	N	Verifiable credentials (VC) can represent all of the same information that physical credentials represent. Additional technologies, such as digital signatures, makes VC more tamper-evident and more trustworthy than their physical counterparts. VC are typically stored in digital wallets.
Confidential Computing	–	Confidential computing allows an entity to do computations on data without having access to the data itself. This can be realised in a centralised way with homomorphic encryption or a trusted execution environment (TEE) or in a decentralised way with secure multiparty computation.
NLP	–	Natural Language Processing (NLP), part of AI, includes techniques to distil information from unstructured textual data, with the aim of using that information inside analytics algorithms. Used for text mining, sentiment analysis, entity recognition, Natural Language Generation (NLG).
Graph Analytics	=	Graph Analytics is the process of investigating relational structures (i.e., relations between entities such as people, companies, addresses, …) by the use of network and graph theory. When entities include people, we talk about SNA (Social Network Analytics).
AI/ML Engineering	=	In machine learning ‘by hand’, a lot of time is lost between training a model and putting it in production, to then wait for feedback for potential retraining. CD4ML (continuous delivery for ML) attempt to automate this process, working towards Adaptive AI.
API Economy	=	API’s, to connect services within and across multiple systems, or even to 3rd parties, are becoming prevalent and push a new business model, centred around the integration of readily available data and services. They also help with loose coupling between components.
Augmented Data Quality	=	Through the addition of AI, machine learning, knowledge graphs, NLP , … in data quality tools technologies, results could be more efficient for the business.
SuperApps	=	Some mobile apps, like WeChat and AliPay, become entire ecosystems of pluggable mini-apps. Users can greatly customise their experience within the superapp, and integration between mini-apps is much tighter than that of normal smartphone apps. Popular now in China, but may be coming here soon
Synthetic Data	=	Synthetic Data is concerned with creating a fictitious dataset that mimics a real one in format, looks and statistical properties. Can be used to further minimise the need to share sensitive or protected data.
Rules as Code	+	Rules as Code is a LegalTech concept in which the goal is to semi-automate the link between (suitably formalised) regulations on one hand, and the derived code (implementations, verification or compliance processes) on the other hand.
Human Augmentation	+	Human Augmentation is the enhancement of human capabilities, such as senses, actions, or cognition, using technology and science. It includes medical advancements, wearables (e.g. intelligent glasses), genetic engineering, and brain-computer interfaces.
Data centric AI	N	A machine learning approach consisting in systematically applying data engineering best practices with a strong focus on data quality in order to improve the quality of a model.
prompt engineering	N	A prompt is a natural language description defining the context in which a Large Language Model operates and outputs text. Small changes in the prompt can cause great changes in behaviour. Prompt engineering tries to find optimal prompts to achieve desired LLM behaviour.
Zero Trust Architecture	–	The main concept behind zero trust is “never trust, always verify,” which means that devices should not be trusted by default, even if they are connected to a managed corporate network such as the corporate LAN and even if they were previously verified. Also known as “perimeterless security.”
Back Tracking Anomalies	–	Method to detect causes of data quality problems in data flows between information systems and to improve them structurally. ROI is Important and facilitates a win-win approach between institutions. To monitor the anomalies and transactions an extension to the existing DBMS has to be built.
Big Data Processing	–	Big data analytics solutions require architecture, which 1) has the calculations executed where data is stored, 2) spreads data and calculations over several nodes, and 3) uses a data warehouse architecture that makes all types of data available for analytical tools in a transparent way.
Data Virtualisation	–	Methods and tools to access databases with heterogeneous models and to facilitate access for users using a virtual logical view.
Knowledge Graphs	–	Knowledge Graphs relate entities in a meaningful graph structure to facilitate various processes from information retrieval to business analytics. Knowledge graphs typically integrate data from heterogeneous sources such as databases, documents, and even human input.
Microservices	–	Independently maintainable and deployable services, which are kept very small (hence, ‘micro-‘), make an application, or even large groups of related systems, much more flexibly scalable, and provide functional agility, which allows a system to rapidly support new business opportunities.
Reactive Computing	–	The flow of (incoming) data, and not an application’s (or CPU’s) regular control flow, govern its architecture. This is a new paradigm, sometimes even driven by new hardware, and opposes the traditional way of working with fluxes. Also known as Dataflow Architecture and related to EDA.
Data-Centric Security	=	Approach to protect sensitive data uniquely and centrally, regardless of format or location (using e.g. data anonymization or tokenisation technologies in conjunction with centralised policies and governance).
Multimedia Data Protection	=	Protection of multimedia data has gained importance with social media, remote-working, but also with the development of powerful AI models. Detecting falsification is critical. For instance one should be able to detect forgery of images (e.g., faces used for biometrics).
Augmented Data Science	=	Augmented data science and machine learning (augmented DSML) uses artificial intelligence to help automate and assist key aspects of a DSML process. These aspects include data access and preparation, feature engineering, as well as model operationalisation, model tuning and management.
Data Observability	=	Monitoring and management of performance and “system incidents” & Monitoring of data errors in real time and lineage to automatically resolve the cause (only bugs and formal causes) in the software components of the various information systems that are linked to each other
Edge Computing	=	Information processing and content collection and delivery are placed closer to the endpoints to fix high WAN costs and unacceptable latency of the cloud. Also in context of AI solutions, edge computing becomes more relevant (ref. tinyML)
GitOps	=	Best practices coming from DevOps, applied to Operations. This, for instance, means, that all configuration is specified in files that can be maintained using version control and that are machine readable by tools to automate as many things as possible.
Privacy by Design	=	Privacy by design calls for privacy to be taken into account throughout the whole engineering process. The European GDPR regulation incorporates privacy by design. An example of an existing methodology is LINDDUN.
Process Mining	=	Includes automated process discovery (extracting process models from an event log from an information system), and offers also possibilities to monitor, check and improve processes. Often used in preparation of RPA and other business process initiatives (context digital transformation).
Visual Analytics	=	Methodology and enabling tools allowing to combine data visualisation and analytics. Allows rapidly exploring, analysing, and forecasting data. This helps modelling in advanced analytics, and to make modern, interactive, self-service BI applications.
Voice of the Citizen Applications	=	Contains a number of approaches to capture and analyse explicit or non-explicit feedback from users, in order to improve the systems and remove frictions.
Remote Identity Verification	–	Remote identity verification comprises the processes tools to remotely verify someone’s identity, without the need for the person to physically present themselves to an authority.
Composable Applications	–	Applications composed of business-oriented building blocks, where these modular reusable blocks are independent one of another and can be configured by Business and IT into a solution. Main advantage is the support for agility of the business to changes while resilience should be maintained.
Collaborative MDM	–	In Master Data Management, collaborative and organised management of anomalies stemming from distributed authentic sources, by their official owners.
Living Documentation	–	Living documentation actively co-evolves with code, making it constantly up-to-date without requiring separate maintenance. It lives with the code and can be automatically used by tools to generate publishable specifications. An example can be found in some forms of annotations.
Hexagonal Architecture	=	Also ‘Onion Architecture’: a set of architectural principles making the domain model code central to everything and dependent on no other code or framework. Other aspects of the program code can be dependent on the domain code. Gained a lot of popularity in the community recently
Web3	=	Web3 is an idea for a new iteration of the World Wide Web which incorporates concepts such as decentralisation, blockchain technologies, and token-based economics. It promises to give back control to citizens over their assets, as well as over their identity.

Protection des données par la pseudonymisation préservant la structure des numéros de registre national

Kristof Verslype — Wed, 07 Jun 2023 11:50:56 +0000

Nederlandstalige versie

De plus en plus de données personnelles sensibles sont stockées sous forme numérique,
tandis que les cyberattaques deviennent de plus en plus avancées. Aussi l’amélioration de
la protection des données à caractère personnel fait-elle l’objet d’une attention de tous les
instants.

Une mesure complémentaire précieuse consiste à stocker les données à caractère
personnel non pas sous un numéro de registre national, mais sous un pseudonyme.
Pour les applications existantes qui ne procèdent pas encore de la sorte, dans les
environnements production comme dans les environnements de test et de développement,
il peut être utile, voire nécessaire, que ces pseudonymes aient la même structure que les
numéros de registre national. Ceci de manière à ce qu’ils puissent être traités par
l’application et la base de données existantes.

D’où la nécessité d’une technique permettant de convertir les numéros de registre national
en pseudonymes avec la même structure et vice versa. Si le chiffrement classique ne le
permet pas, il en va autrement avec la tokenisation des données (data tokenization en
anglais) ou le chiffrement préservant le format (format-preserving encryption en anglais).

La tokenisation des données dans sa forme la plus simple, implique de tenir un tableau
contenant des paires de la forme (numéro de registre national, pseudonyme), ce qui pose
des problèmes infrastructurels, notamment en matière de sauvegarde, de synchronisation
et de sécurisation du tableau.

Plutôt que de tenir un tableau sans cesse croissant, comportant potentiellement des
millions d’enregistrements, une solution plus simple et plus sûre consisterait en une clé
symétrique unique et immuable d’une longueur de 32 bytes (au maximum).
C’est exactement ce que fait le chiffrement préservant le format (FPE). Cette technique a
été présentée pour la première fois en 2001 et a été normalisée par le NIST. À la suite de la
découverte de faiblesses, les normes ont été révisées en 2019.

Les normes FPE sont principalement axées sur le secteur financier où, par exemple, les
numéros de cartes de crédit sont remplacés par des pseudonymes ayant la même
structure. L’équipe Smals Research s’est demandé si cette technique pouvait également
être appliquée aux numéros de registre national. Cet article présente notre analyse et nos
expériences.

Fonctionnement

Par essence, le FPE consiste en une permutation, soit une réorganisation, comme l’illustre
la figure ci-dessous où les chiffres 1 à 5 sont réorganisés. La permutation est déterminée
par la clé FPE et le tweak. La clé est secrète, le tweak est un nombre à choisir librement
(byte array) qui peut être connu du public et qui simplifie la gestion des clés [1]. Comment
convertir sur cette base les numéros de registre national en pseudonymes ayant la
structure d’un numéro de registre national ?

La chaîne 83.06.21-123-62 revêt la structure d’un numéro de registre national, c’est-à-dire
qu’elle se présente sous la forme YY.MM.DD-III-CC, où YY.MM.DD représente la date de
naissance, III est un compteur de jours dans lequel est également encodé le sexe, et
CC est un chiffre de contrôle, calculé sur la base de tous les éléments précédents et du
siècle de naissance. Votre auteur n’est (hélas/heureusement) pas en mesure de vérifier si
le numéro 83.06.21-123-62 a réellement été attribué à un citoyen et sait donc uniquement
qu’il s’agit d’une chaîne revêtant la structure d’un numéro de registre national.

À partir d’une date de départ à choisir librement – par exemple 01/01/1911 – nous attribuons
à chaque chaîne correctement formée un index unique, qui commence par 0 et augmente ensuite, comme le montre la figure ci-dessous. Nous pouvons nous arrêter, par exemple,
au 31/12/2022. Dans ce cas, nous avons la certitude que les numéros
de registre national de toutes les personnes inscrites au Registre National qui étaient en vie
à la fin de l’année 2022 ont une conversion de et vers un nombre. En effet, personne dans
ce pays n’a plus de 112 ans.

La conversion d’un numéro de registre national en un pseudonyme préservant la structure
est illustrée dans la figure ci-dessous. Le numéro de registre national est d’abord converti
en un nombre, comme indiqué précédemment. Ce nombre est permuté (= chiffré) par FPE
en un autre nombre qui est ensuite reconverti en la chaîne préservant la structure
correspondante. Cette chaîne est le pseudonyme final.

[1] Avec une seule clé secrète et différents tweaks, nous avons donc différentes
permutations (chiffrements). Le tweak peut être considéré comme la partie non secrète de
la clé.

Dans la pratique

Pour utiliser le FPE afin de convertir des numéros de registre national en pseudonymes
préservant la structure, nous avons donc besoin à la fois d’un chiffrement FPE (et d’un
algorithme de déchiffrement) et d’une méthode de conversion.

Pour le chiffrement FPE, nous avons recouru à la bibliothèque cryptographique bien
connue BouncyCastle, qui prend en charge les deux normes du NIST, FF1 et FF3-1.
En coulisses, le FPE utilise toujours un algorithme existant pour le chiffrement par blocs
symétriques. Le choix logique était donc AES. Par conséquent, les clés FPE sont
simplement des clés AES.

L’équipe Smals Research a elle-même réalisé la conversion en Java, en tenant compte de
toutes les complexités liées aux numéros de registre national (voir, par exemple les arrêtés
royaux du 3 avril 1984 et du 25 novembre 1997). En cas d’intérêt concret, ce code de
recherche peut évoluer vers quelque chose qui soit utilisable en production.

Des contraintes cruciales doivent néanmoins être prises en compte lors du choix de la taille
du domaine. Le FPE a été présenté pour la première fois en 2001, dans un article intitulé Ciphers with arbitrary finite domains. Comme l’indique le titre, la taille du domaine peut être choisie arbitrairement. C’est également ce que nous avons fait dans notre exemple précédent.

Toutefois, les normes du NIST s’en écartent et stipulent que la taille du domaine doit avoir
la forme radix^len, c’est-à-dire le nombre racine radix élevé à la puissance len où radix et len
peuvent être choisis librement, tant que radix n’est pas supérieur à 2¹⁶ = 65 536.
Cette approche fonctionne bien pour, par exemple, les numéros de cartes de crédit.
Ces numéros sont composés de 16 chiffres décimaux. Nous choisissons donc radix = 10 et
len = 16. Ainsi, si nous suivons les normes du NIST – ce que je recommande vivement –
nous ne pouvons plus choisir la taille du domaine arbitrairement.

En outre, la taille minimale du domaine, qui était encore de 100 dans la publication du NIST
de 2016, a été portée à 1 000 000, dans la révision de 2019 pour des raisons de sécurité.
Autrement dit, il est exigé que radix^len ≥ 1 000 000. Entre autres conséquences de cette
exigence, il n’est plus possible de conserver l’année de naissance dans le pseudonyme
d’un numéro de registre national. En effet, il n’y a que quelque 365 000 chaînes
correctement formées par an (365 ou 366 jours par an x 998 possibilités pour le compteur
de jours III).

Revenons à nos expériences. Comment déterminer le domaine (et donc sa taille) ?
Dans notre exemple précédent, ce domaine était composé de toutes les chaînes dotées de
la structure d’un numéro de registre national pour les personnes nées entre 1911 et 2022,
soit plus de 40,8 millions de chaînes. Il s’agit bien évidemment d’utiliser le système pendant
plusieurs années, de sorte qu’il est logique que le domaine soit plus grand. En effet, de
nouveaux numéros de registre national sont émis en permanence, et il ne s’agit pas
d’oublier les anciens.

Pour nos tests, nous avons choisi le 1er janvier 1912 comme date de départ et
226 = 67 108 864 comme taille de notre domaine. Ensemble, la date de départ et la taille du
domaine déterminent également la date de fin, soit le 7 février 2096 dans notre cas.
Comme nous l’avons déjà mentionné, le FPE est une permutation sous-jacente sur
l’ensemble du domaine, de sorte que le pseudonyme d’une personne vivante peut être
converti en un pseudonyme préservant la structure avec une date de naissance située
plusieurs dizaines d’années dans le futur. Il se peut également que, dans dix ans, le
numéro de registre national d’une personne vivante à cette époque soit converti en un
pseudonyme avec une date de naissance qui est de toute façon trop éloignée dans le
temps pour être celle d’une personne vivante à ce moment-là.

En résumé, le FPE peut être utilisé pour convertir des numéros de registre national en
pseudonymes avec la même structure, mais toutes les informations contenues dans le
numéro de registre national seront perdues au cours du processus. Les contrôles de la date
de naissance et du sexe (contenu dans la 9e décimale) deviennent donc impossibles.
Ceci peut affecter certaines applications qui exécutent ces contrôles de toute façon.

Une mise en garde à cet égard s’impose toutefois. Nous ne devons pas considérer qu’un
numéro de registre national contient ces informations par définition. Il existe en effet des
exceptions, où la date de naissance exacte n’est pas contenue dans le numéro national
(voir les AR susmentionnés). La meilleure pratique consiste dès lors à utiliser le numéro de registre national comme identifiant uniquement et à demander au Registre national les
données à caractère personnel dont l’application a besoin. Dans un tel contexte, le FPE
pour les pseudonymes préservant la structure peut constituer une mesure de sécurité
précieuse.

Membrane de confidentialité

La membrane de confidentialité est un concept commun – il n’y a pas encore de code – du service Sécurité de l’information de Smals et de l’équipe Smals Research. L’idée est qu’un
environnement, par exemple une application en acceptation, est entouré d’une membrane
virtuelle, la membrane de confidentialité. Tous les numéros de registre national qui entrent
sont convertis en pseudonymes préservant la structure lorsqu’ils traversent la membrane de
confidentialité. Et tous les pseudonymes préservant la structure qui sortent sont reconvertis
en numéros de registre original lorsqu’ils traversent cette membrane. À l’intérieur de la
membrane, seul le pseudonyme est donc connu. Cette approche est transparente à la fois
pour la ou les applications qui se trouvent à l’intérieur de la membrane et pour les
applications/services avec lesquels s’effectue une communication.

La membrane de confidentialité pourrait en fait être un serveur proxy par lequel passe tout
le trafic entrant et sortant. Ce serveur proxy pourrait éventuellement être hébergé par un
tiers.

Contrairement aux autres techniques de pseudonymisation avancées conçues par l’équipe
Smals Research, ce tiers voit inévitablement à la fois le numéro de registre national et le
pseudonyme. Il est donc impossible de proposer un service de pseudonymisation aveugle
sur la base du FPE, de sorte qu’un degré de confiance supérieur s’impose à l’égard de ce
tiers.

Conclusion

Le FPE autorise une belle approche pour convertir les numéros de registre national en
pseudonymes avec la même structure. Cette approche peut améliorer la protection des
données à caractère personnel sans qu’il soit nécessaire d’adapter l’application ou la base
de données sous-jacente. En revanche, les informations contenues dans le numéro de
registre national – en particulier la date de naissance et le sexe biologique – seront perdues.
Cela ne devrait toutefois pas être problématique si les meilleures pratiques sont appliquées
et si les informations sont récupérées à partir de la source authentique, à savoir le Registre
national.

La même technique peut être appliquée à d’autres types d’identifiants numériques, tels que
les numéros BCE, les numéros de téléphone et les numéros de compte bancaire.
Aujourd’hui, dans son code de recherche, l’équipe Smals Research prend déjà en charge
les numéros BIS, i.e. des numéros d’identification uniques pour les personnes qui ne sont
pas inscrites au Registre national mais qui sont en relation avec les autorités belges, en
plus des numéros de registre national. Les numéros de registre national et les numéros BIS
constituent ensemble les numéros NISS, les numéros d’identification de la sécurité sociale.

L’introduction mentionne que le FPE est une mesure de protection complémentaire.
Lorsque, par exemple, dans un enregistrement de base de données, le numéro de registre
national est remplacé par un pseudonyme, mais que le nom et l’adresse restent en clair
dans la base de données, l’identification du citoyen reste assez triviale. Dès lors, soit des
mesures de protection complémentaires s’imposent, soit ces données à caractère
personnel ne sont plus stockées localement, mais sont systématiquement extraites de la
source authentique (en l’espèce le Registre national).

En décembre 2021, un sondage réalisé à la fin de mon webinaire consacré aux
technologies d’amélioration de la vie privée posait la question suivante : quelles sont les
technologies d’amélioration de la vie privée qui, selon vous, ont le plus de potentiel et
méritent donc plus d’attention ? Le vainqueur fut FPE (suivi d’Oblivious Join et de Synthetic
data). Ce résultat nous a amenés à accorder davantage d’attention à cette technologie.
Depuis, avec l’équipe Smals Research, nous avons réalisé les premières expériences
réussies avec le FPE.

Si vous souhaitez appliquer le FPE, éventuellement sous la forme d’une membrane
de confidentialité, ou convertir des identifiants en pseudonymes, n’hésitez pas à
prendre contact avec nous.

Cette contribution a été soumise par Kristof Verslype, cryptographe chez Smals Research.
Elle a été rédigée en son nom propre et ne prend pas position au nom de Smals.

Source featured image: Pixabay

Gegevensbescherming m.b.v. structuurbehoudende pseudonimisatie van rijksregisternummers

Kristof Verslype — Tue, 16 May 2023 05:00:00 +0000

Version en français

Steeds meer gevoelige persoonsgegevens worden digitaal bewaard, terwijl cyberaanvallen steeds geavanceerder worden. Het verbeteren van de bescherming van persoonsgegevens geniet dan ook permanente aandacht.

Een waardevolle aanvullende maatregel is om persoonsgegevens niet onder een rijksregisternummer te bewaren, maar onder een pseudoniem. Voor bestaande toepassingen die dit nog niet doen, in productie alsook in test- en ontwikkelomgevingen, kan het nuttig en zelfs noodzakelijk zijn dat deze pseudoniemen dezelfde structuur hebben als rijksregisternummers. Dit is immers wat de bestaande toepassing en database verwachten en mee om kunnen.

Vandaar dus de nood aan een techniek die rijksregisternummers omzet in pseudoniemen met dezelfde structuur, en terug. Dit is onmogelijk met klassieke vercijfering, maar wordt wel mogelijk m.b.v. ofwel data tokenization, ofwel format-preserving encryption.

Bij data tokenization wordt, in zijn meest eenvoudige vorm, een tabel bijgehouden met paren van de vorm (rijksregisternummer, pseudoniem), wat met infrastructurele uitdagingen komt, onder meer op het vlak van backup, synchronisatie en het veilig bewaren van de tabel.

Het zou eenvoudiger en veiliger zijn indien we niet een steeds groeiende tabel, met potentieel miljoenen records moeten bijhouden, maar in de plaats daarvan gewoon één enkele, onveranderlijke symmetrische sleutel met een lengte van (maximaal) 32 bytes. Dit is exact wat format-preserving encryptie (FPE) doet. Deze techniek werd voor het eerst voorgesteld in 2001 en werd in 2016 gestandaardiseerd door het NIST. Na het ontdekken van zwakheden werden in 2019 de standaarden weliswaar gereviseerd.

De FPE standaarden richten zich op de eerste plaats op de financiële sector, waarbij bijvoorbeeld kredietkaartnummers vervangen worden door pseudoniemen met dezelfde structuur. Bij Smals Research vroegen we ons af of deze techniek ook op rijksregisternummers kan toegepast worden. Dit artikel bespreekt onze analyse en ervaringen.

Werking

In essentie is FPE een permutatie, ofwel een herordening zoals geïllustreerd in onderstaande figuur waarbij de nummers 1 tot 5 herordend worden. De permutatie wordt bepaald door de FPE sleutel en de tweak. De sleutel is geheim, de tweak is een vrij te kiezen nummer (byte array) dat publiek gekend mag zijn en dat key management vereenvoudigd [1]. Hoe kunnen we op basis hiervan rijksregisternummers omzetten in pseudoniemen met de structuur van een rijksregisternummer?

De string 83.06.21-123-62 heeft de structuur van een rijksregisternummer, dat wil zeggen dat het van de vorm YY.MM.DD-III-CC is, waarbij YY.MM.DD de geboortedag aanduidt, III een dagteller is waarin ook het geslacht geëncodeerd zit, en CC een controlegetal is, berekend op basis van zowel al het voorgaande als de geboorte-eeuw. Uw auteur beschikt (helaas/gelukkig) niet over de mogelijkheid om na te gaan of 83.06.21-123-62 effectief aan een burger toegekend is en weet dus enkel dat dit een string is met de correcte structuur van een rijksregisternummer.

Vertrekkende vanaf een vrij te kiezen startdatum – bijvoorbeeld 01/01/1911 – kennen we aan elke correct gevormde string een unieke index toe, startend bij 0 en oplopend, zoals aangegeven in onderstaande figuur. We kunnen ophouden bij, bijvoorbeeld, 31/12/2022. In dat geval zijn we zeker dat de rijksregisternummers van alle personen ingeschreven in het Rijksregister die eind 2022 in leven waren een conversie van en naar een getal hebben. Niemand in dit land is immers ouder dan 112.

De omzetting van een rijksregisternummer naar een structuurbewarend pseudoniem wordt geïllustreerd in onderstaande figuur. Het rijksregisternummer wordt eerst geconverteerd naar een getal, zoals net aangegeven. Dat getal wordt door FPE gepermuteerd (=geëncrypteerd) naar een ander getal dat vervolgens terug geconverteerd wordt naar de bijhorende structuurbehoudende string. Deze string is het uiteindelijke pseudoniem.

[1] Met een enkele geheime sleutel en verschillende tweaks heb je dus verschillende permutaties (encrypties). De tweak kan gezien worden als het niet geheime deel van de sleutel.

In de praktijk

Om FPE te gebruiken voor het omzetten van rijksregisternummers naar structuurbehoudende pseudoniemen hebben we dus nood aan zowel een FPE cijfer (en decryptiealgoritme) als een conversiemethode.

Voor het FPE cijfer deden we beroep op de gekende crypto library BouncyCastle, dat beide NIST standaarden, FF1 en FF3-1, ondersteunt. Onderliggend maakt FPE steeds gebruikt van een bestaand algoritme voor symmetrische blokvercijfering. De logische keuze was dan ook AES. Bijgevolg zijn FPE sleutels gewoon AES sleutels.

De conversie heeft Smals Research zelf in Java geïmplementeerd, waarbij alle complexiteiten rond rijksregisternummers mee in rekening genomen werden (zie bijvoorbeeld de koninklijke besluiten van 3 april 1984 en 25 november 1997). Bij concrete interesse kan deze research code evolueren richting iets dat ook in productie bruikbaar is.

Wel moet rekening gehouden worden met cruciale beperkingen bij het kiezen van de domeingrootte. FPE werd voor het eerst voorgesteld in 2001, in een artikel getiteld Ciphers with arbitrary finite domains. Zoals de titel suggereert kon de domeingrootte willekeurig gekozen worden. Dit is ook wat we in ons voorgaande voorbeeld gedaan hebben.

De NIST standaarden wijken daar echter van af en stellen dat de domeingrootte de vorm radix^len moet hebben, dus het grondtal radix verhoffen tot de macht len waarbij radix en len vrij gekozen kunnen worden, zolang radix niet groter is dan 2¹⁶ = 65 536. Deze benadering werkt goed voor bijvoorbeeld kredietkaartnummers. Dergelijke nummers bestaan uit 16 decimale cijfers. We kiezen dus radix = 10 en len = 16. Als we de NIST standaarden volgen – wat ik ten zeerste aanbeveel –, kunnen we de domeingrootte dus niet langer willekeurig kiezen.

Bovendien werd de minimumdomeingrootte, die in de NIST publicatie van 2016 nog 100 bedroeg, in de revisie van 2019 uit veiligheidsoverwegingen opgetrokken naar 1 000 000. Anders gezegd is er de vereiste dat radix^len≥ 1 000 000. Een implicatie van dat laatste is dat het behoud van het geboortejaar in het pseudoniem van een rijksregisternummer niet langer een optie is. Per jaar zijn er immers slechts ongeveer 365 000 correct gevormde strings (365 of 366 dagen per jaar x 998 mogelijkheden voor de dagteller III).

Terug naar onze experimenten. Hoe bepalen we het domein (en dus de domeingrootte)? In ons eerdere voorbeeld bestond dit domein uit alle strings met de structuur van een rijksregisternummer voor personen geboren tussen 1911 en 2022, wat samen goed was voor ruim 40,8 miljoen strings. Het is uiteraard de bedoeling om het systeem ettelijke jaren te gebruiken. Daarom is het verstandig om het domein groter te nemen. Er worden immers steeds nieuwe rijksregisternummers uitgereikt, en de oude mogen we niet zomaar vergeten.

Voor onze testen kozen we als startdatum 1 januari 1912 en als grootte voor ons domein 2²⁶ = 67 108 864. De startdatum en domeingrootte bepalen samen ook de einddatum, wat in dit geval 7 februari 2096 is. Zoals eerder gezegd is FPE onderliggend een permutatie over het volledige domein, wat impliceert dat het pseudoniem van een levende persoon omgezet kan worden in een structuurbehoudend pseudoniem met een geboortedatum die decennia in de toekomst ligt. Het is eveneens mogelijk dat binnen 10 jaar een rijksregisternummer van een op dat moment levende persoon omgezet wordt naar een pseudoniem met een geboortedatum die sowieso te ver in het verleden ligt om van een dan nog levende persoon te zijn.

Samengevat kan FPE gebruikt worden om rijksregisternummers om te zetten in pseudoniemen met dezelfde structuur, maar gaat daarbij wel alle informatie verloren die in het rijksregisternummer vervat zit. Controles op geboortedatum en geslacht (wat vervat zit in de 9^e decimaal) worden dus onmogelijk. Dit kan gevolgen hebben voor bepaalde toepassingen die dergelijke controles toch doen.

Hierbij dient wel een kanttekening gemaakt te worden. We mogen er niet van uitgaan dat een rijksregisternummer sowieso deze informatie bevat. Er zijn inderdaad uitzonderingen, waarbij de exacte geboortedatum niet in het rijksregisternummer vervat zit (zie daarvoor de eerder vermeldde KB’s). Het is dan ook sowieso een best practice om het rijksregisternummer enkel te gebruiken als identifier, en de persoonsgegevens die de toepassing nodig heeft aan het rijksregister op te vragen. In een dergelijke context kan FPE voor structuurbehoudende pseudoniemen een waardevolle beveiligingsmaatregel zijn.

Privacy membraan

Het privacy membraan is een gezamenlijk concept – er is nog geen code – van de dienst informatieveiligheid en de dienst onderzoek van Smals. Het idee is dat een omgeving, bijvoorbeeld een toepassing in acceptatie, omgeven wordt door een virtuele schil, het privacy membraan. Alle rijksregisternummers die het privacy membraan binnenkomen worden omgezet in een structuurbehoudend pseudoniem. Alle structuurbehoudende pseudoniemen die het membraan verlaten worden bij het passeren van het membraan opnieuw omgezet in het oorspronkelijke rijksregisternummer. Binnen het membraan is dus enkel het pseudoniem gekend. Een dergelijke aanpak is transparant voor zowel de toepassing(en) binnen het membraan, als de toepassingen/services waarmee gecommuniceerd wordt.

Het privacy membraan zou in werkelijkheid een proxy server kunnen zijn waarlangs al het inkomend en uitgaand verkeer passeert. Die proxy server kan eventueel gehost worden door een derde partij.

In tegenstelling tot andere, door Smals Research bedachte, geavanceerde peudonimisatietechnieken, ziet deze partij onvermijdelijk zowel het rijksregisternummer als het pseudoniem. Een blinde pseudonimiseringsdienst is dus onmogelijk m.b.v. FPE en bijgevolg is wel een hogere graad van vertrouwen vereist in deze partij.

Conclusie

FPE laat een elegante aanpak toe om rijksregisternummers om te zetten in pseudoniemen met dezelfde structuur. Dit kan de bescherming van persoonsgegevens verbeteren, zonder dat de onderliggende toepassing of database aangepast dient te worden. De informatie die vervat zit in het rijksregisternummer – met name de geboortedatum en het biologische geslacht – gaat daarbij weliswaar verloren. Toch zou dit geen probleem mogen zijn indien de best practices gevolgd worden en de informatie dus opgevraagd wordt aan de authentieke bron, zijnde het Rijksregister.

Dezelfde techniek kan ook toegepast worden op andere types numerieke identifiers, zoals KBO nummers, telefoonnummers en bankrekeningnummers. Smals Research biedt vandaag in haar research code, naast rijksregisternummers, ook reeds ondersteuning voor BIS-nummers, wat unieke identificatienummers zijn voor personen die niet ingeschreven zijn in het Rijksregister, maar die toch een relatie hebben met de Belgische overheden. De rijksregisternummers en BIS-nummers vormen samen de INSZ nummers, de identificatienummers van de sociale zekerheid.

De inleiding vermeldde dat FPE een aanvullende beschermingsmaatregel is. Wanneer bijvoorbeeld in een database record het rijksregisternummer vervangen wordt door een pseudoniem, maar verder naam en adres gewoon in klaartekst in de database blijven staan, blijft identificatie van de burger vrij triviaal. Ofwel zijn dan bijkomende beschermingsmaatregelen nodig, ofwel worden deze persoonsgegevens niet langer lokaal bewaard, maar wel systematisch bij de authentieke bron (in dit geval het Rijksregister) opgevraagd.

In december 2021 werd op het einde van mijn webinar over privacy bevorderende technologieën via een peiling de volgende vraag gesteld: welke privacy bevorderende technologieën hebben volgens u het meest potentieel en verdienen dus meer aandacht? De winnaar was FPE (gevolgd door Oblivious Join en Synthetic data). Dit was voor ons een signaal om deze technologie meer aandacht te geven. Ondertussen hebben we met Smals research de eerste succesvolle experimenten met FPE achter de rug.

Mocht u interesse hebben in het toepassen van FPE, eventueel in de vorm van een privacy membraan, of in het omzetten van identifiers in pseudoniemen, gaan wij graag met u in gesprek.

Dit is een ingezonden bijdrage van Kristof Verslype, cryptograaf bij Smals Research. Het werd geschreven in eigen naam en neemt geen standpunt in namens Smals.

Bron featured image: Pixabay

Approaches Radar 2023

Smals Research — Thu, 26 Jan 2023 10:01:55 +0000

Methodology, Approaches & Architectural Styles

Légende – Legende

AI / Machine Learning	AI is the broader concept of machines acting in a way that we would consider “smart”. Machine Learning is a form of AI based on giving machines access to data and let them learn for themselves. Includes neural networks, deep learning, language processing. A possible application is fraud detection.
AI Augmented Development	Use of AI and NLP in the development environment: debugging, testing (mutation, fuzzing), generation of code/documentation, augmented coding, recommendations for refactoring, …
Confidential Computing	Confidential computing allows an entity to do computations on data without having access to the data itself. This can be realised in a centralised way with homomorphic encryption or a trusted execution environment (TEE) or in a decentralised way with secure multiparty computation.
EDA	An Event Driven Architecture (EDA) can offer many advantages over more traditional approaches. Events and asynchronous communication can make a system much more responsive and efficient. Moreover, the event model often better resembles the actual business data coming in.
NLP	Natural Language Processing (NLP), part of AI, includes techniques to distil information from unstructured textual data, with the aim of using that information inside analytics algorithms. Used for text mining, sentiment analysis, entity recognition, Natural Language Generation (NLG).
Graph Analytics	Graph Analytics is the process of investigating relational structures (i.e., relations between entities such as people, companies, addresses, …) by the use of network and graph theory. When entities include people, we talk about SNA (Social Network Analytics).
AI for Security	Non-traditional methods for improving analysis methods in the security technology of systems and applications (e.g., user behaviour analytics).
AI/ML Engineering	In machine learning ‘by hand’, a lot of time is lost between training a model and putting it in production, to then wait for feedback for potential retraining. CD4ML (continuous delivery for ML) attempt to automate this process, working towards Adaptive AI.
Analytics Engineering	Analytics engineers provide clean data sets to end users, modelling data in a way that empowers end users to answer their own questions. Focus on transforming, testing, deploying, and documenting data. Tools: dbt, snowflake, stitch, fivetran, looker, mode, redash, columnar DBs
API Economy	API’s, to connect services within and across multiple systems, or even to 3rd parties, are becoming prevalent and push a new business model, centred around the integration of readily available data and services. They also help with loose coupling between components.
Augmented Data Quality	Through the addition of AI, machine learning, knowledge graphs, NLP , … in data quality tools technologies, results could be more efficient for the business.
Back Tracking Anomalies	Method to detect causes of data quality problems in data flows between information systems and to improve them structurally. ROI is Important and facilitates a win-win approach between institutions. To monitor the anomalies and transactions an extension to the existing DBMS has to be built.
Big Data Processing	Big data analytics solutions require architecture, which 1) has the calculations executed where data is stored, 2) spreads data and calculations over several nodes, and 3) uses a data warehouse architecture that makes all types of data available for analytical tools in a transparent way.
Causal AI	Causal AI techniques makes it possible to understand the causes of a prediction outcome, it encompasses methods like causal Bayesians networks, causal rules, combination of symbolic and neural AI, etc.
Composable Applications	Applications composed of business-oriented building blocks, where these modular reusable blocks are independent one of another and can be configured by Business and IT into a solution. Main advantage is the support for agility of the business to changes while resilience should be maintained.
Crypto-agility	Crypto-agility allows an information security system to switch to alternative cryptographic primitives and algorithms without making significant changes to the system’s infrastructure. Crypto-agility facilitates system upgrades and evolution.
Data Virtualisation	Methods and tools to access databases with heterogeneous models and to facilitate access for users using a virtual logical view.
Knowledge Graphs	Knowledge Graphs relate entities in a meaningful graph structure to facilitate various processes from information retrieval to business analytics. Knowledge graphs typically integrate data from heterogeneous sources such as databases, documents, and even human input. Makes part of AI.
Microservices	Independently maintainable and deployable services, which are kept very small (hence, ‘micro-‘), make an application, or even large groups of related systems, much more flexibly scalable, and provide functional agility, which allows a system to rapidly support new business opportunities.
Platform Engineering	The discipline of designing and building toolchains and workflows that enable self-service capabilities, by providing an integrated product most often referred to as an “Internal Developer Platform” covering the operational necessities of the entire lifecycle of an application.
Reactive Computing	The flow of (incoming) data, and not an application’s (or CPU’s) regular control flow, govern its architecture. This is a new paradigm, sometimes even driven by new hardware, and opposes the traditional way of working with fluxes. Also known as Dataflow Architecture and related to EDA.
Remote Identity Verification	Remote identity verification comprises the processes tools to remotely verify someone’s identity, without the need for the person to physically present themselves to an authority.
SuperApps	Some mobile apps, like WeChat and AliPay, become entire ecosystems of pluggable mini-apps. Users can greatly customise their experience within the superapp, and integration between mini-apps is much tighter than that of normal smartphone apps. Popular now in China, but may be coming here soon
Synthetic Data	Synthetic Data is concerned with creating a fictitious dataset that mimics a real one in format, looks and statistical properties. Can be used to further minimise the need to share sensitive or protected data.
Zero Trust Architecture	The main concept behind zero trust is “never trust, always verify,” which means that devices should not be trusted by default, even if they are connected to a managed corporate network such as the corporate LAN and even if they were previously verified. Also known as “perimeterless security.”
Augmented Data Science	Augmented data science and machine learning (augmented DSML) uses artificial intelligence to help automate and assist key aspects of a DSML process. These aspects include data access and preparation, feature engineering, as well as model operationalization, model tuning and management.
Collaborative MDM	In Master Data Management, collaborative and organised management of anomalies stemming from distributed authentic sources, by their official owners.
Compliance Automation & Rules as Code	The (semi-)automation of compliance and compliance verification processes which currently rely on manual input. This requires one to formalise, to the extent possible, regulation and policies that trigger actions. Tightly coupled to the LegalTech concept of Rules as Code.
Data Observability	Monitoring and management of performance and “system incidents” & Monitoring of data errors in real time and lineage to automatically resolve the cause (only bugs and formal causes) in the software components of the various information systems that are linked to each other
Data-Centric Security	Approach to protect sensitive data uniquely and centrally, regardless of format or location (using e.g. data anonymization or tokenisation technologies in conjunction with centralised policies and governance).
Cyber Immune System	A cyber immune system combines processes and technologies to increase the robustness of computer systems against any kind of failure. It builds on technologies such as AI-augmented testing, auto remediation and processes such as software supply chain security or reliability engineering.
Edge Computing	Information processing and content collection and delivery are placed closer to the endpoints to fix high WAN costs and unacceptable latency of the cloud. Also in context of AI solutions, edge computing becomes more relevant (ref. tinyML)
Eventual Consistency	A general way to evolve systems away from too restrictive ACID principles. Using this, and pushing it through on a business level, are the only way to keep systems evolving towards a more distributed, scalable, flexible, and maintainable lifecycle.
GitOps	Best practices coming from DevOps, applied to Operations. This, for instance, means, that all configuration is specified in files that can be maintained using version control and that are machine readable by tools to automate as many things as possible.
Human Augmentation	Enhancement of human capabilities using technology and science. Can be very futuristic (e.g. brain implants) but intelligent glasses could be a realistic physical augmentation. Cognitive augmentation (a human’s ability to think and make better decisions) will be made possible thanks to AI.
Living Documentation	Living documentation actively co-evolves with code, making it constantly up-to-date without requiring separate maintenance. It lives with the code and can be automatically used by tools to generate publishable specifications. An example can be found in some forms of annotations.
Mobile Development	Set of techniques, tools and platforms to develop web based and platform-specific mobile applications.
Multimedia Data Protection	Protection of multimedia data has gained importance with social media, remote-working, but also with the development of powerful AI models. Detecting falsification is critical. For instance one should be able to detect forgery of images (e.g., faces used for biometrics).
Observability-Driven Development	By designing systems to be observable from the start, it becomes easier to detect and fix unexpected problems as early in the development life cycle as possible, making it cheaper to deal with them
Privacy by Design	Privacy by design calls for privacy to be taken into account throughout the whole engineering process. The European GDPR regulation incorporates privacy by design. An example of an existing methodology is LINDDUN.
Process Mining	Includes automated process discovery (extracting process models from an event log from an information system), and offers also possibilities to monitor, check and improve processes. Often used in preparation of RPA and other business process initiatives (context digital transformation).
Self-Integrating Applications	A new way to integrate, to minimise manual work, based on having applications discover services, extracting metadata from various sources, automating the definition of processes, and automatically mapping dependencies
Visual Analytics	Methodology and enabling tools allowing to combine data visualisation and analytics. Allows rapidly exploring, analysing, and forecasting data. This helps modelling in advanced analytics, and to make modern, interactive, self-service BI applications.
Voice of the Citizen Applications	Contains a number of approaches to capture and analyse explicit or non-explicit feedback from users, in order to improve the systems and remove frictions.
Onion Architecture	Also ‘Hexagonal Architecture’: a set of architectural principles making the domain model code central to everything and dependant on no other code or framework. Other aspects of the program code can be dependant on the domain code. Gained a lot of popularity in the community recently
Web3 – Citizen Control	Web3 is an idea for a new iteration of the World Wide Web which incorporates concepts such as decentralisation, blockchain technologies, and token-based economics. It promises to give back control to citizens over their assets, as well as over their identity.

« Synthetic Data » – Webinar by Smals Research (december 01,2022)

Smals Research — Fri, 02 Dec 2022 09:39:58 +0000

“Fake it till you make it” : une introduction aux données synthétiques

(Nederlandstalige tekst : zie onder)

Un ensemble de données synthétiques est un ensemble de données fictives qui reproduit le plus fidèlement possible les caractéristiques d’un ensemble de données réelles. Un ensemble de données synthétiques correctement constitué peut, comme il s’agit de données purement fictives, être librement partagé, réutilisé ou publié. L’accès aux données sensibles, réelles peut ainsi être limité à un minimum. Mais dans quelle mesure un tel ensemble de données fictives est-il encore représentatif des données réelles ? Et que pouvez-vous en faire ?

Lors de ce webinaire, nous nous pencherons sur le concept de données synthétiques ainsi que sur les préoccupations pratiques qui interviennent dans leur création. Nous nous concentrerons sur les données tabulaires telles qu’elles se trouvent dans la plupart des bases de données classiques.
Nous présenterons les domaines d’application possibles pour le gouvernement. Nous verrons ainsi qu’il n’existe pas de solution miracle et qu’il s’agit souvent de poser diverses conditions préalables supplémentaires, selon le type de données que nous traitons et l’usage que nous voulons en faire.

Sur la base d’une expérience menée avec des composants open source et un lot de données ouvert, nous pourrons émettre des recommandations pour améliorer systématiquement la création d’un ensemble de données synthétiques. Nous aborderons les éléments à considérer dans ce processus et verrons dans quelle mesure les analyses basées sur des données synthétiques sont représentatives des données réelles sous-jacentes. Enfin, nous présenterons brièvement le marché commercial, qui évolue à une vitesse vertigineuse sous l’influence des développements de l’intelligence artificielle.

Slides et recording

Les slides et l’enregistrement du webinaire sont désormais disponibles :

Les webinaires de Smals Research sont gratuits et destinés aux collaborateurs de Smals et du secteur public. L’intention est de présenter les résultats du travail de Smals Research sur l’utilisation des nouvelles technologies dans le secteur public. Abonnez-vous à la liste de diffusion Smals Research Newsletter & Webinars via le site web website.smalsrech.be.

“Fake it till you make it”: een introductie tot synthetische data

Een synthetische dataset is een fictieve dataset die de kenmerken van een echte dataset zo goed mogelijk nabootst. Een correct samengestelde synthetische dataset kan, omdat het om louter fictieve gegevens gaat, probleemloos gedeeld, hergebruikt of gepubliceerd worden. Zo kan de toegang tot de echte, gevoelige gegevens, tot een minimum herleid worden. Maar in welke mate is zo’n fictieve dataset nog representatief voor de echte data? En wat kan je ermee doen?

In dit webinar gaan we dieper in op het concept van synthetische data en op de praktische bekommernissen die komen kijken bij het aanmaken ervan. We leggen daarbij de focus op tabulaire gegevens zoals we ze kunnen terugvinden in de meeste klassieke databases. Mogelijke toepassingsgebieden voor de overheid zullen worden toegelicht. We leren daarbij dat er geen “1-druk-op-de-knop” oplossing bestaat en dat het vaak nodig is om allerlei extra randvoorwaarden op te leggen, afhankelijk van het type gegevens dat we behandelen en waarvoor we de data willen gebruiken.

Op basis van een experiment met open source componenten en een open dataset, kunnen we aanbevelingen geven om de aanmaak van een synthetische dataset stelselmatig te verbeteren. We gaan in op de afwegingen die daarbij gemaakt moeten worden en we gaan na in welke mate analyses op synthetische data nog representatief zijn voor de onderliggende echte data. Tot slot belichten we kort de commerciële markt, die erg snel evolueert onder invloed van de ontwikkelingen in artificiële intelligentie.

Slides en recording

Slides en recording van de webinar zijn nu beschikbaar:

De webinars van Smals Research zijn gratis en bestemd voor medewerkers van de publieke sector en Smals. Bedoeling is de onderzoeksresultaten van Smals Research naar het gebruik van nieuwe en recente technologieën in de publieke sector kenbaar te maken. Inschrijven in de distributielijst Smals Research Nieuwsbrief & Webinars, kan via website.smalsrech.be.

.

Fake it till you make it – an introduction to synthetic data

Joachim Ganseman — Thu, 01 Dec 2022 14:58:48 +0000

(NL) Een synthetische dataset is een fictieve dataset die de kenmerken van een echte dataset zo goed mogelijk nabootst. Een correct samengestelde synthetische dataset kan, omdat het om louter fictieve gegevens gaat, probleemloos gedeeld, hergebruikt of gepubliceerd worden. Zo kan de toegang tot de echte, gevoelige gegevens, tot een minimum herleid worden. Maar in welke mate is zo’n fictieve dataset nog representatief voor de echte data? En wat kan je ermee doen?

(FR) Un ensemble de données synthétiques est un ensemble de données fictives qui reproduit le plus fidèlement possible les caractéristiques d’un ensemble de données réelles. Un ensemble de données synthétiques correctement constitué peut, comme il s’agit de données purement fictives, être librement partagé, réutilisé ou publié. L’accès aux données sensibles, réelles peut ainsi être limité à un minimum. Mais dans quelle mesure un tel ensemble de données fictives est-il encore représentatif des données réelles ? Et que pouvez-vous en faire ?

Nous présenterons les domaines d’application possibles pour le gouvernement. Nous verrons ainsi qu’il n’existe pas de solution miracle et qu’il s’agit souvent de poser diverses conditions préalables supplémentaires, selon le type de données que nous traitons et l’usage que nous voulons en faire.

Recording

Presentation

20221201-infosessie-synthdata-Final Download

Webinar DEVOXX- Fake it till you make it: an introduction to synthetic data

Joachim Ganseman — Thu, 13 Oct 2022 14:07:42 +0000

Slides van de webinar voor Devoxx op 12/10/2022

Using ‘real’ data may be tempting, yet under the GDPR it’s not a good idea when dealing with personal information. Unfortunately, testing or debugging software may be harder without having full access to all underlying data. A synthetic dataset can be a good solution: generating fictitious replacement data, that mimics the structure and distribution of the original data. Joachim Ganseman from Smals Research talks about how synthetic data can be generated, and especially about the practical concerns and limitations. How do we deal with rarely occurring values, correlations or dependencies? What about the balance between maximum privacy protection vs. retaining enough functional usability? Can we do reliable analytics on a synthetic dataset? He will share some practical examples using open source software in Python.

Recording

Presentation

20220217-devoxx-syntheticdata Download

Search Results for “synthetic data” – Smals Research

Software testing

Test planning & management

Maturity levels

Test analyse & design

Maturity levels

Test implementation, automation & test data generation

Maturity levels

Test execution

Maturity levels

Evaluating exit criteria & reporting

Maturity levels

Test control

Maturity levels

Communication / Marketing

AI copywriting

Maturity levels

AI translation assistants

Maturity levels

AI video generator

Maturity levels

AI image generator

Maturity levels

AI SEO assistants

Maturity levels

PII Filtering – door ******* uit *****

PII vs. Personal Data

PII Detectie en Filtering

Tools of the trade

Conclusie

Approaches Radar 2024

Methodology, Approaches & Architectural Styles

Protection des données par la pseudonymisation préservant la structure des numéros de registre national

Fonctionnement

Dans la pratique

Membrane de confidentialité

Conclusion

Gegevensbescherming m.b.v. structuurbehoudende pseudonimisatie van rijksregisternummers

Werking

In de praktijk

Privacy membraan

Conclusie

Approaches Radar 2023

Methodology, Approaches & Architectural Styles

« Synthetic Data » – Webinar by Smals Research (december 01,2022)

“Fake it till you make it” : une introduction aux données synthétiques

Slides et recording

“Fake it till you make it”: een introductie tot synthetische data

Slides en recording

.

Fake it till you make it – an introduction to synthetic data

Recording

Presentation

Webinar DEVOXX- Fake it till you make it: an introduction to synthetic data

Recording

Presentation

PII Filtering – door *** uit *