Four principles of AI-ready data architecture

Blog by Sanjay Sajeevan04 Aug 25

AI – whether it’s machine learning, generative, or agentic – can solve real business challenges. But to get any of these solutions working, a lot needs to happen behind the scenes. It’s not just about the model. You need the right data in the right format, properly transformed and modelled so that the outputs from your AI systems are accurate, reliable, and can be trusted. This is where data engineering and governance play a crucial role.

Everything is data

The definition of data is rapidly evolving. Today, it includes formats like images, audio, PDFs, logs, and transcripts, each requiring specialized methods to extract, structure, and convert them into a form that’s usable.

For example, one of our customers was struggling to make use of the large bank of dashboards they had built up over many years. Our solution was to build a web app where a user can input their question and, within a few seconds, the app brings up a list of recommended dashboards that provide the answer. These are ranked by relevancy and come with a description of what each dashboard can do.

To make the solution work, we used REST APIs behind the scenes to automate extracting report screenshots, dataset info, filters, and user annotations. Then we passed this data to a large language model to generate a searchable index of possible questions each report could answer. Here are four key observations of what make this kind of project successful.

1. Scalability shouldn’t be an afterthought

Projects often start with a pilot for a single region or use case and once the value has been proven, are quickly rolled out across multiple regions and business functions. The results can be messy if scalability isn’t considered from the beginning. We’ve seen customers build dozens of pipelines for each market when a single pipeline with parameterized and modular design would’ve done the job.

2. Metadata makes reuse possible

By separating logic from configuration, you can drive pipelines with metadata stored in a file, such as an Excel workbook, to define inputs, locations, and patterns. If you want to add a new data source or region, you don’t touch the pipeline, you just update the metadata. We used this approach to move over 13,500 files between layers for one customer, all handled through a single reusable pipeline.

3. Build pipelines that bend instead of break

Data is always changing. Columns are added, formats shift, and new sources of data are identified. Your pipeline needs to be flexible enough to cope with this constant change, extracting only the required information and continuing to work even as inputs evolve. Build in auditing, logging, and quality checks from the start.

4. Avoid vendor lock-in

The AI ecosystem is evolving fast. New models appear regularly, each promising better capabilities. To keep up, your architecture needs flexibility. That means avoiding lock-in to a single tool or provider. One way to do this is by using open formats like Parquet, Delta, or Iceberg that make your data portable. This gives you the flexibility to switch platforms without costly migrations or compatibility issues.

Be conscious of where you’re using proprietary formats and choose open standards wherever possible. Flexibility will pay off as your AI strategy matures and your tooling evolves.

It starts with the data

AI success isn’t just about the model; it’s about the data. When we combine good data engineering with robust data governance, we are setting ourselves up for true AI success, where the right people get the right data and everyone can trust the results.

Find out more