From unstructured to AI-ready data

Written by Andreas

October 29, 2025

Act 1 – The challenges: AI-ready data from unstructured formats

Unstructured data remains a significant bottleneck in unlocking the full potential of AI solutions. How many enterprises have truly managed to overcome this challenge? Organizations are still drowning in documents, from invoices, contracts, compliance policies, on-boarding forms, legacy PDFs, scanned images, audio files and many more. Dave Donahue from Unstructured emphasizes in this context, “Unstructured data is not inherently more valuable than structured data, but generally speaking, large organizations produce four times as much unstructured data as structured data” (Davison, 2025).

This raises the question of how to handle this volume and type of data? Key issues are that unstructured data often remains trapped in silos, poorly cataloged, and not easily discoverable (Muscolino et al., 2023, p. 11). The same study also emphasizes that “… even though the volume and variety of unstructured data is vastly greater than that of structured data, spending on technology to utilize unstructured data is lower than that for structured data” (ebd., 2023, p. 9), highlighting the ongoing challenges of unifying and managing those data structures. This gap is also reflected in how the data is analyzed: 40% of the study participants (n=414 business and IT decision makers; ebd., 2023, pp. 19-20) reported relying on manual processing, while 34% used partially automated approaches (ebd., 2023, p. 10). Even though these findings were published two years ago, they’re still highly relevant today, as I can confirm from my own experience. For decades, converting documents back into a consistent, machine-readable format has been a major hurdle – mainly because they come in so many different formats, aren’t well standardized, and were designed for print, which often causes the loss of structure and metadata (Livathinos et al., 2025, p. 1). A challenge that has developed over time, which, as Livathinos et al. (2025) emphasize, “with the advent of LLMs and popular application patterns such as retrieval-augmented generation (RAG), leveraging the rich content embedded in PDFs, Office documents, and scanned document images has become ever more relevant.” (p. 1).

Most enterprises are aware of these issues when it comes to integrating their data into AI solutions. But the question now is how to tackle this challenge – and where to start experimenting on their own unstructured data without immediately investing in commercial software.

One way to do so is using a document-processing technology like Docling. Livathinos et al. (2025) describe Docling as “… a cost-effective, accurate and transparent open-source library with a permissive license, offering a reliable and flexible solution for document conversion” (p. 2) which “… runs efficiently on commodity hardware in a small resource budget” (p. 1). That sounds like a good starting point for local AI document-processing experiments, even if budget constraints and obstacles to funding projects that utilize unstructured data exist (Muscolino et al., 2023, pp. 10-11).

Therefore, let’s have a deeper look into this library.

Act 2 – Meet Docling: Your document processing ally

What is Docling and where does it come from? The roots of this project trace back to the “AI for Knowledge” group at IBM Research Zurich, a team of researchers that has been exploring the intersection of artificial intelligence and information extraction for years. Led by Peter Staar, the team developed the Deep Search platform, from which the Docling project later emerged.

The latest features are highlighted on the GitHub project page. Docling can handle a wide range of document formats – from PDFs and Word files to spreadsheets, presentations, audio, and images. It understands complex PDF structures like layouts, tables, code, and formulas, and outputs everything in a unified DoclingDocument format. You can run it locally for secure or offline use, and it integrates with Langflow, LangChain, LlamaIndex, Crew AI, and Haystack for agentic AI workflows. It also includes OCR for scanned files, supports visual language models like GraniteDocling, processes audio through automatic speech recognition (ASR), connects to agents via MCP, and offers a simple command-line interface. One limitation of Docling, at the time this post was written, is that it does not yet support handwritten documents.

To try your own “smell test”, I recommend using this demo environment on Hugging Face, which runs the latest visual language model (VLM), Granite-Docling-285M. You can either upload your own documents or choose from one of the samples provided (see screenshot below). For a deeper dive into GraniteDocling, check out this link.

granite-docling-demo — Screenshot of the Hugging Face – IBM Granite-Docling-258M demo captured by the author. October 27, 2025. Used for illustrative purposes only.

Another very convenient way to test Docling is via its Langflow plugin: “Langflow is an open-source, Python-based, customizable framework for building AI applications” (Langflow, n.d., “About Langflow”). To install and run Langflow, you can begin here. To get started with the Docling plugin, you’ll need to be running Langflow version 1.6 or later. Earlier versions don’t include it, so double-check that your setup is up to date before you dive in. Once you’re ready, I highly recommend watching the video tutorial “Docling + Langflow: Visual Document Processing for AI Workflows”. It’s a good resource that walks you through how to combine Docling and Langflow for seamless document processing within AI workflows. The following screenshots show a Langflow flow using the Docling plugin to extract all text from the Docling paper referenced above.

docling-langflow-1 — Screenshot captured by the author, demonstrating a Langflow flow using the Docling plugin. October 27, 2025. Used for illustrative purposes only.

Screenshot captured by the author, displaying the extracted text of the document processed in the playground. October 27, 2025. Used for illustrative purposes only.

To take things a step further, you can create a RAG (Retrieval-Augmented Generation) scenario. Begin with one or more documents, extract the unstructured data, store it in a vector database (for example, Astra DB), and then use it for retrieval-based tasks. The first screenshot below illustrates the Docling ingestion pipeline in Langflow, while the second shows how the chunked data appears once loaded into Astra DB. You can also find a reference to this in the video tutorial “Docling + Langflow: Visual Document Processing for AI Workflows”, which I mentioned in the section before.

Screenshot captured by the author, showing an ingestion pipeline using Docling and Astra DB. October 27, 2025. Used for illustrative purposes only.

Screenshot created by the author, visualizing how the chunked data appears once loaded into Astra DB. October 27, 2025. Used for illustrative purposes only.

This is good for testing, but questions which quickly arrive are, How can unstructured data ingestion and the resulting data assets actually be managed and used for enterprise use cases?

Act 3 – Scaling-up: Integrating AI-ready data into enterprise workflows

Document processing is one thing, but the output needs to be consumable for business users. Once documents are transformed and stored, businesses want to act on them. To achieve this, the scenario described in Act 2 will be extended: documents stored as chunks in a vector database only become truly valuable when they are made accessible through a business application. For this purpose, watsonx Orchestrate can be used – an enterprise solution that orchestrates AI agents and workflows. The screenshot below shows Astra DB integrated with watsonx Orchestrate as a knowledge source. When a business user submits a query through the chat – such as “What is the DocLayNet dataset about?” – the orchestration agent automatically routes the request to the connected knowledge base. It retrieves the most relevant vector embedding from Astra DB, reformulates the content for readability, and returns the response to the user. The agent also provides a reference to the exact source snippet from which the answer was derived, ensuring transparency and traceability. This is a typical RAG (Retrieval-Augmented Generation) use case.

docling-rag-3 — Screenshot captured by the author, showing IBM watsonx Orchestrate Agent Builder using Astra DB as knowledge source. October 27, 2025. Used for illustrative purposes only.

Conclusion

AI-ready data is often not available where it is needed, yet it is often a crucial prerequisite for AI-driven use cases and business processes. Tools like Docling cannot solve all challenges associated with unstructured data, but they provide a strong starting point. By experimenting with your own documents, you can begin to transform raw text into structured, usable information, integrating it as a key component in a document processing or a broader data ingestion pipeline.

For enterprises aiming to scale, additional considerations become essential. Processed documents and extracted data must be stored in a reliable, organized way to prevent fragmentation and silos. Access needs to be fast, secure, and compliant with data privacy regulations. Moreover, the quality of the extracted data is critical, as document processing can be error-prone due to poor document or scan quality. To address this, a robust data governance and intelligence layer is required on top of high-quality extraction to ensure the data is trustworthy and actionable.

Once properly stored and governed, this AI-ready data can be seamlessly integrated with AI agents, as demonstrated in Act 3, enabling automated workflows and intelligent business processes. By combining Docling’s extraction capabilities with enterprise-grade storage, governance, and AI integration, organizations can build a foundation for truly AI-driven operations using their unstructured data.

💡Liked what you read? Subscribe to my blog and get new posts delivered straight to your inbox – so you never miss what’s next.

References

Davison, A. (2025). AI and the future of unstructured data. https://www.ibm.com/think/insights/unstructured-data-trends

Langflow. (n.d.). What is Langflow? Langflow Documentation. Retrieved October 27, 2025, from https://docs.langflow.org/

Livathinos, N., Auer, C., Lysak, M., Nassar, A., Dolfi, M., Vagenas, P., Berrospi Ramis, C., Omenetti, M., Dinkla, K., Kim, Y., Gupta, S., Teixeira de Lima, R., Weber, V., Morin, L., Meijer, I., Kuropiatnyk, V., & Staar, P. W. J. (2025). Docling: An efficient open-source toolkit for AI-driven document conversion. arXiv. https://arxiv.org/abs/2501.17887

Muscolino, H., Machado, A., Rydning, J., & Vesset, D. (2023). Untapped value: What every executive needs to know about unstructured data

Welcome to my blog

Why your Enterprise AI project is likely to fail – and how to overcome it