Unlocking the Arabic Archive: CoreTechX's Hybrid Transformer Architecture and the End of the "Black Box"

The GCC is witnessing a fundamental shift in document intelligence, moving from theoretical AI ethics toward quantifiable ROI and secure, localized intelligence. While standard OCR merely identifies characters, decision-makers today require systems that are searchable, auditable, and capable of unlocking the meaning within millions of "locked" handwritten government and historical records.

As Fahad Faisal Fahad AlSaud, co-founder of CoreTechX, notes, "Intelligence from Arabic documents should not be treated as a long-term ambition. It is a necessity at the present time. We are enabling leaders to turn decades of silent archives into active strategic assets."

The E2E Philosophy: A Modular Technical Deep Dive

To bridge this strategic gap, CoreTechX developed Raqmn.ai, a turn-key solution designed to make the digitization and analysis process more efficient and secure. Unlike fragmented tools, Raqmn.ai integrates the entire recognition process, including image processing, text detection, and output generation, into a unified, high-speed framework.

"We did not rely on off-the-shelf OCR," explains Fahad Durukan, co-founder of CoreTechX. "We built our own end-to-end pipeline tailored specifically to Arabic handwriting, including a hybrid CNN–Transformer architecture optimized for both character and line-level recognition. This ensures that the nuances of Arabic script are preserved, not lost in translation."

CoreTechX
CoreTechX

The pipeline operates through several sophisticated stages:

  • Preprocessing and Noise Reduction: Techniques are used to enhance readability and clean inputs, accounting for physical degradation like ink bleed or faded strokes.
  • Line Segmentation and Sorting: The system divides the document into line-based inputs and arranges them in a meaningful order to maintain the original reading order.
  • Core OCR Engine: A transformation engine that is optimized for the contextual dependencies and ligatures found in cursive scripts.
  • Structure Builder: Organizes text in an ordered JSON structure, enabling logical formatting, marginal comments, and annotations.
  • The "LLM Fix": A lightweight language model applies a final refinement stage to improve contextual accuracy and semantic meaning.

Architectural Innovation: The Transformer Advantage

CoreTechX has moved beyond traditional CNN-RNN-CTC models toward a Hybrid CNN–Transformer architecture. This approach is superior at capturing long-range dependencies and global context, which is essential given that Arabic letters change shape depending on their position.

To solve the region's "data scarcity" problem, the system utilizes Synthetic Pre-Training with custom-generated images that mimic real-world noise. This is followed by Multi-Domain Fine-Tuning across diverse datasets, such as KHATT and Muharaf. By modifying "cross-attention layers," the model avoids "over-forgetting" historical orthography while gaining contemporary precision.

Benchmarking Success vs. Global Models

The latest technical results represent a new State-of-the-Art (SOTA) for the Arabic language. In comprehensive benchmarks comparing models like Azure, Google-Vision, Claude, and Gemini, the Raqmn.ai engine, referred to as CTX, emerged as a leader in consistency and character accuracy:

  • Superior Character Accuracy (Muharaf): Across historical manuscript datasets, CTX achieved the lowest overall Character Error Rate (CER) of 6.3%, significantly outperforming Azure at 24.6% and the broader market.
  • Handwriting Excellence (Khatt): While competitors struggled with the nuances of modern handwriting, CTX performed best with a CER of 3.6%, proving its reliability for practical document reading.
  • Reliability Gap: While generalist models record significantly higher error rates, such as ChatGPT reaching a 74.3% CER on Muharaf and a 37.3% CER on Khatt, Raqmn.ai maintains high precision across both public and private data.

Deployment: Sovereignty and Scalability

Because GCC government and historical institutions demand strict data sovereignty, CoreTechX avoids external API risks by offering a fully on-premise system. This ensures sensitive data remains entirely within the client's infrastructure.

By layering generative AI and vectorized retrieval on top of structured data, archives become interactive knowledge systems. "Our goal is to structure this vast unstructured corpus and make it accessible to everyone—governments, researchers, businesses, and the public," says AlSaud. As the platform evolves into RAQMN, it is set to become the backbone of structured Arabic knowledge, scaling employee productivity by at least threefold.

ⓒ 2026 TECHTIMES.com All rights reserved. Do not reproduce without permission.

Join the Discussion