ISSN: 2643-6744
Christian Mancas*1, Diana Christina Mancas2
Received:March 29, 2026; Published:April 08, 2026
*Corresponding author: Christian Mancas, Ovidius University, Bd. Mamaia 124, Constanta, CT, Romania
DOI: 10.32474/CTCSA.2026.03.000171
This research paper describes how to reverse engineer relational database schemas into (Elementary) Mathematical Data Model ones. Four tools are used to simplify this job, using the same example of a medium complex MS Access database: MatBase, ChatGPT, Gemini AI, and Claude AI, each one with its advantages and disadvantages.
Keywords: (Elementary) Mathematical Data Model; Reverse Database and Software Engineering; MatBase; ChatGPT; Gemini Ai; Claude Ai
Be it in production, design, scientific research, or Education, forward database (db) and software engineering (se) is almost the sole direction in these fields, harnessed even more in these latest years by the race in Artificial Intelligence (AI), and, especially, by the Large Language Models (LLMs) explosion.
However, there is still and always there be a need for reverse database and software engineering, which should not for ever remain the Cinderella of Computer Science (CS) and Information Technology (IT) [1]. The IT state of the art is still dominated by lot of legacy db software applications (apps) which are poorly documented (if any), working almost fine, but needing extensions from time to time, and even refactoring or completely replacing them with newer technology ones, without losing any of their useful functionalities.
Even extending them, not to mention refactoring (e.g., switching to a safe web interface) or complete replacement, the database and software “surgeons” need a deep, full, and precise knowledge of some, if not even all, of their conceptual and technological de tails, not only to fulfill their tasks, but also not to tamper with the functioning of the rest of these apps. Of course, before understanding the apps’ code, you must first understand the underlying dbs’ structure.
Up to now, reverse db and se required highly skilled db and software architects and developers, with extensive knowledge in the corresponding legacy programming languages and technologies. Even only the reverse engineering of legacy dbs was often a big challenge. Especially if your target is not a plain English verbose novel, but a formal, concise mathematical schema. This paper reports on our latest research results in reverse db engineering towards concise and accurate mathematical schemas, with the help of AI, which considerably alleviates these prerequisites.
The goal of this research was to compare the capabilities of our long date tool MatBase [2,3] with three of the top current AI tools, namely (in chronological order of their public availability) ChatGPT [4], Gemini [5], and Claude [6] in reverse db engineering, the common target formalism being our (Elementary) Mathematical Data Model ((E)MDM) schemes [7]. (E)MDM is based on the semi-naïve theory of sets, relations, and functions (SNTSRF) [8], the temporal first-order predicate logic with equality (TFOPL) [9,10], and Datalog ¬ [11,12].
MatBase is our intelligent data and knowledge base management system prototype, mainly based on (E)MDM, but also on Datalog, the Relational (RDM) [11,13,14] and Entity-Relationship (E-R) [14-16] Data Models.
MatBase was designed to accept (E)MDM schemas (which include Datalog ¬ programs), translate them to relational databases (rdbs) and automatically code generated database (db) applications (apps), to accept and import rdbs and translate them to (E) MDM schemas, to accept E-R Diagrams and translate them to (E) MDM schemas, as well as, again dually, to translate (E)MDM schemas into E-R diagrams [14-16]. The main goal was always to reach modeling as programming [17] and, especially, mathematical data modeling. Currently, MatBase has two versions, one developed in MS Access and VBA, for small dbs, and one in MS C# and SQL Server, for large dbs.
ChatGPT, Gemini, and Claude are AI agent tools.
The next Section mentions related work. The third one is dedicated to the materials and methods used. The fourth one presents and discusses the results obtained. The paper ends with conclusions and a list of references.
The FOPL component of MatBase was described in [18]. The relationship between ChatGPT and mathematics is the topic of several published articles, e.g. [19,20]. Similarly, for Gemini AI see, e.g., [21,22], and for Claude AI, e.g., [23,24]. A comparison between mathematical capabilities of ChatGPT 5, Claude 4.1 Opus, Gemini 2.5 Pro, and Grok 4 can be found in [25].
We used our two Toshiba Satellite Intel CORE i7 running MS Windows 10, Google Chrome browser’s current version (146.0.7680.81), MS Access 365 (v. 2603), MS SQL Server 2025, MS SQL Server Management Studio v. 21.4.12, MS SQL Server Migration Assistant for Access (SSMAA) [26] v. 17, MatBase 5.2 Access, ChatGPT Plus 5.3, Gemini 3, and Claude Sonnet 4.6.
We started to refactor a legacy MS Access Geography app managed by MatBase into a MS Razor C#ASP web one, over MS SQL Server. The architecture of this app is fine: a Geography.mdb pure VBA code one (i.e., containing only forms, VBA code for enforcing non-relational constraints, and a menu) uses links to the tables of two pure data dbs: a GeographyDB.mdb storing the fundamental data, and a GeographyTmp.mdb storing the temporary tables. The GeographyTmp.mdb has only 3 empty tables and was immediately and correctly imported into a SQL Server db by SSMAA.
The GeographyDB.mdb has 389 objects (62 tables, each with its own surrogate primary key, other 106 unique keys, 11 non-unique indexes, and 148 foreign keys) storing 3.75MB+ of data. The MS Access Database Documenter generated for this db a .pdf file of 449 A4 pages of documentation (taking 1.35MB+).
SSMAA took almost half an hour to import it and managed only partially: it wrote 510 statistics.html files taking 4,55MB, other 4 .html ones, 1 .xml, 12 .js, 7 .css, 1 .ttf taking, and 81 .gif ones, in total, other 4.2MB, only for displaying the import statistics, plus a 60 A4 pages .txt file with the table import list. Unfortunately, the instances of 10 tables were lost. The T-SQL script of the imported db schema generated by the SQL Server has 5,650 lines.
MatBase exports the managed dbs in either XML, HTML, PDF, or DOCX formats, by simply clicking on its submenu option /Manage Databases/ Export Database; moreover, after any successful execution triggered by its submenu option Other Databases/Import Relational Database, MatBase also generates a HTML file with the (E) MDM schema of the newly imported db.
With the three AI we considered for this research, we started the dialog by proposing them the GeographyDB.mdb file and asking for the corresponding formalization of its structure using SNTSRF and FOPL.
Here are the results obtained with each one of these 4 tools.
MatBase 5.2 Access
MatBase wrote in some 2 minutes a .pdf file of 20 pages, taking almost 300KB. Figures 1 to 4 show a fragment of it (the scheme of 9 tables out of 62, having a total of 51 columns, i.e., mathematical functions).
As detailed in [7], (E)MDM uses the following abbreviations
and conventions:
• Entity-type sets are written in bold and italic.
• Relationship-type sets (e.g., GALAXIES_NEIGHBORHOOD from
the bottom of Figure 4) are similarly written but followed by
parentheses with their canonical projections.
• Attributes (e.g., x), i.e., the functions taking values from data
types or their subsets, are written under the name of their corresponding
domain sets, indented, without explicitly mentioning
them.
• Structural functions (e.g., GalacticSuperCluster), i.e., functions
taking values from object sets (i.e., corresponding to foreign
keys), are written without any abbreviation.
• Explicit constraints are prefixed by the letter C having as subscript
the corresponding unique identification value from Mat-
Base CONSTRAINTS metacatalog table [3], followed, in parentheses,
by their name, unique within the db.
• obid is the abbreviation for object identifier; auton. is the abbreviation
for autonumbering.
• The double arrow is used for injective (one-to-one) functions.
• NAT(n) stands for the subset of naturals having at most n digits.
• ASCII(n) stands for the subset of strings over the ASCII alphabet
having maximum length n.
• The total constraint means that the corresponding function is
totally defined (i.e., the corresponding table column has a NOT
NULL constraint).
• Computed sets (e.g., *STARS from Figure 3) and functions (e.g.,
*OrbitCenterName from Figure 2) have their names prefixed by
‘*’.
• ° is the function composition operator, while • is the function
product one.
• ¬| − f • g is the notation for the non-existence constraint
[27], meaning that, for no element x of their domain set, may
both f(x) and g(x) be defined (i.e., not null).
• key (see the constraint C38 on the last line of Figure 4) is the
abbreviation for minimal injective.
• The parentheses following set names, function and constraint
definitions are comments (stored in MS Access dbs in the optional
Description column).
As MatBase also manages the corresponding app stored in Geography.mdb (which mainly enforces the non-relational constraints), this (E)MDM schema contains explicit constraints as well. Were MatBase only importing GeographyDB.mdb, no such constraints would be present in this schema, except for C38.
ChatGPT Plus 5.3
ChatGPT does not accept binary files like the .mdb ones. When we provided the .pdf one exported by the MS Access Database Documenter, it rejected it too, as being too long. In the end, it accepted the .sql script generated by the SQL Server after the SSMAA import. From the options it offered, we chose the maximum rigor ones (i.e., both functions and constraints, full 3-valued logic, including null values, and the logic textbook style). Figures 5 to 10 show the schema fragment corresponding to the one in Figures 1 to 4. Figure 11 shows almost everything we could then obtain when asking again for the formalizations of constraints as well. Please note the following:
• The answer is only about 6 of the 9 tables from the astronomy
section of the db. Generally, even after insisting several times,
we could not obtain the whole formalized schema, without getting
any No to our requests: ChatGPT always tries to deflect
your queries by providing other options to choose from.
• ChatGPT sometimes used the name of our db schema tables
and columns, sometimes abbreviated, and sometimes even
changed them (e.g. from x to id_abbrev.SetName).
• Function codomains do not take into consideration corresponding
check constraints: they are only specifying the mathematical
sets corresponding to the data types (e.g., the naturals,
the integers, the reals, etc.).
• Primary key constraints (see Figure 8) are not compactly written
as, at least, injective, or one-to-one, or key, but using explicit
one-to-oneness definition (unfortunately, using their id notation
instead of x). Even worse (see Figure 10) the injectivity of
the Continent, which stores continent names, is misidentified
as primary as well, although the corresponding CONTINENTS
table also has a surrogate, AutoNumber primary key x. Generally,
of course, primary does not make sense in SNTSRF.
• Both totality and not totality are also verbosely described (see
Figure 8).
• Similar verbosity is used for foreign keys (see Figure 9). Even
worse, all foreign keys are described as being totally defined,
which, for their majority, is not the case.
• Generally, ChatGPT is extremely verbose, not rigorous, and
mainly uses plain English spiced with logic quantifiers and
symbols (e.g., ≠,⇒,¬,∈), rather than the “logic textbook
style” and “maximum rigor” advertised. Unfortunately, when
you copy its answers from the browsers, they are verbosely
written using Latex conventions: if you do not know them and
do not want to learn them either, you must take screen shots
and manually replace the Latex commands with the corresponding
math symbols.
• Only one positive remark: although the T-SQL schema does
not include our non-existential constraint ¬| − River • Lake
• Sea • Ocean • GeographicUnit (Rivers may flow in only one
of these: another river, a lake, sea, ocean, or a geographic unit,
e.g., desert, cave, etc.), at the end of Figure 10 ChatGPT added
the non-existence constraint Sea ¬| − Ocean, but not its dual
and using a rather Prolog-like notation.
Gemini 3
Gemini is much more verbose than ChatGPT and does not provide you the mathematical scheme it generates either, not even parts of it, although we tried several times during the long dialogue we had (85 pages A4): it writes you that it has stored it internally, “intended for research, simulation, and high-fidelity geographical modeling”. The dialogue is full of metaphors; here are some examples:
• The DISTANCES table is the most complex part of your “Metabolism.”
• This is the Bio-Logic Actuator. If a user attempts to enter 72
or 11023, the “encapsulated bacteria” in the database (the
Validation Text) produces a limestone wall (Error Message) to
heal the rupture.
• Unlike terrestrial geography, which is fixed by soil, the Celestial
Map is a Relational Projection. The “Fortress” ensures that a
star cannot belong to two constellations simultaneously (The
Monogamy of Entanglement).
• These formulas represent the “Fortress” rules. If a data entry
violates these, the mathematical symmetry of your universe is
ruptured.
From the very few formal fragments we obtained, we noticed that Gemini’s style is identical to the ChatGPT one, e.g., set and function names are usually abbreviated or completely replaced (e.g., id for x), primary and foreign keys use the same verbose syntax, which extends to the relational domain constraints, e.g., for our Altitude : PEAKS → [1000, 8848], Gemini wrote the constraint “Domain: ∀ m ∈ PEAKS : 0 ≤ H(m) ≤ 8848” (unfortunately, even wrongly replacing 1000, the minimum altitude for a peak to be considered a mountain one, with 0: only “poets” like it can consider that kid-built sand heaps by the seaside are mountain peaks as well).
Moreover, surprisingly for us, Gemini also used an RDM-style functional dependency notation, e.g., for CELESTIAL_BODIES it wrote: fCELESTIAL_BODIES: IDCC → Name × TypeID × IDGal × Orbit Center × Mass.
Figure 12 shows what does Gemini understand by first-order logic constraints. On a positive note, although the 449 A4 .pdf pages that it took as input is not presenting the non-relational constraints enforced in Geography.mdb, it suggested us to not forget adding to the DISTANCES table (between cities) the geometrical triangle inequality constraint ( ∀ City1, City2, City3 ∈ CITIES : DISTANCES (City1, City3) ≤ DISTANCES (City1, City2) + DISTANCES (City2, City3)).
Instead of mathematical formalizations, Gemini is much more
interested in coding and pushing ahead usage of PostgreSQL and
Python:
• Without being asked for and even without asking for our permission,
Gemini translated our .mdb db into a PostgreSQL one.
• Then, with our permission, Gemini built on top of it a FastAPI
(Python) with Pydantic validation models layer, plus an analysis
NetworkX (for graph topology) and AstroPy (for celestial
metrics) module, as well as a visualization Plotly (for connectivity
heatmaps) and Graphviz (for relational mapping) one.
• Finally, it added to this app a pathfinding Dijkstra algorithm, a
connectivity density map (that calculates which counties are
the more connected hubs, measured by how many roads pass
through their cities), a closeness centrality one (that identifies
which cities are closest to all others, acting as the natural
hubs of countries’ spines), a semantic mapping interface layer
embedded in the app’s interface for translating English/Romanian
questions into the complex JOIN logic required by the 62
db tables (natural language query tool called by Gemini “The
Oracle”), and even a geographical quiz engine!
Figure 13 shows Gemini’s summary of this effort (where you can note once more its appetite for metaphors).
Claude Sonnet 4.6
Claude was the only one who could decrypt the Geography.mdb file. However, lot of columns and constraints were not captured, so we provided instead the corresponding T-SQL script: in only a couple of minutes, it offered for downloading an .md text file with its corresponding db schema mathematical formalization. The fact that the file is a .md one means that Claude stored it as an axioms one for our further interactions on this project.
Unfortunately, this second answer was only a “sketch”, not including all sets, nor all columns, no domain (check) constraints, and containing some incorrect formalizations. A much better formalization was delivered for download after our reactions. Unfortunately, it still had some already flagged issues, as well as new ones: Figure 14 shows our corresponding message. In a couple of minutes, Claude replied with the message shown in Figures 15 and 16, plus a new Geography.md file. Neither were these corrections enough: to our message from Figure 17, Claude answered as shown in Figure 18.
Although issue 3 was not solved and there were still erroneous expressions like x.Min Visible Latitude and a UNICODE (255) +, we were happy with this fourth version of the formal schema, as we reached the quota for daily free interactions. Figures 19 to 22 show the math schema fragment equivalent to MatBase’s one from Figures 1 to 4.
Remarkably, CLAUDE AI added irreflexively constraints for all 9 dyadic relations (DISTANCES and 8 neighbor-type ones), as well as for all 8 self-maps (e.g., Galactic Supercluster and Orbit Center); of course, the self-maps are, in fact, acyclic, not only irreflexive, but at least irreflexively is commonsense. The whole file has 17 A4 .pdf pages.
As expected, MatBase is still the best one of these four tools, both as accuracy and speed. Its great disadvantage, however, is that it may reverse engineer only MS Access and SQL Server dbs. The 2nd best is, by far, Claude AI, both as speed and as almost perfect accuracy. Moreover, it is not limited to only MS Access and SQL Server dbs, very probably just like ChatGPT and Gemini. Unfortunately, almost after any answer, you must wait some 10h before you get the right to ask another free of charge question. The 3rd best is ChatGPT. Unfortunately, it takes extremely long to get a full schema, which, moreover, is not that accurate. Fortu nately, it is also providing the possibility to formalize db schemas using Category Theory formalism as well, but this is beyond the scope of this research: it will be our next further work topic. Finally, Gemini is almost not useful at all for this endeavor but is very interesting for developing data intelligence apps. Future work will evaluate Claude AI, ChatGPT, and Gemini for legacy database reverse engineering.
The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.
This research was not sponsored by anybody and nobody other than its authors contributed to it. The corresponding author always recalls with pleasure the contributions made by some of his outstanding former students: Lavinia Crasovschi for the (E)MDM, Adrian Mocanu and Sabina-Maria Motoc for MatBase.
https://www.anthropic.com/news/anthropic-education-report-how-university-students-use-claude
Bio chemistry
University of Texas Medical Branch, USA
Department of Criminal Justice
Liberty University, USA
Department of Psychiatry
University of Kentucky, USA
Department of Medicine
Gally International Biomedical Research & Consulting LLC, USA
Department of Urbanisation and Agricultural
Montreal university, USA
Oral & Maxillofacial Pathology
New York University, USA
Gastroenterology and Hepatology
University of Alabama, UK
Department of Medicine
Universities of Bradford, UK
Oncology
Circulogene Theranostics, England
Radiation Chemistry
National University of Mexico, USA
Analytical Chemistry
Wentworth Institute of Technology, USAMinimally Invasive Surgery
Mercer University school of Medicine, USA
Pediatric Dentistry
University of Athens , Greece
The annual scholar awards from Lupine Publishers honor a selected number Read More...
