Preprint Neuro-Symbolic Multi-Modal

Thinking with Tables

Enhancing Multi-Modal Tabular Understanding via Neuro-Symbolic Reasoning — a code-based framework that lets VLMs write and execute programs to solve table QA and prediction tasks.

How It Works

Inference Pipeline

TWT uses multi-turn neuro-symbolic reasoning: the model generates code, a sandbox executes it, and the result feeds back into context — looping until the final answer emerges.

🖼️Image + Question
table snippet & task
🧠VLM Reasoning
<analy> + <code>
⚙️Sandbox Exec
pandas, sklearn…
📊Code Result
<code_result>
Final Answer
<answer>
↻ Loop until <answer> is produced — multi-turn interaction
Framework

Key Features

A two-stage training recipe — Task-Oriented SFT followed by Adaptive Loss-Scaled GRPO — that teaches VLMs to reason with tables through code.

📋

Multi-Modal Table QA

Given only a table header image and file path, the model writes pandas code in a sandbox to query, aggregate, and reason over the full table.

📈

Multi-Modal Table Prediction

Combines tabular feature engineering (via code) with ResNet50 image predictions, fusing both modalities for classification and regression tasks.

🎯

Task-Oriented SFT

Supervised fine-tuning with masked code results — the model learns the <analy> / <code> / <answer> protocol without memorizing execution outputs.

🔁

Adaptive Loss-Scaled GRPO

Reinforcement learning with multi-turn rollouts. Only successfully executed code contributes to gradient updates, improving code reliability.

🛡️

Sandboxed Execution

All generated code runs in a controlled sandbox with timeout and persistent environment, ensuring safe and reproducible inference.

🔌

Plug-and-Play Tools

External tools like ResNet50Predict are injected into the sandbox, letting the model call vision models from generated code seamlessly.

Interactive Examples

TWT in Action

Watch the multi-turn reasoning unfold step by step. Click Play to animate the conversation.

Table snippet for QA example
Table header image (WikiTQ example)
Query Who served as the Minister of Foreign Affairs for Second Labour Government of New Zealand, and what dates did he serve?

Table file path: your_path/xx.csv
Analysis — Turn 1 The sample table shows ministries like Deputy Prime Minister, Attorney-General, etc., but does not include "Minister of Foreign Affairs". I need to read the full CSV file and search for the relevant ministry entry. I'll load the CSV and check column names and data.
Code — Turn 1 import pandas as pd df = pd.read_csv('xx.csv') print(df.head().to_markdown(index=False))
Code Result (Table head returned — Ministry, Minister, Term of office columns displayed)
Analysis — Turn 2 The first five rows don't mention "Foreign Affairs." Historically it was called "External Affairs" in New Zealand. I'll search the 'Ministry' column for both terms (case-insensitive).
Code — Turn 2 foreign = df[df['Ministry'].str.contains( 'External Affairs|Foreign Affairs', case=False, na=False )] print(foreign[['Ministry','Minister', 'Term of office','Term of office.1' ]].to_markdown(index=False))
Code Result Minister of External Affairs — Walter Nash — 12 Dec 1957 to 12 Dec 1960
Analysis — Turn 3 Found: "Minister of External Affairs" (historical name). The minister was Walter Nash, serving the full term of the Second Labour Government.
Answer Walter Nash served as the Minister of External Affairs (the historical title for Foreign Affairs in New Zealand) from 12 December 1957 to 12 December 1960.
Pet image for Pawpularity prediction
Pet image (Pawpularity task)
Query Predict the pawpularity of the pet shown in the image. Combine tabular features with the ResNet50 image model prediction.

Image path: your_path/…jpg  |  Tabular path: your_path/train.csv
Analysis — Visual Feature Extraction From the image I can extract binary features: {'Eyes': 1, 'Face': 1, 'Near': 1, 'Action': 0, 'Accessory': 0, 'Group': 0, 'Collage': 0, 'Human': 0, 'Occlusion': 0, 'Info': 0, 'Blur': 0}. Next: load training CSV, train a tabular model, and also call ResNet50Predict.
Code — Turn 1 import pandas as pd train_df = pd.read_csv("your_path/train.csv") train_df.head()
Code Result (DataFrame head: Id, Subject Focus, Eyes, Face, Near, …, Pawpularity)
Code — Turn 2: Train + Predict from sklearn.ensemble import RandomForestRegressor import numpy as np feature_cols = [c for c in train_df.columns if c not in ['Id','Pawpularity']] X = train_df[feature_cols] y = train_df['Pawpularity'] model = RandomForestRegressor(n_estimators=100, random_state=42) model.fit(X, y) sample = {'Eyes':1,'Face':1,'Near':1,'Action':0, 'Accessory':0,'Group':0,'Collage':0, 'Human':0,'Occlusion':0,'Info':0, 'Blur':0,'Subject Focus':0} tab_pred = model.predict( pd.DataFrame([sample])[feature_cols])[0] img_pred = ResNet50Predict("pawpularity", "your_path/image.jpg") (tab_pred, img_pred)
Code Result (38.02, 94.54)
Analysis — Fusion Tabular prediction: 38.02, Image prediction: 94.54. Since Pawpularity is primarily visual, I'll weight the image prediction more heavily: 0.9 × image + 0.1 × tabular.
Answer 88.89
At a Glance

Task Overview

TWT covers two families of Tabular-Vision Multi-Modal Understanding tasks.

Task TypeInputSandbox RequirementsAnswer
Table QA Table header image + question + CSV path pandas, standard Python Text / number
Table Prediction Sample image + task desc + CSV & image paths pandas, sklearn, ResNet50Predict Class label / regression value
Training

Two-Stage Recipe

Task-Oriented SFT teaches the output protocol; Adaptive Loss-Scaled GRPO refines reasoning through RL with multi-turn rollouts.

Stage 1 — TO-SFT

Supervised fine-tuning on ~1.5K Table QA + 1.2K Table Prediction samples. Code execution results are masked during loss computation to prevent memorization. The model learns the <analy> / <code> / <answer> protocol.

Stage 2 — AL-GRPO

GRPO reinforcement learning with a multi-turn scheduler. The TWT plugin extracts code, executes in sandbox, and appends results to context. Only successfully executed code contributes to gradient updates (adaptive loss scaling).

Cite

Citation

If you use TWT in your research, please cite our paper.

@article{twt2026, title = {Thinking with Tables: Enhancing Multi-Modal Tabular Understanding via Neuro-Symbolic Reasoning}, author = {Yu, Kun-Yang and Zhou, Zhi and Tian, Shi-Yu and others}, year = {2026} }