# SlideCoder: Layout-Aware RAG-Enhanced Hierarchical Slide Generation
Converting Reference Images into Editable PowerPoint Scripts with Precision

Wenxin Tang, Jingyu Xiao, Wenxuan Jiang, Xi Xiao, Yuhang Wang, Xuxin Tang, Qing Li, Yuehe Ma, Junliang Liu, Shisong Tang, Michael R. Lyu.
Tsinghua University | CUHK | Northeastern University | Kuaishou Technology 丨Peng Cheng Laboratory
# The Challenge: Automated Slide Creation Fails at Fidelity
Current Multimodal LLMs (MLLMs) like AutoPresent attempt to automate slide creation from natural language. However, when tasked with replicating a visual Reference Image (Rl), they fail to capture the structural integrity of the design.
miss Technical Orange Our topic is slide generation. incorrect Technical Orange 1. Slide Design Princi 2.MLLMs for Slide Understanding disorder AI 3.MLLMs for Slide General Technical Orange
# Visual Evidence
1. Miss: Complete omission of visual or textual elements.
2 Incorrect: Deviations in style (font weight, color, shape attributes). Disorder: Significant errors in spatial arrangement and alignment.
# Why MLLMs Struggle with Slide Generation
# Visual Blindness

#
# The Complexity Gap

Natural language prompts cannot capture pixel-perfect layout nuances or the high density of elements in professional slides.
Library Ignorance
slide.shapes.add_magic_box(l× General purpose MLLMs
hallucinate python-pptx code,
often using invalid syntax or non-existent APls.
Current methods rely on
simplified APIs (SLIDLILIB) that
support only basic operations,
failing to capture complex
human design intentions.
# Introducing SlideCoder: A Unified End-to-End Framework
Divide and Conquer via Hierarchical Generation

# Contribution 1: The Slide2Code Benchmark
New Task: Reference Image (RI) to Slide Generation.
Goal: Generate Python code that replicates a provided image.
Data Source: Curated from \~32,000 candidates to select 300 high-quality samples.

Measuring Difficulty: The Slide Complexity Metric (SCM)
zi=a·ci+β·e+γ·Ui
Variable: Ci
Variable: ei
Variable: Ui
Visual Density Total number of visual elements on the canvas.
Count of distinct element types (textboxes, charts, shapes).
Element Coverage Ratio calculated via activated color grids.
Validation: SCM correlates strongly with human judgment $\left( r = 0 . 8 7 3 \right)$ !
# Methodology l: Color Gradient-based Segmentation (CGSeg) Recursive decomposition based on visual density.

# Methodology Il: Hierarchical RAG (H-RAG)
Dual-layer knowledge injection for syntax correctness.
Concept Level 1: Shape Type KB Level 2: Operation Function KB Target Agent: Describer Target Agent: Coder Function:Standardizes terminology Function: Provides exact APl syntax, (e.g.,'Placeholder' vs 'Box'). parameters, and return values. {'Auto Shape': 'Predefined shape pptx.Presentation(pptx=None) -> with approx 180 variations...'} Presentation object
Implementation You are a python-pptx exper.. The following is an introducti to layout shape shape types in PPT: {<Grammar>}
# Methodology Il: The Multi-Agent Workflow

# Methodology IV: Layout-Aware Prompting
# The Challenge:
Local code snippets don't know where they belong on the global canvas.
# The Fix:
Injecting relative coordinates $( x , y , w , h )$ derived from CGSeg directly into the prompt.

# SlideMaster: The Open-Source Contender
AAAAA < YUYUY A PY YYYYY
The Secret Sauce: Custom PPTX Reverse-Engineering Tool
Base Model: Qwen2.5-VL-7B-Instruct
<html><body><table><tr><td>Feature</td><td>AutoPresent (Baseline)</td><td>SlideMaster (Ours)</td></tr><tr><td> Object Types</td><td>5</td><td>10</td></tr><tr><td> Styles</td><td>16</td><td>44</td></tr><tr><td>Compler</td><td>Basic Shapes</td><td></td></tr></table></body></html>

Training: Fine-tuned on high-quality (Image, Code) pairs.
# Quantitative Results: Dominating the Leaderboard
<html><body><table><tr><td>Framework</td><td>Difficulty</td><td>Execution</td><td>Ovcrall</td></tr><tr><td>AutoPT-40nt</td><td> Simple</td><td>61.0%</td><td>80.3</td></tr><tr><td>Technical Green</td><td>Complex</td><td>41.4%</td><td>53.0</td></tr><tr><td>SlidPT-4der</td><td> Simple</td><td>99.0%</td><td>91.8</td></tr><tr><td>Technical Green</td><td>Complex</td><td>85.5%</td><td>82.2</td></tr></table></body></html>

On Complex tasks, SlideCoder
outperforms the
baseline by 40.5 points. Execution Rate
approaches 100%,
compared to \~40- 60% for baselines.
# Qualitative Comparison: Visual Fidelity
# Reference Image (Input) Baseline (AutoPresent)
# SlideCoder (Ours)
# EverydayObjectsandTheirRoles
Weinteroctwith various objects
inoudily 可
unique purpose!
TheyAre
Understanding their basic
choracteristicshetpsusrecognize
theirimportanceineveryday Elophort, Comoutor
routines! et
EverydayObjectsandTheirRoles
Weinteroct with variousobjects
inourdilyivn
unique purpose! Schoef lhag Fitcb Cenerota erng Orcatregpetruel
They Are
Understanding their basic
characteristicshelpsusrecognize
theirimportanceineveryday Eleplant Computer
routines! Loperarertet te

Preserves grid layout, icon placement, and text hierarchy.
# Ablation Study: Why Every Component Matters
Impact of Removing Components (Overall Score on Simple Tasks)
Full Model 89.9 Removing Segmentation
W/o H-RAG 80.4 (CGSeg) causes a massive performance drop.
w/o Layout 81.2 The model gets overwhelmed by
w/o CGSeg 55.4 the fullslide image.
# Conclusion & Future Directions
·Slide2Code: Established a rigorous new benchmark for complex slide generation.
· SlideCoder: Achieved SOTA by integrating Segmentation (CGSeg), Retrieval (H-RAG),and Layout-Aware Prompting.
·SlideMaster: Released an open-source 7B model that rivals proprietary giants like GPT-40.
Future Work:
·Multi-slide generation capabilities.
· Handling ful-slide background images without separate assets.