內容審查 - PDF轉PPT系統

步驟3: 內容審查確認

請檢查解析結果並確認內容正確性
SlideCoder_Layout_and_Code_Synthesis

SlideCoder_Layout_and_Code_Synthesis.pdf 解析完成
已解析
布局解析結果

正在載入布局覆蓋圖...
提取的文本內容

# SlideCoder: Layout-Aware RAG-Enhanced Hierarchical Slide Generation

Converting Reference Images into Editable PowerPoint Scripts with Precision

![](images/image_0de4e880e26ab381828ce0a712e9c856ddb57462395ab3ac53279562d759047c.jpg)

Wenxin Tang, Jingyu Xiao, Wenxuan Jiang, Xi Xiao, Yuhang Wang, Xuxin Tang, Qing Li, Yuehe Ma, Junliang Liu, Shisong Tang, Michael R. Lyu.

Tsinghua University | CUHK | Northeastern University | Kuaishou Technology 丨Peng Cheng Laboratory

# The Challenge: Automated Slide Creation Fails at Fidelity

Current Multimodal LLMs (MLLMs) like AutoPresent attempt to automate slide creation from natural language. However, when tasked with replicating a visual Reference Image (Rl), they fail to capture the structural integrity of the design.

miss Technical Orange Our topic is slide generation. incorrect Technical Orange 1. Slide Design Princi 2.MLLMs for Slide Understanding disorder AI 3.MLLMs for Slide General Technical Orange

# Visual Evidence

1. Miss: Complete omission of visual or textual elements.   
2 Incorrect: Deviations in style (font weight, color, shape attributes). Disorder: Significant errors in spatial arrangement and alignment.

# Why MLLMs Struggle with Slide Generation

# Visual Blindness

![](images/image_6664acc856e9a1509f7403c777a8e3b9834c4405e90e67aaf4a0d03d34ff950c.jpg)

#

# The Complexity Gap

![](images/image_d703c6fad00f991f90757bb8c2f9ec17880434d0bf5669927f3b60f06d84d173.jpg)

Natural language prompts cannot capture pixel-perfect layout nuances or the high density of elements in professional slides.

Library Ignorance   
slide.shapes.add_magic_box(l× General purpose MLLMs   
hallucinate python-pptx code,   
often using invalid syntax or non-existent APls.   
Current methods rely on   
simplified APIs (SLIDLILIB) that   
support only basic operations,   
failing to capture complex   
human design intentions.

# Introducing SlideCoder: A Unified End-to-End Framework

Divide and Conquer via Hierarchical Generation

![](images/image_da629d086d8750941660b8be49d02d3662c9c7eb34931d489dc6abdbc13d37e4.jpg)

# Contribution 1: The Slide2Code Benchmark

New Task: Reference Image (RI) to Slide Generation.

Goal: Generate Python code that replicates a provided image.

Data Source: Curated from \~32,000 candidates to select 300 high-quality samples.

![](images/image_d4f2b353e5a1dee4f16debfeae10dc0de3c6aa047e583c2645d78976c5c845a1.jpg)

Measuring Difficulty: The Slide Complexity Metric (SCM)

zi=a·ci+β·e+γ·Ui

Variable: Ci

Variable: ei

Variable: Ui

Visual Density Total number of visual elements on the canvas.

Count of distinct element types (textboxes, charts, shapes).

Element Coverage Ratio calculated via activated color grids.

Validation: SCM correlates strongly with human judgment $\left( r = 0 . 8 7 3 \right)$ ！

# Methodology l: Color Gradient-based Segmentation (CGSeg) Recursive decomposition based on visual density.

![](images/image_2bc4a44a90c6593ecdb1b7fe83b7e407a85c73b96be9d01920625762829257a5.jpg)

# Methodology Il: Hierarchical RAG (H-RAG)

Dual-layer knowledge injection for syntax correctness.

Concept Level 1: Shape Type KB Level 2: Operation Function KB Target Agent: Describer Target Agent: Coder Function:Standardizes terminology Function: Provides exact APl syntax, (e.g.,'Placeholder' vs 'Box'). parameters, and return values. {'Auto Shape': 'Predefined shape pptx.Presentation(pptx=None） -> with approx 180 variations...'} Presentation object   
Implementation You are a python-pptx exper.. The following is an introducti to layout shape shape types in PPT: {<Grammar>}

# Methodology Il: The Multi-Agent Workflow

![](images/image_9bade18d2e43cf7f61d109752f21e8ef66d91619e9888ee2fc26439eed282479.jpg)

# Methodology IV: Layout-Aware Prompting

# The Challenge:

Local code snippets don't know where they belong on the global canvas.

# The Fix:

Injecting relative coordinates $( x , y , w , h )$ derived from CGSeg directly into the prompt.

![](images/image_d4b7e30b73db06d6b07667a71fe8e763be9332756c07525da19208317d49a8c0.jpg)

# SlideMaster: The Open-Source Contender

AAAAA < YUYUY A PY YYYYY

The Secret Sauce: Custom PPTX Reverse-Engineering Tool

Base Model: Qwen2.5-VL-7B-Instruct

<html><body><table><tr><td>Feature</td><td>AutoPresent (Baseline)</td><td>SlideMaster (Ours)</td></tr><tr><td> Object Types</td><td>5</td><td>10</td></tr><tr><td> Styles</td><td>16</td><td>44</td></tr><tr><td>Compler</td><td>Basic Shapes</td><td></td></tr></table></body></html>

![](images/table_84f86f430bfc47d1d9f24083d03511c8ab8edf667164b39d95fcf9d89cb6bfae.jpg)

Training: Fine-tuned on high-quality (Image, Code) pairs.

# Quantitative Results: Dominating the Leaderboard

<html><body><table><tr><td>Framework</td><td>Difficulty</td><td>Execution</td><td>Ovcrall</td></tr><tr><td>AutoPT-40nt</td><td> Simple</td><td>61.0%</td><td>80.3</td></tr><tr><td>Technical Green</td><td>Complex</td><td>41.4%</td><td>53.0</td></tr><tr><td>SlidPT-4der</td><td> Simple</td><td>99.0%</td><td>91.8</td></tr><tr><td>Technical Green</td><td>Complex</td><td>85.5%</td><td>82.2</td></tr></table></body></html>

![](images/table_b397694b91dd341ebcdbc167e711164e123260d150370f9172f9f375354056d0.jpg)

On Complex tasks, SlideCoder   
outperforms the   
baseline by 40.5 points. Execution Rate   
approaches 100%,   
compared to \~40- 60% for baselines.

# Qualitative Comparison: Visual Fidelity

# Reference Image (Input) Baseline (AutoPresent)

# SlideCoder (Ours)

# EverydayObjectsandTheirRoles

Weinteroctwith various objects   
inoudily 可   
unique purpose!   
TheyAre   
Understanding their basic   
choracteristicshetpsusrecognize   
theirimportanceineveryday Elophort, Comoutor   
routines! et   
EverydayObjectsandTheirRoles   
Weinteroct with variousobjects   
inourdilyivn   
unique purpose! Schoef lhag Fitcb Cenerota erng Orcatregpetruel   
They Are   
Understanding their basic   
characteristicshelpsusrecognize   
theirimportanceineveryday Eleplant Computer   
routines! Loperarertet te

![](images/image_d7ebad1b8aa18ce5055ffd1010887c639e71d03388b004b2b747ec65583502ee.jpg)

Preserves grid layout, icon placement, and text hierarchy.

# Ablation Study: Why Every Component Matters

Impact of Removing Components (Overall Score on Simple Tasks)

Full Model 89.9 Removing Segmentation   
W/o H-RAG 80.4 (CGSeg) causes a massive performance drop.   
w/o Layout 81.2 The model gets overwhelmed by   
w/o CGSeg 55.4 the fullslide image.

# Conclusion & Future Directions

·Slide2Code: Established a rigorous new benchmark for complex slide generation.

· SlideCoder: Achieved SOTA by integrating Segmentation (CGSeg), Retrieval (H-RAG),and Layout-Aware Prompting.

·SlideMaster: Released an open-source 7B model that rivals proprietary giants like GPT-40.

Future Work:

·Multi-slide generation capabilities.   
· Handling ful-slide background images without separate assets.
內容確認

我確認以上解析內容正確，可以繼續處理
額外說明
返回處理頁面
步驟3: 內容審查確認

SlideCoder_Layout_and_Code_Synthesis

布局解析結果

布局覆蓋圖 (PDF)

MinerU 布局解析結果

提取的文本內容

內容確認

步驟3: 內容審查確認

SlideCoder_Layout_and_Code_Synthesis

布局解析結果

布局覆蓋圖 (PDF)

MinerU 布局解析結果

提取的文本內容

內容確認

布局覆蓋圖 (PDF)

無法在此處預覽PDF

圖片預覽