Brick-Composer: MLLMs Construct Everything from Building Blocks

Jiateng Liu1, Bingxuan Li1, Zhenhailong Wang1, Rushi Wang1, Kaiwen Hong1, Cheng Qian1, Jiayu Liu1, Denghui Zhang2, Katherine Driggs-Campbell1, Manling Li3, Heng Ji1

1UIUC    2Stevens Institute of Technology    3Northwestern University

Overview of BC-Bench task setting for brick selection and brick pose estimation
Overview of the BC-Bench task setting. Left: Brick selection. The model selects the required brick from a candidate grid using manual images. Right: Brick pose estimation. Given the manual context, current assembly state, and selected brick, the model predicts the brick's target pose as a translation vector and rotation matrix.

Abstract

We dream of AI agents that can read arbitrary designs and construct real-world objects from reusable building blocks. As a first step toward this vision, we study whether multimodal large language models (MLLMs) possess the visual grounding and spatial reasoning capabilities required for brick assembly. We formulate brick assembly as a sequential decision-making problem, where each step involves two subtasks: brick selection, identifying the target brick from candidate components, and brick pose estimation, predicting where and how the selected brick should be placed.

To support this study, we introduce BC-Bench (Brick Construction Benchmark), a benchmark for evaluating MLLMs on assembly with diverse bricks. We further propose Brick-Composer, a learning framework that equips MLLMs with assembly skills through Human Design Sparks, World Feedback, and Synthetic Experience. Brick-Composer improves brick selection accuracy by over three times, substantially reduces pose estimation errors, and raises strict step-level assembly success from less than 1% to around 15%.

BC-Bench: Assembly with Diverse Bricks

BC-Bench evaluates whether MLLMs can follow step-wise assembly manuals with diverse LEGO-style parts. Each step requires the model to interpret multi-view manual images, identify the next brick from visually similar candidates, and infer the target 3D pose in an evolving structure.

Assembly target examples for a lion and little bonsai with top, front, right, and bottom manual views
Example assembly trajectories. Red boxes highlight the brick to add at each step, while multi-view manuals expose geometric and affordance cues.
Multi-view LDraw part examples and assembly target objects for BC-Bench
Diverse part geometry and object-level trajectories in BC-Bench. Models must recognize fine-grained components and infer valid 3D placements across a wide variety of brick types and target objects.

Brick-Composer Learning Framework

Brick-Composer combines three complementary sources of supervision: affordance-rich human-designed assemblies, simulator-based world feedback for error recovery, and procedurally generated synthetic experiences for scalable spatial learning.

Method overview showing Human Design Sparks, World Feedback, and Synthetic Experiences
Overview of the Brick-Composer learning framework. We improve assembly reasoning through three complementary signals: Human design supervision, World feedback for error recovery, and scalable synthetic objects for experience expansion. Together, they substantially enhance the model's assembly capabilities.

Main Results

The tables below reproduce the full final result tables from the paper, including both overall-average performance and best-object performance.

Table 2: Main evaluation of state-of-the-art MLLMs on BC-Bench.
Model Overall Average Best Object Performance
Selection ↑
Acc. (%)
PE Trans. ↓
Err. (LDU)
PE Rot. ↓
Err. (°)
Step-Wise ↑
SR (%)
Selection ↑
Acc. (%)
PE Trans. ↓
Err. (LDU)
PE Rot. ↓
Err. (°)
Step-Wise ↑
SR (%)
Gemma-3-12B4.35269.0963.000.0917.2441.8612.862.22
InternVL-3.5-8B13.24221.6988.430.0031.2541.5435.000.00
Qwen-3-VL-8B22.76210.1462.470.3655.1743.8112.864.44
Qwen-3.5-VL-27B37.44314.3282.940.1875.8664.4734.844.44
GPT-5.443.88310.7874.850.1893.1067.9440.002.22
Table 3: Performance improvements on BC-Bench.
Approach Overall Average Best Object Performance
Selection ↑
Acc. (%)
PE Trans. ↓
Err. (LDU)
PE Rot. ↓
Err. (°)
Step-Wise ↑
SR (%)
Selection ↑
Acc. (%)
PE Trans. ↓
Err. (LDU)
PE Rot. ↓
Err. (°)
Step-Wise ↑
SR (%)
Performance Comparison of Learning Approaches for Model: Gemma-3-12B
Direct Prompting4.35269.0963.000.0917.2441.8612.862.22
World Feedback (P)273.4162.260.0039.4614.532.22
Designer Supervision15.59201.5155.690.7226.6736.9115.004.44
World Feedback (L)170.3351.491.9927.5612.869.07
Brick-Composer17.95123.6945.434.3552.8724.6312.8618.75
Performance Comparison of Learning Approaches for Model: Qwen-3-8B-VL
Direct Prompting22.76210.1462.470.3655.1743.8112.864.44
World Feedback (P)226.3365.660.2742.3612.864.44
Designer Supervision48.29162.8257.815.4073.2427.1312.958.92
World Feedback (L)137.2652.656.2421.4611.6415.36
Brick-Composer68.2165.6337.9714.2790.6514.290.0041.63

Synthetic Experience Data

Examples of synthesized assembly configurations with steps 0 through 8
Examples of synthesized assembly configurations used for synthetic experience learning. Each structure is generated by incrementally attaching sampled bricks through feasible connection positions, while filtering invalid placements based on collision avoidance and structural connectivity. A density preference further encourages compact layouts, allowing the synthesized data to provide diverse and scalable supervision for brick selection and pose estimation.

Citation

@misc{liu2026brickcomposer,
  title  = {Brick-Composer: MLLMs Construct Everything from Building Blocks},
  author = {Liu, Jiateng and Li, Bingxuan and Wang, Zhenhailong and Wang, Rushi and Hong, Kaiwen and Qian, Cheng and Liu, Jiayu and Zhang, Denghui and Driggs-Campbell, Katherine and Li, Manling and Ji, Heng},
  year   = {2026},
  url    = {https://github.com/Lumos-Jiateng/Brick-Composer}
}