Crystalite: A Lightweight Transformer for Efficient Crystal Modeling

Introduction

Generative modeling of crystals

Discovering new crystalline materials remains a difficult search problem and a central challenge in modern materials discovery [1]. The number of possible compositions and structures is enormous, while only a small fraction of candidates are thermodynamically competitive. Traditional structure-search strategies such as AIRSS and evolutionary crystal structure prediction can explore this space systematically [2, 3, 4]. In principle, first-principles calculations can assess whether a proposed material is meaningful. In practice, this quickly becomes too costly if the objective is broad exploration rather than detailed validation of a small set of structures.

Stable, unique, and novel crystal structures discovered by Crystalite. This visualization gives a qualitative preview of the kinds of candidates behind the quantitative S.U.N. results discussed later in the post.

Generative models are therefore attractive because they can learn to propose candidate crystals directly from data. However, crystal generation is not simply a matter of predicting discrete atom labels. Atoms occupy positions inside a periodic lattice, distances wrap across unit-cell boundaries, and small geometric errors can change the physical plausibility of a structure. This is the main reason why much of the recent literature has relied on geometry-aware and often equivariant graph neural networks [5, 6, 8, 9, 10].

Crystalite begins from a narrower question: how much of this geometric structure must be built into the backbone itself? Put differently, can a diffusion Transformer remain competitive if the inductive bias is placed more carefully? Recent diffusion-transformer results suggest that lighter backbones can indeed be competitive when the representation and geometry are handled well [10, 11].

In this post, we focus on two design choices that define Crystalite. The first is a chemically structured atom representation. The second is a lightweight way of injecting periodic geometry into attention without replacing the Transformer with a fully geometry-specific backbone.

Background

Equivariant GNNs for crystal modeling

A crystal is naturally described by three coupled objects: atom identities $ \mathbf{A} $, fractional coordinates $ \mathbf{F} $, and lattice geometry $ \mathbf{L} $:

$$ \mathcal{C} = (\mathbf{A}, \mathbf{F}, \mathbf{L}), \qquad \mathbf{A} \in \{0,1\}^{N \times N_Z}, \quad \mathbf{F} \in [0,1)^{N \times 3}, \quad \mathbf{L} \in \mathbb{R}^{3 \times 3}. $$

An important observation is that fractional coordinates should not be regarded as Cartesian coordinates in another form. A row of $ \mathbf{F} $ specifies the position of an atom within the unit cell relative to the lattice basis, whereas $ \mathbf{L} $ specifies the lattice basis itself in real space. As a result, the same fractional arrangement can correspond to substantially different Cartesian structures under different lattices, and real-space positions are only well defined when $ \mathbf{F} $ and $ \mathbf{L} $ are considered jointly.

Comparison between crystal positions in real space and fractional coordinates inside a unit cell — Fractional coordinates live inside the unit cell, while real-space positions are obtained only after the lattice basis is applied. This is why the coordinate channel and the lattice channel are inseparable in a crystal model.

This representation immediately explains why the problem is specialized. The unit cell is repeated periodically in all directions. Local environments matter, but only through the lattice that defines how the crystal repeats in real space. Geometric relations must be computed under the minimum-image convention rather than inside an isolated box.

Equivariant GNNs are well matched to this setting. They are designed to process neighborhoods, distances, directions, and symmetry transformations in a controlled way, which has enabled strong results in both crystal structure prediction and crystal generation. At the same time, incorporating equivariance in this way can make the resulting models architecturally complex and relatively slow at sampling time, when the denoiser must be evaluated repeatedly. Representative examples include CDVAE, DiffCSP, FlowMM, MatterGen, and OMatG [5, 6, 7, 8, 9].

Crystalite does not argue that geometry can be ignored. Rather, it asks whether geometry can be introduced in a more economical way. The model retains a standard diffusion Transformer backbone and concentrates its crystal-specific structure in the representation and the attention mechanism.

Representation

Chemistry-based atom encoding

The first modification concerns atom identity. In many crystal generators, atom types are represented by one-hot vectors. This is convenient, but chemically it is a poor geometry: every element is orthogonal to every other element. Sodium is no closer to lithium than it is to xenon.

This matters especially when diffusion is formulated over a continuous atom-type signal rather than with an explicitly discrete diffusion process. In that setting, one-hot atom identity provides no notion of chemical proximity: nearby vectors in representation space do not correspond to chemically similar elements, and chemically plausible substitutions are not encoded in the representation.

Crystalite replaces one-hot atom identity with Subatomic Tokenization, a compact descriptor built from periodic and electronic structure: period, group, block, and valence-shell occupancy. This gives the atom representation a more chemically meaningful structure.

Subatomic Tokenization is based on the idea that nearby chemistry should remain nearby in representation space. Here, the periodic table is useful because period, group, block, and valence structure directly inform the token design.

In a simplified form, the token for element $k$ can be written as

$$ \mathbf{h}_k = \big[ \mathrm{onehot}(r_k), \mathrm{onehot}(g_k), \mathrm{onehot}(b_k), s_k/2,\, p_k/6,\, d_k/10,\, f_k/14 \big]. $$

These descriptors are standardized, balanced across feature groups, optionally PCA-compressed, and then treated as continuous atom tokens. During sampling, a predicted token is decoded back to a valid element by nearest-token matching. This makes the denoising problem better aligned with chemical structure: nearby errors in token space correspond more naturally to chemically related species.

$$ \mathcal{C} = (\mathbf{H}, \mathbf{F}, \mathbf{y}), \qquad \mathbf{H} \in \mathbb{R}^{N \times d_H}, \quad \mathbf{F} \in [0,1)^{N \times 3}, \quad \mathbf{y} \in \mathbb{R}^{6}. $$

These are the three continuous channels denoised jointly by the model: atom tokens $ \mathbf{H} $, fractional coordinates $ \mathbf{F} $, and a compact latent description of the lattice $ \mathbf{y} $.

Method

Geometry-aware attention with GEM

The second key design choice concerns geometry. Crystals are periodic objects, so geometric quantities should respect the torus structure of fractional coordinates. The subtle point is that Crystalite uses two closely related constructions here: a wrapped fractional residual for the coordinate loss, and a metric-aware periodic-image search for the geometry-aware attention module introduced below.

$$ \boldsymbol{\delta}^{\mathrm{wrap}}_{ij} = \mathrm{wrap}(\mathbf{f}_i - \mathbf{f}_j), \qquad \mathrm{wrap}(\mathbf{u}) = \mathbf{u} - \mathrm{round}(\mathbf{u}), \qquad d^{\mathrm{wrap}}_{ij} = \left\| \boldsymbol{\delta}^{\mathrm{wrap}}_{ij} \mathbf{L} \right\|_2. $$

This wrapped residual is the right object for the coordinate loss because the model predicts fractional coordinates directly. For the geometry-aware attention bias, the goal is slightly different: we want the shortest periodic displacement under the lattice metric, not just a componentwise wrap. In practice, the model searches over a small set of nearby periodic images and keeps the one with the smallest real-space norm:

$$ \Delta \mathbf{f}_{ij}^{\star} = \arg\min_{\mathbf{r} \in \Omega_R} \left\| \left( \mathbf{f}_i - \mathbf{f}_j + \mathbf{r} \right)\mathbf{L} \right\|_2. $$

For orthogonal cells these two notions coincide, but for skewed lattices they do not have to. That is why the paper distinguishes between wrapped fractional residuals in the loss and metric-aware minimum-image geometry in the attention module. The backbone itself is still intentionally simple: one token per atom, one additional global token for the lattice, and a standard diffusion-conditioned Transformer trunk.

$$ \mathbf{t}_i^{\mathrm{atom}} = E_H(\mathbf{h}_i) + E_F(\mathbf{f}_i), \qquad \mathbf{t}^{\mathrm{lat}} = E_{\mathrm{lat}}(\mathbf{y}). $$

Overview of the Crystalite architecture — Crystalite keeps the architecture compact: one token per atom, one lattice token, a diffusion-conditioned Transformer trunk, and lightweight heads for the atom, coordinate, and lattice channels.

This gives the model a clean decomposition between local atomic information and global cell information. The lattice is parameterized through a lower-triangular latent, which keeps the representation unconstrained while ensuring positive diagonal entries:

$$ \mathbf{L}(\mathbf{y}) = \begin{bmatrix} e^{y_1} & 0 & 0 \\ y_2 & e^{y_3} & 0 \\ y_4 & y_5 & e^{y_6} \end{bmatrix}. $$

The geometry-specific piece of this backbone is the Geometric Enhancement Module (GEM). GEM computes periodic pairwise features from atom positions and the lattice, then feeds them into attention as a bias rather than replacing the Transformer with a heavier message-passing architecture.

Concretely, GEM adds a geometry-dependent bias directly to the attention logits:

$$ A_{\mathrm{geom}} = \frac{QK^\top}{\sqrt{d}} + B_{\mathrm{geom}}. $$

Diagram of the Geometric Enhancement Module — GEM computes minimum-image pair geometry under periodic boundary conditions and converts it into an additive attention bias. Geometry therefore enters the model directly at the level of token interactions.

The bias term $B_{\mathrm{geom}}$ is built from periodic pairwise geometry: metric-aware minimum-image displacements, normalized distances, Fourier or radial features, and a compact lattice descriptor. The resulting attention is still standard self-attention, but it is informed by crystal geometry at the point where token interactions are decided.

This is the central modeling idea. Crystalite does not attempt to remove geometric inductive bias. Instead, it introduces a softer and more modular form of that bias inside the attention mechanism.

Results

Crystal structure prediction and de novo generation results

We evaluate Crystalite in two settings. In crystal structure prediction (CSP), the composition is known and the model predicts the structure. In de novo generation, the model generates atom types, coordinates, and lattice jointly from noise.

Crystal structure prediction

The CSP results are particularly direct. Across the reported benchmarks, Crystalite achieves the strongest results in both match rate and RMSE. The RMSE improvements are especially notable, because they suggest better geometric refinement rather than merely better recovery of the correct structural mode.

Model	MP-20		MPTS-52		Alex-MP-20
Model	MR ↑	RMSE ↓	MR ↑	RMSE ↓	MR ↑	RMSE ↓
CDVAE	33.90	0.1045	5.34	0.2106	--	--
DiffCSP	51.49	0.0631	12.19	0.1786	--	--
FlowMM	61.39	0.0566	17.54	0.1726	--	--
CrystalFlow	62.02	0.0710	22.71	0.1548	--	--
KLDM	65.83	0.0517	23.93	0.1276	--	--
OMatG	63.75	0.0720	25.15	0.1931	64.71	0.1251
Crystalite	66.05	0.0329	31.49	0.0701	67.52	0.0335

CSP results across the three reported benchmarks. The strongest gains appear in RMSE, which is consistent with GEM primarily improving geometric refinement.

This interpretation matches the ablations in the paper: GEM has only a modest effect on match rate, but it consistently improves structural accuracy.

De novo generation

The de novo generation setting is more nuanced, because validity, novelty, uniqueness, and stability pull in different directions. For that reason, the most informative summary target is often S.U.N., which rewards structures that are simultaneously stable, unique, and novel. On this benchmark, Crystalite achieves the strongest S.U.N. result while remaining substantially faster than competing baselines.

Model	Quality and Diversity					Stability, Distribution, and Speed
Model	Struct. Val. ↑	Comp. Val. ↑	Unique ↑	Novel ↑	U.N. ↑	Stable ↑	S.U.N. ↑	wdist-ρ ↓	wdist N-ary ↓	Time / 1k ↓
FlowMM	93.03	83.15	97.44	85.00	83.99	46.05	31.64	1.389	0.075	1560
CrystalDiT	77.82	67.28	90.88	59.33	56.86	83.41	41.70	0.202	0.171	73.72
DiffCSP	99.93	82.10	96.90	89.53	87.89	50.28	38.60	0.192	0.344	237
MatterGen	99.78	83.72	98.10	91.14	90.26	51.70	42.29	0.088	0.184	2639
ADiT	99.52	90.15	90.25	59.80	56.91	76.90	36.76	0.231	0.089	84.81
Crystalite	99.61	81.94	95.33	79.15	77.12	70.97	48.55	0.046	0.125	22.36 / 5.14†

Main de novo generation results on 10,000 generated crystals, with stability-related metrics estimated using NequIP. In this evaluation, Crystalite achieves the strongest S.U.N. result while also being the fastest sampler by a large margin. The daggered timing denotes an optimized inference setting with FlashAttention and bfloat16.

External benchmarking: LeMat GenBench

We also evaluate Crystalite on LeMat GenBench [12], an external benchmark that is separate from our own evaluation. This provides an additional check that the main conclusions are not specific to a single benchmark. The pre-relaxed and non-pre-relaxed groups should be read within their own categories. On this benchmark, Crystalite achieves state-of-the-art results on most metrics, and particularly on SUN and MSUN.

Model	Valid	Unique	Novel	Stable	Metastable	SUN	MSUN	E Above Hull	Relax. RMSD
Model	(%) ↑	(%) ↑	(%) ↑	(%) ↑	(%) ↑	(%) ↑	(%) ↑	(eV) ↓	(Å) ↓
Pre-Relaxed Models
Crystalite	97.20	95.80	53.20	12.70	51.60	1.50	22.60	0.0905	0.1320
OMatG	96.40	95.20	51.20	11.60	49.80	1.00	18.00	0.0956	0.0759
MatterGen	95.70	95.10	70.50	2.00	33.40	0.20	15.00	0.1834	0.3878
PLaID++	96.00	77.80	24.20	12.40	60.70	1.00	7.60	0.0854	0.1286
WyFormer-DFT	95.20	95.00	66.40	3.70	24.80	0.40	7.80	0.2708	0.4173
WyFormer	93.40	93.00	66.40	0.50	15.70	0.10	1.90	0.4988	0.8121
Non-Pre-Relaxed Models
DiffCSP	95.70	94.80	66.20	2.30	29.80	0.10	8.50	0.2747	0.3794
DiffCSP++	95.30	95.10	62.00	1.00	26.40	0.20	5.00	0.4093	0.6933
SymmCD	73.40	73.00	47.00	1.40	18.60	0.10	2.40	0.8761	0.8720
CrystalFormer	69.90	69.40	31.80	1.40	28.80	0.00	3.10	0.7039	0.6585
ADiT	90.60	87.80	26.00	0.40	36.50	0.00	1.00	0.3333	0.3794
Crystal-GFN	51.70	51.70	51.70	0.00	0.00	0.00	0.00	2.0858	1.8665

LeMat GenBench results, shown separately for pre-relaxed and non-pre-relaxed models. Within the pre-relaxed group, Crystalite leads on validity, uniqueness, stable rate, SUN, and MSUN.

Large-scale generation

A good generative model should not only produce plausible samples, but continue to generate diverse candidates as the sampling budget grows. Evaluating that behavior at scale is only practical when generation is fast enough to make such studies feasible in the first place. This is exactly where Crystalite is useful: it is fast enough for large-scale generation, while preserving uniqueness and unique-and-novel rate more effectively as more crystals are sampled.

Conclusion

Balancing efficiency and expressivity

Crystalite shows that strong crystal generation does not require pushing all of the geometric inductive bias into a heavy backbone. Instead, chemical structure can be encoded in the atom representation, while periodic geometry can be placed where it matters most: in the loss and in attention through GEM.

In practice, that gives a useful balance: Crystalite remains easier to train, sample from, and extend than many geometry-heavy alternatives while still reaching state-of-the-art crystal structure prediction results and strong de novo generation performance. In the main generation comparison, it also achieves the highest S.U.N. score while sampling one to two orders of magnitude faster than leading GNN-based baselines, depending on the reference model and inference setting.

That speed matters for more than runtime alone. Fast sampling makes conditional generation, guidance, and steering much easier to use in practice, because adding search loops or external constraints no longer multiplies an already expensive denoising cost. In that sense, Crystalite is not only an efficient generator, but also a more practical foundation for controllable and scalable materials discovery workflows.

References

Selected references

Merchant et al. (2023). Scaling deep learning for materials discovery. Nature.
Pickard and Needs (2011). Ab initio random structure searching. Journal of Physics: Condensed Matter.
Oganov and Glass (2006). Crystal structure prediction using ab initio evolutionary techniques. The Journal of Chemical Physics.
Oganov et al. (2019). Structure prediction drives materials discovery. Nature Reviews Materials.
Xie et al. (2021). Crystal Diffusion Variational Autoencoder for Periodic Material Generation.
Jiao et al. (2024). Crystal Structure Prediction by Joint Equivariant Diffusion.
Miller et al. (2024). FlowMM: generating materials with Riemannian flow matching. ICML.
Hoellmer et al. (2025). Open Materials Generation with Stochastic Interpolants.
Zeni et al. (2025). A Generative Model for Inorganic Materials Design. Nature.
Joshi et al. (2025). All-atom Diffusion Transformers: Unified generative modelling of molecules and materials.
Yi et al. (2025). CrystalDiT: A Diffusion Transformer for Crystal Generation.
Betala et al. (2026). LeMat-GenBench: A Unified Evaluation Framework for Crystal Generative Models.

Crystalite: A Lightweight Transformer forEfficient Crystal Modeling