DORA Explorer:

Improving the Exploration Ability of LLMs Without Training

Anonymous Authors

DORA Explorer (Diversity-Oriented Ranking of Actions) is a training-free framework for improving exploration in LLM agents. LLM agents struggle with exploration, often converging to sub-optimal solutions or getting stuck in loops. DORA addresses this by generating diverse action candidates, scoring them via token log-probabilities, and selecting actions with a tunable exploration parameter. DORA achieves UCB-competitive performance on Multi-Armed Bandits and consistent gains across the Text Adventure Learning Environment Suite (TALES).

Method

DORA Explorer samples diverse action candidates, scores them by log-probability, and selects the best action via a tunable exploration parameter.

Algorithm 1 DORA Explorer

Input: History H, observation ot, used-action set Aot

Hyperparams: nC, {τd, τλ, τC}, α

Output: Selected action a*t


 1: c ← [H; o_t]
 2: d ~ π_d(c, τ_d)                           ▷ Decide greedy or explore
 3: if d = EXPLORE then
 4:   λ ← λ_exp(t)  or  π_λ(c, τ_λ)        ▷ Schedule or policy
 5:   C ← ∅
 6:   C' ~ π_C(c, τ_C, n_C)                  ▷ Sample n_C candidates
 7-11: Filter previously used actions from C' → C
12:   s = (Score(a, α))_{a∈C}
13:   p_λ = Softmax(λ · s)
14:   a*_t ~ Categorical(p_λ)
15: else
16:   a*_t ~ π_g(c, τ=0)                     ▷ Greedy action
17: end if
18: A_{o_t} ← A_{o_t} ∪ {a*_t}
19: return a*_t

Results on Multi-Armed Bandits

Arm-selection dynamics on a 5-armed bandit (the best arm is purple). On the left, DORA Explorer with a λ-scheduler progressively converges to the optimal arm, while on the right the LLM at temperature τ = 0 stays stuck on a suboptimal arm.

t = 0 / 199
slow fast
1.0x
DORA Explorer
total reward 0
chooses -
reward 0
LLM at τ=0
total reward 0
chooses -
reward 0

Results on TALES

A TextWorld cooking task from TALES. DORA Explorer systematically explores rooms, finds the cookbook, gathers ingredients, and completes the recipe (3/3), while the zero-shot agent wanders in loops across the Shed, Backyard, and Garden without ever reaching the Kitchen (0/3).

Step 0
slow fast
1.0x
DORA Explorer
Score 0 / 3
Shed
Zero-shot
Score 0 / 3
Shed

Findings

DORA explores early in MAB and converges quickly to the optimal arm.

Arms selected by DORA over a single episode. In the early timesteps the λ policy samples all arms, and as t grows λ increases and the agent shifts to exploiting the optimal arm (red).

Time Step (t) # Cumulative Selection 0 25 50 75 100 125 0 20 40 60 80 Red (optimal) Yellow Blue Green Purple

Optimal-arm selection across methods over a horizon of T = 200. DORA explores more in the early stages but selects the optimal arm more often as the episode progresses.

Optimal arm fraction across baselines and DORA with varying tau
DORA achieves near-optimal performance on Multi-Armed Bandits.

Multi-Armed Bandits (MAB) provide a classical testbed for the exploration–exploitation trade-off. We study the hard instance with K=5 arms, gap Δ=0.2, and horizon T=200. For τ ≤ 1, the model behaves overly greedily, often committing early to a suboptimal arm (suffix failure up to 95%). For τ > 1, increased randomness prevents consistent exploitation. DORA explores effectively in early stages and rapidly converges to the optimal arm, achieving zero Suffix Failure Frequency and approaching the near-optimal UCB baseline.

Metric UCBTSGreedyε-Greedy τ=0τ=0.7τ=1τ=1.5τexp DORA
Mean Avg Reward 0.5310.5120.5060.490 0.4100.4100.4010.4400.433 0.514
SuffFailFreq(T/2) 0.020.000.420.30 0.900.900.950.250.10 0.00
Best Arm Fraction 0.660.570.530.455 0.100.100.040.3010.434 0.585
Cumulative Regret 13.6817.1618.9021.80 36.0036.0038.3928.4524.41 16.61

Table 1. Performance on the hard MAB instance (Llama-3.1 8B).

DORA improves overall task success on TALES.

DORA visits over 2× more unique states in TextWorld and ScienceWorld, enabling discovery of hidden objects and task-relevant signals. By enabling agents to discover new information and avoid repetitive loops, DORA leads to consistent performance gains across all TALES environments, including measurable progress in AlfWorld where feedback is sparse and unhelpful.

Model Setting TextWorld TW Express AlfWorld ScienceWorld Jericho
Qwen-2.5 7B Zero-shot 29.20±1.92 48.93±1.69 0.00±0.00 13.49±0.12 0.66±0.04
Chain of Thought 24.89±0.91 37.44±0.00 0.00±0.00 9.81±0.18 1.58±0.03
Tree of Thought 11.39±0.83 45.90±0.21 0.00±0.00 10.10±0.00 1.40±0.00
Prompt Explore 23.12±1.29 47.81±0.00 0.00±0.00 13.21±0.34 1.27±0.04
DORA (Ours) 45.40±1.68 51.17±3.38 2.78±2.78 19.01±0.62 1.42±0.24
Llama-3.1 8B Zero-shot 40.74±0.67 48.29±1.04 0.00±0.00 15.53±0.50 2.21±0.14
Chain of Thought 37.72±1.16 24.72±2.50 0.00±0.00 7.81±0.87 2.22±0.22
Tree of Thought 15.06±0.69 15.52±0.21 0.00±0.00 8.64±1.64 2.05±0.15
Prompt Explore 43.07±0.72 52.77±2.08 0.00±0.00 13.70±0.59 1.60±0.21
DORA (Ours) 50.15±6.88 52.89±3.51 2.78±2.78 16.85±0.58 2.17±0.21
Mistral Small 22B Zero-shot 62.25±1.08 31.70±2.08 0.00±0.00 26.47±0.57 1.35±0.03
Chain of Thought 31.43±2.22 25.16±0.28 0.00±0.00 9.63±0.35 1.43±0.13
Tree of Thought 32.54±0.83 26.40±1.71 0.00±0.00 10.88±0.85 1.27±0.03
Prompt Explore 50.60±0.69 42.46±0.79 0.00±0.00 21.48±0.48 1.70±0.05
DORA (Ours) 74.28±2.99 43.12±1.21 2.78±2.78 32.43±1.15 2.92±0.66

Table 2. Performance across TALES environments, reported as mean normalised scores ± standard error over 3 runs.

DORA enables loop recovery with 5× to 20× higher rates.

Baseline methods frequently enter loops and rarely recover (below 1%). DORA substantially reduces loops encountered and achieves 5× to 20× higher recovery rates across all TALES environments.

Method TextWorld ScienceWorld AlfWorld
LoopsRec.% LoopsRec.% LoopsRec.%
Zero-shot61230.5255190.3581100.0
Prompt Explore66350.82321100.4372110.1
CoT673243.6262160.2384610.1
ToT78381.0280120.0785700.0
DORA (Auto) 1853619.5 1234453.6 1111210.8

Table 3. Loops encountered, loops recovered, and recovery rate (%) across TALES environments.

DORA covers more states compared to baselines.

DORA visits significantly more unique states per task across all TALES environments, indicating greater coverage and information gathering.

Zero-shot Prompt Explore CoT ToT DORA (Auto)
TextWorld
14.0
12.1
17.5
12.1
32.0
ScienceWorld
6.5
8.9
8.23
4.6
20.23
AlfWorld
2.67
3.67
3.25
3.67
27.92

Figure 1. Average number of unique states visited per task across TALES environments.

DORA consistently outperforms all temperature strategies.
Model τ=0τ=0.3τ=0.7τ=1τ=1.5τ=2τ-policy DORA
Llama-3.1 8B 39.4134.0830.7414.330.000.000.00 44.74
Qwen-2.5 7B 27.6523.6526.7417.6521.8312.8310.33 44.62
Mistral Small 22B 60.1654.6239.835.835.830.000.00 73.59

Table 4. Comparison of τ-sampling strategies against DORA on TextWorld.

α=0.8 yields the best confidence–consistency trade-off.
(α,β) (1.0,0.0)(0.8,0.2)(0.6,0.4)(0.4,0.6)(0.2,0.8)(0.0,1.0)
Mean Reward0.4960.5140.5090.5060.5000.511
SuffFail0.050.000.050.050.050.00
Best Arm0.5160.5850.5840.5400.5280.577
Regret19.3416.6116.6318.4018.9016.92

Table 5. Effect of the α confidence–consistency weighting on MAB performance.

Auto-Explorer outperforms the λ-scheduler on complex tasks.
Modelλ-SchedAuto
Qwen-2.5 7B5.8342.95
Llama-3.1 8B21.0041.89
Mistral Small 22B9.1769.57

Table 6. λ-schedule versus Auto-Explore on TextWorld.

DORA uses more tokens but achieves substantially better outcomes.

DORA Explorer uses ~2× more tokens than Prompt Explore (which uses the fewest), but this additional compute translates into significantly better task performance.

Zero-shot
0.20M
Prompt Explore
0.19M
CoT
0.28M
ToT
0.30M
DORA Explorer
0.36M

Figure 2. Average token usage per task on TALES TextWorld.