SAC(λ): Efficient Reinforcement Learning for Sparse-Reward Autonomous Car Racing using Imperfect Demonstrations

Heeseong Lee, Sungpyo Sagong, Minhyeong Lee, Jeongmin Lee, Dongjun Lee
Seoul National University
† indicates corresponding author

Abstract

Recent advances in Reinforcement Learning (RL) have demonstrated promising results in autonomous car racing. However, two fundamental challenges remain: sparse rewards, which hinder efficient learning process, and the quality of demonstrations, which directly affects the effectiveness of RL from Demonstration (RLfD) approaches. To address these issues, we propose SAC(λ), a novel RLfD algorithm tailored for sparse-reward racing tasks with imperfect demonstrations. SAC(λ) introduces two key components: (1) a discriminator-augmented Q-function, which integrates prior knowledge from demonstrations into value estimation while maintaining off-policy learning benefits, and (2) a Positive-Unlabeled (PU) learning framework with adaptive prior adjustment, which enables the agent to progressively refine its understanding of positive behaviors, while mitigating the overfitting problem. Through extensive experiments in the Assetto Corsa simulator, we demonstrate that SAC(λ) significantly accelerates training, surpasses the provided demonstrations, and achieves superior lap times over existing RL and RLfD approaches

Approach

Algorithm Overview

The above figure shows an overview of SAC(λ). Based on the original SAC algorithm (green shaded box), additional value from the discriminator is augmented to the original Q-function (red dotted box). The objective of discriminator is formulated as a positive-unlabeled learning problem, and is learned using given low-quality and narrow-distributed demonstrations.

Experimental setup

Exp Setup

We consider the lap completion task in car racing, which aims to achieve minimum lap-time. Ferrari 458 GT2 is used as our racing car and evaluate on two tracks: Silverstone1967 (S67) and Monza (MNZ). The red dotted boxes indicate sections requiring advanced driving strategies

Interface

Interfacing framework is built meticulously using AC APIs. Overall flow is shown in the figure above. Firstly, necessary data such as vehicle velocity, acceleration, and contact flags are collected using supported APIs. A real-time data parser then passes those data into our virtual RL environment, along with additional track data. The route manager parses local map data such as preview curvatures, slopes, and bank angles, and a 2D rangefinder is implemented. The agent interacts with this virtual environment and generates actions, which are then passed through the virtual gamepad to transfer the control inputs to AC. We also provide a random initial spawn function.

We set the task which posseses following key challenges :

  • Sparse reward : Tasks which require long and precise sequence of actions to successfully finish, designed as a sparse reward problems.
  • Slow sample collection speed : Unable to fast-forward or duplicate the environment on a single machine.
  • Imperfect demonstrations : Given demonstrations are sub-optimal and narrowly distributed.
Our goal is to show that SAC(λ) can effectively utilize the given imperfect demonstrations to boost the learning process and achieve better performance than existing methods in the long-horizon and sparse reward tasks, specifically autonomous racing.

Result

We present the experimental results to address the following questions:

  • Training efficiency : How effectively does SAC(λ) accelerate the early stages of learning?
  • Final Performance : Does SAC(λ) achieve superior performance after sufficient training steps?
  • Learned Behavior : Is the behavior learned by the agent qualitatively comparable to that of a human expert driver?
  • Effect of λ : How does the choice of λ affect the entire learning process?


Training Efficiency

Training Efficiency

The above figure shows the episode returns during the initial stages of training. Training efficiency is evaluated based on two criteria: (1) the number of training steps required for the agent to complete its first lap and (2) the lap time at that episode. Vertical dotted lines in the figure represent the average number of steps required for the first lap completion. Among all algorithms, SAC(λ) demonstrates the fastest learning speed, completing the first lap in fewer steps and achieving the highest returns, indicating faster lap times. Notably, IL-based algorithms struggle to deal with OOD data as it does not leverage online interaction information. TRPOfD also fails to complete a lap, as it becomes trapped in a local minimum, as shown below.

TRPO collapse
TRPO Best model before collapse
TRPO Local minima after collapse

Final Performance : lap time (s)

Final Performance
SAC(λ)
Demonstration
SACfD
SACBC
SAC-d
SACfD-d(L)
SACfD-d(H)

Above table shows the comparison of the best lap times achieved after sufficient training steps. The lap time for demonstrations indicates the average lap time among the existing data. Our approach achieves the fastest lap time of 1:29.767 and maintains higher speeds in all sections compared to the best performance of the demonstrations and other baseline methods.


Learned Behavior

Learned Behavior
Straight
Chicane
Sweeper
Corners
Kink
Esses

The above figure shows the agent’s trajectory for achieving the minimum lap time, with challenging sections of varying curvatures selected for detailed visualization. The agent effectively uses the full width of the track to minimize the curvature of its path and maximize speed, following an "out-in-out" trajectory—a technique commonly used by expert human drivers. By learning this optimal strategy, the agent attains higher speeds through these curve sections, resulting in improved overall performance.


Comparison with dense-reward setting, and effect of the λ

Ablation

The above figure shows evaluation results on the S67 track: (a) a comparison of the best lap times against algorithms trained with dense rewards and (b) ablation study on the effects of different λ values on SAC(λ) performance. Compared to algorithms trained with dense rewards, SAC($\lambda$) still demonstrates superior performance As λ increases, the algorithm begins to leverage offline demonstrations more effectively, reaching peak performance at λ = 0.1. Beyond this point, while the initial learning speed improves, performance starts to decline in the later stages due to the stronger influence of IL.