Recent advances in Reinforcement Learning (RL) have demonstrated promising results in autonomous car racing. However, two fundamental challenges remain: sparse rewards, which hinder efficient learning process, and the quality of demonstrations, which directly affects the effectiveness of RL from Demonstration (RLfD) approaches. To address these issues, we propose SAC(λ), a novel RLfD algorithm tailored for sparse-reward racing tasks with imperfect demonstrations. SAC(λ) introduces two key components: (1) a discriminator-augmented Q-function, which integrates prior knowledge from demonstrations into value estimation while maintaining off-policy learning benefits, and (2) a Positive-Unlabeled (PU) learning framework with adaptive prior adjustment, which enables the agent to progressively refine its understanding of positive behaviors, while mitigating the overfitting problem. Through extensive experiments in the Assetto Corsa simulator, we demonstrate that SAC(λ) significantly accelerates training, surpasses the provided demonstrations, and achieves superior lap times over existing RL and RLfD approaches
The above figure shows an overview of SAC(λ). Based on the original SAC algorithm (green shaded box), additional value from the discriminator is augmented to the original Q-function (red dotted box). The objective of discriminator is formulated as a positive-unlabeled learning problem, and is learned using given low-quality and narrow-distributed demonstrations.
We consider the lap completion task in car racing, which aims to achieve minimum lap-time. Ferrari 458 GT2 is used as our racing car and evaluate on two tracks: Silverstone1967 (S67) and Monza (MNZ). The red dotted boxes indicate sections requiring advanced driving strategies
Interfacing framework is built meticulously using AC APIs. Overall flow is shown in the figure above. Firstly, necessary data such as vehicle velocity, acceleration, and contact flags are collected using supported APIs. A real-time data parser then passes those data into our virtual RL environment, along with additional track data. The route manager parses local map data such as preview curvatures, slopes, and bank angles, and a 2D rangefinder is implemented. The agent interacts with this virtual environment and generates actions, which are then passed through the virtual gamepad to transfer the control inputs to AC. We also provide a random initial spawn function.
We set the task which posseses following key challenges :
We present the experimental results to address the following questions:
The above figure shows the episode returns during the initial stages of training. Training efficiency is evaluated based on two criteria: (1) the number of training steps required for the agent to complete its first lap and (2) the lap time at that episode. Vertical dotted lines in the figure represent the average number of steps required for the first lap completion. Among all algorithms, SAC(λ) demonstrates the fastest learning speed, completing the first lap in fewer steps and achieving the highest returns, indicating faster lap times. Notably, IL-based algorithms struggle to deal with OOD data as it does not leverage online interaction information. TRPOfD also fails to complete a lap, as it becomes trapped in a local minimum, as shown below.
Above table shows the comparison of the best lap times achieved after sufficient training steps. The lap time for demonstrations indicates the average lap time among the existing data. Our approach achieves the fastest lap time of 1:29.767 and maintains higher speeds in all sections compared to the best performance of the demonstrations and other baseline methods.
The above figure shows the agent’s trajectory for achieving the minimum lap time, with challenging sections of varying curvatures selected for detailed visualization. The agent effectively uses the full width of the track to minimize the curvature of its path and maximize speed, following an "out-in-out" trajectory—a technique commonly used by expert human drivers. By learning this optimal strategy, the agent attains higher speeds through these curve sections, resulting in improved overall performance.
The above figure shows evaluation results on the S67 track: (a) a comparison of the best lap times against algorithms trained with dense rewards and (b) ablation study on the effects of different λ values on SAC(λ) performance. Compared to algorithms trained with dense rewards, SAC($\lambda$) still demonstrates superior performance As λ increases, the algorithm begins to leverage offline demonstrations more effectively, reaching peak performance at λ = 0.1. Beyond this point, while the initial learning speed improves, performance starts to decline in the later stages due to the stronger influence of IL.