POS-ISP: Pipeline Optimization at the Sequence Level for Task-aware ISP
| Entity Passport | |
| Registry ID | arxiv-paper--unknown--2604.06938 |
| License | ArXiv |
| Provider | hf |
Cite this paper
Academic & Research Attribution
@misc{arxiv_paper__unknown__2604.06938,
author = {Jiyun Won, Heemin Yang, Woohyeok Kim},
title = {POS-ISP: Pipeline Optimization at the Sequence Level for Task-aware ISP Paper},
year = {2026},
howpublished = {\url{https://free2aitools.com/paper/arxiv-paper--unknown--2604.06938}},
note = {Accessed via Free2AITools Knowledge Fortress}
} š¬Technical Deep Dive
Full Specifications [+]ā¾
āļø Nexus Index V2.0
š¬ Index Insight
FNI V2.0 for POS-ISP: Pipeline Optimization at the Sequence Level for Task-aware ISP: Semantic (S:50), Authority (A:0), Popularity (P:0), Recency (R:100), Quality (Q:45).
Verification Authority
š Executive Summary
ā Cite Node
@article{Unknown2026POS-ISP:,
title={POS-ISP: Pipeline Optimization at the Sequence Level for Task-aware ISP},
author={},
journal={arXiv preprint arXiv:arxiv-paper--unknown--2604.06938},
year={2026}
} Abstract & Analysis
POSāISP: Pipeline Optimization at the Sequence Level for Taskāaware ISP
Title:
Content selection saved. Describe the issue below:
Description:
License: CC BY 4.0
arXiv:2604.06938v1 [cs.CV] 08 Apr 2026
POSāISP: Pipeline Optimization at the Sequence Level for Taskāaware ISP
Jiyun Won 1
āHeemin Yang 1
āWoohyeok Kim 2
āJungseul Ok 1,2
āSunghyun Cho 1,2
POSTECH CSE 1 & GSAI 2 {w1jyun, heeminid, woohyeok, jungseul, s.cho}@postech.ac.kr
Abstract
Recent work has explored optimizing image signal processing (ISP) pipelines for various tasks by composing predefined modules and adapting them to task-specific objectives. However, jointly optimizing module sequences and parameters remains challenging. Existing approaches rely on neural architecture search (NAS) or step-wise reinforcement learning (RL), but NAS suffers from a training-inference mismatch, while step-wise RL leads to unstable training and high computational overhead due to stage-wise decision-making. We propose POS-ISP, a sequence-level RL framework that formulates modular ISP optimization as a global sequence prediction problem. Our method predicts the entire module sequence and its parameters in a single forward pass and optimizes the pipeline using a terminal task reward, eliminating the need for intermediate supervision and redundant executions. Experiments across multiple downstream tasks show that POS-ISP improves task performance while reducing computational cost, highlighting sequence-level optimization as a stable and efficient paradigm for task-aware ISP. The project page is available at https://w1jyun.github.io/POS-ISP
1
Introduction
Image signal processors (ISPs) transform RAW sensor data captured by digital cameras into sRGB images suitable for human perception or machine vision. Conventional ISPs apply a fixed chain of operations such as white balance and tone mapping that are primarily designed to enhance image quality. While such fixed pipelines are suitable for general photography, they often fail to align with the preferences or objectives of specific tasks, ranging from visual appearance optimization to high-level vision tasks such as object detection and semantic segmentation. Although ISPs can be manually tuned by golden-eye experts, the process is time-consuming and difficult because it requires precise adjustment of many tightly coupled parameters for each task. As a result, it is difficult to achieve consistent and optimal performance across different objectives.
To obtain enhanced ISP pipelines for downstream tasks, data-driven approaches have recently been proposed that learn ISPs directly from data. Among these, modular approaches have attracted particular attention due to their practical advantages. They decompose the ISP pipeline into well-established operations such as white balance and denoising, and optimize the pipeline in a task-driven manner. This modular design is especially appealing because the operations are already integrated into imaging systems and have low computational complexity, making them suitable for practical deploymentĀ [ 5 , 6 ] . However, despite this efficiency, optimizing modular ISPs remains difficult, since selecting the best sequence of modules and tuning parameters often requires non-differentiable search procedures.
To address this challenge, several approaches have recently explored neural architecture search (NAS) or reinforcement learning (RL) for modular ISP optimizationĀ [ 27 , 21 , 24 ] . While they resolve the issue of non-differentiable optimization, they also introduce new limitations. First, the NAS-based methodĀ [ 27 ] enables gradient-based optimization by mixing the outputs of candidate modules. However, the reliance on mixture training causes inconsistency at inference, where the modules are discretely selected. Second, RL-based methodsĀ [ 21 , 24 ] model ISP optimization as a stepwise RL formulation that performs sequential decision-making at each intermediate stage of the ISP pipeline. Unfortunately, such a formulation requires repeated evaluations and relies on future reward estimation, resulting in unstable training and high computational overhead. This instability is a well-known issue in deep reinforcement learning, arising from the difficulty of stabilizing bootstrapped value estimation under function approximationĀ [ 10 , 23 ] . Moreover, since the decision process must be repeatedly evaluated at each stage to determine the next action, this stepwise formulation is structurally inefficient.
In this paper, we present POS-ISP , a novel RL framework for searching optimal modular ISP pipelines tailored to downstream tasks. Unlike existing RL methods that make stepwise decisions, POS-ISP performs sequence-level optimization by evaluating the entire pipeline with a single final reward. This formulation enables direct evaluation of the final result, avoiding unstable future reward estimation and leading to more stable optimization. It also captures dependencies between ISP modules, allowing the policy to consider the global pipeline context when predicting module sequences. Furthermore, POS-ISP predicts the entire pipeline in a single forward pass, significantly reducing memory and computation. Such efficiency is essential for ISPs deployed on mobile or edge devices, where they must function as lightweight pre-processing components.
To enable sequence-level, context-aware optimization, POS-ISP adopts a carefully designed network to predict the module sequence, named sequence predictor , along with parameter predictor for predicting module parameters. The sequence predictor is a recurrent policy network that predicts the entire module sequence by leveraging contextual information from preceding modules. At each recurrent step, the sequence predictor takes the previously selected module along with the hidden state, which contains contextual information of preceding modules, and predicts a probability distribution over the module candidates. Thanks to this context-aware and lightweight recurrent design, POS-ISP can predict the module sequence with reduced computational cost and memory overhead while considering the dependencies between the modules. In parallel, parameter predictor predicts module parameters with a small encoderādecoder network conditioned on the input image, enabling image-adaptive parameter prediction. The predicted module sequence and its corresponding parameters together form a complete ISP pipeline, whose output image is evaluated based on task-driven performance.
We validate POS-ISP by measuring its task-specific performance after optimization for multiple tasks, including object detection, instance segmentation, and image enhancement. Extensive experiments demonstrate that POS-ISP outperforms other task-aware ISP optimization methods both quantitatively and qualitatively, with a lower computational cost and memory footprint.
Our main contributions can be summarized as follows:
ā¢
We introduce POS-ISP, a framework that performs sequence-level optimization of the ISP pipeline by predicting the entire pipeline in a single forward pass, directly optimizing the final task reward without relying on unstable stepwise supervision.
ā¢
We design a recurrent sequence predictor that enables sequence-level prediction while capturing inter-module dependencies for context-aware optimization.
ā¢
We evaluate POS-ISP on object detection, instance segmentation, and image enhancement. Extensive experiments demonstrate that POS-ISP achieves state-of-the-art performance with substantially reduced computational cost and memory usage.
2
Related Work
With the advancement of deep learning, several works have been proposed to replace conventional ISPs with end-to-end deep neural networksĀ [ 14 , 1 , 2 , 9 , 26 , 28 ] . They aim to design neural networks that learn RAW-to-RGB mappings. Thanks to strong image priors, they have shown promising performance in mimicking the ISPs.
Beyond directly learning RAW-to-RGB mappings, several works have explored optimizing ISP configurations for downstream tasksĀ [ 22 , 17 ] . Tseng et al . Ā [ 22 ] train a differentiable proxy to approximate a black-box ISP and then optimize the ISP hyperparameters through this proxy to maximize downstream task performance. Motivated by this work, Qin et al . Ā [ 17 ] introduced an attention-aware framework to better capture the important image regions when predicting ISP parameters. However, these approaches rely on neural networks to approximate ISP behavior or predict its parameters, which increases computational complexity.
In contrast, modular ISP designs have practical advantages in terms of interpretability and computational efficiency. Therefore, there have been several works that optimize modular ISP pipelines in a task-driven mannerĀ [ 21 , 27 , 24 ] . To resolve the non-differentiability of optimizing the ISP module sequence and parameters, ReconfigISPĀ [ 27 ] employs differentiable proxy networks to approximate ISP modules, enabling gradient-based optimization of the sequence and parameters. During the architecture search, it assigns learnable weights to the modules at each step and mixes the module outputs based on the weights for differentiability. It then selects the module with the highest weight to construct the ISP pipeline for inference. However, the mismatch between soft selection during search and hard selection during inference leads to suboptimal performance.
Other approaches cast the ISP search problem as a sequential decision-making process and adopt a reinforcement learning (RL) framework. DRL-ISPĀ [ 21 ] sequentially selects ISP modules to construct the pipeline, with a search space that includes both CNN-based modules and discretized variants of traditional ISP operators. AdaptiveISPĀ [ 24 ] further extends this framework by searching module sequences in a discrete space while predicting module parameters in a continuous space. However, despite these advances, prior RL-based ISP search methods rely on an actorācritic framework, where a critic network estimates future rewards to guide the agentās decisions. This reliance on intermediate supervision often leads to suboptimal performance due to unstable critic optimization and also incurs substantial computational overhead, as decisions are made sequentially at each stage.
Unlike these stepwise RL approaches, our method performs sequence-level optimization in a single forward pass without intermediate supervision. This results in more stable training and improved computational efficiency.
3
POS-ISP
3.1
Problem Formulation
Figure 1 : Overview of the proposed method. POS-ISP aims at constructing the ISP pipeline that best performs for the downstream task. The sequence predictor predicts the image processing module sequence based on the learned policy, and the parameter predictor estimates the corresponding parameters of each module.
Figure 2 : Detailed architecture of sequence predictor. The sequence predictor predicts the image processing module sequence based on the learned policy.
Fig. Ā 1 shows the overall framework of POS-ISP. The goal of POS-ISP is to discover an ISP pipeline that transforms a RAW image I in I_{\text{in}} into an sRGB image I out I_{\text{out}} , such that the output maximizes the performance of a target downstream task šÆ \mathcal{T} . Following conventional camera ISPs, our framework models an ISP pipeline as a sequence of image processing modules, each with its own internal parameters. In our experiments, we adopt the same set of modules as those used in AdaptiveISPĀ [ 24 ] , encompassing standard ISP operations such as white balance and tone mapping. The detailed list of modules is provided in the supplementary material.
Formally, we define a candidate set of ISP modules š = { ā³ 1 , ⯠, ā³ n } \mathbb{M}={\mathcal{M}{1},\cdots,\mathcal{M}{n}} , where n n denotes the total number of available modules, and each module ā³ i \mathcal{M}{i} is parameterized by its own parameters Īø i \theta{i} . Based on this, we model the ISP pipeline of POS-ISP as:
I out \displaystyle I_{\text{out}}
= ( ā³ a k ā ( ā
; Īø a k ) ā ⯠ā ā³ a 2 ā ( ā
; Īø a 2 ) ā ā³ a 1 ā ( ā
; Īø a 1 ) ) ā ( I in ) \displaystyle=\bigl(\mathcal{M}_{a_{k}}(\cdot;\theta_{a_{k}})\circ\cdots\circ\mathcal{M}_{a_{2}}(\cdot;\theta_{a_{2}})\circ\mathcal{M}_{a_{1}}(\cdot;\theta_{a_{1}})\bigr)\!(I_{\text{in}})
(1)
= F ā ( I in ; š , Ī ) \displaystyle=F(I_{\text{in}}\,;\,\mathcal{A},\Theta)
where a i a_{i} corresponds to a module index such that a i ā { 1 , ⦠, n } a_{i}\in{1,...,n} , and k k denotes the number of modules in the pipeline. We denote the sequence of modules as š = ( a 1 , ⦠, a k ) \mathcal{A}=(a_{1},...,a_{k}) , and the corresponding module parameters are represented as Ī = ( Īø a 1 , ⦠, Īø a k ) \Theta=(\theta_{a_{1}},...,\theta_{a_{k}}) . For stability and tractability, we assume that each module is sampled at most once, i.e., a i ā a j a_{i}\neq a_{j} if i ā j i\neq j .
With these definitions in place, our framework aims to find an optimal module sequence š ^ \hat{\mathcal{A}} and the corresponding parameters Ī ^ \hat{\Theta} . Specifically, we design our framework to determine the optimal sequence at the task level and the optimal parameters at the image level. In other words, our framework finds a single sequence š ^ \hat{\mathcal{A}} for a given downstream task šÆ \mathcal{T} and then uses the sequence across all images for that task, while training a parameter predictor network that predicts parameters adapted to each input image.
This separation is motivated by two key considerations. First, conventional camera ISPs adopt a fixed sequence of modules (e.g., white balancing, tone mapping), with parameters that adapt to each image. This design reflects practical hardware constraints, as pipeline structures are typically embedded in silicon or firmware and cannot be reconfigured per image. By following this paradigm, our framework both aligns with real-world ISP design principles and enables the discovery of task-optimized pipelines that can be readily deployed in hardware. Second, the order of operations in an ISP pipeline is largely determined by the target downstream task. For example, tasks that rely on structural information (e.g., object detection) tend to place contrast and sharpening earlier in the sequence, whereas tasks targeting perceptual quality prioritize exposure and tone adjustments to achieve balanced brightness and color.
To find the optimal sequence š ^ \hat{\mathcal{A}} and parameters Ī ^ \hat{\Theta} , our framework introduces two complementary components: sequence predictor and parameter predictor. The sequence predictor models a probability distribution over possible ISP operation sequences, while the parameter predictor predicts imageāadaptive parameters conditioned on an input image. Instead of directly committing to a single sequence, we model a distribution to capture the inherent uncertainty and variability in pipeline design, enabling exploration of multiple candidate structures during training and preventing premature convergence to suboptimal solutions.
By jointly training these networks for a downstream task šÆ \mathcal{T} , we simultaneously learn the distribution of taskāspecific module sequences and image-adaptive parameters. After training, we select the most probable sequence for the target task and employ parameter predictor to generate parameters tailored to each input image.
3.2
Network Architecture
POS-ISP constructs its task-adaptive ISP pipeline by predicting a task-specific module sequence from the sequence predictor and image-specific module parameters from the parameter predictor. The sequence predictor models the full distribution over module sequences instead of predicting only the most probable pipeline. This probabilistic formulation enables thorough exploration of diverse yet plausible pipelines during ISP search. The parameter predictor predicts the image-specific module parameters given an input image. In the following, we describe the networks in detail.
Sequence predictor
For an ISP sequence š = ( a 1 , ⯠, a T ) \mathcal{A}=(a_{1},\cdots,a_{T}) , the sequence predictor models its probability as:
p ā ( š ) = ā i = 1 T p ā ( a i ⣠a < i ) , p(\mathcal{A})=\prod_{i=1}^{T}p(a_{i}\mid a_{
(2)
where a a_{0}=\texttt{ } denoting a special token that represents the start of the sequence. The sequence also includes a special token that terminates the sequence to allow pipelines of arbitrary length. To parameterize this distribution, the sequence predictor adopts a recurrent architecture based on Gated Recurrent Units (GRUs)Ā [ 8 ] , which are widely used in sequence modeling across domains such as natural language processing and recommendation systems, thanks to their efficiency and ability to capture sequential dependenciesĀ [ 20 , 11 , 19 ] .
Fig. Ā 2 shows an overview of the sequence predictor.
At the i i -th recurrent step, the sequence predictor takes the previous module index a i ā 1 a_{i-1} as input and embeds it into a vector.
Then, the GRU updates the hidden state h i h_{i} using this embedding together with the previous hidden state h i ā 1 h_{i-1} (initialized as zeros).
Here, h i h_{i} encodes the past context a
During the ISP search, we train the sequence predictor by sampling ISP pipelines from the learned policy and evaluating their task performance.
At each recurrent step, a i a_{i} is sampled from the probability distribution Ļ ā ( a i ) \pi(a_{i}) and fed into the next step until the token is produced, forming a complete pipeline.
To balance exploration and exploitation during policy training, we apply a temperature-controlled sampling strategyĀ [ 4 ] , encouraging exploration in the early phase and progressively focusing on exploitation in later stages.
After the search, the final ISP pipeline is generated using greedy decoding, where the highest-probability module is sequentially selected from Ļ ā ( a i ) \pi(a_{i}) until is reached.
Further details are provided in the supplementary material.
Parameter predictor
The parameter predictor predicts the module parameters Ī \Theta conditioned on the input image. A lightweight convolutional neural network (CNN)-based encoder processes a downsampled 64Ć64 input and extracts a compact feature representation, which is then passed through a decoder to produce the parameter sets for all modules in š \mathbb{M} . When constructing the ISP pipeline, only the parameters corresponding to the selected sequence š = ( a 1 , ⦠, a k ) \mathcal{A}=(a_{1},\dots,a_{k}) are applied; that is, Īø a 1 , ⦠, Īø a k {\theta_{a_{1}},\dots,\theta_{a_{k}}} are retrieved from Ī \Theta to form the final pipeline.
Although the parameter predictor can be conditioned on both the input image and the predicted module sequence, we empirically found that using only the image yields better performance. This is likely because sequence conditioning increases learning complexity, while image-only input acts as regularization. Importantly, as the sequence policy gradually converges toward high-performing pipelines, the parameter predictor, trained via task-driven feedback, naturally adapts to these dominant sequences. Even without direct access to the sequence, it learns to produce compatible parameters for frequently selected pipelines, enabling effective coordination and near-optimal performance.
3.3
ISP Search
Building on the sequence predictor and parameter predictor, we search for effective ISP pipelines tailored to a downstream task šÆ \mathcal{T} . During ISP search, both predictors are jointly trained, enabling the selection of the most probable pipeline and its associated parameters after search.
We formulate ISP search as a RL problem over discrete module sequences, while learning module parameters via differentiable optimization. An ISP pipeline is represented by a module sequence š \mathcal{A} together with its parameter set Ī \Theta , yielding F ā ( ā ; š , Ī ) F(\cdot;\mathcal{A},\Theta) that maps an input image I in I_{\text{in}} to an output image I out I_{\text{out}} . Pipeline quality is assessed by applying F F to I in I_{\text{in}} and measuring task performance on the output, which serves as the reward signal in the RL framework. Unlike a standard Markov decision process, our formulation does not involve explicit states or stepwise decisions: the sequence predictor generates the module sequence in a single forward pass, while the parameter predictor predicts the corresponding parameters conditioned on I in I_{\text{in}} . This design provides a terminal reward based on the performance of a fully-formed ISP, avoiding unstable reward estimation and enabling end-to-end optimization with improved stability.
We define the reward as the improvement in downstream task performance compared to the baseline input, with an additional penalty term to discourage degenerate solutions:
R ā ( I in , š , Ī ) = ā šÆ ā ( I in ) ā ā šÆ ā ( I out ) ā P ā ( I out ) , R(I_{\text{in}},\mathcal{A},\Theta)=\mathcal{L}_{\mathcal{T}}(I_{\text{in}})-\mathcal{L}_{\mathcal{T}}(I_{\text{out}})-P(I_{\text{out}}),
(3)
where I in I_{\text{in}} is the input image, and I out = F ā ( I in ; š , Ī ) I_{\text{out}}=F(I_{\text{in}};\mathcal{A},\Theta) is the output processed by the ISP pipeline defined by š \mathcal{A} and Ī \Theta . Here, ā šÆ \mathcal{L}{\mathcal{T}} denotes the loss function of the target task šÆ \mathcal{T} , and P P is a penalty term that prevents degenerate outputs. In the RL formulation, I in I{\text{in}} corresponds to the current state, while the choice of module sequence š \mathcal{A} and parameters Ī \Theta constitutes the action. The ISP pipeline F F represents the environment transition, producing the next state I out I_{\text{out}} . The reward R R thus measures how much the chosen action improves task performance relative to the baseline, while penalizing implausible results.
For example, when the task is object detection, ā šÆ \mathcal{L}{\mathcal{T}} is defined as the detection loss from a pretrained detector. In this case, the reward reflects the improvement in detection accuracy achieved by the ISP pipeline. More generally, ā šÆ \mathcal{L}{\mathcal{T}} can be adaptively defined for other tasks, allowing POS-ISP to tailor its pipeline to diverse objectives by directly optimizing task-relevant metrics.
For the penalty term P P , in the detection and instance segmentation experiments, we adopt the intensity-based penalty from the truncation condition in AdaptiveISPĀ [ 24 ] , which discourages extreme pixel values:
P = α 1 ā [ I low ā I ĀÆ out ] + + α 2 ā [ I ĀÆ out ā I high ] + , P=\alpha_{1}[I_{\text{low}}-\bar{I}{\text{out}}]{+}+\alpha_{2}[\bar{I}{\text{out}}-I{\text{high}}]_{+},
(4)
where I ĀÆ out \bar{I}{\text{out}} is the mean intensity of I out I{\text{out}} , and I low I_{\text{low}} and I high I_{\text{high}} are the lower and upper bounds, set to 0.01 0.01 and 0.9 0.9 following AdaptiveISP. Here, [ x ] + [x]_{+} is equivalent to max ā” ( 0 , x ) \max(0,x) . This term serves as a soft regularizer that discourages extreme exposure shifts while preserving flexibility for optimization.
Given the reward, we train the sequence predictor and parameter predictor alternately. For the sequence predictor, we update it via the REINFORCE policy gradient methodĀ [ 25 ] to increase the likelihood of high-reward module sequences. Specifically, we define the learning objective for the sequence predictor as
ā seq = ā š¼ ^ š ā¼ Ļ ā [ R ā ( I in , š , Ī ) ā
ā i = 1 k log ā” Ļ ā ( a i ) ] , \mathcal{L}_{\text{seq}}=-\hat{\mathbb{E}}_{\mathcal{A}\sim\pi}\left[R(I_{\text{in}},\mathcal{A},\Theta)\,\cdot\,\sum_{i=1}^{k}\log\pi(a_{i})\right],
(5)
where š¼ ^ \hat{\mathbb{E}} denotes the expectation over a mini-batch and Ļ \pi denotes the probability of selecting a i a_{i} at step i i . Ī \Theta is computed from I in I_{\text{in}} using the parameter predictor, i.e., Ī = Ī ā ( I in ) \Theta=\Theta(I_{\text{in}}) . This objective encourages the sequence predictor to assign higher probability to actions that yield higher rewards. On the other hand, the parameter predictor is trained by minimizing the following loss via backpropagation:
ā param = ā šÆ ā ( I out ) + P ā ( I out ) , \mathcal{L}_{\text{param}}=\mathcal{L}_{\mathcal{T}}\!\left(I_{\text{out}}\right)+P(I_{\text{out}}),
(6)
which is equivalent to maximizing the reward in Eq. Ā 3 .
After the search, we construct the final ISP pipeline as follows. First, the optimal module sequence š ^ = ( a ^ 1 , ⦠, a ^ k ) \hat{\mathcal{A}}=(\hat{a}{1},...,\hat{a}{k}) is obtained from the trained sequence predictor by discretely selecting the module with the highest probability at each step until is reached. This sequence š ^ \hat{\mathcal{A}} is fixed during inference. In parallel, the parameter predictor takes the input image I in I_{\text{in}} and predicts parameters for all n n modules ( Īø ^ 1 , ⦠, Īø ^ n ) (\hat{\theta}{1},...,\hat{\theta}{n}) . From these, the parameters corresponding to the selected sequence, Ī ^ = ( Īø ^ a 1 , ⦠, Īø ^ a k ) \hat{\Theta}=(\hat{\theta}{a{1}},...,\hat{\theta}{a{k}}) , are selected. Finally, the pipeline defined by š ^ \hat{\mathcal{A}} and Ī ^ \hat{\Theta} processes I in I_{\text{in}} to produce the task-adapted output I out I_{\text{out}} .
4
Experiments
Figure 3 : Comparison of different ISP methods on object detection and instance segmentation tasks. Reference images are well-lit scenes from the LOD and LIS datasets, with brightness increased by 1.5 Ć 1.5\times for visualization. More results are in the supplementary material.
Implementation details
We adopt the same candidate set of ISP modules as in AdaptiveISPĀ [ 24 ] . Following their setting, we also assume that the input images are not completely raw sensor data, but have instead undergone only minimal preprocessing steps, such as defective pixel correction, black level correction, and demosaicking. These operations are standard prerequisites in camera pipelines, applied prior to any higher-level ISP modules, ensuring that our framework operates on minimally processed inputs while remaining aligned with prior work. Details on the ISP modules are provided in the supplementary material.
We implemented POS-ISP using PyTorch. We train the framework on resized image patches of size 512 Ć 512 512\times 512 with a batch size of 8 for 15,000 iterations. We use the Adam optimizerĀ [ 15 ] with β 1 = 0.9 \beta_{1}=0.9 and β 2 = 0.99 \beta_{2}=0.99 , and the learning rates are set to 1 Ć 10 ā 4 1\times 10^{-4} for the parameter predictor and 3 Ć 10 ā 5 3\times 10^{-5} for the sequence predictor with no learning rate scheduling. The training process takes approximately 24 hours on a single RTX A5000 GPU with 24GB VRAM.
4.1
Comparison
We compare POS-ISP with other task-driven ISP optimization methods on various downstream tasks, including object detection, instance segmentation, and image enhancement. For each downstream task, POS-ISP and competing methods are trained to find an optimal ISP pipeline that maximizes task performance. After training, the learned pipelines are applied to process test images, and the task performance is evaluated on the resulting outputs.
We benchmark against state-of-the-art approaches, including DRL-ISPĀ [ 21 ] , ReconfigISPĀ [ 27 ] , and AdaptiveISPĀ [ 24 ] . These methods differ in their input/output configurations: AdaptiveISP and our framework operate on RAW images that have undergone basic preprocessing operations, whereas DRL-ISP takes a Bayer input and produces a Bayer output without demosaicking. ReconfigISP, in contrast, incorporates demosaicking as part of its candidate modules, taking a Bayer RAW image and producing an sRGB output. To ensure a fair comparison under a consistent setting, we adapted both ReconfigISP and DRL-ISP to accept RAW images processed by the same preprocessing operations as ours. Further implementation details are provided in the supplementary material.
In addition, we include the in-camera ISP as a baseline for comparison on object detection and instance segmentation tasks. The datasets we use provide both RAW and JPEG images, where the JPEGs are generated by in-camera ISPs embedded in commercial devices. This allows us to directly assess the performance of real camera ISPs alongside learned task-driven ISPs, highlighting how our method compares not only with prior research but also with the ISPs deployed in practice.
To maintain a fair comparison, we also align the training objectives across methods. RL-based methods such as AdaptiveISP and DRL-ISP use rewards derived from ā šÆ \mathcal{L}{\mathcal{T}} , while ReconfigISP directly optimizes ā šÆ \mathcal{L}{\mathcal{T}} under its neural architecture search formulation.
Figure 4 : Qualitative comparison on image enhancement. We use the images retouched by Expert C from the Adobe FiveK dataset as ground truth. Our method more closely matches the brightness and color tones of the ground truth.
LOD-Dark
LOD-All
Method
mAP @0.5:0.95
mAP @0.5
mAP @0.75
mAP @0.5:0.95
mAP @0.5
mAP @0.75
Input RAW 44.1 67.7 47.5 53.6 70.5 57.5
Camera ISP 37.6 55.4 41.6 48.8 64.5 52.2
DRL-ISPĀ [ 21 ]
44.2 67.8 48.4 52.8 69.9 56.7
ReconfigISPĀ [ 27 ]
43.7 66.7 47.8 51.1 68.5 54.8
AdaptiveISPĀ [ 24 ]
47.2 71.4 51.7 56.1 72.8 60.6
Ours
47.8
72.1
52.8
56.6
73.1
60.9
Table 1 : Quantitative comparison with object detection task. We highlight the best metrics as bold.
Object detection
We first evaluate the effectiveness on the object detection task. Following AdaptiveISPĀ [ 24 ] , we define the task loss ā šÆ \mathcal{L}_{\mathcal{T}} as the sum of bounding box regression, objectness, and classification errors computed by the YOLOv3Ā [ 18 ] detector pretrained on the COCO datasetĀ [ 16 ] , with the detector parameters kept frozen during optimization. All methods are evaluated using the same detector to ensure consistency in the task loss definition.
For training and evaluation, we employ the LOD datasetĀ [ 12 ] , a real-world benchmark for low-light object detection. The dataset provides two subsets: LOD-Normal (well-lit) and LOD-Dark (low-light), each in JPEG and demosaicked RAW formats. Following AdaptiveISP, we primarily use LOD-Dark to evaluate the performance gains of task-driven ISP optimization methods, including ours, under challenging low-light conditions. In addition, we use LOD-All, which combines LOD-Normal and LOD-Dark, to assess robustness to images with varying illuminations.
Tab. Ā 1 and Fig. Ā 3 present the quantitative and qualitative results, respectively. In the table, āInput RAWā denotes images processed with the preprocessing operations assumed in our framework. The results show that the in-camera ISP performs worse than the input RAW images, underscoring its limitations for task-driven objectives. Task-driven ISP methods generally outperform the in-camera ISP, but ReconfigISP suffers from a mismatch between soft training and hard inference. RL-based methods surpass the in-camera ISP yet remain suboptimal due to unstable reward estimation and stepwise formulation. In contrast, POS-ISP achieves the highest accuracy by leveraging stable sequence-level optimization with accurate final rewards.
LIS-Dark
LIS-All
Method
mAP @0.5:0.95
mAP @0.5
mAP @0.75
mAP @0.5:0.95
mAP @0.5
mAP @0.75
Input RAW 27.8 45.6 27.9 32.6 52.3 33.0
Camera ISP 20.1 35.1 20.0 30.4 48.9 31.0
DRL-ISPĀ [ 21 ]
27.1 44.7 27.4 23.6 40.1 23.8
ReconfigISPĀ [ 27 ]
24.2 40.8 24.5 31.1 51.2 31.0
AdaptiveISPĀ [ 24 ]
25.2 42.3 25.2 32.4 52.3 32.5
Ours
32.1
51.8
32.1
34.9
55.9
34.9
Table 2 : Quantitative comparison with instance segmentation task. We highlight the best metrics as bold.
Instance segmentation
We further evaluate POS-ISP on instance segmentation to assess its generalization to fine-grained vision tasks beyond object detection. Here, ā šÆ \mathcal{L}_{\mathcal{T}} is the sum of detection and mask losses from a YOLOv11-segĀ [ 13 ] model pretrained on the COCO dataset, with the segmentation model parameters kept frozen during optimization. Evaluation is conducted on the LIS datasetĀ [ 7 ] , which consists of well-lit images (LIS-Normal) and low-light images (LIS-Dark). Similar to the object detection comparison, we consider LIS-Dark and LIS-All in this comparison, where LIS-All is the union of LIS-Dark and LIS-Normal.
Tab. Ā 2 and Fig. Ā 3 present quantitative and qualitative comparisons, respectively. Unlike object detection, task-driven ISP methods outperform the in-camera ISP but remain inferior to the input RAW images. This highlights the difficulty of optimizing ISP pipelines for dense prediction tasks, where pixel-level supervision produces complex and unstable training signals. In segmentation, small pixel deviations can disproportionately affect the reward, increasing variance and destabilizing learning. For RL-based approaches, such high-variance rewards complicate value estimation and can lead to error accumulation during updates. Our method avoids this issue by optimizing directly with task-level supervision rather than unstable value estimates, resulting in more stable training and better performance.
Image enhancement
Lastly, we evaluate POS-ISP on image enhancement using the Adobe FiveK datasetĀ [ 3 ] . We adopt Expert C retouched images as the target style and define the task loss ā šÆ \mathcal{L}_{\mathcal{T}} as the Mean Squared Error (MSE) between the ISP output and the corresponding Expert C image. Qualitative results are presented in Fig. Ā 4 .
DRL-ISP brightens the image but leaves many regions underexposed. ReconfigISP improves brightness but produces overly saturated tones that deviate from the expert style. AdaptiveISP shows noticeable color and white-balance shifts, producing desaturated tones. In contrast, POS-ISP produces visually pleasing results that closely match the desired retouching style.
These results demonstrate the robustness of our approach and its ability to generalize beyond recognition tasks to perceptual quality enhancement. Additional results are provided in the supplementary material.
4.2
Optimization Stability
Figure 5 :
Optimization behavior. (a) Task score on the test set over training progress.
(b) (left) policy entropy convergence and (right) relative likelihood of the final pipeline.
Training dynamics
We further analyze the optimization behavior during training. In Fig. Ā 5 -(a), we plot the test performance curves for object detection and instance segmentation on the LOD-AllĀ [ 12 ] and LIS-AllĀ [ 7 ] , respectively, against the training progress, defined as the ratio between the checkpoint iteration and the total training iterations. POS-ISP improves steadily throughout training, whereas AdaptiveISPĀ [ 24 ] either exhibits noticeable fluctuations or improves only marginally early in training, suggesting that our method shows more stable optimization behavior.
The optimization statistics in Fig. Ā 5 -(b) further support this observation. The policy entropy steadily decreases during training, suggesting that the policy becomes increasingly confident in selecting pipelines. At the same time, the likelihood assigned to the pipeline selected by the final policy consistently increases, growing by approximately 20 ā 60 Ć 20\text{--}60\times across datasets. This likelihood is computed by summing the log probabilities of the selected modules to obtain the pipeline log-likelihood and exponentiating its difference from the initial log-likelihood. Together, these trends indicate that the policy progressively concentrates its probability mass on effective pipelines, resulting in stable optimization.
Method
mAP @0.5:0.95
mAP @0.5
mAP @0.75
DRL-ISPĀ [ 21 ]
44.00 ± \pm 0.20 67.73 ± \pm 0.47 47.77 ± \pm 0.31
AdaptiveISPĀ [ 24 ]
47.13 ± \pm 0.25 71.17 ± \pm 0.15 51.73 ± \pm 0.57
Ours
47.80 ± \pm 0.08
72.10 ± \pm 0.16
52.67 ± \pm 0.12
Table 3 : Multi-seed quantitative comparison on the LOD-Dark object detection benchmark. We report mean ± \pm standard deviation over three seeds.
Multi-seed comparison
We examine the robustness of the optimization across random seeds. On the LOD-DarkĀ [ 12 ] object detection benchmark with a YOLOv3Ā [ 18 ] detector, the three-seed results ( Tab. Ā 3 ) show that POS-ISP consistently outperforms DRL-ISPĀ [ 21 ] and AdaptiveISPĀ [ 24 ] , while exhibiting substantially smaller standard deviations. This indicates that our method achieves reproducible gains and consistently attains high performance across independent training runs.
Method Params (M) MACs (M) Peak GPU Memory (MB) Runtime (ms)
DRL-ISPĀ [ 21 ]
6.57 155.3 1013.9 15.71
AdaptiveISPĀ [ 24 ]
7.18 70.2 39.6 12.72
Ours
0.53
15.1
14.4
1.55
Table 4 : Comparison of computational efficiency. All results are measured on a single NVIDIA RTX 2080 Ti with inputs of resolution 512 Ć 512 512\times 512 . We excluded the module execution time when measuring the runtime. We highlight the best metrics as bold.
4.3
Computational Efficiency
We compare the inference efficiency of POS-ISP with RL-based methodsĀ [ 21 , 24 ] . For DRL-ISPĀ [ 21 ] and AdaptiveISPĀ [ 24 ] , runtime is measured using pipelines of three and five modules, respectively, following the original settings. The results are summarized in Tab. Ā 4 . RL-based approaches incur considerable computational overhead because the controller must be executed repeatedly during pipeline construction. DRL-ISP further amplifies this cost by employing a heavy feature extractor, resulting in substantial computational and memory demands. AdaptiveISP reduces memory usage compared to DRL-ISP, but still relies on a relatively large controller and repeated inference, leading to non-negligible overhead. In contrast, POS-ISP is designed to be lightweight and efficient. By fixing the module sequence at inference time, it predicts only the module parameters in a single forward pass, eliminating repeated controller execution and iterative decision-making. This significantly reduces both computational cost and memory usage, enabling superior efficiency with minimal overhead.
4.4
Ablation on Sequence Predictor
Our sequence predictor is designed to capture inter-module dependencies in ISP pipeline prediction. To evaluate its impact, we conduct an ablation study by constructing a variant in which sequence predictor is replaced with a learnable probability table. In this table, the element at row i i and column j j denotes the probability of selecting module a j a_{j} at step i i , thereby modeling each decision independently without considering contextual relationships. This design allows a direct examination of the benefits of exploiting inter-module dependencies when predicting module sequences. We use the same experimental settings described in Tab. Ā 1 .
As shown in Tab. Ā 5 , the probability-table variant ( Tab. Ā 5 -(a)) fails to capture contextual relationships among modules, resulting in lower performance. By contrast, adopting a recurrent structure ( Tab. Ā 5 -(b)) enables the model to learn inter-module dependencies, leading to improved performance by effectively modeling the influence of module order on sequence prediction.
Sequence predictor
LOD-Dark
LIS-Dark
mAP mAP mAP mAP mAP mAP
@0.5:0.95 @0.5 @0.75 @0.5:0.95 @0.5 @0.75
(a) Prob. table 47.5 71.4 52.1 31.3 50.9 31.9
(b) GRU (Ours)
47.8
72.1
52.8
32.1
51.8
32.1
Table 5 :
Ablation study on the effect of adding recurrent sequence estimation. We highlight the best metrics as bold.
5
Conclusion
In this paper, we introduce POS-ISP, a task-aware ISP optimization framework that jointly learns module sequences and parameters for a target downstream task. By eliminating the unstable future reward estimation and stepwise decision process, our approach enables stable optimization while explicitly modeling inter-module dependencies. POS-ISP achieves state-of-the-art accuracy on object detection and instance segmentation with substantially lower computational and memory overhead, and also shows promising results on image enhancement.
Limitations and future work
Despite promising results, our framework still has some limitations. Increasing the number of ISP candidate modules enlarges the search space and may require longer training to achieve convergence. Moreover, the current system requires separate training for each downstream task, which limits scalability in multi-task settings, as different tasks rely on independently optimized ISP policies. This increases system complexity when multiple tasks need to be supported. As future work, extending the framework to a unified model that can jointly optimize multiple tasks could improve scalability and broaden its applicability.
Acknowledgement
We thank Junpyo Seo for his assistance with the on-device deployment test. This work was supported by Samsung Electronics Co., Ltd (IO251210-14286-01), and by the Institute of Information & Communications Technology Planning & Evaluation (IITP) grant funded by the Korea government (MSIT) (IITP-2026-RS-2024-00437866, No.RS-2019-II191906, Artificial Intelligence Graduate School Program (POSTECH)).
References
Afifi etĀ al. [2021]
Mahmoud Afifi, Abdelrahman Abdelhamed, Abdullah Abuolaim, Abhijith Punnappurath, and MichaelĀ S Brown.
Cie xyz net: Unprocessing images for low-level computer vision tasks.
IEEE TPAMI , 2021.
Brooks etĀ al. [2019]
Tim Brooks, Ben Mildenhall, Tianfan Xue, Jiawen Chen, Dillon Sharlet, and JonathanĀ T Barron.
Unprocessing images for learned raw denoising.
In CVPR , 2019.
Bychkovsky etĀ al. [2011]
Vladimir Bychkovsky, Sylvain Paris, Eric Chan, and FrƩdo Durand.
Learning photographic global tonal adjustment with a database of input / output image pairs.
In CVPR , 2011.
Cesa-Bianchi etĀ al. [2017]
Nicolò Cesa-Bianchi, Claudio Gentile, GÔbor Lugosi, and Gergely Neu.
Boltzmann exploration done right.
In NeurIPS , 2017.
Chen etĀ al. [2018]
Chen Chen, Qifeng Chen, Jia Xu, and Vladlen Koltun.
Learning to see in the dark.
In CVPR , 2018.
Chen etĀ al. [2025]
Kai Chen, Jin Xiao, Leheng Zhang, Kexuan Shi, and Shuhang Gu.
Task-aware image signal processor for advanced visual perception.
arXiv , 2025.
Chen etĀ al. [2023]
Linwei Chen, Ying Fu, Kaixuan Wei, Dezhi Zheng, and Felix Heide.
Instance segmentation in the dark.
IJCV , 2023.
Cho etĀ al. [2014]
Kyunghyun Cho, Bart Van Merriënboer, Dzmitry Bahdanau, and Yoshua Bengio.
On the properties of neural machine translation: Encoder-decoder approaches.
arXiv , 2014.
Conde etĀ al. [2022]
Marcos V Conde, Steven McDonagh, Matteo Maggioni, Ales Leonardis, and Eduardo Pérez-Pellitero.
Model-based image signal processors via learnable dictionaries.
In AAAI , 2022.
Fujimoto etĀ al. [2018]
Scott Fujimoto, Herke van Hoof, and David Meger.
Addressing function approximation error in actor-critic methods.
In ICML , 2018.
Hidasi etĀ al. [2015]
BalƔzs Hidasi, Alexandros Karatzoglou, Linas Baltrunas, and Domonkos Tikk.
Session-based recommendations with recurrent neural networks.
arXiv , 2015.
Hong etĀ al. [2021]
Yang Hong, Kaixuan Wei, Linwei Chen, and Ying Fu.
Crafting object detection in very low light.
In BMVC , 2021.
Khanam and Hussain [2024]
Rahima Khanam and Muhammad Hussain.
Yolov11: An overview of the key architectural enhancements.
arXiv , 2024.
Kim etĀ al. [2024]
Woohyeok Kim, Geonu Kim, Junyong Lee, Seungyong Lee, Seung-Hwan Baek, and Sunghyun Cho.
Paramisp: Learned forward and inverse isps using camera parameters.
In CVPR , 2024.
Kingma and Ba [2014]
DiederikĀ P. Kingma and Jimmy Ba.
Adam: A method for stochastic optimization.
arXiv , 2014.
Lin etĀ al. [2014]
Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr DollÔr, and C. Lawrence Zitnick.
Microsoft coco: Common objects in context.
In ECCV , 2014.
Qin etĀ al. [2022]
Haina Qin, Longfei Han, Juan Wang, Congxuan Zhang, Yanwei Li, Bing Li, and Weiming Hu.
Attention-aware learning for hyperparameter prediction in image processing pipelines.
In ECCV , 2022.
Redmon and Farhadi [2018]
Joseph Redmon and Ali Farhadi.
Yolov3: An incremental improvement.
arXiv , 2018.
Ren etĀ al. [2019]
Pengjie Ren, Zhumin Chen, Jing Li, Zhaochun Ren, Jun Ma, and Maarten de Rijke.
Repeatnet: A repeat-aware neural recommendation machine for session-based recommendation.
In AAAI , 2019.
Serban etĀ al. [2016]
Iulian Serban, Alessandro Sordoni, Yoshua Bengio, Aaron Courville, and Joelle Pineau.
Building end-to-end dialogue systems using generative hierarchical neural network models.
In AAAI , 2016.
Shin etĀ al. [2022]
Ukcheol Shin, Kyunghyun Lee, and InĀ So Kweon.
Drl-isp: Multi-objective camera isp with deep reinforcement learning.
In IROS , 2022.
Tseng etĀ al. [2019]
Ethan Tseng, Felix Yu, Yuting Yang, Fahim Mannan, Karl ST Arnaud, Derek Nowrouzezahrai, Jean-François Lalonde, and Felix Heide.
Hyperparameter optimization in black-box image processing using differentiable proxies.
ACM TOG , 2019.
van Hasselt etĀ al. [2018]
Hado van Hasselt, Yotam Doron, Florian Strub, Matteo Hessel, Nicolas Sonnerat, and Joseph Modayil.
Deep reinforcement learning and the deadly triad.
arXiv , 2018.
Wang etĀ al. [2024]
Yujin Wang, Tianyi Xu, Fan Zhang, Tianfan Xue, and Jinwei Gu.
Adaptiveisp: Learning an adaptive image signal processor for object detection.
In NeurIPS , 2024.
Williams [1992]
RonaldĀ J. Williams.
Simple statistical gradient-following algorithms for connectionist reinforcement learning.
Machine Learning , 1992.
Xing etĀ al. [2021]
Yazhou Xing, Zian Qian, and Qifeng Chen.
Invertible image signal processing.
In CVPR , 2021.
Yu etĀ al. [2021]
Ke Yu, Zexian Li, Yue Peng, ChenĀ Change Loy, and Jinwei Gu.
Reconfigisp: Reconfigurable camera image processing pipeline.
In ICCV , 2021.
Zamir etĀ al. [2020]
SyedĀ Waqas Zamir, Aditya Arora, Salman Khan, Munawar Hayat, FahadĀ Shahbaz Khan, Ming-Hsuan Yang, and Ling Shao.
Cycleisp: Real image restoration via improved data synthesis.
In CVPR , 2020.
BETA
AI Summary: Based on hf metadata. Not a recommendation.
š”ļø Paper Transparency Report
Technical metadata sourced from upstream repositories.
š Identity & Source
- id
- arxiv-paper--unknown--2604.06938
- slug
- unknown--2604.06938
- source
- hf
- author
- Jiyun Won, Heemin Yang, Woohyeok Kim
- license
- ArXiv
- tags
- paper, research
āļø Technical Specs
- architecture
- null
- params billions
- null
- context length
- null
- pipeline tag
š Engagement & Metrics
- downloads
- 0
- stars
- 0
- forks
- 0
Data indexed from public sources. Updated daily.