快速开始
============
.. currentmodule:: qlib

QlibRL提供了一个单资产订单执行任务的实现示例，以下是使用QlibRL进行训练的配置文件示例。

.. code-block:: yaml

    simulator:
        # 每步包含30分钟
        time_per_step: 30
        # 交易量上限，应为null或0到1之间的浮点数，如果是浮点数，表示上限按市场交易量的百分比计算
        vol_limit: null
    env:
        # 并发环境工作器数量.
        concurrency: 1
        # dummy或subproc或shmem。对应`tianshou中的并行模式 <https://tianshou.readthedocs.io/en/master/api/tianshou.env.html#vectorenv>`_。
        parallel_mode: dummy
    action_interpreter:
        class: CategoricalActionInterpreter
        kwargs:
            # 候选动作，可以是长度为L的列表：[a_1, a_2,..., a_L]或整数n，此时会自动生成长度为n+1的列表，即[0, 1/n, 2/n,..., n/n]。
            values: 14
            # 总步数（上限估计）
            max_step: 8
        module_path: qlib.rl.order_execution.interpreter
    state_interpreter:
        class: FullHistoryStateInterpreter
        kwargs:
            # 数据维度数量。
            data_dim: 6
            # 等于记录总数。例如，在SAOE每分钟数据中，data_ticks是一天的分钟数。
            data_ticks: 240
            # 总步数（上限估计）。例如，390分钟 / 每步30分钟 = 13步。
            max_step: 8
            # 处理后数据的提供器。
            processed_data_provider:
                class: PickleProcessedDataProvider
                module_path: qlib.rl.data.pickle_styled
                kwargs:
                    data_dir: ./data/pickle_dataframe/feature
        module_path: qlib.rl.order_execution.interpreter
    reward:
        class: PAPenaltyReward
        kwargs:
            # 短时间内大量交易的惩罚值。
            penalty: 100.0
        module_path: qlib.rl.order_execution.reward
    data:
        source:
            order_dir: ./data/training_order_split
            data_dir: ./data/pickle_dataframe/backtest
            # 时间索引数量
            total_time: 240
            # 开始时间索引
            default_start_time: 0
            # 结束时间索引
            default_end_time: 240
            proc_data_dim: 6
        num_workers: 0
        queue_size: 20
    network:
        class: Recurrent
        module_path: qlib.rl.order_execution.network
    policy:
        class: PPO
        kwargs:
            lr: 0.0001
        module_path: qlib.rl.order_execution.policy
    runtime:
        seed: 42
        use_cuda: false
    trainer:
        max_epoch: 2
        # 每次训练迭代收集的 episodes 数量
        repeat_per_collect: 5
        earlystop_patience: 2
        # 训练时每次收集的 episodes 数量。
        episode_per_collect: 20
        batch_size: 16
        # 每n次迭代执行一次验证
        val_every_n_epoch: 1
        checkpoint_path: ./checkpoints
        checkpoint_every_n_iters: 1


And the config file for backtesting:

.. code-block:: yaml

    order_file: ./data/backtest_orders.csv
    start_time: "9:45"
    end_time: "14:44"
    qlib:
        provider_uri_1min: ./data/bin
        feature_root_dir: ./data/pickle
        # feature generated by today's information
        feature_columns_today: [
            "$open", "$high", "$low", "$close", "$vwap", "$volume",
        ]
        # feature generated by yesterday's information
        feature_columns_yesterday: [
            "$open_v1", "$high_v1", "$low_v1", "$close_v1", "$vwap_v1", "$volume_v1",
        ]
    exchange:
        # 买卖股票限制的表达式
        limit_threshold: ['$close == 0', '$close == 0']
        # 买卖交易价格
        deal_price: ["If($close == 0, $vwap, $close)", "If($close == 0, $vwap, $close)"]
    volume_threshold:
        # 买卖双方的交易量限制，"cum"表示这是随时间累积的值
        all: ["cum", "0.2 * DayCumsum($volume, '9:45', '14:44')"]
        # 买入的交易量限制
        buy: ["current", "$close"]
        # 卖出的交易量限制，"current"表示这是实时值，不会随时间累积
        sell: ["current", "$close"]
    strategies: 
        30min: 
            class: TWAPStrategy
            module_path: qlib.contrib.strategy.rule_strategy
            kwargs: {}
        1day: 
            class: SAOEIntStrategy
            module_path: qlib.rl.order_execution.strategy
            kwargs:
            state_interpreter:
                class: FullHistoryStateInterpreter
                module_path: qlib.rl.order_execution.interpreter
                kwargs:
                max_step: 8
                data_ticks: 240
                data_dim: 6
                processed_data_provider:
                    class: PickleProcessedDataProvider
                    module_path: qlib.rl.data.pickle_styled
                    kwargs:
                    data_dir: ./data/pickle_dataframe/feature
            action_interpreter: 
                class: CategoricalActionInterpreter
                module_path: qlib.rl.order_execution.interpreter
                kwargs: 
                values: 14
                max_step: 8
            network: 
                class: Recurrent
                module_path: qlib.rl.order_execution.network
                kwargs: {}
            policy: 
                class: PPO
                module_path: qlib.rl.order_execution.policy
                kwargs: 
                    lr: 1.0e-4
                    # 最新模型的本地路径。模型在训练过程中生成，因此如果要使用训练好的策略进行回测，请先运行训练。也可以删除此参数文件，使用随机初始化的策略进行回测。
                    weight_file: ./checkpoints/latest.pth
    # 并发环境工作器数量。
    concurrency: 5

使用上述配置文件，您可以通过以下命令开始训练智能体：

.. code-block:: console

    $ python -m qlib.rl.contrib.train_onpolicy.py --config_path train_config.yml

训练完成后，您可以通过以下命令进行回测：

.. code-block:: console

    $ python -m qlib.rl.contrib.backtest.py --config_path backtest_config.yml

在这种情况下，:class:`~qlib.rl.order_execution.simulator_qlib.SingleAssetOrderExecution` 和 :class:`~qlib.rl.order_execution.simulator_simple.SingleAssetOrderExecutionSimple` 作为模拟器示例，:class:`qlib.rl.order_execution.interpreter.FullHistoryStateInterpreter` 和 :class:`qlib.rl.order_execution.interpreter.CategoricalActionInterpreter` 作为解释器示例，:class:`qlib.rl.order_execution.policy.PPO` 作为策略示例，:class:`qlib.rl.order_execution.reward.PAPenaltyReward` 作为奖励函数示例。
对于单资产订单执行任务，如果开发者已经定义了自己的模拟器/解释器/奖励函数/策略，他们只需修改配置文件中的相应设置即可启动训练和回测流程。
示例的详细信息可以在`这里 <https://github.com/ssvip9527/qlib/blob/main-cn/examples/rl/README.md>`_ 找到。

未来我们将提供更多不同场景的示例，例如基于强化学习的投资组合构建。