FFmpeg OpenCL加速

最近有人给FFmpeg提了个patch，目的是在苹果设备上同时使用VideoToolbox和OpenCL。我这才发现，还真的有人在macOS上用OpenCL。

OpenCL是在异构设备上做计算加速的API，最初就是苹果提出的，后面变成了Khronos组织的开放标准。

Khronos最著名的成果是OpenGL/OpenGL ES，Vulkan还在起步阶段（虽然已有8年历史……），制定的其他标准都不温不火，比如OpenMax，OpenSL ES。

OpenCL目前有三个大版本，1.0、2.0、3.0，提出OpenCL的苹果却已经废弃了OpenCL（也废弃了OpenGL），替代方案是Metal。苹果目前最高支持到OpenCL 1.2版本。虽然废弃了，现在还可以用，短时间可能不会被删除功能，只是不做更新了。

关于OpenCL为什么没普及，有个自称参与标准制定的开发者说（Why OpenCL is not widely used，reddit）：

So, I actually worked on the original version of OpenCL for one of the original Khronos group partners, which meant I got a ringside seat to how parts of the standard was developed. Lots of REALLY smart people who (for the most part) were very dedicated to making something pretty awesome.

Unfortunately, this was right when GPU acceleration and several other new processor techniques were just talking off. The various partners had invested enormous sums into their respective technologies, and at some levels things got rather cutthroat. There were more than a few parts of the standard which were not exactly crippled, but designed in such a way to put a competitor at a disadvantage. If AMD needed a feature to enable some capacity but Nvidia and Apple didn’t, they might not get it, even though it didn’t harm anyone else. There were instances of this from all parties, but Nvidia already was gaining ground with Cuda and tended to swing the biggest stick. AMD and IBM had their own models as well, and I think Apple did as well. No one wanted to miss out on being in the “standard” but most of them had vested interests in seeing it be not-quite-as-good as their own proprietary stuff.

简单来说，就是一堆聪明人，在几家互相竞争的大公司KPI导向的作用下，互相下绊子，看对方提案通过比自己提案不通过还难受，结果可想而知。

不说这糟心的东西，OpenCL还是可以用的，并且跨平台能力挺好。FFmpeg里，有OpenCL实现的filter，没有OpenGL实现的filter（第三方实现的不算）。

在回头说那个想同时用macOS VideoToolbox和OpenCL的patch。那个开发者没理解FFmpeg hwaccel derive的含义，提交的patch硬开一个口子，不合理没有通过。

derive大概含义是衍生，FFmpeg支持从一个硬件设备抽象实例，生成另一个抽象实例。比如说，我可以创建一个VAAPI设备，做视频解码；从VAAPI设备衍生出一个Vulkan设备，VAAPI解码输出的图像可以传给Vulkan，不走CPU拷贝。由此可见，derive的目的是在多种硬件加速API之间进行互操作。

那么，如果我想用VideoToolbox做解码，OpenCL做图像处理，怎么实现一个比较好的性能呢？VideoToolbox解码之后拷贝到CPU，再用OpenCL API从CPU传递到GPU，显然不是一个优化的方案。

苹果提供了一个OpenCL的扩展，核心是

clCreateImageFromIOSurfaceWithPropertiesAPPLE

基本流程是：

1. 从VideoToolbox解码输出的CVPixelBufferRef拿到IOSurfaceRef

2. 通过clCreateImageFromIOSurfaceWithPropertiesAPPLE()接口，从

IOSurfaceRef创建cl_mem，也就是OpenCL的image

3. 之后就是通用的OpenCL处理流程，值得注意的是CVPixelBufferRef和创建出的cl_mem的生命周期问题

有苹果提供的OpenCL扩展，给FFmpeg加上VideoToolbox到OpenCL的互操作大致工作如下：

1. 加上derive能力，从VideoToolbox derive出OpenCL设备的能力，实际上是个基本为空的实现（VAAPI、DRM、D3D11 derive生成OpenCL设备的过程还挺复杂的，感兴趣的可以翻翻源码）

2. 加上AV_PIX_FMT_VIDEOTOOLBOX到AV_PIX_FMT_OPENCL的map，也就是前面说的从用clCreateImageFromIOSurfaceWithPropertiesAPPLE，从CVPixelBufferRef创建OpenCL image的过程

FFmpeg hwcontext图像传输转换分为两类：transfer和map：

1. transfer一般是要走拷贝的，看具体的hwcontext和图像情况，可能是CPU拷贝到GPU、GPU拷贝到CPU、GPU拷贝到GPU

2. map一般是不拷贝的，但也不绝对。实际上map操作定义了几个flag


/**
 * Flags to apply to frame mappings.
 */
enum {
    /**
     * The mapping must be readable.
     */
    AV_HWFRAME_MAP_READ      = 1 << 0,
    /**
     * The mapping must be writeable.
     */
    AV_HWFRAME_MAP_WRITE     = 1 << 1,
    /**
     * The mapped frame will be overwritten completely in subsequent
     * operations, so the current frame data need not be loaded.  Any values
     * which are not overwritten are unspecified.
     */
    AV_HWFRAME_MAP_OVERWRITE = 1 << 2,
    /**
     * The mapping must be direct.  That is, there must not be any copying in
     * the map or unmap steps.  Note that performance of direct mappings may
     * be much lower than normal memory.
     */
    AV_HWFRAME_MAP_DIRECT    = 1 << 3,
};

map的应用场景是从一种硬件加速的图像，map到另一种硬件加速的图像，比如现在说的VideoToolbox输出的图像map到OpenCL image。另一个常见场景是从GPU的图像，map出来供CPU读写，是否有拷贝动作看具体实现。

FFmpeg命令行使用方式示例：

./ffmpeg -hwaccel videotoolbox \
  -hwaccel_output_format videotoolbox_vld \
  -i foo.mp4 \
  -vf hwmap=derive_device=opencl,transpose_opencl=dir=clock,hwmap,format=nv12 \
  -c:v hevc_videotoolbox \
  -c:a copy \
  -b:v 2M -tag:v hvc1 bar.mp4

总体流程是VideoToolbox解码==> OpenCL处理图像==> VideoToolbox编码。这里有个不完善的点是，没有从OpenCL image到CVPixelBufferRef的转换，也就是我们前面实现的derive是单向的，解码后直接传给OpenCL，但没有OpenCL直接传给编码的实现。幸运的是，OpenCL支持输出的图像map到CPU，有一定可能从OpenCL到编码器没有CPU拷贝动作。

让没人待见的OpenCL再发点光发点热吧！

我的Patch在这里

https://patchwork.ffmpeg.org/project/ffmpeg/list/?series=10879