Android neon 优化实践示例_Android-免费资源网

来自：网络

时间：2022-12-26

阅读：

搭建实验环境

首先新建一个包含native代码的项目：

Android neon 优化实践示例

然后在gradle中添加对neon的支持：

       externalNativeBuild {
            cmake {
                cppFlags "-std=c++14"
                arguments "-DANDROID_ARM_NEON=TRUE"
            }
        }

这样，项目就可以支持neon加速了。

小试牛刀

一个最简单的neon编程的流程大致是这样的： 1、装载数据到neon寄存器 2、执行运算 3、从neon寄存器中把结果写回内存。

没有例子不知从何说起，先上一个超级简单的例子吧：

#include <jni.h>
#include <string>
#include <arm_neon.h>
#include <android/log.h>
#define LOG_TAG "TEST_NEON"
#define LOGD(...) __android_log_print(ANDROID_LOG_DEBUG, LOG_TAG, __VA_ARGS__)
#define LOGI(...) __android_log_print(ANDROID_LOG_INFO, LOG_TAG, __VA_ARGS__)
extern "C"{
void test()
{
    int16_t result[8];
    int8x8_t a = vdup_n_s8(121);
    int8x8_t b = vdup_n_s8(2);
    int16x8_t c;
    c = vmull_s8(a,b);
    vst1q_s16(result,c);
    for(int i=0;i<8;i++){
        LOGD("data[%d] is %d ",i,result[i]);
    }
}
JNIEXPORT jstring
JNICALL
Java_com_example_javer_myapplication_MainActivity_stringFromJNI(
        JNIEnv *env,
        jobject /* this */) {
    std::string hello = "Hello from C++";
    test();
    return env->NewStringUTF(hello.c_str());
}
}

执行结果：

09-07 12:03:08.335 11709-11709/? D/TEST_NEON:
data[0] is 242
data[1] is 242
data[2] is 242
data[3] is 242
data[4] is 242
data[5] is 242
data[6] is 242
data[7] is 242

代码中，test函数中实现了两个64位neon寄存器的乘法。

vdup是数据复制指令，这里把128这个8位的数复制到一个64位的寄存器中，64位能存放8个8位的数，因此，此时a指向的neon寄存器存放了8个128。

两个8位的数相乘，结果可能是16位的，因此，结果需要用一个128位的寄存器来保存。int16x8就表示的是一个128位的寄存器。

vmull_s8把a,b相乘，并将结果保存在c中。c指向的是neon的128位寄存器，因此，我们需要把结果写回内存。

vst1q_s16把c中的数据协会result指向的内存中。

这是一个简单的测试neon指令的代码，通过这个代码我们能清晰的认识到neon加速的原理：一次装载8个8位的数到64位寄存器，一条指令能把实现两个8*8的数据块的乘法。

这样效率不就接近提升8倍么？当然没有这么理想，毕竟装载数据和写回数据也是需要时间的。

实战尝试

接下来，尝试一个比较简单的rgb转灰度图的code:

void normal_convert (uint8_t * __restrict dest, uint8_t * __restrict src, int n)
{
    int i;
    for (i=0; i<n; i++)
    {
        int r = *src++; // load red
        int g = *src++; // load green
        int b = *src++; // load blue
        // build weighted average:
        int y = (r*77)+(g*151)+(b*28);
        // undo the scale by 256 and write to memory:
        *dest++ = (y>>8);
    }
}
void neon_convert (uint8_t * __restrict dest, uint8_t * __restrict src, int n)
{
    int i;
    uint8x8_t rfac = vdup_n_u8 (77);
    uint8x8_t gfac = vdup_n_u8 (151);
    uint8x8_t bfac = vdup_n_u8 (28);
    n/=8;
    for (i=0; i<n; i++)
    {
        uint16x8_t  temp;
        uint8x8x3_t rgb  = vld3_u8 (src);
        uint8x8_t result;
        temp = vmull_u8 (rgb.val[0],      rfac);
        temp = vmlal_u8 (temp,rgb.val[1], gfac);
        temp = vmlal_u8 (temp,rgb.val[2], bfac);
        result = vshrn_n_u16 (temp, 8);
        vst1_u8 (dest, result);
        src  += 8*3;
        dest += 8;
    }
}
void test1()
{
    //准备一张图片，使用软件模拟生成，格式为rgb rgb ..
    uint32_t const array_size = 2048*2048;
    uint8_t * rgb = new uint8_t[array_size*3];
    for(int i=0;i<array_size;i++){
        rgb[i*3]=234;
        rgb[i*3+1]=94;
        rgb[i*3+2]=23;
    }
    //灰度图大小为rgb的1/3
    uint8_t * gray = new uint8_t[array_size];
    struct timeval tv1,tv2;
    gettimeofday(&tv1,NULL);
    normal_convert(gray,rgb,array_size);
    gettimeofday(&tv2,NULL);
    LOGD("pure cpu cost time:%ld",(tv2.tv_sec-tv1.tv_sec)*1000000+(tv2.tv_usec-tv1.tv_usec));
    gettimeofday(&tv1,NULL);
    neon_convert(gray,rgb,array_size);
    gettimeofday(&tv2,NULL);
    LOGD("neon cost time:%ld",(tv2.tv_sec-tv1.tv_sec)*1000000+(tv2.tv_usec-tv1.tv_usec));
    delete[] rgb;
    delete[] gray;
}
JNIEXPORT jstring
JNICALL
Java_com_example_javer_myapplication_MainActivity_stringFromJNI(
        JNIEnv *env,
        jobject /* this */) {
    std::string hello = "Hello from C++";
    test1();
    return env->NewStringUTF(hello.c_str());
}

具体的指令就不一一说明了，大家参考neon汇编指令集，对照着看就好。

纯cpu耗时53ms,neon优化后耗时43ms,提升非常有限，跟提升近8倍的预期相差甚远。这主要是因为c转换为汇编后，生成的汇编指令不够简洁，使得效率大大降低。因此，接下来，使用汇编对代码进行优化。

CMake添加汇编支持

为了在Cmake中编译汇编文件，我们需要在CMakeLists.txt文件中申明对汇编语言的支持，添加ENABLE_LANGUAGE(ASM)即可实现对汇编的支持，接着将汇编文件添加进来，此处贴出完整的CMakeLists.txt文件供大家参考：

# For more information about using CMake with Android Studio, read the
# documentation: https://d.android.com/studio/projects/add-native-code.html
# Sets the minimum version of CMake required to build the native library.
cmake_minimum_required(VERSION 3.4.1)
# Creates and names a library, sets it as either STATIC
# or SHARED, and provides the relative paths to its source code.
# You can define multiple libraries, and CMake builds them for you.
# Gradle automatically packages shared libraries with your APK.
ENABLE_LANGUAGE(ASM)
add_library( # Sets the name of the library.
             native-lib
             # Sets the library as a shared library.
             SHARED
             # Provides a relative path to your source file(s).
             src/main/cpp/Neon.S
             src/main/cpp/native-lib.cpp
             )
# Searches for a specified prebuilt library and stores the path as a
# variable. Because CMake includes system libraries in the search path by
# default, you only need to specify the name of the public NDK library
# you want to add. CMake verifies that the library exists before
# completing its build.
find_library( # Sets the name of the path variable.
              log-lib
              # Specifies the name of the NDK library that
              # you want CMake to locate.
              log )
# Specifies libraries CMake should link to your target library. You
# can link multiple libraries, such as libraries you define in this
# build script, prebuilt third-party libraries, or system libraries.
target_link_libraries( # Specifies the target library.
                       native-lib
                       # Links the target library to the log library
                       # included in the NDK.
                       ${log-lib} )

实现汇编Neon优化

然后在cpp文件中申明：

void neon_asm_convert(uint8_t * dest, uint8_t * src,int n);

注意，这个申明是包含在extern “C”中的。然后在Neon.S中实现neon_asm_convert函数：

.globl neon_asm_convert
neon_asm_convert:
      # r0: Ptr to destination data
      # r1: Ptr to source data
      # r2: Iteration count:
      push        {r4-r5,lr}
      lsr         r2, r2, #3
      # build the three constants:
      mov         r3, #77
      mov         r4, #151
      mov         r5, #28
      vdup.8      d3, r3
      vdup.8      d4, r4
      vdup.8      d5, r5
  .loop:
      # load 8 pixels:
      vld3.8      {d0-d2}, [r1]!
      # do the weight average:
      vmull.u8    q3, d0, d3
      vmlal.u8    q3, d1, d4
      vmlal.u8    q3, d2, d5
      # shift and store:
      vshrn.u16   d6, q3, #8
      vst1.8      {d6}, [r0]!
      subs        r2, r2, #1
      bne         .loop
      pop         { r4-r5, pc }

为了对比结果的正确性，专门写了个比对函数：

int compare(uint8_t *a,uint8_t* b,int n)
{
    for(int i=0;i<n;i++){
        if(a[i]!=b[i]){
            return -1;
        }
    }
    return 0;
}

并将结果打印在时间后面：

LOGD("neon c cost time:%ld,result is %d",(tv2.tv_sec-tv1.tv_sec)*1000000+(tv2.tv_usec-tv1.tv_usec),result);

三者对比：

09-07 17:12:19.946 25861-25861/com.example.javer.myapplication D/TEST_NEON: pure cpu cost time:57073
09-07 17:12:20.012 25861-25861/com.example.javer.myapplication D/TEST_NEON: neon c cost time:45460,result is 0
09-07 17:12:20.034 25861-25861/com.example.javer.myapplication D/TEST_NEON: neon asm cost time:3397,result is 0
09-07 17:12:25.271 25861-25861/com.example.javer.myapplication D/TEST_NEON: pure cpu cost time:57404
09-07 17:12:25.336 25861-25861/com.example.javer.myapplication D/TEST_NEON: neon c cost time:45166,result is 0
09-07 17:12:25.359 25861-25861/com.example.javer.myapplication D/TEST_NEON: neon asm cost time:3493,result is 0

最终发现，汇编执行的结果完全正确，时间提升超过了16倍！！！！！！！！！！！我甚至不敢相信能提升这么多。。。可对比的结果是完全一样啊！！这…….

如果程序有问题，感谢大神指出。

最后附完整代码： native_lib.cpp:

#include <jni.h>
#include <string>
#include <arm_neon.h>
#include <android/log.h>
#define LOG_TAG "TEST_NEON"
#define LOGD(...) __android_log_print(ANDROID_LOG_DEBUG, LOG_TAG, __VA_ARGS__)
#define LOGI(...) __android_log_print(ANDROID_LOG_INFO, LOG_TAG, __VA_ARGS__)
extern "C"{
void neon_asm_convert(uint8_t * dest, uint8_t * src,int n);
void test()
{
    int16_t result[8];
    int8x8_t a = vdup_n_s8(121);
    int8x8_t b = vdup_n_s8(2);
    int16x8_t c;
    c = vmull_s8(a,b);
    vst1q_s16(result,c);
    for(int i=0;i<8;i++){
        LOGD("data[%d] is %d ",i,result[i]);
    }
}
void normal_convert (uint8_t * __restrict dest, uint8_t * __restrict src, int n)
{
    int i;
    for (i=0; i<n; i++)
    {
        int r = *src++; // load red
        int g = *src++; // load green
        int b = *src++; // load blue
        // build weighted average:
        int y = (r*77)+(g*151)+(b*28);
        // undo the scale by 256 and write to memory:
        *dest++ = (y>>8);
    }
}
void neon_convert (uint8_t * __restrict dest, uint8_t * __restrict src, int n)
{
    int i;
    uint8x8_t rfac = vdup_n_u8 (77);
    uint8x8_t gfac = vdup_n_u8 (151);
    uint8x8_t bfac = vdup_n_u8 (28);
    n/=8;
    for (i=0; i<n; i++)
    {
        uint16x8_t  temp;
        uint8x8x3_t rgb  = vld3_u8 (src);
        uint8x8_t result;
        temp = vmull_u8 (rgb.val[0],      rfac);
        temp = vmlal_u8 (temp,rgb.val[1], gfac);
        temp = vmlal_u8 (temp,rgb.val[2], bfac);
        result = vshrn_n_u16 (temp, 8);
        vst1_u8 (dest, result);
        src  += 8*3;
        dest += 8;
    }
}
int compare(uint8_t *a,uint8_t* b,int n)
{
    for(int i=0;i<n;i++){
        if(a[i]!=b[i]){
            return -1;
        }
    }
    return 0;
}
void test1()
{
    //准备一张图片，使用软件模拟生成，格式为rgb rgb ..
    uint32_t const array_size = 2048*2048;
    uint8_t * rgb = new uint8_t[array_size*3];
    for(int i=0;i<array_size;i++){
        rgb[i*3]=234;
        rgb[i*3+1]=94;
        rgb[i*3+2]=23;
    }
    //灰度图大小为rgb的1/3
    uint8_t * gray_cpu = new uint8_t[array_size];
    uint8_t * gray_neon = new uint8_t[array_size];
    uint8_t * gray_neon_asm = new uint8_t[array_size];
    struct timeval tv1,tv2;
    gettimeofday(&tv1,NULL);
    normal_convert(gray_cpu,rgb,array_size);
    gettimeofday(&tv2,NULL);
    LOGD("pure cpu cost time:%ld",(tv2.tv_sec-tv1.tv_sec)*1000000+(tv2.tv_usec-tv1.tv_usec));
    gettimeofday(&tv1,NULL);
    neon_convert(gray_neon,rgb,array_size);
    gettimeofday(&tv2,NULL);
    bool result = compare(gray_cpu,gray_neon,array_size);
    LOGD("neon c cost time:%ld,result is %d",(tv2.tv_sec-tv1.tv_sec)*1000000+(tv2.tv_usec-tv1.tv_usec),result);
    gettimeofday(&tv1,NULL);
    neon_asm_convert(gray_neon_asm,rgb,array_size);
    gettimeofday(&tv2,NULL);
    result = compare(gray_cpu,gray_neon_asm,array_size);
    LOGD("neon asm cost time:%ld,result is %d",(tv2.tv_sec-tv1.tv_sec)*1000000+(tv2.tv_usec-tv1.tv_usec),result);
    delete[] rgb;
    delete[] gray_cpu;
    delete[] gray_neon;
    delete[] gray_neon_asm;
}
JNIEXPORT jstring
JNICALL
Java_com_example_javer_myapplication_MainActivity_stringFromJNI(
        JNIEnv *env,
        jobject /* this */) {
    std::string hello = "Hello from C++";
    test1();
    return env->NewStringUTF(hello.c_str());
}
}

Neon.S

.globl neon_asm_convert
neon_asm_convert:
      # r0: Ptr to destination data
      # r1: Ptr to source data
      # r2: Iteration count:
      push        {r4-r5,lr}
      lsr         r2, r2, #3
      # build the three constants:
      mov         r3, #77
      mov         r4, #151
      mov         r5, #28
      vdup.8      d3, r3
      vdup.8      d4, r4
      vdup.8      d5, r5
  .loop:
      # load 8 pixels:
      vld3.8      {d0-d2}, [r1]!
      # do the weight average:
      vmull.u8    q3, d0, d3
      vmlal.u8    q3, d1, d4
      vmlal.u8    q3, d2, d5
      # shift and store:
      vshrn.u16   d6, q3, #8
      vst1.8      {d6}, [r0]!
      subs        r2, r2, #1
      bne         .loop
      pop         { r4-r5, pc }

以上就是Android neon 优化实践示例的详细内容，更多关于Android neon 优化的资料请关注其它相关文章！

目录一、向设备传输文件命令二、操作步骤2.1 连接设备2.2 启动windows的cmd2.3 输入adb push指令三、总结一、向设备传输文件命令用于在windows系统下，向设备传输文件。二、

2024-06-09 01:14:15

目录一、前言二、绘制原理三、总结一、前言旋转菜单是一种占用空间较大，实用性稍弱的UI，一方面由于展示空间的问题，其展示的数据有限，但另一方面真由于这个原因，对用户而言趣味性

2024-06-09 01:14:12

目录一、了解什么是Android Studio二、了解什么是sqlite三、创建项目文件四、创建活动文件和布局文件五、创建数据库连接数据库六、创建实体类实现注册功能七、实现登录功能

2024-06-09 01:14:09

目录原理解析主动监测被动监测在第三方图片加载库回调中进行大图监测在网络加载图片时进行大图监测使用ASM插桩进行大图监控注意事项与优化技巧总结原理解析内存占用计算首

2024-06-09 01:14:05

目录概念类简介简单例子执行流程一、在执行完 AsyncTask.excute() 后二、方法分析源码分析一、主分支二、次分支主分支部分一、分析mWorker二、分析mFuture三、回过头来看一

2024-06-09 01:14:03

目录Kotlin 协程的异常处理概述异常处理六大准则准则一：协程的取消需要内部配合问题：cancel不被响应解决：使用 isActive 判断是否处于活跃状态准则二：不要打破协程的父子结构问

2024-06-09 01:14:00

目录一、前言需求问题二、方案方案：自定义Presentation原理WindowType问题解决WindowManagerImpl 问题方案：Delagate方式：兼容总结一、前言Android 多屏互联的时代，必然会出现多

2024-06-09 01:13:57

目录一、前言二、实现方法三、全部代码四、总结一、前言在很多app种内置了语音助手，也存在各种动画，主要原因是处理2个阶段问题，第一个是监听声音的等待效果，第二个是语意解析存

2024-06-09 01:13:55

目录Flutter路由跳转基本路由跳转返回上一页路由基本路由跳转传参命名路由跳转命名路由跳转需要先配置路由命名路由跳转传参命名路由替换跳转移除所有页面返回到根页面Flut

2024-06-09 01:13:52

目录前言提取蒙版蒙版绘制扩大蒙版（影子）闪烁效果总结全部代码前言先看下我们阔爱滴海绵宝宝，其原图是一张PNG图片，我们给宝宝加上描边效果，今天我们使用的是图片蒙版技术。说到

2024-06-09 01:13:50

目录Android EditText设置边框简介快速开始Android EditText设置边框简介Android应用程序中给EditText设置边框。效果图：快速开始1.在res/drawable目录下新建样式文件 edit_b

2024-06-09 01:13:47

目录前言安装Qt安装JDK配置环境问题解决SDK配置报错：× Android SDK Command-line Tools runsQt版本出现错误：无法检测 Qt 版本所使用的 ABI。安卓构建套件警告：no device

2024-06-09 01:13:44

目录stack特性示例stack特性在Flutter中，你可以使用Stack和Positioned来创建悬浮 UI。Stack允许你将多个小部件叠放在一起，而Positioned则用于定位小部件在Stack中的位置。示

2024-06-09 01:13:42

目录Okhttp 介绍Okhttp 中几个重要类的介绍OkHttpClientOkHttpClient使用注意OkHttpClient的创建不需要了可以关闭Call 类Request 类RequestBody总结一下OKHTTP架构图OKHttp

2024-06-09 01:13:39

目录1. HVAC 功能介绍1.1 双区温度调节1.2 空调开关1.3 内/外循环1.4 风量调节1.5 风向调节1.6 A/C开关1.7 主副驾座椅加热1.8 除霜1.9 自动模式2. HVAC 源码结构3. HVAC 核

2024-06-09 01:13:36

目录前言1. 布局自动滚动的思路2. 最终效果3. 代码实现4. 总结前言在平时的开发中，有时会碰到这样的场景，设计上布局的内容会比较紧凑，导致部分机型上某些布局中的内容显示不完

2024-06-09 01:13:34

目录BuildContext 简介BuildContext的主要作用BuildContext 简介BuildContext是Flutter中的一个重要概念，表示当前Widget在树中的位置上下文。它是一个对Widget树的一个位置

2024-06-09 01:13:26

目录需求和背景实现安全管理类相关工具类需求和背景行业相关，对安全性较高的程序一般都需要添加完整性检测的功能，以防止程序被篡改，从而导致安全问题的发生。
相关的支付应用

2024-06-09 01:13:24

目录Android开发各种Gradle错误缺少依赖项版本冲突配置错误Android开发各种Gradle错误在开发Android应用程序时，我们可能会遇到各种Gradle错误。这些错误可能来自不同的原因，

2024-06-09 01:13:22

目录Draggable介绍构造函数参数说明使用示例DragTarget介绍构造函数参数说明使用示例DragTarget如何接收Draggable传递过来的数据结束语Draggable介绍Draggable是Flutter框

2024-06-09 01:13:20

目录前言：步骤：引用库实战过程我的案例前言：我们在上一篇文章中学到了Recyclerview但是在现实中往往需求不是那么的简单，可能需要多种需求合并起来，例如常见的上下拉刷新，删除 ite

2024-06-09 01:13:18

目录滑动冲突的原理解决方法外部拦截法内部拦截法注意事项和优化技巧总结滑动冲突的原理Android的事件分发机制是基于ViewGroup的。当用户在屏幕上触摸时，事件会首先传递给最

2024-06-09 01:13:16

目录背景横线效果网格效果基础属性绘制背景色绘制边框线绘制四个边角线扫描线绘制及移动特点背景最近在开发新项目时，使用了扫描二维码的功能，一般扫描二维码的效果是一条横线

2024-06-09 01:12:49

目录HTextView前言HTextView的简单使用方法HTextView引入line使用以及效果fade使用以及效果typer使用以及效果rainbow使用以及效果scale使用以及效果evaporate使用以及效果f

2024-06-09 01:12:45

目录应用设置一个不同于全局的多语言系统设置中支持为应用设置单独多语言入口在哪里自动添加手动添加在应用内部设置多语言应用设置一个不同于全局的多语言通常情况下多语言

2024-06-09 01:12:43

目录报错错误原因解决方法序列化和反序列化结语报错使用fluro时报错type ‘String’ is not a subtype of type ‘Queue<Task>’报错如下：错误原因在使

2024-06-09 01:12:41

目录Kotlin 协程 supervisorScope {} 运行崩溃解决前言解决方法kotlin 协程异常处理Kotlin 协程 supervisorScope {} 运行崩溃解决前言简单介绍supervisorScope函数，它用于创

2024-06-09 01:12:38

目录一、ADB简介1、什么是adb2、为什么要用adb二、准备工具1、下载adb2、配置环境变量3、连接 4、电脑打开cmd窗口三、ADB命令详解1、基本命令2、权限命令 3、建立连接4、ap

2024-06-09 01:12:36

目录引言解决上下文问题上下文类型引言我在恢复安卓Termux数据时遇到了权限问题，我将数据恢复到/data/data目录中，并用chown设置了正确的用户和组，但是Termux在访问时还是遇到

2024-06-09 01:12:32

目录APT作用Android基本编译流程APT基本使用1、自定义注解2、注解处理器注解处理器注解处理器注册java代码生成3. 对外调用4. 调用总结APT，Annotation Processing Tool，即注

2024-06-09 01:12:30

2020-09-23

2021-02-18

2021-05-16

2021-01-19

2020-10-14

2020-05-18

2020-10-14

2020-10-23

Android neon 优化实践示例

目录

搭建实验环境

小试牛刀

实战尝试

CMake添加汇编支持

实现汇编Neon优化

热点内容

免费资源网

在线工具

扫一扫随时看

本站下载频道