RK3588 边缘计算性能实测

前言

RK3588 是瑞芯微推出的新一代旗舰级 AI 芯片，采用 8nm 制程工艺，配备 6TOPS 算力的 NPU。在智能体自动化场景中，我们需要在边缘设备上运行 AI 模型进行图像识别、目标检测等任务。本文将详细测试 RK3588 在 64 路并发控制场景下的实际表现。

硬件配置

本次测试使用的设备配置如下：

组件	规格
CPU	RK3588 (4xCortex-A76 + 4xCortex-A55)
NPU	6TOPS INT8
内存	8GB LPDDR4X
存储	128GB eMMC
OS	Ubuntu 22.04 LTS

测试场景

我们设计了三种典型测试场景：

基础识别：单路图像分类任务
并发测试：64 路同时进行目标检测
压力测试：混合负载（识别 + OCR + 语音）

性能测试结果

1. 单路推理性能

首先测试单个 AI 模型推理的 baseline：

import time
import numpy as np
from rknn.api import RKNN

def benchmark_single_inference(model_path, input_data, iterations=1000):
    """单次推理性能测试"""
    rknn = RKNN()
    rknn.load_rknn(model_path)
    rknn.init_runtime()
    
    times = []
    for _ in range(iterations):
        start = time.perf_counter()
        outputs = rknn.inference([input_data])
        end = time.perf_counter()
        times.append((end - start) * 1000)  # 转换为毫秒
    
    rknn.release()
    return {
        'mean': np.mean(times),
        'std': np.std(times),
        'p50': np.percentile(times, 50),
        'p95': np.percentile(times, 95),
        'p99': np.percentile(times, 99),
    }

测试结果：

模型	输入尺寸	Mean (ms)	P95 (ms)	P99 (ms)
MobileNetV3	224x224	2.1	3.2	4.5
YOLOv5s	640x640	12.5	18.3	25.6
EfficientNet-B0	224x224	3.8	5.1	6.9
CRNN (OCR)	动态	8.2	12.4	18.7

2. 64 路并发测试

这是我们最关心的场景——在单台设备上同时控制 64 台手机并进行 AI 推理：

import asyncio
from concurrent.futures import ThreadPoolExecutor

class ConcurrentInferenceManager:
    def __init__(self, max_concurrent=64):
        self.max_concurrent = max_concurrent
        self.semaphore = asyncio.Semaphore(max_concurrent)
        self.results = []
    
    async def process_phone(self, phone_id, screenshot):
        async with self.semaphore:
            # 每个手机独立推理
            result = await self.run_inference(screenshot)
            return (phone_id, result)
    
    async def run_all(self, screenshots: dict):
        """并发处理所有手机截图"""
        tasks = [
            self.process_phone(phone_id, img) 
            for phone_id, img in screenshots.items()
        ]
        return await asyncio.gather(*tasks)

测试配置：

并发数量：64 路
模型：YOLOv5s (优化版)
输入分辨率：320x320
批处理：禁用（实时性优先）

测试结果：

指标	数值
总吞吐量	142 images/sec
单路平均延迟	712 ms
99% 延迟	1,250 ms
CPU 平均占用	45%
NPU 平均占用	78%
内存占用	6.2 GB / 8 GB

3. 资源利用率分析

通过持续监控，我们得到资源使用情况：

┌─────────────────────────────────────────────────────────────┐
│                    NPU 利用率趋势                            │
├─────────────────────────────────────────────────────────────┤
│ 100% │▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓│     │
│  80% │▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓│     │
│  60% │▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓│     │
│  40% │▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓│     │
│  20% │      ░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░│     │
│   0% └──────┴──────┴──────┴──────┴──────┴──────┴──────┴────┘     │
│        0s   10s   20s   30s   40s   50s   60s   70s   80s         │
│                                                             │
│  ▓▓▓ NPU 占用    ░░░ CPU 占用                               │
└─────────────────────────────────────────────────────────────┘

关键发现：

NPU 是瓶颈：在满负载情况下，NPU 利用率接近 80%，成为主要瓶颈
内存带宽：8GB 内存对于 64 路并发略显紧张
散热影响：持续满负载运行 30 分钟后，出现轻微降频

优化建议

基于测试结果，我们提出以下优化方案：

1. 模型量化

# 将 FP32 模型量化 INT8，减少 4x 计算量
rknn.load_rknn('yolov5s.rknn')
rknn.config(mean=[[0, 0, 0]], std=[[255, 255, 255])
rknn.export_rknn('yolov5s_int8.rknn')

量化前后对比：

模型	精度	模型大小	推理速度提升
YOLOv5s	FP32	27MB	1x (baseline)
YOLOv5s	INT8	7.2MB	2.8x

2. 动态分辨率调整

def adaptive_resolution(screen_size, task_type):
    """根据任务类型动态调整输入分辨率"""
    resolutions = {
        'detection': (320, 320),  # 目标检测用低分辨率
        'classification': (224, 224),
        'ocr': (640, 64),         # OCR 用长条形分辨率
        'text_detection': (320, 320),
    }
    return resolutions.get(task_type, (320, 320))

3. 任务调度优化

class TaskScheduler:
    def __init__(self, npu_units=2):
        self.npu_units = npu_units
        self.queue = asyncio.Queue()
    
    async def schedule(self, task):
        """智能任务调度，最大化 NPU 利用率"""
        if task.requires_npu:
            await self.npu_semaphore.acquire()
        await self.queue.put(task)

结论

RK3588 作为边缘计算平台，在智能体自动化场景中表现出色：

优势：NPU 算力强劲，功耗控制良好，生态成熟
局限：内存容量限制了对大规模并发的扩展
建议：在 32-48 路并发场景下性价比最高

对于需要 64 路以上并发的场景，建议采用多台 RK3588 分布式部署的方案。

相关阅读：

前言