使用 GroundingDINO 检测指定提示词中的对象

26 April 2025

以下是实现 matting_object_detect 方法的完整代码，使用 GroundingDINO 检测指定提示词中的对象，并返回检测结果：

import numpy as np
from PIL import Image
import torch
from groundingdino.util.inference import Model as GroundingDINO

def matting_object_detect(image_path: str, matting_prompt: str, 
                         box_threshold: float = 0.3, 
                         text_threshold: float = 0.25) -> list:
    """
    使用GroundingDINO检测图片中与提示词匹配的对象
    
    Args:
        image_path: 图片路径
        matting_prompt: 提示词 (格式: "obj1 . obj2 . obj3 .")
        box_threshold: 框置信度阈值 (默认0.3)
        text_threshold: 文本-视觉匹配阈值 (默认0.25)
    
    Returns:
        list: 检测结果列表，每个元素为 {
            'phrase': 匹配的提示词短语,
            'bbox': [x1, y1, x2, y2],
            'confidence': 置信度
        }
    """
    # 初始化模型 (假设模型文件已在指定路径)
    model_config = "groundingdino/config/GroundingDINO_SwinT_OGC.py"
    model_weights = "groundingdino_swint_ogc.pth"
    grounding_dino = GroundingDINO(
        model_config_path=model_config,
        model_checkpoint_path=model_weights,
        device="cuda" if torch.cuda.is_available() else "cpu"
    )
    
    # 加载并预处理图像
    image = np.array(Image.open(image_path).convert("RGB"))
    
    # 执行检测
    detections = grounding_dino.predict_with_caption(
        image=image,
        caption=matting_prompt,
        box_threshold=box_threshold,
        text_threshold=text_threshold
    )
    
    # 解析检测结果
    results = []
    if len(detections) >= 3:  # 新版返回 (boxes, confidences, phrases)
        boxes = detections[0].xyxy  # 检测框 [N,4] numpy数组
        confidences = detections[0].confidence  # 置信度 [N,] numpy数组
        phrases = detections[1]  # 匹配的短语列表 [N,]
        
        for box, conf, phrase in zip(boxes, confidences, phrases):
            results.append({
                'phrase': phrase,
                'bbox': box.tolist(),  # 转换为列表 [x1,y1,x2,y2]
                'confidence': float(conf)  # numpy float转为python float
            })
    
    return results

# 使用示例
if __name__ == "__main__":
    # 示例参数
    image_path = "test_image.jpg"
    matting_prompt = "cat . dog . backpack ."
    
    # 执行检测
    detections = matting_object_detect(image_path, matting_prompt)
    
    # 打印结果
    print(f"检测到 {len(detections)} 个对象:")
    for obj in detections:
        print(f"- {obj['phrase']}: 置信度 {obj['confidence']:.2f}, 坐标 {obj['bbox']}")

关键功能说明：

输入输出：
- 输入：图片路径 + 提示词（格式要求："obj1 . obj2 ."）
- 输出：检测到的对象列表，每个对象包含：
  - phrase: 匹配的提示词短语
  - bbox: 边界框坐标 [x1,y1,x2,y2]
  - confidence: 置信度 (0-1)
参数调整：
- box_threshold: 控制检测框的严格度（默认0.3）
- text_threshold: 控制文本-视觉匹配的严格度（默认0.25）

结果示例：

[
    {
        'phrase': 'cat',
        'bbox': [100.5, 150.2, 300.3, 400.8],
        'confidence': 0.87
    },
    {
        'phrase': 'backpack', 
        'bbox': [50.1, 200.4, 180.9, 350.7],
        'confidence': 0.65
    }
]

使用注意事项：

模型文件准备：
- 需提前下载 GroundingDINO 模型文件：
  - 配置文件：GroundingDINO_SwinT_OGC.cfg.py
  - 权重文件：groundingdino_swint_ogc.pth
- 可通过官方仓库获取：https://github.com/IDEA-Research/GroundingDINO
提示词格式：
- 必须使用 . 分隔不同对象
- 示例："person . bicycle . car ."
扩展功能：
- 如需可视化结果，可添加OpenCV绘制检测框
- 可结合SAM模型进一步获取分割掩码
错误处理：
- 方法已内置模型初始化检查
- 建议在实际使用时添加try-catch块

该方法可直接集成到图像处理流程中，适用于对象检测、内容审核等场景。检测结果的坐标可直接用于后续裁剪或分割操作。

code GroundingDINO object_detection