如何在 Java 中基于行号计算正则匹配的字符偏移（而非全局偏移）_技术教程

本文讲解如何将 java `matcher.start()` 返回的全局字符串偏移，准确转换为「相对于当前行首的列位置（即行内偏移）」，解决跨行文本分批处理时定位失准的问题。

在使用 java.util.regex.Matcher 处理多行文本（如按批次读取 1000 行拼接为单个 \n 分隔字符串）时，match.start() 默认返回的是从整个字符串开头起算的绝对字符索引，而非“该匹配所在行的第几个字符”。例如：

Line 1: The Project Gutenberg EBook of The Adventures...
Line 2: by Sir Arthur Conan Doyle

当 Arthur 在第 2 行第 7 个字符位置（即 A 是 Arthur 的首字母，其在该行中索引为 6，若按 1-based 计则为第 7 位）被匹配时，match.start() 可能返回 72（即前一行含换行符共 71 个字符），这显然无法直接用于行级定位。

✅ 正确做法：将全局偏移转为行内偏移

核心思路是：找到匹配位置 start 所在的换行符边界，再用 start 减去上一行末尾的索引 + 1。

推荐使用 String.lastIndexOf('\n', start) 安全定位前一个换行符位置（兼容首行无前置 \n 的情况）：

public int getCharOffsetInLine(String text, int globalStart) {
    int lastNewline = text.lastIndexOf('\n', globalStart);
    if (lastNewline == -1) {
        return globalStart; // 匹配在第 1 行，行内偏移 = 全局偏移
    }
    return globalStart - lastNewline - 1; // -1 是跳过 '\n' 本身
}

在你的 matchV1 方法中调用它即可：

public List matchV1(String source, Integer line) {
    List result = new ArrayList<>();
    Matcher match = Pattern.compile(String.join("|", keys)).matcher(source);
    while (match.find()) {
        int globalStart = match.start();
        int charOffsetInLine = getCharOffsetInLine(source, globalStart);
        result.add(new OffsetResult(match.group(), line, charOffsetInLine));
    }
    return result;
}

⚠️ 注意事项：使用 '\n' 而非 System.lineSeparator() 进行查找，因 Files.lines() 默认按 \n、\r\n 等通用换行符分割，但拼接后统一为 \n（Windows 下 System.lineSeparator() 是 \r\n，会导致 lastIndexOf("\r\n", ...) 匹配失败）；若需严格支持 \r\n 源文本且保留原始换行，建议预处理：source = source.replace("\r\n", "\n").replace("\r", "\n")，再统一按 \n 计算；OffsetResult 中的 lineOffset 字段目前传入的是批次起始行号（如 startLine=1000），若需精确到匹配实际所在的物理行号，应额外计算：actualLine = line + countNewlinesBefore(source, globalStart) + 1（其中 countNewlinesBefore 统计 source.substring(0, globalStart) 中 \n 个数）。

✅ 替代方案（不推荐用于大文本）

如坚持逐行匹配，可改用流式处理避免偏移混淆：

public List matchByLines(String file, int startLine, int step) {
    try (Stream lines = Files.lines(Paths.get(file)).skip(startLine).limit(step)) {
        return lines
                .map(line -> {
                    Matcher m = Pattern.compile(String.join("|", keys)).matcher(line);
                    List perLine = new ArrayList<>();
                    while (m.find()) {
                        perLine.add(new OffsetResult(m.group(), startLine, m.start()));
                    }
                    return perLine;
                })
                .flatMap(List::stream)
                .collect(Collectors.toList());
    } catch (IOException e) {
        log.error("Read error", e);
        return Collections.emptyList();
    }
}