1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379
| import lombok.extern.slf4j.Slf4j; import java.io.BufferedReader; import java.io.File; import java.io.FileInputStream; import java.io.InputStreamReader; import java.nio.charset.StandardCharsets; import java.util.*; import java.util.regex.Pattern;
/** * DFA算法 * * @author lrs * @since 2020-10-16 */ @Slf4j public class SensitiveWordUtils { /** * 敏感词匹配规则 * 最小匹配规则,如:敏感词库["代购","代购商"],语句:"我是代购商",匹配结果:我是[代购]商 */ public static final int MIN_MATCHTYPE = 1;
/** * 最大匹配规则,如:敏感词库["代购","代购商"],语句:"我是代购商",匹配结果:我是[代购商] */ public static final int MAX_MATCHTYPE = 2;
/** * 敏感词集合 */ public static Map<String,Object> sensitiveWordMap;
/** * 加载敏感词库 * @param sensitiveWordSet 敏感词库 */ public static synchronized void load(Set<String> sensitiveWordSet) { initWords(sensitiveWordSet); log.info("==加载敏感词库={}个==", sensitiveWordSet.size()); }
/** * 初始化敏感词库,构建DFA算法模型 * @param sensitiveWordSet 敏感词库 */ private static void initWords(Set<String> sensitiveWordSet) { //初始化敏感词容器,减少扩容操作 sensitiveWordMap = new HashMap<>(sensitiveWordSet.size()); String word; Map<String,Object> nowMap; Iterator<String> iterator = sensitiveWordSet.iterator(); while (iterator.hasNext()) { word = iterator.next(); nowMap = sensitiveWordMap; addWord(word, nowMap); } } /** * 添加敏感词 * * @param word 要添加的敏感词 * @param nowMap 已加载的敏感词库 */ public static void addWord(String word, Map<String, Object> nowMap) { if (StrUtil.isBlank(word)) { return; } for (int i = 0; i < word.length(); i++) { //转换成char型 char keyChar = word.charAt(i); //库中获取关键字 Object wordMap = nowMap.get(String.valueOf(keyChar)); //如果存在该key,直接赋值,用于下一个循环获取 if (wordMap != null) { nowMap = (Map<String, Object>) wordMap; } else { //不存在则,则构建一个map,同时将isEnd设置为0,因为他不是最后一个 Map<String, Object> newWorMap = new HashMap<>(); //不是最后一个 newWorMap.put("isEnd", "0"); nowMap.put(String.valueOf(keyChar), newWorMap); nowMap = newWorMap; } if (i == word.length() - 1) { //最后一个 nowMap.put("isEnd", "1"); } } } /** * 移除敏感词 * @param word 敏感词 * @param nowMap 总词库 * @return boolean */ public static boolean removeWord(String word, Map<String, Object> nowMap) { if (StrUtil.isBlank(word)) { return false; } boolean canRemove=false; String oneLeveKey = String.valueOf(word.charAt(0)); // 最外层的map Map<String,Object> tempMap = (Map<String, Object>) nowMap.get(oneLeveKey); for (int i = 1; i < word.length(); i++) { //转换成char型 char keyChar = word.charAt(i); //库中获取关键字 Object wordMap = tempMap.get(String.valueOf(keyChar)); if(wordMap==null){ canRemove=false; break; } tempMap= (Map<String, Object>) wordMap; canRemove=true; } if(canRemove && tempMap!=null){ if(tempMap.size()==1){ nowMap.remove(oneLeveKey); log.info("敏感词库已移除:{} 关键词",word); }else { tempMap.put("isEnd","0"); log.info("敏感词库已更新:{} 状态",word); } } return canRemove; }
/** * 获取词库的方法,本地或redis * @return 返回本地敏感词库,可改为redis */ public static Map<String,Object> getSensitiveMap() { return sensitiveWordMap; }
/** * 判断文字是否包含敏感字符 * * @param txt 文字 * @param matchType 匹配规则 1:最小匹配规则,2:最大匹配规则 * @return 若包含返回true,否则返回false */ public static boolean contains(String txt, int matchType) { Map<String,Object> sensitiveMap = getSensitiveMap(); for (int i = 0; i < txt.length(); i++) { //判断是否包含敏感字符 int matchFlag = checkWord(txt, i, matchType, sensitiveMap); if (matchFlag > 0) { //大于0存在,返回true return true; } } return false; }
/** * 判断文字是否包含敏感字符 * * @param txt 文字 * @return 若包含返回true,否则返回false */ public static boolean contains(String txt) { return contains(txt, MIN_MATCHTYPE); }
/** * 获取文字中的敏感词 * * @param txt 文字 * @param matchType 匹配规则 1:最小匹配规则,2:最大匹配规则 * @return 返回匹配的敏感词 */ public static Set<String> getSensitiveWord(String txt, int matchType) { Map<String,Object> sensitiveMap = getSensitiveMap(); Set<String> resultSet = new HashSet<>(); for (int i = 0; i < txt.length(); i++) { //判断是否包含敏感字符 int length = checkWord(txt, i, matchType, sensitiveMap); //存在,加入list中 if (length > 0) { resultSet.add(txt.substring(i, i + length)); //减1的原因,是因为for会自增 i = i + length - 1; } } return resultSet; }
/** * 获取文字中的敏感词 * @param txt 文字 * @return 返回敏感词 */ public static Set<String> getSensitiveWord(String txt) { return getSensitiveWord(txt, MIN_MATCHTYPE); }
/** * 替换敏感字字符 * @param txt 文本 * @param replaceValue 替换的字符,替换单个字 * @return 返回替换后的字符串 */ public static String replaceWordByOne(String txt, String replaceValue) { return replaceWord(txt, replaceValue, MAX_MATCHTYPE,false); }
/** * 替换敏感字字符 * @param txt 文本 * @param replaceStr 替换的字符串,替换整个词 * @return 返回替换后的字符串 */ public static String replaceWordByAll(String txt, String replaceStr) { return replaceWord(txt, replaceStr, MAX_MATCHTYPE,true); }
/** * 替换敏感字字符 * @param txt 文本 * @param replaceValue 替换的字符 * @param matchType 敏感词匹配规则 * @param isReplaceAll replaceValue 替换所有还是替换 敏感词的一个字,true-所有。false- 一个 * @return 返回替换后的字符串 */ public static String replaceWord(String txt, String replaceValue, int matchType,boolean isReplaceAll) { String resultTxt = txt; //获取所有的敏感词 Set<String> set = getSensitiveWord(txt, matchType); Iterator<String> iterator = set.iterator(); String word; String replaceString; while (iterator.hasNext()) { word = iterator.next(); replaceString = isReplaceAll?replaceValue:getReplaceChars(replaceValue, word.length()); resultTxt = resultTxt.replaceAll(word, replaceString); } return resultTxt; }
/** * 获取替换字符串 * @param replaceChar 替换的字符 * @param length 长度 * @return 返回 length 个 replaceChar */ private static String getReplaceChars(Object replaceChar, int length) { StringBuilder sb = new StringBuilder(); sb.append(replaceChar); for (int i = 1; i < length; i++) { sb.append(replaceChar); } return sb.toString(); }
/** * 校验文字是否包含敏感词 * @param txt 文字 * @param beginIndex 开始索引 * @param matchType 匹配规则 * @param wordStoreMap 敏感词库 * @return 返回找到敏感词字符的长度 */ private static int checkWord(String txt, int beginIndex, int matchType, Map<String,Object> wordStoreMap) { //敏感词结束标识位:用于敏感词只有1位的情况 boolean flag = false; //匹配标识数默认为0 int matchFlag = 0; char word; Map<String,Object> nowMap = wordStoreMap; for (int i = beginIndex; i < txt.length(); i++) { word = txt.charAt(i); //获取指定key nowMap = (Map<String,Object>) nowMap.get(String.valueOf(word)); //存在,则判断是否为最后一个 if (nowMap != null) { //找到相应key,匹配标识+1 matchFlag++; //如果为最后一个匹配规则,结束循环,返回匹配标识数 if ("1".equals(nowMap.get("isEnd"))) { //结束标志位为true flag = true; //最小规则,直接返回,最大规则还需继续查找 if (MIN_MATCHTYPE == matchType) { break; } } } else { //不存在,直接返回 break; } } return flag ? matchFlag : 0; }
/** * 过滤常见特殊字符与空格 * @param str * @return */ public static String filterSpecialStr(String str) { String regEx = "[`~!@#$%^&*()+=|{}:;\\\\[\\\\].<>/?~!@#¥%……&*()——+|{}【】‘;:”“’。,、?']"; Pattern pattern = Pattern.compile(regEx); return pattern.matcher(str).replaceAll("").replaceAll(" ","").trim(); }
/** * 读取敏感词文件 * @param filePaths 文件绝对路径,支持多个文件 * @return 返回敏感词词库 */ public static Set<String> readFile(List<String> filePaths) { Set<String> result = new HashSet<>(); InputStreamReader read = null; BufferedReader bufferedReader = null; try { if (filePaths != null && filePaths.size() > 0) { for (String filePath : filePaths) { File file = new File(filePath); int count = 0; //判断文件是否存在 if (file.isFile() && file.exists()) { read = new InputStreamReader(new FileInputStream(file), StandardCharsets.UTF_8); bufferedReader = new BufferedReader(read); String lineTxt ; while ((lineTxt = bufferedReader.readLine()) != null) { if (StrUtil.isNotBlank(lineTxt)) { result.add(lineTxt); count++; } } } else { log.info("找不到指定的文件={}", filePath); } log.info("加载文件:{},个数={}", filePath, count); } } if (bufferedReader != null) { bufferedReader.close(); } if (read != null) { read.close(); } } catch (Exception e) { log.error("加载敏感词文件出错", e); } return result; }
public static void main(String[] args) { test(); }
public static void test() { List<String> fileList = new ArrayList<>(); fileList.add("D:\\敏感词\\广告.txt"); // fileList.add("D:\\敏感词\\色情类.txt"); // fileList.add("D:\\敏感词\\涉枪涉爆违法信息关键词.txt"); // fileList.add("D:\\敏感词\\政治类.txt"); // fileList.add("D:\\敏感词\\网址.txt"); Set<String> sensitiveWordSet = readFile(fileList); System.out.println("加载敏感词个数:" + sensitiveWordSet.size()); //初始化敏感词库 SensitiveWordUtils2.load(sensitiveWordSet); String content = "结束标志位结束代购商标职业志位结束标志敏感位结束标志位代购结束标志位结束苹果标"; long beginTime = System.currentTimeMillis(); boolean isContains = contains(content); Set<String> sensitiveWord = getSensitiveWord(content); System.out.println("敏感词=" + sensitiveWord); System.out.println("耗时:" + (System.currentTimeMillis() - beginTime) + "ms"); System.out.println(replaceWordByOne(content, "*")); System.out.println(replaceWordByAll(content, "替换整个词")); } }
|