如何在Java中去除字符串中的阿拉伯标点符号

5
我正在制作一个阿拉伯语词典,我得到了类似以下的句子: String original = "'أَبَنَ فُلانًا: عَابَه ورَمَاه بخَلَّة سَوء.'"; 但是如果不去除重音和标点符号,我就无法处理这个句子。 我尝试使用...(待续)
import java.text.Normalizer;
import java.text.Normalizer.Form;
import java.util.regex.Pattern;

public static String deAccent(String str) {
    String nfdNormalizedString = Normalizer.normalize(str, Normalizer.Form.NFD); 
    Pattern pattern = Pattern.compile("\\p{InCombiningDiacriticalMarks}+");
    return pattern.matcher(nfdNormalizedString).replaceAll("");
} 

但它没有起作用。


3
“它没工作”是什么意思?它无法编译,还是抛出异常?它返回了一些意外的东西吗? - Nodebody
5个回答

11

试试这段代码,在我的项目中运行良好:

/**
 * ArabicNormalizer class
 * @author Ibrabel <ibrabel@gmail.com>
 */
public final class ArabicNormalizer {

    private String input;
    private final String output;

    /**
     * ArabicNormalizer constructor
     * @param input String
     */
    public ArabicNormalizer(String input){
        this.input=input;
        this.output=normalize();
    }

    /**
     * normalize Method
     * @return String
     */
    private String normalize(){

        //Remove honorific sign
        input=input.replaceAll("\u0610", "");//ARABIC SIGN SALLALLAHOU ALAYHE WA SALLAM
        input=input.replaceAll("\u0611", "");//ARABIC SIGN ALAYHE ASSALLAM
        input=input.replaceAll("\u0612", "");//ARABIC SIGN RAHMATULLAH ALAYHE
        input=input.replaceAll("\u0613", "");//ARABIC SIGN RADI ALLAHOU ANHU
        input=input.replaceAll("\u0614", "");//ARABIC SIGN TAKHALLUS

        //Remove koranic anotation
        input=input.replaceAll("\u0615", "");//ARABIC SMALL HIGH TAH
        input=input.replaceAll("\u0616", "");//ARABIC SMALL HIGH LIGATURE ALEF WITH LAM WITH YEH
        input=input.replaceAll("\u0617", "");//ARABIC SMALL HIGH ZAIN
        input=input.replaceAll("\u0618", "");//ARABIC SMALL FATHA
        input=input.replaceAll("\u0619", "");//ARABIC SMALL DAMMA
        input=input.replaceAll("\u061A", "");//ARABIC SMALL KASRA
        input=input.replaceAll("\u06D6", "");//ARABIC SMALL HIGH LIGATURE SAD WITH LAM WITH ALEF MAKSURA
        input=input.replaceAll("\u06D7", "");//ARABIC SMALL HIGH LIGATURE QAF WITH LAM WITH ALEF MAKSURA
        input=input.replaceAll("\u06D8", "");//ARABIC SMALL HIGH MEEM INITIAL FORM
        input=input.replaceAll("\u06D9", "");//ARABIC SMALL HIGH LAM ALEF
        input=input.replaceAll("\u06DA", "");//ARABIC SMALL HIGH JEEM
        input=input.replaceAll("\u06DB", "");//ARABIC SMALL HIGH THREE DOTS
        input=input.replaceAll("\u06DC", "");//ARABIC SMALL HIGH SEEN
        input=input.replaceAll("\u06DD", "");//ARABIC END OF AYAH
        input=input.replaceAll("\u06DE", "");//ARABIC START OF RUB EL HIZB
        input=input.replaceAll("\u06DF", "");//ARABIC SMALL HIGH ROUNDED ZERO
        input=input.replaceAll("\u06E0", "");//ARABIC SMALL HIGH UPRIGHT RECTANGULAR ZERO
        input=input.replaceAll("\u06E1", "");//ARABIC SMALL HIGH DOTLESS HEAD OF KHAH
        input=input.replaceAll("\u06E2", "");//ARABIC SMALL HIGH MEEM ISOLATED FORM
        input=input.replaceAll("\u06E3", "");//ARABIC SMALL LOW SEEN
        input=input.replaceAll("\u06E4", "");//ARABIC SMALL HIGH MADDA
        input=input.replaceAll("\u06E5", "");//ARABIC SMALL WAW
        input=input.replaceAll("\u06E6", "");//ARABIC SMALL YEH
        input=input.replaceAll("\u06E7", "");//ARABIC SMALL HIGH YEH
        input=input.replaceAll("\u06E8", "");//ARABIC SMALL HIGH NOON
        input=input.replaceAll("\u06E9", "");//ARABIC PLACE OF SAJDAH
        input=input.replaceAll("\u06EA", "");//ARABIC EMPTY CENTRE LOW STOP
        input=input.replaceAll("\u06EB", "");//ARABIC EMPTY CENTRE HIGH STOP
        input=input.replaceAll("\u06EC", "");//ARABIC ROUNDED HIGH STOP WITH FILLED CENTRE
        input=input.replaceAll("\u06ED", "");//ARABIC SMALL LOW MEEM

        //Remove tatweel
        input=input.replaceAll("\u0640", "");

        //Remove tashkeel
        input=input.replaceAll("\u064B", "");//ARABIC FATHATAN
        input=input.replaceAll("\u064C", "");//ARABIC DAMMATAN
        input=input.replaceAll("\u064D", "");//ARABIC KASRATAN
        input=input.replaceAll("\u064E", "");//ARABIC FATHA
        input=input.replaceAll("\u064F", "");//ARABIC DAMMA
        input=input.replaceAll("\u0650", "");//ARABIC KASRA
        input=input.replaceAll("\u0651", "");//ARABIC SHADDA
        input=input.replaceAll("\u0652", "");//ARABIC SUKUN
        input=input.replaceAll("\u0653", "");//ARABIC MADDAH ABOVE
        input=input.replaceAll("\u0654", "");//ARABIC HAMZA ABOVE
        input=input.replaceAll("\u0655", "");//ARABIC HAMZA BELOW
        input=input.replaceAll("\u0656", "");//ARABIC SUBSCRIPT ALEF
        input=input.replaceAll("\u0657", "");//ARABIC INVERTED DAMMA
        input=input.replaceAll("\u0658", "");//ARABIC MARK NOON GHUNNA
        input=input.replaceAll("\u0659", "");//ARABIC ZWARAKAY
        input=input.replaceAll("\u065A", "");//ARABIC VOWEL SIGN SMALL V ABOVE
        input=input.replaceAll("\u065B", "");//ARABIC VOWEL SIGN INVERTED SMALL V ABOVE
        input=input.replaceAll("\u065C", "");//ARABIC VOWEL SIGN DOT BELOW
        input=input.replaceAll("\u065D", "");//ARABIC REVERSED DAMMA
        input=input.replaceAll("\u065E", "");//ARABIC FATHA WITH TWO DOTS
        input=input.replaceAll("\u065F", "");//ARABIC WAVY HAMZA BELOW
        input=input.replaceAll("\u0670", "");//ARABIC LETTER SUPERSCRIPT ALEF

        return input;
    }

    /**
     * @return the output
     */
    public String getOutput() {
        return output;
    }

    public static void main(String[] args) {
        String test = "كَلَّا لَا تُطِعْهُ وَاسْجُدْ وَاقْتَرِبْ ۩";
        System.out.println("Before: "+test);
        test=new ArabicNormalizer(test).getOutput();
        System.out.println("After: "+test);
    }
}

这是唯一一个对我有效的答案! - Ahmed Nabil

6

为什么不选择Unicode标点/标记,非间隔类别?

因为您没有发布预期的结果,我也看不懂阿拉伯语:),但可以尝试此代码:

String input = "'أَبَنَ فُلانًا: عَابَه ورَمَاه بخَلَّة سَوء.'";
Pattern p = Pattern.compile("[\\p{P}\\p[Mn]");
Matcher m = p.matcher(input);
while (m.find()) {
    System.out.println("found: " + m.group());
}
m.reset();
System.out.println("Replaced: " + m.replaceAll(" "));

输出:

found: '
found: َ
found: َ
found: َ
found: ُ
found: ً
found: :
found: َ
found: َ
found: َ
found: َ
found: َ
found: ّ
found: َ
found: َ
found: .
found: '
Replaced:  أ ب ن  ف لان ا  ع اب ه ور م اه بخ ل  ة س وء  

我想这可能不是你期望的最终结果,但我希望你能够用它来处理。 此外,这里有丰富的Unicode分类信息,我相信大部分适用于Java的Pattern

1
实际上谢谢,这正是我想要的,但只有正则表达式中的 \p{Mn}。 - Firas252
正则表达式("\p{Mn}")足以去除变音符号。 - itabdullah

2

我发现这样做更好。感谢joop提供的技巧。

import java.text.Normalizer;
import java.text.Normalizer.Form;

/**
 *
 * @author Ibbtek <http://ibbtek.altervista.org/>
 */
public class ArabicDiacritics {

    private String input;
    private final String output;

    /**
     * ArabicDiacritics constructor
     * @param input String
     */
    public ArabicDiacritics(String input){
        this.input=input;
        this.output=normalize();
    }

    /**
     * normalize Method
     * @return String
     */
    private String normalize(){

        input = Normalizer.normalize(input, Form.NFKD)
                .replaceAll("\\p{M}", "");

        return input;
    }

    /**
     * @return the output
     */
    public String getOutput() {
        return output;
    }

    public static void main(String[] args) {
        String test = "كَلَّا لَا تُطِعْهُ وَاسْجُدْ وَاقْتَرِبْ ۩";
        System.out.println("Before: "+test);
        test=new ArabicDiacritics(test).getOutput();
        System.out.println("After: "+test);
    }
}

1
您可以使用这个:

String withDiacritics = "طَائِفِيّةٌ";
String withoutDiacritics = withDiacritics.replaceAll("(ّ)?(َ)?(ً)?(ُ)?(ٌ)?(ِ)?(ٍ)?(~)?(ْ)?", "");

输出将会是:"طائفية"。


请重新格式化您的代码清单,删除前导空格。 - isaias-b

1
在我的情况下,这个解决方案是有效的。我尝试了很多解决方案,但没有一个能正常工作。
String diacless = Normalizer.normalize(textWithDiacritics, Normalizer.Form.NFKD).replaceAll("\\p{M}", "");
Log.d("diac_remove", "replaced: "+diacless);

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接