value - java string size in bytes




Java Unicode String length (6)

I am trying hard to get the count of unicode string and tried various options. Looks like a small problem but struck in a big way.

Here I am trying to get the length of the string str1. I am getting it as 6. But actually it is 3. moving the cursor over the string "குமார்" also shows it as 3 chars.

Basically I want to measure the length and print each character. like "கு", "மா", "ர்" .

 public class one {
    public static void main(String[] args) {
            String str1 = new String("குமார்");
            System.out.print(str1.length());
    }
}

PS : It is tamil language.


Basically this happen due to encoding problem so,First change the text file Encoding of your java project by following the below steps

Right click your project Name=>select properties=>select resource=>Text File encoding=>chose other and select UTF- 8 as encoding,

This will resolve your issue.


Found a solution to your problem.

Based on this SO answer I made a program that uses regex character classes to search for letters that may have optional modifiers. It splits your string into single (combined if necessary) characters and puts them into a list:

import java.util.*;
import java.lang.*;
import java.util.regex.*;

class Main
{
    public static void main (String[] args)
    {
        String s="குமார்";
        List<String> characters=new ArrayList<String>();
        Pattern pat = Pattern.compile("\\p{L}\\p{M}*");
        Matcher matcher = pat.matcher(s);
        while (matcher.find()) {
            characters.add(matcher.group());            
        }

        // Test if we have the right characters and length
        System.out.println(characters);
        System.out.println("String length: " + characters.size());

    }
}

where \\p{L} means a Unicode letter, and \\p{M} means a Unicode mark.

The output of the snippet is:

கு
மா
ர்
String length: 3

See https://ideone.com/Apkapn for a working Demo


EDIT

I now checked my regex with all valid Tamil letters taken from the tables in http://en.wikipedia.org/wiki/Tamil_script. I found out that with the current regex we do not capture all letters correctly (every letter in the last row in the Grantha compound table is splitted into two letters), so I refined my regex to the following solution:

Pattern pat = Pattern.compile("\u0B95\u0BCD\u0BB7\\p{M}?|\\p{L}\\p{M}?");

With this Pattern instead of the above one you should be able to split your sentence into every valid Tamil letter (as long as wikipedia's table is complete).

The code I used for checking is the following one:

String s = "ஃஅஆஇஈஉஊஎஏஐஒஓஔக்ககாகிகீகுகூகெகேகைகொகோகௌங்ஙஙாஙிஙீஙுஙூஙெஙேஙைஙொஙோஙௌச்சசாசிசீசுசூசெசேசைசொசோசௌஞ்ஞஞாஞிஞீஞுஞூஞெஞேஞைஞொஞோஞௌட்டடாடிடீடுடூடெடேடைடொடோடௌண்ணணாணிணீணுணூணெணேணைணொணோணௌத்ததாதிதீதுதூதெதேதைதொதோதௌந்நநாநிநீநுநூநெநேநைநொநோநௌப்பபாபிபீபுபூபெபேபைபொபோபௌம்மமாமிமீமுமூமெமேமைமொமோமௌய்யயாயியீயுயூயெயேயையொயோயௌர்ரராரிரீருரூரெரேரைரொரோரௌல்லலாலிலீலுலூலெலேலைலொலோலௌவ்வவாவிவீவுவூவெவேவைவொவோவௌழ்ழழாழிழீழுழூழெழேழைழொழோழௌள்ளளாளிளீளுளூளெளேளைளொளோளௌற்றறாறிறீறுறூறெறேறைறொறோறௌன்னனானினீனுனூனெனேனைனொனோனௌஶ்ஶஶாஶிஶீஶுஶூஶெஶேஶைஶொஶோஶௌஜ்ஜஜாஜிஜீஜுஜூஜெஜேஜைஜொஜோஜௌஷ்ஷஷாஷிஷீஷுஷூஷெஷேஷைஷொஷோஷௌஸ்ஸஸாஸிஸீஸுஸூஸெஸேஸைஸொஸோஸௌஹ்ஹஹாஹிஹீஹுஹூஹெஹேஹைஹொஹோஹௌக்ஷ்க்ஷக்ஷாக்ஷிக்ஷீக்ஷுக்ஷூக்ஷெக்ஷேக்ஷைஷொக்ஷோஷௌ";
List<String> characters = new ArrayList<String>();
Pattern pat = Pattern.compile("\u0B95\u0BCD\u0BB7\\p{M}?|\\p{L}\\p{M}?");
Matcher matcher = pat.matcher(s);
while (matcher.find()) {
    characters.add(matcher.group());
}

System.out.println(characters);
System.out.println(characters.size() == 325);

Notepad don't support UTF characters, by default. Instead it supports ANSI. However your problem is not due to this.

Your program should know what encoding it's going to use while reading or writing. There is no magic. You need to set the encoding (for e.g. - UTF8). The constructure of FileReader takes default platform coding which clearly won't work for you.

I guess you need -

Reader reader = new InputStreamReader(new FileInputStream("c:/foo.txt"), "UTF-8");

Read file and write file which has characters in UTF - 8 (different language)


This turns out to be really ugly.... I have debugged your string and it contains following characters (and their hex position):

க 0x0b95
ு 0x0bc1
ம 0x0bae
ா 0x0bbe
ர 0x0bb0
் 0x0bcd

So tamil language obviously use diacritics-like sequences to get all characters which unfortunately count as separate entities.

This is not a problem with UTF-8 / UTF-16 as erronously claimed by other answers, it is inherent in the Unicode encoding of the Tamil language.

The suggested Normalizer does not work, it seems that tamil has been designed by Unicode "experts" to explicitly use combination sequences which cannot be normalized. Aargh.

My next idea is not to count characters, but glyphs, the visual representations of characters.

String str1 = new String(Normalizer.normalize("குமார்", Normalizer.Form.NFC ));

Font display = new Font("SansSerif",Font.PLAIN,12);
GlyphVector vec = display.createGlyphVector(new FontRenderContext(new AffineTransform(),false, false),str1);

System.out.println(vec.getNumGlyphs());
for (int i=0; i<str1.length(); i++)
        System.out.printf("%s %s %s %n",str1.charAt(i),Integer.toHexString((int) str1.charAt(i)),vec.getGlyphVisualBounds(i).getBounds2D().toString());

The result:

க b95 [x=0.0,y=-6.0,w=7.0,h=6.0]
ு bc1 [x=8.0,y=-6.0,w=7.0,h=4.0]
ம bae [x=17.0,y=-6.0,w=6.0,h=6.0]
ா bbe [x=23.0,y=-6.0,w=5.0,h=6.0]
ர bb0 [x=30.0,y=-6.0,w=4.0,h=8.0]
் bcd [x=31.0,y=-9.0,w=1.0,h=2.0]

As the glyphs are intersecting, you need to use Java character type functions like in the other solution.

SOLUTION:

I am using this link: http://www.venkatarangan.com/blog/content/binary/Counting%20Letters%20in%20an%20Unicode%20String.pdf

public static int getTamilStringLength(String tamil) {
    int dependentCharacterLength = 0;
    for (int index = 0; index < tamil.length(); index++) {
        char code = tamil.charAt(index);
        if (code == 0xB82)
            dependentCharacterLength++;
        else if (code >= 0x0BBE && code <= 0x0BC8)
            dependentCharacterLength++;
        else if (code >= 0x0BCA && code <= 0x0BD7)
            dependentCharacterLength++;
    }
    return tamil.length() - dependentCharacterLength;
  }

You need to exclude the combination characters and count them accordingly.


How to split Tamil characters in a string in PHP

I think you should be able to use the grapheme_extract function to iterate over the combined characters (which are technically called "grapheme clusters").

Alternatively, if you prefer the regex approach, I think you can use this:

preg_match_all('/\pL\pM*|./u', $str, $results)

where \pL means a Unicode "letter", and \pM means a Unicode "mark".

(Disclaimer: I have not tested either of these approaches.)


how to get character length of the unicode along with space in java

The code below worked for me. There were three issues that I fixed:

  1. I added a check for spaces to your regular expression.
  2. I added a check for punctuation to your regular expression.
  3. I pasted the string from your comment into the string in your code. They weren't the same!

Here's the code:

public static void main(String[] args) {
    String s = "பாரதீய ஜனதா இளைஞர் அணி தலைவர் அனுராக்சிங் தாகூர் எம்.பி. நேற்று தேர்தல் ஆணையர் வி.சம்பத்";
    List<String> characters = new ArrayList<String>();
    Pattern pat = Pattern.compile("\\p{P}|\\p{L}\\p{M}*| ");
    Matcher matcher = pat.matcher(s);
    while (matcher.find()) {
        characters.add(matcher.group());
    }
    // Test if we have the right characters and length
    int i = 1;
    for (String character : characters) {
        System.out.println(String.format("%d = [%s]", i++, character));
    }
    System.out.println("Characters Size: " + characters.size());
}

It's probably worth pointing out that your code is remarkably similar to the solution for this SO. One comment on that solution in particular led me to discover the missing check for punctuation in your code and allowed me to notice that the string from your comment didn't match the string in your code.