Strict Standards: Declaration of SkinClean::initPage() should be compatible with SkinTemplate::initPage(OutputPage $out) in /home/dakfin/public_html/skins/clean.php on line 38
Malayalam Unicode - സ്വതന്ത്ര വിജ്ഞാന ജനാധിപത്യ സഖ്യം <br>Democratic Alliance for Knowledge Freedom

ഉള്ളടക്കം

Draft of An Approach Document on Issues of Malayalam Unicode Encoding


(for Discussion within DAKF)

There are various interests playing in formation of Malayalam Unicode standards. This demand us to address the issues of Malayalam Unicode standard. Those issues might have been solved if there was consensus among the various players who are trying to influence the Unicode authority for Malayalam.

Even before the introduction of Unicode version 5.1, the Malayalam Unicode standards were discussed in various fora including Indic language mailing list of Unicode Consortium (indic@unicode.org). However the version 5.1 and now 5.2 has been released without any consensus reached in those discussions. This had led it to a controversy and a certain section of users and developers not accepting the Unicode version 5.1 or later. As those who did not accept includes maintainers of Unicode fonts widely used with GNU/Linux distributions such as Debian, Fedora, Ubuntu, etc., The default installation GNU/Linux still adhere with the version 5.0 of Unicode. This create much problem for the portability of Malayalam document across the platform.

It is high time now for DAKF to intervene in these issues and empower the Govt. of Kerala to sort out the issues that are causing damage to the evolution of Malayalam digital contents for quite a long time. Some of the major issues are briefed below for discussion within DAKF and to reach a consensus among us.

Issues of Joiners


  • The ZWJ (Zero Width joiner) and ZWNJ (Zero width non joiner ) in Unicode comes under the space of formatting control characters. In general, they are used to effect changes to the default rendering of sequences of Unicode character.
  • In Unicode standards, it is specified that these characters may be stripped in applications for certain kinds of processing. The stripping the joiners causes semantic changes in the text if those are used in collation of character. For example in domain name processing and sorting.
  • In domain name processing, the ZWJ and ZWNJ are mapped to empty string which reduces the possibility of spoofing.
  • The ZWNJ is used in certain words like ദൃക്‌സാക്ഷി in between ക് and സ to avoid it rendering like ദൃക്സാക്ഷി. In such cases a replacement has to be recommended. One possible thing is to use ് + ് instead of ് + ZWNJ
----
Jagan Nadh commented on 3rd Nov 2009
Substituting ZWNJ with ് may cause some other problems if the behaviour is not defined. 
If such a recommendation is made we have to say that the behaviour of the character will be same as of ZWNJ. 
If I am correct in ISCII we used some similar technique. 
While adapting this we have to ensure that it doesn’t makes any harm to NLP applications like spell checker, 
i.e same word with new non joiner and with out non joiner should be recognised as same.
----
Anilkumar commented on 3rd Nov 2009 
----
Definitely behaviour has to be defined in Unicode and properly implemented in all fonts and rendering engines. 
As ് is neither control character of Unicode nor defined outside Malayalam space, It won’t create problem as that of ZWNJ. 
What kind of problem is expected in spell checker? Please explain it with example. 
----
Jagan Nadh commented on 3rd Nov 2009
Ya right I was mentioning about the problem which a spellchecker or morphological analyzer may encounter with this. 
The word ദൃക്‌സാക്ഷി and ദൃക്സാക്ഷി should be recognized as same word by any NLP application.
----
  • It is not healthy to support joiners for backward compatibility as it creates unnecessary complexity to all text processing application (sorting, searching). Hence the approach shall be to migrate the available Malayalam contents with joiners ( ie. Malayalam Document in Unicode 5.0 or earlier ) has to be migrated to newer versions by removing the joiners with new replacement.
Jagan Nadh commented on 3rd Nov 2009
----
I think backward compatibility is Unicode policy
----
Anilkumar commented on 3rd Nov 2009 
----
Even though backward compatibility is defined as a policy of Unicode, 
it can be ensured only for the characters and code-points already there in old version. 
When a new character of code point is introduced in new version, 
the backward compatibility could not be ensured at Unicode level. 
----
Jagan Nadh commented on 3rd Nov 2009
----
It is not an easy task. lakhs of pages will be there in 5.0 standard. 
So either we have to forgot about that data or we have to keep the backward compatibility. 
----

Issues related to Malayalam Chillus


  • Chillus are alternative forms of consonants (വ്യജ്ഞനങ്ങള്‍) without vowel at word-end position. There are 5 commonly used chillus ( of consonants ണ, ന, ല, ള and ര) which are included in Malayalam official alphabets, though there are possibilities of having more chillus of other consonants
Jagan Nadh commented on 3rd Nov 2009
----
Chillus are pure consonants ? Better we can consult the opinion of an expert.
----
Anilkumar commented on 3rd Nov 2009 
----
It says, “Chillus are alternative forms of pure consonants without vowel “. 
If you still have dispute on the word pure, that can be dropped.
----
Jagan Nadh commented on 3rd Nov 2009
----
I dont have any dispute regarding this. 
I think we can define chillu as consonant which can stand alone without vowels. 
I think mentioning it as alternative form will lead to invention of new chillus :-) 
----
  • Till the Unicode standard 5.0 and earlier, there was no Chillus in Malayalam character table. So chillus could not be used with a single Unicode character. This made to define the chillus as a combination for two or three Unicode character.
  • The Microsoft without paying any attention to language rules and future impacts introduced a method of using ZWJ to manifest the chillus ( ല + ് + ZWJ = ല്‍ ). As their software were widely used, this way of chillu formation were widely accepted.
Jagan Nadh commented on 3rd Nov 2009
----
It was due to the blind acceptance of the ISCII standard in the case of Malayalam. 
In ISCII consonant + virama + nukta sequence used to produce chillu. 
They replaced nulta with ZWJ. 
----
Anilkumar commented on 3rd Nov 2009 
----
However, ZWJ cannot be equated with nuktha in ISCII
----
  • The render engines in Swathanthra software (Pango, ICU, Qt etc. ) have also adopted same method of rendering Chillus.
  • There are some inherent error in this method of rendering chillu. The important one is that it is using ZWJ which is out of Malayalam character space in Unicode, but placed in common space. This will create problems in sorting order and searching. The simple use of ZWJ / ZWNJ will not make any problem as it is a Unicode character. However, when a character outside Malayalam space is used for the collation of Malayalam character, It creates abnormal behaviour in sorting and searching.
  • As these issues are pointed out from various corners, and through a long discussion, the Unicode Consortium introduced the commonly used chillus and chillu of ക (ക്‍) in version 5.1 hence can be represented by single code points (Unicode Character) within Malayalam space. Hence they are known as “ആണവചില്ലു്”.
  • This introduction of chillus in Unicode tables solved most of the issues of chillus encoding. However, another set of issues arose with this change.
Jagan Nadh commented on 3rd Nov 2009
----
Especially the new definition for the conjunct sequence ന്റ .
----
Anilkumar commented on 3rd Nov 2009 
----
I feel with the introduction of half of റ്റ in version 5.2 the issue of ന്റ shall be solved. 
However it creates new problems of backward compatibility, which can be solved through content migration.
----
Jagan Nadh commented on 3rd Nov 2009
----
I fell that the introduction of half form of  റ്റ  is not required . 
Any how it is not foing to solve the NTA problem too.
----
  • The backward compatibility with version 5.0 is lost. So all the document created with the version 5.0 or earlier need reprocessing. This can be done with some scanning and character re-mapping software such as (Payyans or Padma). However, identifying all such Malayalam documents with Unicode 5.0 is a herculian task. So some where we have to compromise with the backward incompatibility.
Jagan Nadh commented on 3rd Nov 2009
----
I think backward compatibility is there. 
When ever unicode introduces a new standard it will be backward compatible. 
But the minimum thing is that venders should support the backward compatibility.
----
Anilkumar commented on 3rd Nov 2009 
----
As said earlier, 
there is no point in the backward compatibility policy of Unicode while introducing a new character. 
So the claim of backward compatibility by Unicode consortium cannot be taken for granted.
----
  • Issues of backward compatibility could have been solved to a certain extent by introducing a canonical equivalence for chillus of version 5.0 and “ആണവചില്ലു്”. However that did not happen as most of the discussion on this issues resulted merely in tug of war between various interests.
  • Still the rendering of chillus other than commonly used five is not solved. The use of ZWJ for those chillus could not be allowed on the same grounds by which “ആണവചില്ലു്” is introduced. If any other method is used for them, (that should not be another method of forming the already introduced five chillus), it creates duplication in rendering.
Jagan Nadh commented on 3rd Nov 2009
----
It is a specifica feature of Malayalam language. 
----
  • As per earlier practice the chillus used only in word-end position. However due to various adaptation occurred to language they began to be used also in word medial positions of certain compound words. For example തേന്‍മാവു് (തേന്മാവു്), കല്‍പന (കല്പന), വെണ്‍മ (വെണ്മ) etc.
  • There is some argument like chillus can be first letter of conjuncts (കൂട്ടക്ഷരം). However as per Malayalam language rules it is not possible. For the first letter of conjuncts the normal form of consonants without vowel are to be used. For example there no existence for a conjuncts formed as (ന്‍ + റ) however someone wrongly used it in place of ന്റ (which is ന + ് + റ ).
  • The introduction of “ആണവചില്ലു്" in Malayalam Unicode table does not negate any of the Malayalam language rule. It is still a mere alternate form of consonants without vowel and could not be used as first letter of conjuncts.
  • However there are plenty of reasons to accept “ആണവചില്ലു്” as it solves many of the earlier issues. The new problems created by it need to be solved mainly through migration of old documents of version 5.0 or earlier.

Issues related to Domain Names


  • The Domain Names are used to access the the networked computer or virtual computer. Each domain name is having a translated IP address. It is most widely used for website address as IP addresses are difficult to remember.
  • In public space or Internet, the domain should be unique, so that there won’t be any ambiguity in connecting the appropriate computer using a domain name. Or in other words duplication of domain name is not allowed in public space.
  • The uniqueness of domain names are maintained using Punycode of domain names. The Punycode is translation of domain names to an instance of a general encoding syntax. Using Punycode a string of Unicode characters can be transformed uniquely and reversibly into a smaller, restricted character set.
  • The ZWJ , ZWNJ and other control characters of Unicode tables are not included in the restricted character sets of Punycode. So the domain name processing shall filter out those characters. So, for Unicode version 5.0, the Punycode Chillus will be same as that of consonants with out vowel. Or in other words, Punycode of സര്‍ക്കാര്‍ is same as that of സര്ക്കാര് ( Similarly Punycode of തൊഴില്‍ and തൊഴില് are same, തണല്‍ and തണല് are same and നന്മ,നന്‍മ and നന്‌മ are same )
  • On the other hand, If the things happen in such a way that, there are possibility of rendering a string with Punycode in two different ways, there are possibility of spoofing. Ie we may lead to different website depending on the rendering engine used in our browser. This situation shall not be allowed in any case as it leads to e-crime, especially when society is increasingly going for e-transactions.
  • Another possibility is that a person using Unicode version 5.0 based rendering and fonts, when tries to access www.സര്‍ക്കാര്‍.com he will land in www.സര്ക്കാര് .com and if www.സര്ക്കാര് .com is a spoofing site, that person will be in trouble. This may be avoidd by using the fonts strictly compatible with the latest accepted Unicode standard.
Jagan Nadh commented on 3rd Nov 2009
----
While IDN is implemented all the possibility for these kind of spoofing will be eliminated. 
In IDN in action if a name is submitted for registration it will be examined in different levels to ensure that, 
this domain name does not conflict with any other sires which is already registered . 
These analysis includes visual similarity as well as encoding similarity etc..
----
Anilkumar commented on 3rd Nov 2009 
----
In such case IDN registration rule also have to be redefined with the introduction of new versions of Unicode. 
In such cases also there shall be issues with already registered sites, 
So the use of fonts and rendering engines with strict compatibility with latest accepted standards of Unicode has to be ensured.
----
Jagan Nadh commented on 3rd Nov 2009
----
At the time of IDN implimentation current latest Unicode standard will be taken in to consideration . 
So all the rules will be modified accordingly by the competent authority. 
----
  • It is better not to permit the domain name registration with “ആണവചില്ലു്” till complete replacement of Unicode version 5.0, Or there should be some canonical equivalance of old chillu and “ആണവചില്ലു്” in Punycode formation or in the font itself. In domain registrations with high security implications, it is essential that an automatic folding mechanism is available for such common situations of similar and ambiguous sequences, so that if one is registered the other should not be available for registration.
Jagan Nadh commented on 3rd Nov 2009
----
I think we have to revist this argument. 
Canonical equivalence should be discussed in detals both in terms of linguistics and technology.
----
Anilkumar commented on 3rd Nov 2009 
----
Surely we can have more debate on it. Please initiate it with more details.
----
  • It is also particularly important to strictly follow a standard in Malayalam Unicode Character rendering to avoid spoofing.
Jagan Nadh commented on 3rd Nov 2009
----
IDN has native mechanism for identify and prevent any attempt of registering a site for spoofing.
----
Anilkumar commented on 3rd Nov 2009 
----
All such native mechanism are based on some rules based on existing standards. 
Whenever, there are changes in standards, such mechanism are also have to be re-visited.
----
Jagan Nadh commented on 3rd Nov 2009
----
At the time of IDN implimentation current latest Unicode standard will be taken in to consideration . 
So all the rules will be modified accordingly by the competent authority. 
----

Issues related to certain collation


  • The ambiguity in rendering of ന്‍റ in ഹെന്‍റി (ന്‍ + റ and pronounced as nRa but not as a conjunct, exactly like ന്‍സ, ന്‍ക,) and ന്റ in എന്റെ (which is ന് + half of റ്റ and pronounced as nta) in version 5.0, however, was cleared with the introduction of “ആണവചില്ലു്”
  • Step shall be taken to stop incorrect usage of certain conjunct atleast in government documents, For example, മ്പ = മ + ് + പ and and knot ന + ് + പ.
Jagan Nadh commented on 3rd Nov 2009
----
I think the Malayalam Script Standardisation committee give some recommendation for the same. 
----
  • Unicode should implement the collation based on linguistic rules. Treating the character merely based on representational form is not enough.

Backward compatibility


  • Whenever a change is made to Unicode standards, there should be backward compatibility with the earlier versions. However this is not ensured in most of the cases. This is making much problem.
  • In certain cases it may not be practical to ensure backward compatibility. In such cases the concerned authority may publish a migration plan to move to new version. In certain cases, implementation of canonical equivalence in fonts itself may be helpful to clear out the incompatibility between versions.

Input methods


  • The currently approved standard for Malayalam input method is Inscript. There need to be timely updation in Keyboard layout for this standard to incorporate changes in Unicode standards.
  • There are other input methods used by the user community. So the use of Inscript shall not be imposed on users.
  • However it shall be ensured that all the input methods used shall be strictly compatible with the latest accepted Unicode standard at least in public offices.
  • The training on Malayalam Inscript input method may be incluled in the curriculam of IT@‍School and Akshaya, if it is not there.

Unicode Fonts


  • The use of fonts supporting different standards of Unicode without backward compatibility makes spoofing possible. For example if a person using Unicode version 5.0 based rendering and fonts, tries to access സര്‍ക്കാര്‍.blogspot.com he will land in സര്ക്കാര് .blogspot.com and if www.സര്ക്കാര് .blogspot.com is a spoofing site, that person will be in trouble. Hence it must be ensured that Unicode fonts strictly compatible with latest accepted be used at least in public offices.
  • A monitoring mechanism may be enabled to verify and approve the Unicode fonts that they strictly adhere to the Unicode standards and additional guide-line set up by concerned authority.
Jagan Nadh commented on 3rd Nov 2009
----
This was one of the aim behind establishing a computational linguistics team @C-DIT. 
There was some group called COWMAC which formed before a couple of years and now it is dead. 
This group has to be revoked for the same. 
----
  • Representation of certain expressing representation such for literary documents like വരുന്നുുുുു may be made possible by adding rules in fonts.
  • Now there is limited numbers of Unicode fonts available.
  • The DAKF may take initiative to release new Malayalam Unicode fonts in various category such as document fonts, decorative fonts and hand-written fonts etc.
  • The confusion over the use of old-lipi and new lipi with Unicode standards has to be cleared. Unicode standards defines a set of glyphs and code points for basic characters and collation rules. The glyph of conjuncts are defined in fonts. So whether we require old lipi or new lipi is completely determined by fonts used. As per Unicode standards both old-lipi and new-lipi are possible.

Software used in Public office


  • A monitoring mechanism may be enabled to verify and approve the linguistics software, office software and fonts that they strictly adhere to the Unicode standards and additional guide-line set up by concerned authority.
  • The already existing Unicode Encoding Committee of Govt. of Kerala may be expanded with more representation for this purpose.

Representation to Unicode Consortium


  • The Government of Kerala may negotiate with the Unicode consortium for a single point authority to give recommendations and suggestions on Malayalam Unicode standards.
Jagan Nadh commented on 3rd Nov 2009
----
Now it is happening in this way. Unicode will sent documents for comments to Central Govt. 
And central govt will forward it to state. 
Now also state has a role. But it is not direct. 
The intention of Tamil Nadu govt to be a Unicode Consort. Member is different. 
They are arguing for all character encoding. 
If we van solve all the problem with proper discussion we can avoid forming a body 
for Unicode in govt and spending money for membership and travel etc........ 
----
  • This authority may conduct cyclic or occasional meeting with all interested / concerned parties to formulate such recommendations and suggestions. This body can work in a democratic way.
  • This body can conduct seminars, workshops and open forums to discuss various issues that may arise from time to time.
  • Wiki pages, discussion forum website, and mailing list may be maintained by this forum for collecting feedback from users.
  • This body may set-up a help center to clear out the concerns raised by Malayalam Unicode users.
  • Discuss and finalise the various norms for preparing a comprehensive character repertoire, and to prepare a comprehensive list of historically justified and linguistically valid glyphs comprising of vowels, consonants, signs and conjuncts.

Malayalam Unicode Character Repertoire


  • The very basis of Unicode is to encode only basic characters using single code-point (single Unicode character) and conjuncts generated through collation of these basic characters. Hence, it is important to fix the set of basic characters namely vowels and consonants by an authorized body for Malayalam, instead of doing it based on the whims and fancy of certain individuals or group.
  • The introduction of new characters such as half of റ്റ and alveolar-ന in pipeline of Unicode is very relevant for digitizing historical documents, hence a welcome step. However there should be some mechanism to stop dual rendering of certain words with its introduction.
Jagan Nadh commented on 3rd Nov 2009
----
I think this is not needed . 
AR used these characters just for example and it is not used in our language.
But phonetically it is there. Correct me if am wrong.
I feel we should not restrict the scope of having more matching of writing with pronunciation, just for current convenience.
----
  • There should be a specific norms for conjunct forms, i.e., if two consonants form a conjunct in language, then it should be treated as a conjunct in Unicode and its visual appearance should be fixed.
  • The order of the above basic characters and glyphs (alphabetical order) should also be standardised.
  • The way consonants form conjuncts by joining has to be standardised. The different ways used now in Malayalam are,

o subjoining the second consonant under the first consonant o joining the second consonant on the right of the first consonant o combination with complete change of form

  • A consonant forms conjunct with many consonants. Standardizing some of those combination, but rejecting others and splitting them with chandrakkala, even though all the combination are widely used in the language should be avoided. For example certain fonts allows സ്ന, സ്ത, സ്ഥ and സ്പ, സ്ക, സ്മ are rendered as split characters. Those conjuncts which have been historically evolved and found to be in common use should be accepted.
"http://dakf.in/index.php?title=Malayalam_Unicode" എന്ന താളില്‍നിന്നു ശേഖരിച്ചത്
Powered by MediaWiki