WriteLine unicode?

BlitzMax Forums/BlitzMax Programming/WriteLine unicode?

JoshK(Posted 2015) [#1]
I'm lookin to switch all my code over to use unicode encoding. Is there a way to make WriteLine / ReadLine write two bytes per character?


Brucey(Posted 2015) [#2]
Unicode is not 2 bytes.
It could be 1 byte sequences (up-to-four), 2 byte sequences (up-to-four), or perhaps even 4 bytes fixed - all depending which UTF-n you are using.


Mithril(Posted 2015) [#3]
This may help out:

https://en.wikipedia.org/wiki/UTF-8

But this can be a complex issue to deal with, especially on console.
Also if you plan on using windows headers, remember to add this one:

#ifdef _UNICODE
#ifndef UNICODE
#define UNICODE
#endif
#endif

_UNICODE is for C & MFC, and the other for windows headers. Otherwise you may run into strange errors while compiling.


Henri(Posted 2015) [#4]
Blitzmax doesn't have full unicode support (it's only limited to 2 bytes). I'm not sure how hard it would be to implement.

EDIT: Maybe there could be a new String type called UString which would hold variable length UTF-8 encoded characters.

-Henri


Yasha(Posted 2015) [#5]
Simplest approach (probably ridiculously inefficient):

- get byte ptr to string data, create memory buffer for output
- have libiconv convert the string from UCS-2 (or whatever BlitzMax uses internally) to UTF-8
- write bytes

You probably want to rethink using a WriteLine/ReadLine oriented approach at all for this, though, if that's what you're doing. Mixing in custom byte formats with whatever else Max is writing will lead to pain down the line as you have to keep track of complications... better to either use a purely binary format (WriteString/ReadString will dump Max's internal representations of the string data in a not-necessarily-readable way; the encoding is irrelevant because it's not meant for others to read), or a purely text format (JSON/XML/etc; have the writer/reader treat it as a single long string to be built/parsed in-memory, so there's only one conversion operation as it's all dumped/read in a single go). If the writing of strings is mixed with the writing of other forms of data (the rest of the Write* commands), Unicode is worthless; nobody who doesn't know the exact format can read the file anyway!


Grisu(Posted 2015) [#6]
Libiconv is a nightmare. Couldn't get it to work with tag ids properly.


JoshK(Posted 2015) [#7]
I don't understand how this is any harder than writing a short instead of a byte.

Local s:String = "InsertAnyRussianOrChineseTextHere"

For Local n:Int=0 To s.length-1
	Print s[n]
Next


Would I ever need any more than two bytes? I just need to all languages to work.


Yasha(Posted 2015) [#8]
The point is that fixed-size two-byte characters are not real Unicode. They're the obsolete UCS-2 format; BlitzMax is using them for fast indexing, not because they're necessarily correct. True Unicode is variable-width. There are over a million code points if you want to support all languages. Code points are also not the same thing as characters (characters can have multiple representations, modifiers, etc.).

Doing what you're thinking of will work for most real-world text - kinda has to since Max doesn't support anything bigger. It'll just be interpreted as UTF-16. However, most users and applications expect and handle UTF-8. If you're going to do this, you may as well do it right from the start rather than taking advantage of a hack (in this case, the hack being that BlitzMax's internal representation happens to line up correctly enough to part of a Unicode format that you could just dump it as one and expect it to work 99% of the time). If you do it the "simple" way, you will run into complications later. If you do things a more complicated way, your application has the potential to be able to still emit nice text that renders correctly in web browsers and the like, even if it goes beyond BlitzMax's own ability to manipulate as native strings.


JoshK(Posted 2015) [#9]
Wow, that looks like something a government organization creates to justify their own existence.


Mithril(Posted 2015) [#10]
:) Welcome to the wonderful world of localization. hehehe.


Yasha(Posted 2015) [#11]
Ehh, writing is a hard problem. If you care about it.

Put it this way: are 'a' and '�' the same letter? BlitzMax says no. Unicode gives you the option of saying yes, and handling things in a way that's closer to the linguistic meaning of the character. There are plenty of situations where this might be useful in applications that are actually oriented towards text.

What's your attitude towards this kind of issue? If you reckon you can afford for your application to take the "don't care, use a canonical representation or get out" attitude - which TBH most non-text-oriented applications probably can, if e.g. all you're doing is asking for a filename - then the "simple" handling option is going to cover 99% of your cases and most of the rest will have a resolution, although they might expose some weirdness to the end user (if e.g. they copy-and-paste some non-normalized text). But barring really annoying edge cases, doing it "right" probably isn't strictly necessary for such programs.

If your application is intended to take a serious attitude towards editing/rendering text (like an actual text editor or web browser), then this isn't going to be appropriate for obvious reasons.


Henri(Posted 2015) [#12]
@In general
I all comes down to what is the user requirement specification. If MaxGUI text controls are needed for I/O then as Windows controls only support WideStrings (2 byte), using UTF-8 would require converting before displaying in a box.

@Shorts
If Writeline would be used to write into textstream then easiest (IMO) would be to imitate WriteLine function and convert string to "short pointer" and iterate from 0 to string length and simply use WriteShort to write every short to a stream and then add "carriage return + line feed"- shorts . Not sure if LoadText / SaveText would work also.

-Henri


JoshK(Posted 2015) [#13]
Thanks, that was a very thoughtful explanation.


impixi(Posted 2015) [#14]
An old article but still relevant:

The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)


JoshK(Posted 2015) [#15]
Wow. So in other words it's 2015 and we still haven't figured out a reliable way to store text information.


Mithril(Posted 2015) [#16]
If you really want to dig into it, I suppose this is it:

http://www.unicode.org


impixi(Posted 2015) [#17]
In this day and age of cheap, voluminous memory and fast 32-/64-bit wide buses, the obvious thing to do is just encode a character as a 32bit unsigned integer and map every character in every language to an agreed upon unique number. But that would be too easy...


JoshK(Posted 2015) [#18]
It reminds me of Collada, where it has to be complicated for the organization to justify their own activity.


Mithril(Posted 2015) [#19]
@impixi: You are of course right in theory, but the problem isn't really how you choose to store things. It is more about retrieving information from an outside source, and communicating with it.


TomToad(Posted 2015) [#20]
TTextStream will read/write UTF-8 and UTF-16 characters.
the example below are writing characters U+1E6C through U+1E71 to Example.txt as a UTF-8 string. Then it reads it twice. First as single bytes to show that UTF-8 encoding is actually used, second time to show that it can be read correctly.