Discussion:
How many people need/want to send Unicode files?
(too old to reply)
Chip
2005-02-16 06:21:35 UTC
Permalink
I admit that I am slow, sometimes slower than a snail in a blizzard. The
recent link to a Japanese version of Blat, based on version 1.8.2, prompted
me to investigate Unicode files and what it would take to support this
format in the message body file, as opposed to alternate text or other forms
of text input. Tonight I looked in previous messages for any mention of
UTF-8 or Unicode, to gauge how much need there might be for Unicode support.

There is a way to send these files already, by using the -attach option.
However, sometimes this is not very convenient, and has led to some creative
actions to "convert" Unicode to plain text. If Blat could send Unicode
files automagically, how many folks would actually benefit from this?

It appears that all, or nearly all, Latin based languages do not need
Unicode text formatting, but that Microsoft prefers to store double byte
character sets as Unicode. If the message body file, that elusive first
argument, has a Byte Order Marker in the first two or four bytes, Blat could
use this to identify the file as Unicode and convert it to UTF-8 for
transmission.

I have a personal distaste for sending message bodies encoded with base64,
since this often is used to hide spam content from the casual viewer. Blat
is coded in such a way that message bodies do not use base64, this is
reserved for attachments. Due to this limitation, UTF-8 bytes will be sent
as quoted printable, which unfortunately takes more bytes than base64. I
could change the code to create an override condition whereby UTF-8 data
would be sent encoded with base64 so the message is smaller.


Chip
--
Homepage:
http://www.blat.net
ykai
2005-02-17 00:06:12 UTC
Permalink
Post by Chip
I admit that I am slow, sometimes slower than a snail in a blizzard. The
recent link to a Japanese version of Blat, based on version 1.8.2, prompted
me to investigate Unicode files and what it would take to support this
format in the message body file, as opposed to alternate text or other forms
of text input. Tonight I looked in previous messages for any mention of
UTF-8 or Unicode, to gauge how much need there might be for Unicode support.
http://www.imc.org/mail-i18n.html formulates recommendations for MUAs
regarding internationalization.
You might have a look into this, if not already done.
Post by Chip
There is a way to send these files already, by using the -attach option.
Ouch
Post by Chip
However, sometimes this is not very convenient, and has led to some creative
actions to "convert" Unicode to plain text. If Blat could send Unicode
files automagically, how many folks would actually benefit from this?
It appears that all, or nearly all, Latin based languages do not need
Unicode text formatting, but that Microsoft prefers to store double byte
character sets as Unicode. If the message body file, that elusive first
argument, has a Byte Order Marker in the first two or four bytes, Blat could
use this to identify the file as Unicode and convert it to UTF-8 for
transmission.
UFT-8 is also "Unicode". I think you concider UTF-16 as being "more Unicode"
than UTF-8, since Notepad first used this tag (save as... "Unicode" -> UTF-16
file). UTF-16 though is all "binary", admittedly.

UTF-8 has a 3 bytes Byte Order Mark (BOM)
http://www.unicode.org/faq/utf_bom.html#22

As I understand it, UTF-8 could be used directly with
'Content-type: text/plain; charset="utf-8"'
As far as I can see, no additional content-transfer-encoding is recommended
(relying on 8BITMIME, see ch.5 of above document).
Post by Chip
I have a personal distaste for sending message bodies encoded with base64,
since this often is used to hide spam content from the casual viewer. Blat
is coded in such a way that message bodies do not use base64, this is
reserved for attachments. Due to this limitation, UTF-8 bytes will be sent
as quoted printable, which unfortunately takes more bytes than base64. I
could change the code to create an override condition whereby UTF-8 data
would be sent encoded with base64 so the message is smaller.
UTF-8 will be mainly "clear-text" even when viewed with a non-Unicode-aware
reader, since (7Bit-)ASCII is a proper subset of UTF-8.
(One has a hard job setting up spam messages in languages using the Roman
alphabet with only accented characters and the like to be used for obscurity
:-))
--
Homepage:
http://www.blat.net
Phillip Lynch
2005-02-17 00:49:35 UTC
Permalink
If I really need unicode content to be received, I always sent it as an
attachment as I can be assured that it, the attachment, wont be changed
during transmission and can be viewed by the recipient using notepad etc.

However, for the most part, I just want the recipient to be able to read
the text normally. In this situation, I just type the document to another
(eg type backup01.log > backup01.txt) and send the resultant non unicode
document.

In effect, I don't want to be able to sent unicode documents as the
message content through blat.
Post by Chip
There is a way to send these files already, by using the -attach option.
[Non-text portions of this message have been removed]
--
Homepage:
http://www.blat.net
Chip
2005-02-17 01:12:26 UTC
Permalink
<snipped>
Post by ykai
Post by Chip
However, sometimes this is not very convenient, and has led to some creative
actions to "convert" Unicode to plain text. If Blat could send Unicode
files automagically, how many folks would actually benefit from this?
It appears that all, or nearly all, Latin based languages do not need
Unicode text formatting, but that Microsoft prefers to store double byte
character sets as Unicode. If the message body file, that elusive first
argument, has a Byte Order Marker in the first two or four bytes, Blat could
use this to identify the file as Unicode and convert it to UTF-8 for
transmission.
UFT-8 is also "Unicode". I think you concider UTF-16 as being "more Unicode"
than UTF-8, since Notepad first used this tag (save as... "Unicode" -> UTF-16
file). UTF-16 though is all "binary", admittedly.
UTF-8 has a 3 bytes Byte Order Mark (BOM)
http://www.unicode.org/faq/utf_bom.html#22
As I understand it, UTF-8 could be used directly with
'Content-type: text/plain; charset="utf-8"'
As far as I can see, no additional content-transfer-encoding is recommended
(relying on 8BITMIME, see ch.5 of above document).
<snipped the rest>

The use of the terms "Unicode" and "UTF-8" in my message meant different
things. The term Unicode as I used it applies to 16-bit and 32-bit binary
data, while UTF-8 is meant for 8-bit data often suitable for data
communications. I wanted to make the distinction between 8-bit data (UTF-8)
and 16-/32-bit data, because how I treat this data is different. I did not
intend to say Unicode is any more or less Unicode than UTF-8. :)

I am aware of the three byte BOM, but since this already flags the message
as 8-bit data, there is nothing for Blat to do before it can transmit the
data. In other words, there are no binary zeroes that have to be removed or
converted, all extended bytes havea already been converted before Blat read
the file. Users need to be aware that if their data is _already_ in UTF-8
format, they need only to use "-charset utf-8" on the command line for the
message to be sent and received as UTF-8 data. If the user's mail server
will accept 8 bit data, by publishing "250-8BITMIME", then Blat could send
the UTF-8 data as-is without converting to quoted-printable, if
the -8bitmime option is also used.

The purpose of a -unicode command line option is to identify the message
file specifically as 16- or 32-bit Unicode. If Blat cannot find a 16-bit or
32-bit BOM at the start of the file, then the -unicode option will tell Blat
that it must expect the message to be 16-bit Unicode. If Blat finds a
16-bit or 32-bit BOM, Blat will use the correct data size whether or not
the -unicode option was given, and convert the file to UTF-8.

One next question is whether or not this support should be a compile time
option, or built in for all to use? I lean to having it available always.
--
Chip
--
Homepage:
http://www.blat.net
Corp Library
2005-02-17 04:27:26 UTC
Permalink
One next question is whether or not this support should be a compile time option, or built in for all to use?
If you're collecting votes, my vote(s) <g> are to have this
functionality built-in. I encourage others interested in this
functionality to vote (and vote often!).

Thank you,

Malcolm
--
Homepage:
http://www.blat.net
Rick Nakroshis
2005-02-17 11:52:36 UTC
Permalink
Post by Chip
One next question is whether or not this support should be a compile
time option, or built in for all to use?
If you're collecting votes, my vote(s) <g> are to have this
functionality built-in. I encourage others interested in this
functionality to vote (and vote often!).
Vote early, vote often, eh?

I for one see this as an unnecessary complication to the program. Blat
wasn't designed to be everything to everyone, and I'd hate to see the
program get bogged down trying to implement a complicated and rarely used
feature.

Rick
--
Homepage:
http://www.blat.net
Chip
2005-02-17 13:23:13 UTC
Permalink
Post by Rick Nakroshis
Post by Chip
One next question is whether or not this support should be a compile
time option, or built in for all to use?
If you're collecting votes, my vote(s) <g> are to have this
functionality built-in. I encourage others interested in this
functionality to vote (and vote often!).
Vote early, vote often, eh?
I for one see this as an unnecessary complication to the program. Blat
wasn't designed to be everything to everyone, and I'd hate to see the
program get bogged down trying to implement a complicated and rarely used
feature.
Rick
Its actually very simple to implement. It only takes effect if the user
specified -unicode on the command line, or if a Byte Order Marker is found
at the start of the message file. Without those two conditions, conversion
to UTF-8 would not happen.

Chip
--
Homepage:
http://www.blat.net
L.Willms
2005-02-17 13:06:56 UTC
Permalink
Post by ykai
As I understand it, UTF-8 could be used directly with
'Content-type: text/plain; charset="utf-8"'
which can be specified by the -charset option.

As with other character sets, ISO-8859-1, ISO-8859-2, ISO-88519-15,
RU-KOIR or what have you, BLAT is not prepared to analyse the document
and set the character set accordingly.

When I want to send a Unicode-Document, it might already be encoded
as UTF-8 anyway.


Yours,
Lüko Willms
-----------------------------------------------
Frankfurt/Main
--
Homepage:
http://www.blat.net
Hamilton, Robert L
2005-02-17 12:54:36 UTC
Permalink
K-I-S-S

bobh

-----Original Message-----
From: Rick Nakroshis [mailto:***@smart.net]
Sent: Thursday, February 17, 2005 5:53 AM
To: ***@yahoogroups.com
Subject: Re: [blat] How many people need/want to send Unicode files?
Post by Chip
One next question is whether or not this support should be a compile
time option, or built in for all to use?
If you're collecting votes, my vote(s) <g> are to have this
functionality built-in. I encourage others interested in this
functionality to vote (and vote often!).
Vote early, vote often, eh?

I for one see this as an unnecessary complication to the program. Blat
wasn't designed to be everything to everyone, and I'd hate to see the
program get bogged down trying to implement a complicated and rarely
used
feature.

Rick
--
Homepage:
http://www.blat.net
Romson Christer
2005-02-17 13:07:14 UTC
Permalink
One next question is whether or not this support should be a compile time option, or built in for all to use?
I haven't been following this discussion closely. It should of course be possible to have a file with unicode text in any unicode encoding and send it as a body with Blat with a text/plain; charset=utf-whatever content type. I suppose the -charset does that for us today? Blat will look at the binary data, negotiate with he smtp server and apply a suitable transfer encoding. If I'm mistaken and Blat cannot send such message bodies, then I vote for building this in for everyone to use.

If this about looking at the file and trying to deduce the charset from the data, then I'm skeptical. It's not realistically possible to look at files and deduce the correct content type. Trying to do so is Bad. You can look up extensions in the registry and find out the type and subtype (.txt -> text/plain), but that doesn't tell us about charsets. For text/plain messages, the charset should always be specified on the command line. Assuming that it's ascii, windows-1252, latin/1 or anything else is not keeping things clean and simple, it's keeping them incorrect!

Christer Romson
--
Homepage:
http://www.blat.net
Chip
2005-02-17 13:53:49 UTC
Permalink
Post by Romson Christer
Post by Chip
One next question is whether or not this support should be a compile
time option, or built in for all to use?
I haven't been following this discussion closely. It should of course be
possible to have a file with unicode text in any unicode encoding and send
it as a body with Blat with a text/plain; charset=utf-whatever content
type. I suppose the -charset does that for us today? Blat will look at the
binary data, negotiate with the smtp server and apply a suitable transfer
encoding. If I'm mistaken and Blat cannot send such message bodies, then I
vote for building this in for everyone to use.
If this about looking at the file and trying to deduce the charset from
the data, then I'm skeptical. It's not realistically possible to look at
files and deduce the correct content type. Trying to do so is Bad. You can
look up extensions in the registry and find out the type and subtype
(.txt -> text/plain), but that doesn't tell us about charsets. For
text/plain messages, the charset should always be specified on the command
line. Assuming that it's ascii, windows-1252, latin/1 or anything else is
not keeping things clean and simple, it's keeping them incorrect!
Christer Romson
This is not about trying to deduce any character sets used in Unicode
files. The character set is not actually available, the very bytes
themselves define the character set. This is all about taking 16- and
32-bit 'characters' and transmitting them in a format suitable for 7-/8-bit
email without the need for user action.

My initial crack at writing this code is to convert 16-/32-bit Unicode data
to UTF-8 data before any connection to the server is established. Tonight,
or this weekend, I will move my initial coding to a separate source file so
the message body can be converted to either UTF-8 or UTF-7 as dictated by
the server's 250- response.

Chip
--
Homepage:
http://www.blat.net
Romson Christer
2005-02-17 14:27:39 UTC
Permalink
One next question is whether or not this support should be a compile time option, or built in for all to use?
I haven't been following this discussion closely. If this is about looking at the file and trying to deduce the charset from the data, then I'm skeptical.
This is not about trying to deduce any character sets used in Unicode files. The character set is not actually available, the very bytes themselves define the character set. This is all about taking 16- and 32-bit 'characters' and transmitting them in a format suitable for 7-/8-bit email without the need for user action.
So you have a byte stream, and want to transfer it on top of a line oriented system designed for 7-bit data with a maximum of 1024 7-bit characters per line? And some of the implementations are known to be broken? But sending letetrs, digits and a few punctuation characters, (no more than 60 or so per line) is known to be safe almost all of the time?

You could have a system where you encode every byte with hex value as two 7-bit 'characters' and add some kind of line continuation character to handle the line length limit. This should be easily decoded by the receiving end. You'd have to publich a specification and hope that it gets wide acceptance. Actually, someone might already have done that? Maybe we could re-use their work?
My initial crack at writing this code is to convert 16-/32-bit Unicode data to UTF-8 data before any connection to the server is established. I will move my initial coding so the message body can be converted to either UTF-8 or UTF-7 as dictated by the server's 250- response.
I fail to see why Blat should convert the data to UTF-8. Treat it as a byte stream, and use the 250 response to convert it to 8-bit mime or quoted printable. This sounds like exactly the kind of problem that Transfer Encodings are meant to solve.

Christer Romson
--
Homepage:
http://www.blat.net
c***@att.net
2005-02-18 01:28:13 UTC
Permalink
Post by L.Willms
Post by ykai
As I understand it, UTF-8 could be used directly with
'Content-type: text/plain; charset="utf-8"'
which can be specified by the -charset option.
As with other character sets, ISO-8859-1, ISO-8859-2, ISO-88519-15,
RU-KOIR or what have you, BLAT is not prepared to analyse the document
and set the character set accordingly.
When I want to send a Unicode-Document, it might already be encoded
as UTF-8 anyway.
Yours,
Lüko Willms
Correct, Blat is not able to analyse messages for character sets. Unicode files do not specify a character internally, the characters themselves do that by their values.

If your Unicode document is already UTF8 or UTF7, then you do not need to use a -unicode command line option. Instead, you _should_ use -charset utf-8 or -charset utf-7, depending on which format your document is in. As you wrote "might already be encoded", this means that it may or may not be UTF-8, in which case allowing Blat to make that determination would be in your best interest.

Windows system logs might be in 16-bit Unicode format. Certainly regedit outputs in 16-bit format. If someone wanted to email those files, and was not aware these are in Unicode format, current versions of Blat will not handle these files as the message body. The proposal is for Blat to examine only the first four bytes to determine if the message is Unicode 16-bit or 32-bit, and convert this message to UTF8 or UTF7 as necessary so the file can be transmitted.

Chip
--
Homepage:
http://www.blat.net
Loading...