随着多字节文本数据的处理,尤其是非ASCII文字的出现,典型的如中文,文件的编码判断就提上日程,有很多字节流和字符流默认能处理的编码格式的是和程序文件的编码一致,例如:程序文件编码是UTF-8,默认处理的文本也是UTF-8。处理其他格式的文本时,当不提供具体的编码时,就非常容易把其他格式的文本当成乱码处理。
当前处理的方式一般通过相关reader或writer的装饰类:InputStreamReader(InputStream in, String charsetName)
或
OutputStreamWriter(OutputStream out, String charsetName)
实现显示地将字符编码传进去,但是无法实现自动的发掘文件字符编码,也就是说,此种模式仅仅支持用户将文件的编码传进去。
对中文字符编码主要是GBK(GB2312,GB18030)系列和UTF系列的区别,UTF系列的编码通常在文件的头部若干个字节已经告诉用户此文件的字符编码格式,即文件包含BOM(Byte Order Mark),此标志标志文件的编码方式,常见的有:
BOMs:
- 00 00 FE FF = UTF-32, big-endian
- FF FE 00 00 = UTF-32, little-endian
- EF BB BF = UTF-8,
- FE FF = UTF-16, big-endian
- FF FE = UTF-16, little-endian
在此处提供两个输入流方法,一种是基于字符的reader:
/**
* http://www.unicode.org/unicode/faq/utf_bom.html
*BOMs:
* 00 00 FE FF = UTF-32, big-endian
* FF FE 00 00 = UTF-32, little-endian
* EF BB BF = UTF-8,
* FE FF = UTF-16, big-endian
* FF FE = UTF-16, little-endian
*
*Win2k Notepad:
* Unicode format = UTF-16LE
*
* @author Semantic Wang
*
*/
public class UnicodeReader extends Reader{
PushbackInputStream pbin;
InputStreamReader reader = null;
String defaultEnc;
private static final int BOM_SIZE = 4;
/**
*
* @param in
* inputstream to be read
*
*/
public UnicodeReader(InputStream in) {
this(in, "GBK");
}
/**
*
* @param in
* inputstream to be read
* @param defaultEnc
* default encoding if stream does not have BOM marker. Give NULL
* to use system-level default.
*/
public UnicodeReader(InputStream in, String defaultEnc) {
pbin = new PushbackInputStream(in, BOM_SIZE);
this.defaultEnc = defaultEnc;
}
public String getDefaultEncoding() {
return defaultEnc;
}
/**
* Get stream encoding or NULL if stream is uninitialized. Call init() or
* read() method to initialize it.
*/
public String getEncoding() {
if (reader == null)
return null;
return reader.getEncoding();
}
/**
* Read-ahead four bytes and check for BOM marks. Extra bytes are unread
* back to the stream, only BOM bytes are skipped.
*/
protected void init() throws IOException {
if (reader != null)
return;
String encoding;
byte bom[] = new byte[BOM_SIZE];
int n, unread;
n = pbin.read(bom, 0, bom.length);
if ((bom[0] == (byte) 0x00) && (bom[1] == (byte) 0x00)
&& (bom[2] == (byte) 0xFE) && (bom[3] == (byte) 0xFF)) {
encoding = "UTF-32BE";
unread = n - 4;
} else if ((bom[0] == (byte) 0xFF) && (bom[1] == (byte) 0xFE)
&& (bom[2] == (byte) 0x00) && (bom[3] == (byte) 0x00)) {
encoding = "UTF-32LE";
unread = n - 4;
} else if ((bom[0] == (byte) 0xEF) && (bom[1] == (byte) 0xBB)
&& (bom[2] == (byte) 0xBF)) {
encoding = "UTF-8";
unread = n - 3;
} else if ((bom[0] == (byte) 0xFE) && (bom[1] == (byte) 0xFF)) {
encoding = "UTF-16BE";
unread = n - 2;
} else if ((bom[0] == (byte) 0xFF) && (bom[1] == (byte) 0xFE)) {
encoding = "UTF-16LE";
unread = n - 2;
} else {
// Unicode BOM mark not found, unread all bytes
encoding = defaultEnc;
unread = n;
}
// System.out.println("read=" + n + ", unread=" + unread);
if (unread > 0)
pbin.unread(bom, (n - unread), unread);
// Use given encoding
if (encoding == null) {
reader = new InputStreamReader(pbin);
} else {
reader = new InputStreamReader(pbin, encoding);
}
}
public void close() throws IOException {
init();
reader.close();
}
public int read(char[] cbuf, int off, int len) throws IOException {
init();
return reader.read(cbuf, off, len);
}
}
另一种是基于字节的输入流InputStream:
/**
* @author Semantic Wang
*
*/
public class UnicodeInputStream extends InputStream {
PushbackInputStream pbin;
boolean isInited = false;
String defaultEnc;
String encoding;
private static final int BOM_SIZE = 4;
public UnicodeInputStream(InputStream in) {
this(in, "GBK");
}
public UnicodeInputStream(InputStream in, String defaultEnc) {
pbin = new PushbackInputStream(in, BOM_SIZE);
this.defaultEnc = defaultEnc;
}
public String getDefaultEncoding() {
return defaultEnc;
}
public String getEncoding() {
if (!isInited) {
try {
init();
} catch (IOException ex) {
IllegalStateException ise = new IllegalStateException(
"Init method failed.");
ise.initCause(ise);
throw ise;
}
}
return encoding;
}
/**
* Read-ahead four bytes and check for BOM marks. Extra bytes are unread
* back to the stream, only BOM bytes are skipped.
*/
protected void init() throws IOException {
if (isInited)
return;
byte bom[] = new byte[BOM_SIZE];
int n, unread;
n = pbin.read(bom, 0, bom.length);
if ((bom[0] == (byte) 0x00) && (bom[1] == (byte) 0x00)
&& (bom[2] == (byte) 0xFE) && (bom[3] == (byte) 0xFF)) {
encoding = "UTF-32BE";
unread = n - 4;
} else if ((bom[0] == (byte) 0xFF) && (bom[1] == (byte) 0xFE)
&& (bom[2] == (byte) 0x00) && (bom[3] == (byte) 0x00)) {
encoding = "UTF-32LE";
unread = n - 4;
} else if ((bom[0] == (byte) 0xEF) && (bom[1] == (byte) 0xBB)
&& (bom[2] == (byte) 0xBF)) {
encoding = "UTF-8";
unread = n - 3;
} else if ((bom[0] == (byte) 0xFE) && (bom[1] == (byte) 0xFF)) {
encoding = "UTF-16BE";
unread = n - 2;
} else if ((bom[0] == (byte) 0xFF) && (bom[1] == (byte) 0xFE)) {
encoding = "UTF-16LE";
unread = n - 2;
} else {
// Unicode BOM mark not found, unread all bytes
encoding = defaultEnc;
unread = n;
}
// System.out.println("read=" + n + ", unread=" + unread);
if (unread > 0)
pbin.unread(bom, (n - unread), unread);
isInited = true;
}
public void close() throws IOException {
// init();
isInited = true;
pbin.close();
}
public int read() throws IOException {
// init();
isInited = true;
return pbin.read();
}
}
最后的调用方式为:
InputStream in = new FileInputStream(fileName);
BufferedReader reader = new BufferedReader(new UnicodeReader(in));
分享到:
相关推荐
利用chardet,cpdetector包获取文件格式,并判断文件类型是否带BOM
去除文件中的BOM头信息. 一些文档工具可以打开再保存为无bom格式,但文件太多就很费力,此工具类可以直接运行,直接输入你要修改的文件夹路径即可,可自由调整文件夹深度,进行子级文件夹读取
java处理BOM头的XML,使用记事本编辑会产生BOM头,这样的XML在dom4j处理时会报异常。
Java解决UTF-8的BOM问题,使用“UnicodeInputStream”、“UnicodeReader”。
java去掉txt文本的bom头信息,网上找了很多资料,自己整合了一下。
当上传文件存在中文时,修改上传文件编码为utf-8-bom
此文件用于快速反查php文件中的UTF8编码的文件是不是加了BOM,有则显示,无则跳过。(php源码)。 用途:主要用于整站所有文件为utf8无bom文件,个别文件却存bom,而造成的如:Warning: Cannot modify header information - ...
选择一个目录,移除所有文件中的BOM格式,在IDEA中运行eclipse的项目有时候会报BOM格式的错误
NULL 博文链接:https://baobeituping.iteye.com/blog/1280825
批量去除 bom 工具,解决idea下,Java文件有bom文件头无法编译的问题,可自定义文件后缀。递归目录文件删除文件中的bom头
2. 如果你是其他版本的IDEA,那么用压缩工具处理你的DEA安装目录下lib文件夹中的ecj-x.x.x.jar文件(名字根据IDEA版本会有所不同), 将Util.class替换该jar包目录中\org\eclipse\jdt\internal\compiler\util路径下的...
由于项目需要,需要字符串转为XML文件,直接用Fileopen进行EncodingUTF8编码后,发现文件实际为UTF-8 BOM编码 问度娘发现有相同问题,但解决方式是利用新建一个UTF-8的TXT文件后,再进行COPY加内容。感觉这样操作...
NULL 博文链接:https://henry8088.iteye.com/blog/780743
C#获取去除文件bom头后的内容,可以解决不少乱码问题
C#写入文件加上bom头,主要适用于utf8文件
全自动识别文件编码转换成无bom头的UTF8文件 全自动识别文件编码转换成无bom头的UTF8文件
公司BOM编码规则
2. 如果你是其他版本的IDEA,那么用压缩工具处理你的DEA安装目录下lib文件夹中的ecj-x.x.x.jar文件(名字根据IDEA版本会有所不同),将Util.class替换该jar包目录中\org\eclipse\jdt\internal\compiler\util路径下的...