I create the following table in an UTF8 character set Oracle database.
create table t_charset(col1 varchar2(10), col2 varchar2(20));
The I insert two rows under different NLS_LANG settings. The first column is the client NLS_LANG setting, the second column contains a Chinese word (same value for both two rows, two chars in GBK encoding), and then check the physical storage with Oracle DUMP function.
SQL> col col2 format a40
SQL> select col1, dump(col2) col2 from t_charset;
COL1 COL2
---------- ----------------------------------------
ZHS16GBK Typ=1 Len=6: 228,184,173,229,155,189
UTF8 Typ=1 Len=4: 214,208,185,250
Be cause we want to identify out which rows are not stored as UTF8 encoding, Oracle csscan can do this work, but I will use "
SQL> SELECT CONVERT(COL2,'UTF8','UTFE') FROM T_CHARSET WHERE COL1='UTF8';
*
ERROR at line 1:
ORA-12703: this character set conversion is not supported
For the second row, every thing is ok.
SQL> SELECT CONVERT(COL2,'UTF8','UTFE') FROM T_CHARSET WHERE COL1='ZHS16GBK';
CONVERT(COL2,'UTF8','UTFE')
-----------------------------------------------
......
So Oracle csscan just scan all the character values and try to do the conversion, if error then record the table name and ROWID into it's repository tables. After csscan run, you can get all the bad rows and do a fix job. If I use csscan in this case, it will identify out the second row for me.
