[mysql] collation이란

2024. 6. 10. 12:16

collation 이란

데이터베이스의 문자열 Datatype(CHAR, VARCHAR, TEXT 등)에는 캐릭터 셋(Character set)과 콜래이션(Collation)이라는 속성이 있다.

캐릭터 셋(Character set)은 각 문자가 컴퓨터에 저장될 때 어떻게 저장될지(encoding)에 대한 규칙의 집합이고,

콜래이션(Collation)은 특정 캐릭터 셋(Character set)에 의해

데이터베이스에 저장된 값들을 비교 검색(where clause)하거나
문자들을 서로 정렬(order by) 등의 작업을 위해 비교할 때
그리고 인덱싱을 할 때

사용하는 규칙들의 집합을 의미한다.

예를 들어

int형은 123 < 345 으로 명확히 비교할 수 있고
date형은 2013-01-01 < 2022-01-01로 명백하나

문자열의 경우

'가'와 '나' 중 어느 것이 큰지
'a'와 'A' 중 어느 것이 큰지
'가'와 'ㄱㅏ'는 어떻게 비교해야 하는지

혼란스럽다. 이와 관련하여 정리된 방식이 collation이라고 생각하면 된다.

대표적인 collation 타입

utf8mb4_bin
- binary 저장 값으로 정렬; 각 문자를 byte 취급하여 byte 값을 비교(언어적인 규칙이 고려되지 않음)
- A는 41, a는 61이기 때문에 오름차순 정렬 시 A~Z 다음 a~z가 정렬된다.
  - SELECT HEX(WEIGHT_STRING('A' collate utf8mb4_bin)) as 'A', HEX(WEIGHT_STRING('a' collate utf8mb4_bin)) as 'a' FROM dual;
utf8mb4_general_ci
- 간단하고 빠르게 사용할 수 있는 타입
- 모든 유니코드가 고려된 건 아니지만 일반적으로 많이 사용됨
  - 유니코드 중 Basic Multilingual Plane (BMP)를 벗어나면 정렬이 틀리게 될 수 있음
  - 하지만 중국어(C), 일본어(J), 한국어(K) 통칭 CJK는 BMP에 포함되어 있어 국내에서도 잘 쓰이는 타입
- case insensitive로 A는 41, a도 41로 같기 때문에 A와 a가 혼용되어 정렬된다.
  - SELECT HEX(WEIGHT_STRING('A' collate utf8mb4_general_ci)) as 'A', HEX(WEIGHT_STRING('a' collate utf8mb4_general_ci)) as 'a' FROM dual;
utf8mb4_unicode_ci
- 모든 유니코드를 고려한 정렬 규칙으로 고려하는 규칙 자체가 많아 utf8mb4_general_ci 방식보다 느림
  - 한국어, 영어, 중국어, 일본어 사용 환경에서는 utf8mb4_general_ci와 동일한 결과를 냄
  - 더 특수한 문자의 정렬 순서는 달라질 수 있음
- case insensitive로 A는 0E33, a도 0E33으로 같기 때문에 A와 a가 혼용되어 정렬된다.
  - SELECT HEX(WEIGHT_STRING('A' collate utf8mb4_unicode_ci)) as 'A', HEX(WEIGHT_STRING('a' collate utf8mb4_unicode_ci)) as 'a' FROM dual;

간단하게 collation 타입 읽는 법

ex. utf8mb4_0900_ai_ci

utf8mb4 : 문자 하나당 1~4byte 할당(mb4 : 4byte 지원), 바로 이어서 지역 및 언어를 나타내는 단어로 세분화되기도 함
- utf8mb3는 3byte가 할당되는 방식으로 mysql8에서 deprecated
0900 : Unicode Collation Algorithm (UCA) version 9.0 표준을 따른다는 뜻
- mysql8 추가; 더 세분화된 정렬법 적용
ai : accent insensitive (이전버전에서는 악센트 구분이 안되었으며 MySQL 8.0부터 추가됨)
ci : case insensitive (대소문자 구분하지 않음)

그 외 아래와 같은 표기법이 사용될 수 있음

Suffix	Meaning	비고
_ai	Accent-insensitive	mysql8 추가
_as	Accent-sensitive	mysql8 추가
_ci	Case-insensitive
_cs	Case-sensitive
_ks	Kana-sensitive	mysql8 추가 일본어 히라가나-가타카나 같게할지 여부
_bin	Binary

MySQL 8.0.1 버전부터 utf8mb4_0900_ai_ci이 기본값임

mysql8부터 추가된 collation type의 pad attribute

이전 버전에서는 PAD SPACE를 사용하던 것과 다르게 NO PAD 속성이 생김
- 기본 값은 PAD SPACE
NO PAD 속성은 문자열 끝에 빈 문자열이 있는 경우에 문자열 비교 시 공백까지 포함하여 비교함(공백도 의미를 가진다는 전제)
따라서 정렬 시 의도한 대로 정렬이 되지 않을 수 있으니 사용하고 있는 collation type이 어떤 pad attribute를 갖는지 확인해야 함

그냥 접속 시 매번 collation이 다르게 접속될 수 있음

jdbc 커넥터 버전에 따라 커넥션 별 collation이 달라질 수 있음
워크벤치에서도 지정하지 않으면 외부 영향을 받아 달라질 수 있음
따라서 커넥터 레벨에서 collation을 지정하거나 접속 시 명령어에 아래 내용 추가하여 항상 고정된 collation으로 붙을 수 있도록 하는 게 좋음

-- Set the character set and collation for the session
SET NAMES utf8mb4 COLLATE utf8mb4_general_ci;

collation이 다르게 접속이 될 경우 데이터 저장 시 에러 발생할 가능성 있음

문자열을 기준으로 다음 작업을 할 때 아래 에러가 발생할 수 있음:

테이블을 join 하거나
값을 비교, 필터링하거나
string 연산(concat 등)을 할 때

오류 코드: 1267Illegal mix
of collations (utf8_general_ci, IMPLICIT) and (utf8_unicode_ci, IMPLICIT) for
operation '='

이 경우 내 로컬(커넥션) 설정과 서버의 collation 설정이 다르거나 테이블, 콜롬 등에 설정된 collation이 상이하기 때문으로 collation에 대해서 확인 필요

~~그럴리는 없겠지만~~ 혹시 테이블, 콜롬별 collation 설정이 다르다면 아래와 같이 join 조건에 collate 타입을 명시해야 한다.

SELECT *
FROM table1
JOIN table2 ON table1.name = table2.description COLLATE utf8_general_ci;

지금 내가 사용하는 디비에서 collation 값 조회하는 방법

mysql8

show variables where variable_name like '%collation%';

Variable_name                |Value             |
-----------------------------+------------------+
collation_connection         |utf8mb4_0900_ai_ci|
collation_database           |utf8mb4_general_ci|
collation_server             |utf8mb4_general_ci|
default_collation_for_utf8mb4|utf8mb4_0900_ai_ci|


#전체 값 확인
SHOW COLLATION WHERE Charset = 'utf8mb4';


Collation                 |Charset|Id |Default|Compiled|Sortlen|Pad_attribute|
--------------------------+-------+---+-------+--------+-------+-------------+
utf8mb4_0900_ai_ci        |utf8mb4|255|Yes    |Yes     |      0|NO PAD       |
utf8mb4_0900_as_ci        |utf8mb4|305|       |Yes     |      0|NO PAD       |
utf8mb4_0900_as_cs        |utf8mb4|278|       |Yes     |      0|NO PAD       |
utf8mb4_0900_bin          |utf8mb4|309|       |Yes     |      1|NO PAD       |
utf8mb4_bg_0900_ai_ci     |utf8mb4|318|       |Yes     |      0|NO PAD       |
...생략

mysql5

show variables where variable_name like '%collation%';

Variable_name       |Value             |
--------------------+------------------+
collation_connection|utf8mb4_general_ci|
collation_database  |utf8mb4_general_ci|
collation_server    |utf8mb4_general_ci|


#전체 값 확인
SHOW COLLATION WHERE Charset = 'utf8mb4';

Collation             |Charset|Id |Default|Compiled|Sortlen|
----------------------+-------+---+-------+--------+-------+
utf8mb4_general_ci    |utf8mb4| 45|Yes    |Yes     |      1|
utf8mb4_bin           |utf8mb4| 46|       |Yes     |      1|
utf8mb4_unicode_ci    |utf8mb4|224|       |Yes     |      8|
utf8mb4_icelandic_ci  |utf8mb4|225|       |Yes     |      8|

collation 변경 방법

1) 데이터베이스 레벨

ALTER DATABASE mydb CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci;

2) 테이블 레벨

ALTER TABLE my_table CONVERT TO CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci ;

3) 콜롬 레벨

ALTER TABLE mytable MODIFY name VARCHAR(100) CHARACTER SET utf8 COLLATE utf8_general_ci;

4) 세션 레벨

SET collation_connection = 'utf8mb4_general_ci' ;

5) 쿼리 레벨(insert문 동일)

SELECT 
    @_now := now(),
    @ver := '1.17.0' collate utf8mb4_general_ci , 
    @domain := 'my domain' collate utf8mb4_general_ci , 
    @port_ := '1234'
;

6) 서버 레벨

mysql서버 띄울 때 --character-set-server 혹은 --collation-server 옵션을 주던가
server.cnf파일에 옵션값 설정

[mysqld]
character-set-server=latin1
collation-server=latin1_swedish_ci

결론

같은 문자셋이라도 콜레이션에 따라 영어의 경우 대소문자의 구분, 일본어의 경우 히라가나와 가타카나의 구분, 한글 자음과 결합문자를 구분하는 방법 등이 달라짐
관련해서 정렬 시 정확도와 검색 속도에 영향이 있음
MySql 5-> 8로 올릴 때 collation 설정 값이 정렬 등에 영향을 줄 수 있다는 것을 인지할 필요 있음
MySQL 8.0의 기본 collation 인 utf8 mb4_0900_ai_ci는 utf8이며 글자당 4byte까지 저장하고, 0900 버전의 UCA 규칙을 따르며 accent, 대소문자, 히라가나와 가타카나, 한글 자음과 결합문자를 구분하지 않음

관련 상세 내용은 버전별 공식 문서를 확인하자

https://dev.mysql.com/doc/refman/8.4/en/charset.html

MySQL :: MySQL 8.4 Reference Manual :: 12 Character Sets, Collations, Unicode

MySQL 8.4 Reference Manual / Character Sets, Collations, Unicode Chapter 12 Character Sets, Collations, Unicode MySQL includes character set support that enables you to store data using a variety of character sets and perform comparisons according to a

dev.mysql.com

https://dev.mysql.com/doc/refman/5.7/en/charset.html

MySQL :: MySQL 5.7 Reference Manual :: 10 Character Sets, Collations, Unicode

MySQL 5.7 Reference Manual / Character Sets, Collations, Unicode Chapter 10 Character Sets, Collations, Unicode MySQL includes character set support that enables you to store data using a variety of character sets and perform comparisons according to a

dev.mysql.com

'개발 > sql' 카테고리의 다른 글

[mysql] delete, drop, truncate (0)	2024.10.02
[mysql] basic functions (0)	2024.09.09
DB isolation level (0)	2024.05.22
[mysql] merge into..? (0)	2024.05.17
[mysql] 유저의 등수 구하기 rank under v8 (0)	2024.02.06

기억 파편