271. Encode and Decode Strings

job 인터뷰/코테(Matroid) 준비

271. Encode and Decode Strings

hyunkookim 2025. 4. 17. 04:54

Design an algorithm to encode a list of strings to a string. The encoded string is then sent over the network and is decoded back to the original list of strings.

Machine 1 (sender) has the function:

string encode(vector<string> strs) {
  // ... your code
  return encoded_string;
}

Machine 2 (receiver) has the function:

vector<string> decode(string s) {
  //... your code
  return strs;
}

So Machine 1 does:

string encoded_string = encode(strs);

and Machine 2 does:

vector<string> strs2 = decode(encoded_string);

strs2 in Machine 2 should be the same as strs in Machine 1.

Implement the encode and decode methods.

You are not allowed to solve the problem using any serialize methods (such as eval).

Example 1:

Input: dummy_input = ["Hello","World"]
Output: ["Hello","World"]
Explanation:
Machine 1:
Codec encoder = new Codec();
String msg = encoder.encode(strs);
Machine 1 ---msg---> Machine 2

Machine 2:
Codec decoder = new Codec();
String[] strs = decoder.decode(msg);

Example 2:

Input: dummy_input = [""]
Output: [""]

Constraints:

1 <= strs.length <= 200
0 <= strs[i].length <= 200
strs[i] contains any possible characters out of 256 valid ASCII characters.

Follow up: Could you write a generalized algorithm to work on any possible set of characters?

✅ 문장 1:

"strs[i] contains any possible characters out of 256 valid ASCII characters."

👉 해석:
strs[i]는 총 256개의 모든 ASCII 문자 중 어떤 것이든 포함할 수 있다.

즉, 알파벳, 숫자, 공백뿐만 아니라
줄바꿈 (\n), 탭 (\t), 심지어 #, :, @ 같은 특수문자도 포함될 수 있다는 뜻이야.

👉 중요한 이유:
encode()에서 어떤 특정 문자(예: #)를 구분자로 쓰면,
그 문자가 원래 문자열에 들어 있을 수 있기 때문에 문자열 경계를 잘못 해석할 위험이 있음.

그래서 길이 + 구분자 + 문자열 방식이 안전한 거야.
길이를 보고 어디까지가 문자열인지 정확히 알 수 있거든.

✅ 문장 2:

"Follow up: Could you write a generalized algorithm to work on any possible set of characters?"

👉 해석:
추가 과제:
"모든 문자 집합에서도 잘 작동하는 일반화된 알고리즘을 만들 수 있겠어?"

예를 들어:

Unicode 문자 (이모지, 한글, 한자 등 포함)
ASCII 외 다른 문자셋
바이너리 값 등도 처리할 수 있는 인코딩/디코딩 방식

👉 요점:

지금 방식은 ASCII 문자(1바이트 단위)에서는 잘 작동하지만,
모든 언어/문자셋/유니코드까지 고려하면 더 범용적인 방식을 생각해봐야 할 수도 있다는 이야기야.

내 코드

class Codec:
    def encode(self, strs: List[str]) -> str:
        """Encodes a list of strings to a single string.
        """
        res = ""
        for s in strs:
            res += str(len(s))+"#"+s
        # print(strs)
        # print(res)
        return res
        

    def decode(self, s: str) -> List[str]:
        """Decodes a single string to a list of strings.
        """
        # print(s)
        res = []
        l = r= 0
        
        while r < len(s):
            if s[r] == "#":
                length = int(s[l:r])                
                res.append(s[r+1:r+1+length])
                l = r = r+1+length
            r +=1
        return res
    


# Your Codec object will be instantiated and called as such:
# codec = Codec()
# codec.decode(codec.encode(strs))

솔루션

✅ 수정 버전 (동작 동일, 구조 개선):

class Codec:
    def encode(self, strs: List[str]) -> str:
        # 문자열 길이 + 구분자 + 문자열을 붙여서 하나의 문자열로 만든다
        return ''.join(f"{len(s)}#{s}" for s in strs)

    def decode(self, s: str) -> List[str]:
        res = []
        i = 0
        while i < len(s):
            # '#' 위치를 찾아서 길이 부분 추출
            j = i
            while s[j] != '#':
                j += 1
            length = int(s[i:j])
            j += 1  # '#' 다음 위치로 이동
            res.append(s[j:j+length])
            i = j + length  # 다음 문자열 시작점으로 이동
        return res

🔍 동작 예시:

codec = Codec()
encoded = codec.encode(["Hello", "World"])
print(encoded)  # 출력 예: 5#Hello5#World

decoded = codec.decode(encoded)
print(decoded)  # 출력: ['Hello', 'World']

✅ 네 코드도 잘 작동하고 있고, 아래와 같은 입력도 다 잘 처리돼:

빈 문자열 포함: [""]
특수 문자 포함: ["he#llo", "wo#rld"]
문자열 길이 0~200

🔍 입력 문자열에 # 이 포함된 예시:

✅ 테스트 예제:

strs = ["a#b", "##", "c"]

각 문자열엔 #이 실제로 내용 중에 포함돼 있어.

✅ encode 결과:

Codec.encode(strs)는 이렇게 동작해:

"a#b" → 길이 3 → "3#a#b"  
"##"  → 길이 2 → "2###"  
"c"   → 길이 1 → "1#c"

최종 인코딩 결과:
"3#a#b2###1#c"

✅ decode 단계에서 코드 작동 방식:

이제 이 인코딩된 문자열 "3#a#b2###1#c"을 디코드 해볼게.

s = "3#a#b2###1#c"

코드 흐름:

i = 0, j를 이동시켜서 첫 # 위치 찾음 → s[0:1:2] = "3" → 길이 3
- 다음 3글자 추출: s[2+1 : 2+1+3] = "a#b"
- 결과에 "a#b" 추가 → res = ["a#b"]
i = 6, s[6:7] = "2" → 길이 2
- 다음 2글자 추출: s[8:10] = "##"
- 결과에 "##" 추가 → res = ["a#b", "##"]
i = 10, s[10] = "1" → 길이 1
- 다음 1글자 추출: s[12:13] = "c"
- 결과에 "c" 추가 → res = ["a#b", "##", "c"]

✅ 최종 출력:

decoded = ["a#b", "##", "c"]

즉, 입력과 완전히 동일하게 복원됨 ✅

🔑 왜 잘 작동하냐면?

우리가 #을 구분자로 쓰지만,
항상 그 앞에 있는 **"길이 정보"**를 먼저 읽기 때문에,
#이 문자열 안에 있어도 상관없어.
- 어디까지가 한 문자열인지 "길이"로 정확히 잘라내니까!

✅ Unicode-safe Encode/Decode 버전 (모든 문자셋 대응 가능)

이번엔 유니코드 문자 (예: 이모지, 한글, 한자 등)도 안전하게 인코딩/디코딩할 수 있는 버전을 알려줄게.
아까 ASCII 기반 버전은 str(len(s)) + '#' + s 형식이라서, 모든 문자열을 직접 다루었지.
하지만 완전히 일반화된 버전을 원하면, 바이트 처리(byte-safe) 방식으로 가는 게 좋아.

class Codec:
    def encode(self, strs: List[str]) -> str:
        # 모든 문자열을 UTF-8 바이트로 변환 후, 길이+구분자+인코딩된 값으로 저장
        encoded_bytes = b''
        for s in strs:
            b = s.encode('utf-8')
            encoded_bytes += str(len(b)).encode() + b'#' + b
        return encoded_bytes.decode('latin1')  # 1:1 문자 매핑 가능한 인코딩

    def decode(self, s: str) -> List[str]:
        res = []
        b = s.encode('latin1')  # 다시 바이트로 복원
        i = 0
        while i < len(b):
            j = i
            while b[j:j+1] != b'#':
                j += 1
            length = int(b[i:j])
            j += 1  # '#' 다음으로 이동
            res.append(b[j:j+length].decode('utf-8'))
            i = j + length
        return res

🔍 왜 이게 안전한가?

포인트	설명
UTF-8 인코딩	유니코드 문자들을 바이트로 안전하게 변환 (한글, 이모지 모두 가능)
latin1	바이트값을 문자로 1:1 대응할 수 있는 문자셋 (decode-safe)
len(b)	진짜 바이트 길이를 기반으로 자르기 때문에 잘못된 분할 방지

✅ 예시 테스트

codec = Codec()
data = ["안녕", "😊", "hello", "𠜎𠜱𠝹"]
encoded = codec.encode(data)
print("Encoded:", encoded)
decoded = codec.decode(encoded)
print("Decoded:", decoded)

✅ 출력:

Decoded: ['안녕', '😊', 'hello', '𠜎𠜱𠝹']

✅ base64 기반 Unicode-safe 버전

이번에는 네트워크 전송용으로 안전하게 쓸 수 있게, 인코딩 결과를 base64 문자열로 바꾸는 버전을 만들어볼게.

이 방식은 특히 아래 상황에서 유용해:

바이너리 데이터도 포함할 수 있음 (Unicode 포함)
네트워크/웹 전송 시 문제가 안 됨 (URL-safe 가능)
문자 인코딩 충돌 방지

import base64

class Codec:
    def encode(self, strs: List[str]) -> str:
        # 모든 문자열을 utf-8 바이트로 변환하고 길이 + '#' 붙임
        raw_bytes = b''
        for s in strs:
            b = s.encode('utf-8')
            raw_bytes += str(len(b)).encode() + b'#' + b
        # 바이트 전체를 base64 인코딩
        return base64.b64encode(raw_bytes).decode('utf-8')

    def decode(self, s: str) -> List[str]:
        # base64 디코딩 후 원래 바이트로 복원
        raw_bytes = base64.b64decode(s.encode('utf-8'))
        res = []
        i = 0
        while i < len(raw_bytes):
            j = i
            while raw_bytes[j:j+1] != b'#':
                j += 1
            length = int(raw_bytes[i:j])
            j += 1
            res.append(raw_bytes[j:j+length].decode('utf-8'))
            i = j + length
        return res

✅ 예제 실행:

codec = Codec()
data = ["안녕", "😊", "hello", "𠜎𠜱𠝹", "한글#영어"]
encoded = codec.encode(data)
print("Encoded:", encoded)

decoded = codec.decode(encoded)
print("Decoded:", decoded)

🔐 장점 요약:

기능	설명
utf-8 인코딩	모든 유니코드 문자 지원
base64	안전한 문자열 전송 (이진 전송 가능)
# 앞에 길이 정보	문자열 구분 안전하게 처리
.encode('utf-8') + .decode('utf-8')	네트워크에서 통신 가능한 문자 형태 유지

'job 인터뷰 > 코테(Matroid) 준비' 카테고리의 다른 글

653. Two Sum IV - Input is a BST (0)	2025.04.17
Grocery Shopping List 구현 (1)	2025.04.17
557. Reverse Words in a String III (0)	2025.04.17
239. Sliding Window Maximum ★ (0)	2025.04.16
Matroid 관련 LeetCode 문제 번호 정리 (0)	2025.04.15

현재글271. Encode and Decode Strings

hyunkookim 님의 블로그

hyunkookim 님의 블로그 입니다.

heap, BST, 누적합, neetcode150, 오블완, 슬라이딩윈도우, top150, BFS, tree, dfs, dp, 우선순위큐, Coding Test, 티스토리챌린지, neetcode, Two pointers, array / string, Backtracking, LeetCode, leetcode75,

Today :
Yesterday :

hyunkookim 님의 블로그