Leetcode 443 String Compression 的 map reduce followup

1.将input s分成多个substring, 分给每个mapper slave

2: mapper slave得到 input: (s, begIdx, endIdx), 扫一遍s，output key-value pairs: (char, idx of char in original s)

3: 用char 做key 把key-value pair 分组成 (char, list of indices of char in original s)。

4: reducer slave拿到(char, list of indices of char in original s)的input, 用LC 128. Longest Consecutive Sequence的思路，拼凑出每个char的consecutive intervals。

1. 最后把intervals按beg index 排序，得到 a:[0,0], p:[1,2], l:[3,3], e:[4,4], b:[5,5], e:[6,8]，return ap2lebe3

``````import java.util.*;
public class StringCompression{
public static String compress(String original) {
if (original == null || original.length() == 0) {
return original;
}
int n = original.length();

StringBuilder compressed = new StringBuilder();
int count = 1; // record char count
int len = 1; // count total length
Character c1 = original.charAt(0);
while (len < n) {
// increment character count if no new character occur
if (c1 == original.charAt(len)) {
count ++;
} else { //otherwise update character and refresh count
compressed.append(c1);
compressed.append(Integer.toString(count));
c1 = original.charAt(len);
count = 1;
}
len ++;
}
//add last character and count to the end
compressed.append(c1);
compressed.append(Integer.toString(count));
return compressed.toString();
}

public static String restore(String compressed) {
if (compressed == null || compressed.length() == 0) {
return compressed;
}
int n = compressed.length();

StringBuilder original = new StringBuilder(); // store result string
StringBuilder count = new StringBuilder(); //temporary store character count
int len = 1;
Character c1 = compressed.charAt(0);
while (len <= n) {
// if meet new character or end of string
if((len < n && !Character.isDigit(compressed.charAt(len))) || len == n) {
if (count.length() > 0) {
int cnt = Integer.parseInt(count.toString());
for (int i = 0; i < cnt; i++) {
original.append(c1);
}
}
// if not finished, refresh count and character cache
if (len < n) {
c1 = compressed.charAt(len);
count = new StringBuilder();
}
} else {
//increment character count otherwise
count.append(compressed.charAt(len));
}
len ++;
}

return original.toString();
}

public static void main(String[] args)
{

String input1 = "aabbbbbbbbbbbbbbbbbbbccc";
String compressed1 = compress(input1);
String original1 = restore(compressed1);
System.out.println("input string: " + input1);
System.out.println("compressed string: " + compressed1);
System.out.println("restored original string: " + original1);

String input2 = "a";
String compressed2 = compress(input2);
String original2 = restore(compressed2);
System.out.println("input string: " + input2);
System.out.println("compressed string: " + compressed2);
System.out.println("restored original string: " + original2);

String input3 = "abbbbbbbbbbbbbbbb";
String compressed3 = compress(input3);
String original3 = restore(compressed3);
System.out.println("input string: " + input3);
System.out.println("compressed string: " + compressed3);
System.out.println("restored original string: " + original3);
}
}
``````

``````input string: aabbbbbbbbbbbbbbbbbbbccc
compressed string: a2b19c3
restored original string: aabbbbbbbbbbbbbbbbbbbccc
input string: a
compressed string: a1
restored original string: a
input string: abbbbbbbbbbbbbbbb
compressed string: a1b16
restored original string: abbbbbbbbbbbbbbbb
``````

followup的第一部分（压缩），input切分可以按固定长度切分（比如len = 8），对substring调用 compress方法，存储输出string，最后reduce的时候按照subsequence number遇到前一段substring 结尾和 后一段substring 开头是相同char的情况下，对输出的string进行修改再拼接（比如前一段为a1b7 ，后一段为 b8， 拼接成a1b15）

1 Like

0123456789 (input char array index)
aaabbbbccc (input char array content)

01234 56789
aaabb | bbccc

LongWritable对应Long，Text对应String，V2Object就是Java的Object。

Mapper input <K1, V1> 的type为<LongWritable, Text>
K1: Array segment offset (a.k.a. start index of array segment)
V1: Array segment content

Mapper1拿到<0L, “aaabb”>，把aaabb按连续相同字符切分成aaa和bb。
aaa在original input char array的offset是从0到2，可以对字符a压缩成a3。我们把a作为K2，V2Object {compressedStr: “a3”, startIndex: 0, endIndex: 2}作为V2。

``````    K2: Current character in this continuous array segment with same char repeating
V2: V2Object {
// compressed string of this array segment, for example,
// "a3" (compress of "aaa"), "b2" (compress of "b2")
String compressedStr;
long startIndex;
long endIndex;
}
``````

``````    public static class V2Object implements Comparable<V2Object> {
...
@Override
public int compareTo(V2Object o) {
// 注意这里按startIndex排序
return Long.compare(this.startIndex, o.startIndex);
}
...
}
``````

<“a”, {compressedStr: “a3”, startIndex: 0, endIndex: 2}>
<“b”, {compressedStr: “b2”, startIndex: 3, endIndex: 4}>
Mapper2输入<5L, “bbccc”>, 输出：
<“b”, {compressedStr: “b2”, startIndex: 5, endIndex: 6}>
<“c”, {compressedStr: “c3”, startIndex: 7, endIndex: 9}>

<“a”, [{compressedStr: “a3”, startIndex: 0, endIndex: 2}]>
<“b”, [{compressedStr: “b2”, startIndex: 3, endIndex: 4}, {compressedStr: “b2”, startIndex: 5, endIndex: 6}]>
Reducer2输入：
`<"c", [{compressedStr: "c3", startIndex: 7, endIndex: 9}]>`

K3: startIndex of current interval (V2Object) in the V2Object array after merge interval step
V3: compressedStr, e.g. “b4”

<0, “a3”>
<3, “b4”>
Reducer2会输出:
<7, “c3”>

0 a3
3 b4
7 c3

01234 56789
aaabb | bbccc

mapper 处理完就变成 <a3b2, 1> 和 <b2c3, 2>。