手搓布隆过滤器

布隆过滤器的核心思想是：

使用一个大的 位数组（BitSet）
使用 多个哈希函数（HashFunction）
插入元素时，对元素进行多次哈希并将对应的位数组位置置为true（遍历哈希函数，便于使用不同的种子来把一个要哈希的对象变为多个int值放入bitset中）
查询元素时，只要任意一个对应位为0，就确定该元素一定不存在；如果所有位都为1，可能存在（存在误判）

1
import java.util.BitSet;
2

3
public class MyBloomFilter {
4
    private static final int DEFAULT_SIZE = Integer.MAX_VALUE;
5
    private static final int MIN_SIZE = 1000;
6
    private static int SIZE = DEFAULT_SIZE;
7
    //hash函数种子因子
8
    private static final int[] HASH_SEEDS = new int[]{3, 5, 7, 11, 13, 17, 19, 23, 29, 31};
9

10
    private BitSet bitSet = null;
11

12
    //HASH函数
13
    private HashFunction[] hashFunctions = new HashFunction[HASH_SEEDS.length];
14

15
    public MyBloomFilter() {
16
        init();
17
    }
18

19
    public MyBloomFilter(int size) {
20
        if (size >= MIN_SIZE) {
21
            SIZE = size;
22
        }
23
        init();
24
    }
25

26
    private void init() {
27
        bitSet = new BitSet(SIZE);
28
        for (int i = 0; i < HASH_SEEDS.length; i++) {
29
            hashFunctions[i] = new HashFunction(SIZE, HASH_SEEDS[i]);
30
        }
31
    }
32

33
    public void add(Object value) {
34
        for (HashFunction f : hashFunctions) {
35
            bitSet.set(f.hash(value), true);
36
        }
37
    }
38

39
    public boolean contains(Object value){
40
        boolean result = true;
41
        for (HashFunction f : hashFunctions) {
42
            result = result && bitSet.get(f.hash(value));
43

44
            if (!result){
45
                return result;
46
            }
47
        }
48
        return result;
49
    }
50

51
    public static class HashFunction {
52
        private int size;
53
        private int seed;
54

55
        public HashFunction(int size, int seed) {
56
            this.size = size;
57
            this.seed = seed;
58
        }
59

60
        public int hash(Object value){
61
            if (value == null){
62
                return 0;
63
            }
64
            int hash1 = value.hashCode();
65

66
            int hash2 = hash1 >>> 16;
67

68
            int combine = hash1 ^ hash2;
69

70
            return Math.abs(combine * seed) % size;
71

72
        }
73
    }
74

75
    public static void main(String[] args) {
76
        Integer num1 = 31312;
77
        Integer num2 = 312312;
78

79
        MyBloomFilter bloomFilter = new MyBloomFilter();
80

81
        System.out.println(bloomFilter.contains(num1) + " " + bloomFilter.contains(num2));
82

83
        bloomFilter.add(num1);
84
        bloomFilter.add(num2);
85

86
        System.out.println(bloomFilter.contains(num1) + " " + bloomFilter.contains(num2));
87

88
        System.out.println(bloomFilter.contains(312432));
89
    }
90
}

详解关键

1
int hash1 = value.hashCode();
2

3
int hash2 = hash1 >>> 16;
4

5
int combine = hash1 ^ hash2;
6

7
return Math.abs(combine * seed) % size;

这段代码是布隆过滤器中哈希函数的核心，用于将任意对象转化为在 [0, size) 范围内的整数索引。

1
int hash1 = value.hashCode();

✅ 获取原始哈希值#

使用 Java 的 hashCode() 方法得到对象的哈希值。
这是 Java 中所有对象都具备的，通常能反映其数据特征。

1
int hash2 = hash1 >>> 16;

✅ 提取高位信息，做混淆处理#

将 hash1 无符号右移 16 位，得到高位部分。
这么做是为了将高位信息与低位结合，增强哈希的分布性（也叫“扰动函数”）。

类似思路也出现在 HashMap 的源码中：

1
// HashMap.java 中对 hashCode 的扰动函数
2
static final int hash(Object key) {
3
    int h;
4
    return (key == null) ? 0 : (h = key.hashCode()) ^ (h >>> 16);
5
}

1
int combine = hash1 ^ hash2;

✅ 异或操作，打乱模式#

将原始哈希值与高位进行异或，增强随机性，避免哈希冲突集中。
异或操作本质上是一种 位级扰动（bit-level mixing）。

1
return Math.abs(combine * seed) % size;

✅ 与种子结合，压缩到位图范围#

乘上 seed：
- 使用不同的种子值，可以生成多个不同的哈希函数。
- 这是布隆过滤器中实现多个独立哈希函数的一种方式。
取绝对值：
- 保证哈希结果是非负的。
模 size：
- 将哈希结果限定到 BitSet 的合法索引范围 [0, size)。

🔎 整体目的是：#

增强分布性和抗冲突性（扰动 hash1）
通过种子值构造出多个不同的哈希函数
最终将结果落入 BitSet 有效范围内

✅ 总结#

步骤	目的
`hashCode()`	获取对象哈希值
`>>> 16` + `^`	扰动，提高分布性
`* seed`	构造多个哈希函数
`Math.abs(...) % size`	映射到 BitSet 的有效索引范围

这是一种 轻量级但分布性还可以的哈希函数构造方式，适合用于布隆过滤器中快速构建多个 hash 函数。如果想进一步优化精度，可以考虑使用 MurmurHash, FNV, Guava.hashing() 等更高质量的哈希算法。