使用 Node.js 和 Redis 探索 Bloom Filter 的魅力-编程类技能-免费个人简历,免费PPT模板,免费Excle模板,免费网页模板分享-林夕设计家园

在正确的用例中，布隆过滤器看起来就像魔法。这是一个大胆的说法，但在本教程中，我们将探讨这种奇怪的数据结构、如何最好地使用它，以及一些使用 Redis 和 Node.js 的实际示例。

布隆过滤器是一种概率性、单向数据结构。在这种情况下，“过滤器”一词可能会令人困惑；过滤器意味着它是一个活跃的事物，一个动词，但将它视为存储，一个名词可能更容易。使用简单的布隆过滤器，您可以做两件事：

添加一个项目。检查之前是否未添加过某个项目。

这些是需要理解的重要限制 - 您无法删除项目，也无法在布隆过滤器中列出项目。此外，您无法确定过去是否已将某个项目添加到过滤器中。这就是布隆过滤器的概率性质发挥作用的地方——误报是可能的，但误报则不然。如果过滤器设置正确，误报的可能性会非常小。

存在布隆过滤器的变体，它们添加了其他功能，例如删除或缩放，但它们也增加了复杂性和限制。在继续了解变体之前，首先了解简单的布隆过滤器非常重要。本文仅介绍简单的布隆过滤器。

有了这些限制，您可以获得许多好处：固定大小、基于哈希的加密和快速查找。

当您设置布隆过滤器时，您需要为其指定一个大小。此大小是固定的，因此如果过滤器中有一项或十亿项，它永远不会增长超过指定的大小。当您向过滤器添加更多项目时，误报的可能性就会增加。如果您指定了较小的过滤器，则与使用较大的过滤器相比，误报率会增加得更快。

布隆过滤器建立在单向散列的概念之上。与正确存储密码非常相似，布隆过滤器使用哈希算法来确定传入其中的项目的唯一标识符。哈希本质上是无法逆转的，并且由看似随机的字符串表示。因此，如果有人获得了布隆过滤器的访问权限，它不会直接泄露任何内容。

最后，布隆过滤器速度很快。与其他方法相比，该操作涉及的比较次数要少得多，并且可以轻松存储在内存中，从而防止影响性能的数据库命中。

现在您已经了解了布隆过滤器的限制和优点，让我们来看看可以使用它们的一些情况。

设置

我们将使用 Redis 和 Node.js 来说明 Bloom 过滤器。 Redis 是 Bloom 过滤器的存储介质；它速度快，位于内存中，并且有一些特定的命令（GETBIT、SETBIT），可以提高实施效率。我假设您的系统上安装了 Node.js、npm 和 Redis。您的 Redis 服务器应该在 localhost 上的默认端口上运行，以便我们的示例正常工作。

在本教程中，我们不会从头开始实现过滤器；而是从头开始实现过滤器。相反，我们将重点关注 npm 中预构建模块的实际用途：bloom-redis。 bloom-redis 有一组非常简洁的方法：add、contains 和 clear。

独特的用户名

用户名，尤其是在 URL 中标识用户的用户名，需要是唯一的。如果您构建一个允许用户更改用户名的应用程序，那么您可能需要一个从未使用过的用户名，以避免用户名混淆和被攻击。

如果没有布隆过滤器，您需要引用一个包含曾经使用过的每个用户名的表，而大规模时这可能会非常昂贵。布隆过滤器允许您在用户每次采用新名称时添加一个项目。当用户检查用户名是否被占用时，您所需要做的就是检查布隆过滤器。它将能够绝对确定地告诉您所请求的用户名是否先前已添加。过滤器可能会错误地返回用户名已被使用，而实际上用户名尚未被使用，但这只是为了谨慎起见，不会造成任何真正的伤害（除了用户可能无法声明“k3w1d00d47”） .

为了说明这一点，让我们使用 Express 构建一个快速的 REST 服务器。首先，创建 package.json 文件，然后运行以下终端命令。

npm 安装bloom-redis --save

npm install express --save

npm install redis --save

bloom-redis 的默认选项大小设置为 2 MB。出于谨慎考虑，这是错误的，但它相当大。设置布隆过滤器的大小至关重要：太大会浪费内存，太小则误报率会太高。确定大小所涉及的数学非常复杂，超出了本教程的范围，但幸运的是，有一个布隆过滤器大小计算器可以完成这项工作，而无需破解教科书。

现在，创建 app.js 如下：

var

Bloom = require(bloom-redis),

express = require(express),

redis = require(redis),

app,

client,

filter;

//setup our Express server

app = express();

//create the connection to Redis

client = redis.createClient();

filter = new Bloom.BloomFilter({

client : client, //make sure the Bloom module uses our newly created connection to Redis

key : username-bloom-filter, //the Redis key

//calculated size of the Bloom filter.

//This is where your size / probability trade-offs are made

//http://hur.st/bloomfilter?n=100000&p=1.0E-6

size : 2875518, // ~350kb

numHashes : 20

});

app.get(/check, function(req,res,next) {

//check to make sure the query string has username

if (typeof req.query.username === undefined) {

//skip this route, go to the next one - will result in a 404 / not found

next(route);

} else {

filter.contains(

req.query.username, // the username from the query string

function(err, result) {

if (err) {

next(err); //if an error is encountered, send it to the client

} else {

res.send({

username : req.query.username,

//if the result is false, then we know the item has *not* been used

//if the result is true, then we can assume that the item has been used

status : result ? used : free

});

}

);

}

});

app.get(/save,function(req,res,next) {

if (typeof req.query.username === undefined) {

next(route);

} else {

//first, we need to make sure that its not yet in the filter

filter.contains(req.query.username, function(err, result) {

if (err) { next(err); } else {

if (result) {

//true result means it already exists, so tell the user

res.send({ username : req.query.username, status : not-created });

} else {

//well add the username passed in the query string to the filter

filter.add(

req.query.username,

function(err) {

//The callback arguments to `add` provides no useful information, so well just check to make sure that no error was passed

if (err) { next(err); } else {

res.send({

username : req.query.username, status : created

});

}

);

}

});

}

});

app.listen(8010);

登录后复制

要运行此服务器：node app.js。转到浏览器并将其指向：https://localhost:8010/check?username=kyle。响应应该是：{"username":"kyle","status":"free"}。

现在，让我们通过将浏览器指向 http://localhost:8010/save?username=kyle 来保存该用户名。响应将是：{"username":"kyle","status":"created"}。如果返回地址 http://localhost:8010/check?username=kyle，响应将是 {"username":"kyle","status ":"已使用"}.同样，返回 http://localhost:8010/save?username=kyle 将导致 {"username":"kyle","status":"not -创建“}。

从终端中，您可以看到过滤器的大小： redis-cli strlen 用户名-bloom-filter。

现在，对于一项，它应该显示 338622。

现在，继续尝试使用 /save 路由添加更多用户名。您想尝试多少就尝试多少。

如果您再次检查尺寸，您可能会发现尺寸略有增加，但并非每次添加都会增加。很好奇，对吧？在内部，布隆过滤器在保存在 username-bloom 的字符串中的不同位置设置各个位（1/0）。然而，这些并不是连续的，因此如果您在索引 0 处设置一位，然后在索引 10,000 处设置一位，则两者之间的所有内容都将为 0。对于实际用途，一开始了解每个操作的精确机制并不重要，只需知道这一点即可这是正常的，您在 Redis 中的存储永远不会超过您指定的值。

新鲜内容

网站上的新鲜内容可以吸引用户回头客，那么如何每次都向用户展示新的内容呢？使用传统的数据库方法，您可以向表中添加一个包含用户标识符和故事标识符的新行，然后在决定显示一段内容时查询该表。正如您可能想象的那样，您的数据库将增长得非常快，尤其是随着用户和内容的增长。

在这种情况下，漏报（例如不显示看不见的内容）的后果非常小，这使得布隆过滤器成为一个可行的选择。乍一看，您可能认为每个用户都需要一个布隆过滤器，但我们将使用用户标识符和内容标识符的简单串联，然后将该字符串插入到我们的过滤器中。这样我们就可以对所有用户使用单个过滤器。

在此示例中，让我们构建另一个显示内容的基本 Express 服务器。每次您访问路由 /show-content/any-username （any-username 是任何 URL 安全值）时，都会显示一条新内容，直到该网站没有内容。在示例中，内容是古腾堡计划前十名书籍的第一行。

我们需要再安装一个 npm 模块。从终端运行： npm install async --save

您的新 app.js 文件：

var

async = require(async),

Bloom = require(bloom-redis),

express = require(express),

redis = require(redis),

app,

client,

filter,

// From Project Gutenberg - opening lines of the top 10 public domain ebooks

// https://www.gutenberg.org/browse/scores/top

openingLines = {

pride-and-prejudice :

It is a truth universally acknowledged, that a single man in possession of a good fortune, must be in want of a wife.,

alices-adventures-in-wonderland :

Alice was beginning to get very tired of sitting by her sister on the bank, and of having nothing to do: once or twice she had peeped into the book her sister was reading, but it had no pictures or conversations in it }

登录后复制

如果您仔细注意开发工具中的往返时间，您会发现使用用户名请求单个路径的次数越多，所需的时间就越长。虽然检查过滤器需要固定的时间，但在本例中，我们正在检查是否存在更多项目。布隆过滤器能够告诉您的信息有限，因此您正在测试每个项目是否存在。当然，在我们的示例中，它相当简单，但测试数百个项目效率很低。

过时数据

在此示例中，我们将构建一个小型 Express 服务器，它将执行两件事：通过 POST 接受新数据，并显示当前数据（使用 GET 请求）。当新数据被 POST 到服务器时，应用程序将检查它是否存在于过滤器中。如果它不存在，我们会将其添加到 Redis 中的集合中，否则我们将返回 null。 GET 请求将从 Redis 获取它并将其发送到客户端。

这与前两种情况不同，误报是不行的。我们将使用布隆过滤器作为第一道防线。考虑到布隆过滤器的属性，我们只能确定某些东西不在过滤器中，因此在这种情况下，我们可以继续让数据进入。如果布隆过滤器返回的数据可能在过滤器中，我们将根据实际数据源进行检查。

那么，我们得到了什么？我们获得了不必每次都检查实际来源的速度。在数据源速度较慢的情况下（外部 API、小型数据库、平面文件的中间），确实需要提高速度。为了演示速度，我们在示例中添加 150 毫秒的实际延迟。我们还将使用 console.time / console.timeEnd 来记录 Bloom 过滤器检查和非 Bloom 过滤器检查之间的差异。

在此示例中，我们还将使用极其有限的位数：仅 1024。它很快就会填满。当它填满时，它将显示越来越多的误报 - 您会看到响应时间随着误报率的填满而增加。

该服务器使用与之前相同的模块，因此将 app.js 文件设置为：

100

101

102

103

104

var

async = require(async),

Bloom = require(bloom-redis),

bodyParser = require(body-parser),

express = require(express),

redis = require(redis),

app,

client,

filter,

currentDataKey = current-data,

usedDataKey = used-data;

app = express();

client = redis.createClient();

filter = new Bloom.BloomFilter({

client : client,

key : stale-bloom-filter,

//for illustration purposes, this is a super small filter. It should fill up at around 500 items, so for a production load, youd need something much larger!

size : 1024,

numHashes : 20

});

app.post(

bodyParser.text(),

function(req,res,next) {

var

used;

console.log(POST -, req.body); //log the current data being posted

console.time(post); //start measuring the time it takes to complete our filter and conditional verification process

//async.series is used to manage multiple asynchronous function calls.

async.series([

function(cb) {

filter.contains(req.body, function(err,filterStatus) {

if (err) { cb(err); } else {

used = filterStatus;

cb(err);

}

});

function(cb) {

if (used === false) {

//Bloom filters do not have false negatives, so we need no further verification

cb(null);

} else {

//it *may* be in the filter, so we need to do a follow up check

//for the purposes of the tutorial, well add a 150ms delay in here since Redis can be fast enough to make it difficult to measure and the delay will simulate a slow database or API call

setTimeout(function() {

console.log(possible false positive);

client.sismember(usedDataKey, req.body, function(err, membership) {

if (err) { cb(err); } else {

//sismember returns 0 if an member is not part of the set and 1 if it is.

//This transforms those results into booleans for consistent logic comparison

used = membership === 0 ? false : true;

cb(err);

}

});

}, 150);

}

function(cb) {

if (used === false) {

console.log(Adding to filter);

filter.add(req.body,cb);

} else {

console.log(Skipped filter addition, [false] positive);

cb(null);

}

function(cb) {

if (used === false) {

client.multi()

.set(currentDataKey,req.body) //unused data is set for easy access to the current-data key

.sadd(usedDataKey,req.body) //and added to a set for easy verification later

.exec(cb);

} else {

cb(null);

}

function(err, cb) {

if (err) { next(err); } else {

console.timeEnd(post); //logs the amount of time since the console.time call above

res.send({ saved : !used }); //returns if the item was saved, true for fresh data, false for stale data.

}

);

});

app.get(/,function(req,res,next) {

//just return the fresh data

client.get(currentDataKey, function(err,data) {

if (err) { next(err); } else {

res.send(data);

}

});

app.listen(8012);

登录后复制

由于使用浏览器 POST 到服务器可能会很棘手，所以让我们使用curl 来测试。

curl --data“您的数据放在这里”--header“内容类型：text/plain”http://localhost:8012/

可以使用快速 bash 脚本来显示填充整个过滤器的外观：

#!/bin/bash

for i in `seq 1 500`;

curl --data “data $i" --header "Content-Type: text/plain" http://localhost:8012/

done

登录后复制

观察填充或完整的过滤器很有趣。由于这个很小，你可以使用 redis-cli 轻松查看。通过在添加项目之间从终端运行 redis-cli get stale-filter ，您将看到各个字节增加。完整的过滤器将为每个字节 � 。此时，过滤器将始终返回正值。

结论

布隆过滤器并不是万能的解决方案，但在适当的情况下，布隆过滤器可以为其他数据结构提供快速、有效的补充。

以上就是使用 Node.js 和 Redis 探索 Bloom Filter 的魅力的详细内容，更多请关注php中文网其它相关文章！

知识&宝库

使用 Node.js 和 Redis 探索 Bloom Filter 的魅力

设置

独特的用户名

新鲜内容

过时数据

结论

最新文章

微信小程序导航

linux查看服务器配置命令

小程序转发

云服务器免费

三丰云服务器

裸金属服务器

什么是saas

小程序传参

百度翻译智能小程序

ssh远程连接服务器