2

I have data like the following:

id  | src_ip         | dst_ip       | port |
----|----------------|--------------|------|
256 | 192.168.61.200 | 10.75.64.222 | 80   |
257 | 192.168.61.200 | 10.75.64.222 | 81   |
258 | 10.65.72.207   | 10.75.64.223 | 80   |
259 | 10.75.254.27   | 10.75.64.223 | 82   |
260 | 10.75.254.27   | 10.75.64.224 | 80   |
261 | 10.75.254.27   | 10.75.64.230 | 80   |
262 | 10.75.254.27   | 10.75.64.230 | 81   |
263 | 10.75.254.27   | 10.75.64.230 | 82   |
264 | 10.75.254.27   | 10.75.64.230 | 83   |
265 | 10.75.254.27   | 10.75.64.230 | 84   |

I built the schema and loaded the same data here: http://sqlfiddle.com/#!9/1185c

I'm planning to group the rows in a way to reproduce some rules for my firewall. Given the firewall supports ranges for ports or IPs I want to query this table and group as many rows as I can to produce valid rules.

There are a few different ways to group rows, but I want to start grouping by consecutive dst_ip on the same port.

Given the table presented above, I would like to get something like this:

 dst_ip       | max_consecutive | port |
--------------|-----------------|------|
 10.75.64.222 | 10.75.64.224    | 80   |
 10.75.64.222 | 10.75.64.222    | 81   |
 10.75.64.223 | 10.75.64.223    | 82   |
 10.75.64.230 | 10.75.64.230    | 80   |
 10.75.64.230 | 10.75.64.230    | 81   |
 10.75.64.230 | 10.75.64.230    | 82   |
 10.75.64.230 | 10.75.64.230    | 83   |
 10.75.64.230 | 10.75.64.230    | 84   |

Note: max_consecutive could be the number of consecutive rows, too, not necessarily the upper IP.

Actually, if I can go deeper, I would like to group by port, too. That means the last 5 rows of the result will be summarized in just one like

 10.75.64.230 | 10.75.64.230    | 80   | 84

I'm currently using MySQL, but I can move to PostgreSQL if it is necessary.

Erwin Brandstetter
  • 175,982
  • 27
  • 439
  • 600
emancu
  • 43
  • 6
  • This looks like a classic group by to me, but I do not quite get what results you expect. Can you be more specific and add the expected full result for your example ? – Clément Prévost May 03 '15 at 21:42
  • Would you continue a series across byte boundaries? I.e. is the series 10.75.63.254, 10.75.63.255 continued with 10.75.64.0? – Erwin Brandstetter May 04 '15 at 07:36
  • No, I don't care that scenario right now. Because it will depend on the mask too, so I prefer to ignore it. – emancu May 04 '15 at 16:57

1 Answers1

2

In Postgres (tested with v9.3) you can use the dedicated inet data type, to store IPv4 addresses with only 7 bytes (or IPv6 with 19 bytes) and with automatic integrity checks and dedicated functions and type casts etc.

Schema

The translated schema could look like this:

CREATE TABLE log (
  id serial PRIMARY KEY
, dst_port int
, src_ip inet
, dst_ip inet
);
CREATE INDEX ON log (dst_port);
CREATE INDEX ON log (src_ip);

I moved to dst_port int to the 2nd position to optimize alignment / padding:

Now we can use standard window functions (not possible in MySQL).

Step 1: Fold groups of consecutive dst_ip for same (dst_port)

One special difficulty: The aggregate function min() / max() are not yet implemented for inet in Postgres 9.4. Both are in the upcoming Postgres 9.5!

So I substituted with DISTINCT ON in the first step:

SELECT DISTINCT ON (dst_port, ip_grp)
       dst_ip, count(*) OVER (PARTITION BY dst_port, ip_grp) AS ip_ct, dst_port
FROM  (
   SELECT dst_ip, dst_port, dst_ip - row_number() OVER (PARTITION BY dst_port
                                                        ORDER BY dst_ip) AS ip_grp
   FROM   log
   ORDER  BY dst_port, dst_ip
   ) sub
ORDER  BY dst_port, ip_grp, dst_ip;

Result as desired - with a count of rows (could be upper IP as well).

You can subtract/add integer from/to the inet type. By subtracting the row_number() all consecutive rows get the same grp - the value of grp is irrelevant, just the fast that it's the same per partition (dst_port).

Then we can GROUP BY ... - or in this special case DISTINCT ON dst_port, ip_grp. I use another window function to get the count ip_ct in the same step: count(*) OVER (PARTITION BY dst_port, ip_grp) AS ip_ct.

Note that consecutive IPs can cross byte boundaries (see my comment to question).

Detailed explanation for this technique:

Step 2: Fold groups of consecutive dst_port for same (dst_ip, ip_ct)

SELECT dst_ip, ip_ct, min(dst_port) AS dst_port, count(*) AS port_ct
FROM  (
   SELECT *, dst_port - row_number() OVER (PARTITION BY dst_ip, ip_ct
                                           ORDER BY dst_port) AS port_grp
   FROM  (
      SELECT DISTINCT ON (dst_port, ip_grp)
             dst_ip, count(*) OVER (PARTITION BY dst_port, ip_grp) AS ip_ct, dst_port
      FROM  (
         SELECT dst_ip, dst_port, dst_ip - row_number() OVER (PARTITION BY dst_port
                                                              ORDER BY dst_ip) AS ip_grp
         FROM   log
         ORDER  BY dst_port, dst_ip
         ) sub1
      ORDER  BY dst_port, ip_grp, dst_ip
      ) sub2
   ) sub3
GROUP  BY 1, 2, port_grp
ORDER  BY 1, 3, 2;

Basically, repeat the same logic like in the first step, applied to the result of the first step.
But now you have to group on ip_ct additionally. And this time, you can use the simpler min(dst_port), since the port number is a plain integer.

SQL Fiddle demonstrating all.

Erwin Brandstetter
  • 175,982
  • 27
  • 439
  • 600
  • This looks really great. The only thing is for port grouping, I would like to have the first row of the second query to be like: 10.75.64.222 | 3 | 80 | 2 where port_ct value is 2 given the same ip has 80 and 81 – emancu May 04 '15 at 17:04
  • @emancu: But that would be wrong / ambiguous. (10.75.64.223, 81), (10.75.64.223, 81) and (10.75.64.224, 81), (10.75.64.224, 81) are missing. – Erwin Brandstetter May 04 '15 at 17:43
  • I'm not following you :(. It shouldn't be ambiguous. Given a src_ip we say there are 3 consecutive src_ip with the same port, and at the same time there are 2 consecutive ports with the same src_ip – emancu May 05 '15 at 15:11
  • @emancu: and at the same time there are 2 consecutive ports with the same src_ip .. but only for src_ip 10.75.64.222, not for .223 and .224. My last comment was inaccurate, should be: (10.75.64.223, 81), and (10.75.64.224, 81) are missing. The row represents an aggregated range of 3 IPs, not just the one. – Erwin Brandstetter May 11 '15 at 03:54