One Statistics Professor Was Just Banned By Google: Here Is His Story

Discussion in 'privacy general' started by mirimir, Aug 21, 2017.

  1. mirimir

    mirimir Registered Member

    Joined:
    Oct 1, 2011
    Posts:
    6,768
    That's bigger than the Bitcoin blockchain ;)
     
  2. Stefan Froberg

    Stefan Froberg Registered Member

    Joined:
    Jul 30, 2014
    Posts:
    218
    Uh...isn't block chain size currently now 153 GB or something?
    https://bitinfocharts.com
    And the blocks can continue growing indefinetely right?
    While that 83 GB is the max possible with that binary format (and will in reality be less because not all IP addresses have hostname)

    EDIT: Bitcoin reminds me ... maybe I should use sha256 too instead of sha1 ....
    Or better yet, add field to that binary file that will tell what hash method was used as
    a way to be future proof and as old hashing methods get breaked and new ones come ....
     
    Last edited: Aug 28, 2017
  3. deBoetie

    deBoetie Registered Member

    Joined:
    Aug 7, 2013
    Posts:
    1,329
    Location:
    UK
    Personally, I'd trim this to the top say 1M visited sites, which through Pareto's law effects will comprise 99.999% or something of the total DNS requests. The remainder can be handled through DNS.
     
  4. mirimir

    mirimir Registered Member

    Joined:
    Oct 1, 2011
    Posts:
    6,768
    I think that it's ~127 GB now. It's been a while since I checked :eek:

    https://blockchain.info/charts/blocks-size
    Yep, they sure can :eek:
    That's a fair point.
    :)
     
  5. RockLobster

    RockLobster Registered Member

    Joined:
    Nov 8, 2007
    Posts:
    875
    I think as long as it can fit on a thumbdrive its good.
     
  6. Stefan Froberg

    Stefan Froberg Registered Member

    Joined:
    Jul 30, 2014
    Posts:
    218
    Some status of the mapping:
    All the following countries have now been fully mapped

    AD AE AF AG AI AL AM AO AP AR AS AT AW AX AZ
    BA BB BD BF BG BH BI BJ BL BM BN BO BQ BS BT BW BY BZ
    CD CF CG CI CK CM CR CU CV CW CY
    DJ DM DO DZ
    EC EE ER ET
    FJ FM FO
    GA GD GE GF GG GH GI GL GM GN GP GQ GT GU GW GY
    HN HR HT
    IM IO IQ IS
    JE JM JO
    KG KH KI KM KN KP KW KY
    LA LB LC LI LK LR LS LU LV LY
    MC MD ME MF MG MH MK ML MM MN MO MP MQ MR MS MT MU MV MW MZ

    total hostname/IP combinations: 33 197 647
    total compressed size: 47 MB

    Starting mapping the rest of the countries...
    Hmmmm...some of the remaining countries should have been completed long ago but show 99.9% complete...
    There must be something wrong with nmap...or maybe my mapping script that runs them in parallel....
    Need to test my own program soon.
     
  7. Stefan Froberg

    Stefan Froberg Registered Member

    Joined:
    Jul 30, 2014
    Posts:
    218
    175/195 of countries (89%) now fully mapped.
    compressed size: 242 MB

    going to be interesting ... ;)

    EDIT: If the data is right then some observations can already be made.
    I now have only 35 930 643 hostname/IP addresses from all those completed countries
    and when I look them, most of them are rather small countries.
    So the the remaining 20 countries will cover the rest of the Internet.
     
    Last edited: Aug 31, 2017
  8. Stefan Froberg

    Stefan Froberg Registered Member

    Joined:
    Jul 30, 2014
    Posts:
    218
    Dang jet lag still bothering but managed to make some code for portable DNS.
    3/4 or something like that countries are now finished and so far there are about 71 million hostname/IP combos and the uncompressed size is about 3.1 GB. Estimate is that after the rest of the big countries finish, the size and number of lines might be double or triple that. So all the uncompressed CSV files should fit to one 8 GB thumb drive or at least to one 16 GB thumb drive easily.
    Few of the country data was corrupted, so have to start mapping them again ... *sight*

    Switching to my binary file format from CSV text file format seems to, in generally, reduce size about 3/4 of the original or so. Search speed is, well okay (now takes about 30 s to search the very last record from 13 million records), but Im sure it can be improved by using memory-mapped files, Intel SIMD extensions for comparing SHA hashes, and maybe parallel reading. (could be also that the background parallel mappings still going on affected the reading speed)


    I will start converting CVS files to my own binary file format (please see below) and give away for free but the original raw CVS files (that show the data in human readable hostname/IP combo) I will not give for free. They just took so much time and effort to generate and Im still trying to improve the mapping process to make it faster/better.

    Here is the SF (hehe) fileformat so for those who wan't to use in their own apps:
    Code:
    2 BYTE signature (value 'SF')
    1 BYTE version (value '1')
    1 BYTE hash algorithm used (value '0' = SHA256, '1' = SHA384, '2' = SHA512, etc.....keep adding as new ones come)
    1 DWORD number of records in the file
    
    After header follows the actual data that has simple format of:
    Code:
    N BYTES of hashed hostname + 1 DWORD presentation of ip number for each record.
    
    So for example, after reading the header and if the hash algo is the default '0' (SHA256) then
    you will know that the actual data for each record will be: 32 (256 bits/ 8 bit) BYTES hash value + DWORD ip value.
    Keep in mind that the hash values used in this binary file are not meat to be human readable (aka. hex value) but in plain bytes. If you need to present them in human readable format just double the bytes, fill it with '0' (for padding) and then convert to hex. That will give you the familiar 64 bytes hex string for SHA256.

    Searching of hostname from the binary file(s) is pretty trivial.
    Check what hashing algo was used in the file(s) and then hash the hostname with the correct hashing algo and then search for right hash value from the file(s).

    upload_2017-9-18_1-39-16.png

    upload_2017-9-18_1-39-46.png

    EDIT: SF data for download:
    https://www.orwell1984.today/sf/
    OR
    http://ukp5un24mpxbqcpu.onion/sf

    EDIT2: In the future the non-country tlds must be ripped out from these tables to their own tables so that when user gives example.com then only .com table is searched.
    But for now you can use these tables already for searching hostnames with country tlds.
     
    Last edited: Sep 17, 2017
  9. Stefan Froberg

    Stefan Froberg Registered Member

    Joined:
    Jul 30, 2014
    Posts:
    218
    Okay, reading code uses memory mapped files now and the reading speed improved is quite nice.
    Previously, it took about 30s to search the last record from 13 million records.
    Now, it takes about 12s.

    upload_2017-9-27_11-44-14.png

    Next stop: Intel SIMD extensions for maybe improving hash comparison (maybe)

    Edit: Huuuups! Wrong, older version of the code :oops:
    Here is the right code results when using memory mapped files:

    upload_2017-9-27_12-14-7.png

    639 millisecond to search last record from 13 million records.
    I think this starts to be pretty acceptable :)

    But will still try the Intel SIMD thing and maybe parallel reading ...
     
    Last edited: Sep 27, 2017
  10. Stefan Froberg

    Stefan Froberg Registered Member

    Joined:
    Jul 30, 2014
    Posts:
    218
    Status: 225 709 385
    That's right, my poor little overworking VPS server has been mapping over 200 million hostname/IP combos by now and the raw, uncompressed size is just 9.9 GB. Im pretty confined now that
    the whole big "I" can be fitted to one thumb drive :)

    upload_2017-10-12_19-1-14.png

    Here's what's left to map

    upload_2017-10-12_19-4-55.png
     
  11. mirimir

    mirimir Registered Member

    Joined:
    Oct 1, 2011
    Posts:
    6,768
    OK, so a NoDNSReader lookup from each binary country.sf takes on the order of, at most, hundreds of msec. So, do I understand that you plan parallel lookup from all foo.sf? Or is there some way to quickly get country, and just search that one? Sorry if you've explained that already :oops:
     
  12. Stefan Froberg

    Stefan Froberg Registered Member

    Joined:
    Jul 30, 2014
    Posts:
    218
    Well, the way im visioning how this would work, when the user types the URL into his/her browser, the NoDNS proxy program (will soon start to write code for it after Im happy with lookup results the NoDNSReader test program is giving me) will catch the outgoing DNS request, check the tld and then do the lookup from the corresponding memory mapped file.

    So for example, if I type www.example.fi then the NoDNS proxy would do a lookup from the table FI.sf.
    If I type www.example.fr then from FR.sf and so on.
    Just like the real DNS system does. :)

    Even tought the lookup should be now within 1 second or less (at leat it was for AR.sf, will see how long it will take to lookup last record from RU.sf or CN.sf for example .... ) Im still not happy and would like to try some parallel reading within the same table.
    So for example, if the final RU.sf or CN.sf ends up being extremenly huge, the lookup process would be split evenly between cores in multi-core machines. So if user machine has just 2 cores then the table would be memory mapped and core #1 would start looking from start of table while core #2 would start looking from the middle of table (it's trivial to find the offset for this because number of records is in the table metadata).

    It would really be like doing binary search for memory mapped file but obviously, the more cores one has, the faster the lookup will be done.

    But maybe there is no need for parallel lookup within inside tables because all these country tables also include tlds like .net, .com and so on inside them and in the final finished tables, I will need to rip them out to their own table files (like NET.sf, ORG.sf, COM.sf and so on ...).
    That of course will also make all country tld tables (RU.sf, CA.sf so on) smaller.

    So, in the finished, ready to use product, the table lookup speed should not be problem and neither should the combined size for all the tables.
    And because the way the .sf data files are done (hashed hostnames) there is even some privacy added. :)

    For example, if the NoDNS proxy program sees that there are new updated tables available from user selectable update server and if someone somehow manages to grab those updated tables during download state, then all they would get is hashed gibberish.

    EDIT: In the final product, Im visioning that the user selectable update server would allow two download modes. 1: throught raw naked IP address (like http://1.2.3.4) which has advantage of being totally and truly DNSless but unfortunately no encryption. or 2: throught SSL which adds encryption of course but has downside of sending one DNS request (at least if it is not in the tables). No matter what the chosen download mode will be, the program must check the checksum of the of remote tables and the tables that user actually got after download finish. In case someone tries to intentionally corrupt the lookup process.
     
    Last edited: Oct 12, 2017 at 4:49 PM
  13. mirimir

    mirimir Registered Member

    Joined:
    Oct 1, 2011
    Posts:
    6,768
    Thanks. I wasn't thinking, when I asked about countries :oops:
     
  14. reasonablePrivacy

    reasonablePrivacy Registered Member

    Joined:
    Oct 7, 2017
    Posts:
    21
    Location:
    Some country in the European Union
  15. RockLobster

    RockLobster Registered Member

    Joined:
    Nov 8, 2007
    Posts:
    875
    Yes, I learned my lesson when a similar thing happened to my Microsoft outlook email that I had been using for over ten years.
    They locked it because I tried to access it from a new device.
    I didn't forget my password I knew the answer to the secret question but it was so long since I set it up I no longer had the alternate email. So they refused to let me access my own email.
    I have never used any of the tech corps services for anything important ever since. It gives them too much control.
     
Loading...