How to Read Veery Line in the File and Subtract It in New File


  • I take one document with over 200k lines: DocA.txt
    I take some other document with about 100 lines: DocB.txt

    I would like to subtract the 100 lines in DocB.txt from all the lines in DocA.txt

    Example

    Permit's say DocA.txt looks like this:
    Giraffe
    Crocodile
    Ant
    Panther
    Elephant
    Mosquito
    Zebra
    Lion
    Butterfly
    Antelope

    …and DocB.txt looks like this:
    Ant
    Musquito
    Butterfly

    Is there a way, natively in Due north++ or past using a plugin, to brand sure all lines present in DocB.txt are removed from DocA.txt then the result looks like this?
    Giraffe
    Crocodile
    Panther
    Elephant
    Zebra
    Lion
    Antelope

    Notation that a value similar "Ant" did not interfere with a value like "Antelope", only the unabridged line should be taken into account. Whatever help would be greatly appreciated :)


  • @Timmy-K

    This thread will probably go yous going.
    If not, there is a link to all the same-another-thread in there that might help.
    Yous'd have to put the two files together into a unmarried file, hopefully that is non a bargain-breaker for y'all.


  • @Timmy-M

    just in instance you are using python script plugin and the
    lines in document A are unique (or y'all don't bother if they are unique afterward)
    y'all could use something like this

                  s1 = gear up(editor1.getText().splitlines()) s2 = gear up(editor2.getText().splitlines()) editor1.setText('\r\north'.join(s1-s2))                          

    which turns

    into

    Document A needs to be in View0 and Document B needs to exist in View1

    Cheers
    Claudia


  • How-do-you-do, @timmy-M, @claudia-frank and All,

    Sorry, again, Claudia… !

    I call up, Timmy, that your goal tin can, easily, be accomplished with a regex Due south/R ! So :

    • First, open a new tab ( Ctrl + N )

    • Then, copy all the Doc A contents in this new tab

    • Now, at the end of the list, add together a new line, kickoff with, at least, 3 dashes ( --- ) and followed by a normal line-interruption

    • Then, copy all the Physician B contents, after this line of dashes

    • Open up the Replace dialog ( Ctrl + H )

    • Type in the regex (?-south)^(.+)\R(?south)(?=.*\R\1\R?)|^---.+ , in the Find what: zone

    • Exit the Supplant with: zone EMPTY

    • Tick the Wrap effectually option

    • Select the Regular expression search style

    • Finally, click on the Replace All button

    Et voilà !


    Notes :

    • This regex is looking for a complete line, which is repeated, somewhere, later on, in the file

    • When a match occurs, this duplicate line is, then, deleted

    • Finally, the cord --- followed past the totality of the Medico B contents, are also caught and deleted

    • Note also that IF your list, in Md A , already contains duplicates names, they are suppressed, as well ;-))

    Best Regards,

    guy038


  • Cheers @Scott Sumner, @Claudia-Frank and @guy038 for your suggestions.
    I'll exist honest with yous, when I meet "Python", the monkey in my head starts jumping effectually screaming and so that might not be such a good idea for me :)

    I tried the footstep-past-step that @guy038 wrote here simply that reduced a file with 255.507 lines to i unmarried line and it stated "Supersede All: five occurrences were replaced".

    Lamentable for being a rookie in all of this simply "regex" means "regular expression", right? I don't need to install a plug-in or something? And past "followed by a normal line-break", you merely mean pressing enter (some other new line), right?
    I've made a screenshot of the replace box to verify if I've set everything correct: photos (dot) app (dot) goo (dot) gl (slash) FlszelbbwReIXJWl1

    To clarify my intention a bit more: I'm trying to employ a whitelist on a hosts file. The construction of the text looks like this:
    0.0.0.0 www.notepad-plus-plus.org
    0.0.0.0 world wide web.google.com
    Etc.


  • @Timmy-M said:

    the monkey in my caput starts jumping around screaming

    :-D

    So if I take the communication I gave earlier and look at the other thread, and utilise the technique from there, hither'due south what I become on your test data:

    Imgur

    …which seems to meet your request because I tin then execute the command to delete all bookmarked lines.

    Of course, perhaps this doesn't work on your real data. Information technology seems like sometimes the regular expression (a.k.a. "regex") engine tin accept trouble with a search of this sort on "big" data, or maybe it just seems this mode when a inefficient regex is used.


  • Howdy, @timmy-M, @claudia-frank, @scott-sumner and All,

    I did some tests with unlike data configuration and, indeed, sometimes, the previous regex find a unique friction match which is the totality of the file contents :-((. I suppose that this behavior is due to the amount of characters, matched by the .* part, in the lookahead (?=.*\R\1\R?) , which can be quite important, in some huge files. And so, my previous regex (?-due south)^(.+)\R(?s)(?=.*\R\1\R?) is not a reliable one, because it seems to work, only, on small or middle size files !

    Refer to like issues ( catastrophic backtracking ) at the address, below :

    https://www.regular-expressions.info/catastrophic.html


    Then I tried, ( hard ! ), these ii concluding days, to find out a new regex, which would work for your specific chore ( Subtract B from A ). My idea was to find a style to get each line of Document A shut to every item of certificate B !

    I just presume that, fifty-fifty if your Medico A is quite of import, in number of lines, your Doc B , containing the different words to become rid of, is rather modest with a limited number of lines ( I mean… less than 100 , near ! ). But, this new regex, does NOT change, at all, the construction of your Md A :

    • It keeps possible duplicates, in Doc A

    • It does not perform any sort operation


    To test this new regex, I decided to commencement with the license.txt file, generated past the N++ v7.v.6 install. With different regexes, I extracted all words, ane per line, without any punctuation or symbol character. I obtained a list of 2654 words/lines.

    And for readability, I lowercased whatever give-and-take, also ! So this list begins and ends, equally below :

                  copying describes the terms under which notepad++ is distributed ..... ..... ..... possibility of such damages end of terms and conditions                          

    Then, I duplicated 19 times this list , in order to get a file of 50426 lines ( So each word has xviii duplicates, further on !). This file will be used as Md A

    Now, I decided that every discussion, without digits, containing 2 or 3 letters, in Md A ( so lx words ) will exist part of Doc B . I also added 9 other words, frequently used in License.txt. Finally, the contents of Doc B are :

                  a deed add all an and any are as ask at exist but by can do don end fee for get gnu gpl has he ho how if in inc is it its law ma may new no not of on i or our out run say see she so the to 2 up usa use fashion we who you atmospheric condition program license copyright software modification modifications free distribution                          

    And so, the master idea is to change the Doctor B list of words, in a single line, with the simple regex :

    SEARCH \R

    REPLACE ,

    After immediate replacement, the Doctor B should contain an unique line :

                  ,a,act,add together,all,an,and,whatever,are,as,ask,at,be,merely,past,tin,do,don,stop,fee,for,get,gnu,gpl,has,he,ho,how,if,in,inc,is,it,its,constabulary,ma,may,new,no,not,of,on,1,or,our,out,run,say,see,she,so,the,to,2,up,the states,apply,way,nosotros,who,yous,conditions,program,license,copyright,software,modification,modifications,free,distribution,                          

    Of import :

    • Subsequently replacement, your list of words MUST begin and end with a comma ( , ). If not, just add a comma symbol, at beginning and/or end of the line !

    • From at present on, I strongly advice y'all to unset the Word Wrap feature ( View > Word wrap ). Indeed, navigation among huge files are actually easier with that choice off ;-)

    Now, in Doc A , utilise the post-obit regex S/R :

    SEARCH $

    REPLACE : ,a,act,add,all,an,and,any,are,every bit,enquire,at,be,simply,past,can,do,don,finish,fee,for,go,gnu,gpl,has,he,ho,how,if,in,inc,is,it,its,law,ma,may,new,no,not,of,on,one,or,our,out,run,say,see,she,so,the,to,ii,up,usa,use,way,we,who,you,weather,plan,license,copyright,software,modification,modifications,costless,distribution,

    Subsequently replacement ( 5s with my laptop ), the listing of Doc B words, in one line, are added at the end of the 50426 lines, of Doctor A ,


    Finally, here is, below, the new regex S/R, which will delete whatever line of Doctor A , also contained in Doc B :

    SEARCH (?-s)^(.+?,)(.*,)?\i.*\R|,.+

    Supervene upon Leave EMPTY

    After replacement ( 1m 15s with my laptop ), I obtained a listing of 25764 words/lines, in the same initial social club, than in Doc A , containing, at least, 4 letters and, also, different of the 9 words, below :

                  atmospheric condition program license copyright software modification modifications complimentary distribution                          

    Notes :

    • You, certainly, have understood, that, this time, each initial word , of Physician A , at beginning of line, is merely compared with a possible identical word, in the SAME electric current line, only ! If and then, the entire line, with its line interruption, is deleted. If not, the part ,.+ , after the alternation symbol ( | ), which represents the list of Doctor B words, is only deleted :-))

    • So, I supposed that nosotros can rely on that regex, even in case of a very large file. Of grade, the supersede performance will be longer just safe ! Last New : I gave a try with a new Doc A , which is x times bigger than the initial one. And so, a file of 159,275,670 bytes, with 504,260 lines ! And, afterwards the final replacement, with the regex below :

    SEARCH (?-s)^(.+?,)(.*,)?\1.*\R|,.+

    Supersede Go out EMPTY

    I was left with 257,640 lines ( so, as expected, x times the concluding consequence ), after 12m 27s of procedure, while, simultaneously, listening music on Net ! Now, I'm quite sure that it'south a reliable regex !


    So, Timmy, if you go some spare time, just give a try to this method and tell me if you get positive results !

    Cheers,

    guy038


  • Hi @guy038 !

    Give thanks you and so much for your efforts, they are quite admirable.

    I believe we're getting closer hither but when taking your "replace $ with [line of values]" step (55 seconds on my laptop) information technology stops at line 161,454.
    To mitigate this, I've cut all data from line 161,455 to a new certificate and ran the supplant once more. This time, it stopped at line 48,192.
    At this betoken I noticed a foreign phenomenon: I couldn't delete any text from these files anymore and so I decided to outset over. A new S/R attempt stopped at line 128,653 and everything afterwards that came out garbled (showing "LF" blocks).
    I believe this is related to the punctuation inherent to the content. To meliorate empathise this, mayhap it'due south easier to just take a look at the actual data. Therefore I have uploaded the files on Gdrive:
    DocA.txt
    DocB.txt
    Warning: do not endeavour to visit any of the websites mentioned within these files as they may contain malicious information!
    DocA is a hosts file that blocks unwanted servers. DocB holds prophylactic Facebook servers that are supposed to be deleted from DocA.

    I hope that helps. If this is too difficult to reach or also much effort has gone into this already, you actually shouldn't bother. I'll do it manually each fourth dimension if I must, no worries :)

    Thanks
    Timmy


  • @Timmy-M

    It may be time to consider the "screaming monkey"…

    I tried @Claudia-Frank 'due south Pythonscript lawmaking on your documents and, I don't know, I either didn't wait long enough, or something else went incorrect, simply I ended up having to kill Notepad++ to put a stop to whatsoever it (or the PS plugin) was doing.

    I idea about trying @guy038's regular expression on the data, just and then I idea, "this is as well much piece of work; no one will really be willing to do this, let alone remember it".

    Then…it may exist time to bite the monkey and give this task to an external scripting language. Starting with @Claudia-Frank 's Pythonscript code, I came up with the following Python3 code that runs on your data files so rapidly that I didn't even retrieve to time it:

                  with open('DocA.txt', 'r', encoding='utf-8') as filea: linesa = filea.readlines() with open('DocB.txt', 'r', encoding='utf-8') as fileb: setb = set(fileb.readlines())  linesa = sorted(linesa) seta = fix(linesa)  linesc = sorted(list(seta - setb))  with open up('DocA_sorted.txt', 'westward', encoding='utf-8') every bit filea: filea.write(''.join(linesa)) with open('DocC.txt', 'w', encoding='utf-eight') equally filec: filec.write(''.bring together(linesc))                          

    I sorted the lines considering that makes comparing DocA_sorted and DocC (where C = B - A) piece of cake when attempting to validate the results.

    Sometimes it'southward merely all-time to get out the confines of Notepad++ for a task, and I'm sure Perl or AWK or "insert your favorite programming language here" works just equally well as Python3 for this problem.


  • @Scott-Sumner

    tried it x times in a row without a problem. Takes < 1sec to do the job.

    Thank you
    Claudia


  • @Claudia-Frank

    Interesting…still don't know what was going wrong with information technology for me…but I'yard not inclined to spend any more time to observe out. :-)

    Some tasks just sort of feel correct outside Notepad++, and for me this is one of them. I guess if at that place was a no-brainer menu function that did it I'd experience differently, but I dunno, do you think we'll ever see that? :-)

    And BTW, I should have said C = A - B higher up.


  • Hi, @timmy-M, @claudia-frank, @scott-sumner and All,

    Equally Scott said, I practice call up that an external scripting linguistic communication is the best style to solve your goal ! And I'g convinced that both Claudia's and Scott's scripts work just fine !

    Merely your stubborn servant merely gave a try to a regex solution, which is, of course, longer and less elegant than a script :-((


    Timmy, I could, correctly, download your two files DocA.txt and DocB.txt , without any trouble !

    • The hosts file, DocA.txt , is an Unix UTF-8 encoded file, containing 255,386 lines

    • The Safe Facebook servers file, DocB.txt , is a Windows UTF-8 BOM encoded file, containing only 113 lines

    At present, if we try to alter all the lines of your DocB.txt , containing the prophylactic sites, in an unique line, we become a line of 3,701 bytes. But, if we were going to add iii,701 bytes to each line of DocA.txt to simulate my 2nd method, in my previous post, the total size of DocA.txt would have been increased, up to 908 Mo almost ( half dozen,788,298 + 255,386 * 3,701 = 951,971,884 bytes ! ) Manifestly, this resulting file would exist too big to handle while running fifty-fifty unproblematic regexes :-((

    So, …nosotros need to discover out a tertiary way !

    To begin with, I copied DocA.txt as DocC.txt and DocB.txt as DocD.txt

    Then, I did some verifications on these two files :

    • I normalized all letters of these files in lower-example ( with Ctrl-A and Ctrl + U )

    • I performed a classical sort ( Edit > Line Operations > Sort Lines Lexicographically Ascending )

    Note : Sort comes after the Ctrl + U command, as the sort is a true Unicode sort !

    • Then I ran the regex ^(.+\R)\1+ to search possible duplicate consecutive lines

      • DocD.txt did non contain any duplicate line

      • DocC.txt contained 118 duplicates, with appeared, due to the previous normalization to lower-example. Then, in order to delete them, I used the regex South/R :

    SEARCH ^(.+\R)\one+

    Supplant \ane

    And then, DocC.txt , now, contained 255,268 lines

    • I noticed that, in DocD.txt , many occurrences of addresses concluded with "facebook.com" or "fbcdn.net", for example. Then, after copying DocD.txt contents, in a new tab, I decided to have an thought of the different sites, regarded equally rubber, with the regex Southward/R, below :

    SEARCH ^.{8}.+[.-]([^.\r\n]+\.[^.\r\due north]+)$

    Supersede \1\t\t\t\t\t$0

    Later on a classical sort, I obtained the following list :

                  amazonaws.com					0.0.0.0 ec2-34-193-eighty-93.compute-i.amazonaws.com facebook.com					0.0.0.0 0-act.channel.facebook.com facebook.com					0.0.0.0 0-edge-chat.facebook.com facebook.com					0.0.0.0 i-human action.channel.facebook.com facebook.com					0.0.0.0 1-border-chat.facebook.com facebook.com					0.0.0.0 2-human activity.channel.facebook.com facebook.com					0.0.0.0 ii-edge-chat.facebook.com facebook.com					0.0.0.0 3-deed.channel.facebook.com facebook.com					0.0.0.0 3-edge-conversation.facebook.com facebook.com					0.0.0.0 iv-human activity.channel.facebook.com facebook.com					0.0.0.0 4-edge-conversation.facebook.com facebook.com					0.0.0.0 5-human activity.aqueduct.facebook.com facebook.com					0.0.0.0 5-edge-chat.facebook.com facebook.com					0.0.0.0 6-human action.channel.facebook.com facebook.com					0.0.0.0 6-edge-chat.facebook.com facebook.com					0.0.0.0 act.aqueduct.facebook.com facebook.com					0.0.0.0 api-read.facebook.com facebook.com					0.0.0.0 api.ak.facebook.com facebook.com					0.0.0.0 api.connect.facebook.com facebook.com					0.0.0.0 app.logs-facebook.com facebook.com					0.0.0.0 ar-ar.facebook.com facebook.com					0.0.0.0 attachments.facebook.com facebook.com					0.0.0.0 b-api.facebook.com facebook.com					0.0.0.0 b-graph.facebook.com facebook.com					0.0.0.0 b.static.ak.facebook.com facebook.com					0.0.0.0 badge.facebook.com facebook.com					0.0.0.0 beta-chat-01-05-ash3.facebook.com facebook.com					0.0.0.0 bigzipfiles.facebook.com facebook.com					0.0.0.0 aqueduct-ecmp-05-ash3.facebook.com facebook.com					0.0.0.0 channel-staging-ecmp-05-ash3.facebook.com facebook.com					0.0.0.0 channel-testing-ecmp-05-ash3.facebook.com facebook.com					0.0.0.0 check4.facebook.com facebook.com					0.0.0.0 check6.facebook.com facebook.com					0.0.0.0 creative.ak.facebook.com facebook.com					0.0.0.0 d.facebook.com facebook.com					0.0.0.0 de-de.facebook.com facebook.com					0.0.0.0 developers.facebook.com facebook.com					0.0.0.0 edge-chat.facebook.com facebook.com					0.0.0.0 edge-mqtt-mini-shv-01-lax3.facebook.com facebook.com					0.0.0.0 edge-mqtt-mini-shv-02-lax3.facebook.com facebook.com					0.0.0.0 edge-star-shv-01-lax3.facebook.com facebook.com					0.0.0.0 edge-star-shv-02-lax3.facebook.com facebook.com					0.0.0.0 fault.facebook.com facebook.com					0.0.0.0 es-la.facebook.com facebook.com					0.0.0.0 fr-fr.facebook.com facebook.com					0.0.0.0 graph.facebook.com facebook.com					0.0.0.0 hi-in.facebook.com facebook.com					0.0.0.0 inyour-slb-01-05-ash3.facebook.com facebook.com					0.0.0.0 information technology-it.facebook.com facebook.com					0.0.0.0 ja-jp.facebook.com facebook.com					0.0.0.0 messages-facebook.com facebook.com					0.0.0.0 mqtt.facebook.com facebook.com					0.0.0.0 orcart.facebook.com facebook.com					0.0.0.0 origincache-starfacebook-ai-01-05-ash3.facebook.com facebook.com					0.0.0.0 pixel.facebook.com facebook.com					0.0.0.0 contour.ak.facebook.com facebook.com					0.0.0.0 pt-br.facebook.com facebook.com					0.0.0.0 s-static.ak.facebook.com facebook.com					0.0.0.0 s-static.facebook.com facebook.com					0.0.0.0 secure-profile.facebook.com facebook.com					0.0.0.0 ssl.connect.facebook.com facebook.com					0.0.0.0 star.c10r.facebook.com facebook.com					0.0.0.0 star.facebook.com facebook.com					0.0.0.0 static.ak.connect.facebook.com facebook.com					0.0.0.0 static.ak.facebook.com facebook.com					0.0.0.0 staticxx.facebook.com facebook.com					0.0.0.0 touch.facebook.com facebook.com					0.0.0.0 upload.facebook.com facebook.com					0.0.0.0 vupload.facebook.com facebook.com					0.0.0.0 vupload2.vvv.facebook.com facebook.com					0.0.0.0 www.login.facebook.com facebook.com					0.0.0.0 zh-cn.facebook.com facebook.com					0.0.0.0 zh-tw.facebook.com fbcdn.internet					0.0.0.0 b.static.ak.fbcdn.net fbcdn.cyberspace					0.0.0.0 creative.ak.fbcdn.net fbcdn.net					0.0.0.0 ent-a.xx.fbcdn.net fbcdn.net					0.0.0.0 ent-b.xx.fbcdn.net fbcdn.net					0.0.0.0 ent-c.xx.fbcdn.net fbcdn.net					0.0.0.0 ent-d.twenty.fbcdn.net fbcdn.net					0.0.0.0 ent-eastward.xx.fbcdn.internet fbcdn.net					0.0.0.0 external.ak.fbcdn.internet fbcdn.internet					0.0.0.0 origincache-ai-01-05-ash3.fbcdn.internet fbcdn.net					0.0.0.0 photos-a.ak.fbcdn.net fbcdn.internet					0.0.0.0 photos-b.ak.fbcdn.net fbcdn.net					0.0.0.0 photos-c.ak.fbcdn.cyberspace fbcdn.cyberspace					0.0.0.0 photos-d.ak.fbcdn.net fbcdn.net					0.0.0.0 photos-e.ak.fbcdn.net fbcdn.cyberspace					0.0.0.0 photos-f.ak.fbcdn.net fbcdn.net					0.0.0.0 photos-m.ak.fbcdn.net fbcdn.cyberspace					0.0.0.0 photos-h.ak.fbcdn.net fbcdn.net					0.0.0.0 profile.ak.fbcdn.net fbcdn.cyberspace					0.0.0.0 due south-external.ak.fbcdn.net fbcdn.net					0.0.0.0 southward-static.ak.fbcdn.cyberspace fbcdn.internet					0.0.0.0 scontent-a-lax.xx.fbcdn.net fbcdn.net					0.0.0.0 scontent-a-sin.twenty.fbcdn.net fbcdn.net					0.0.0.0 scontent-a.xx.fbcdn.net fbcdn.cyberspace					0.0.0.0 scontent-b-lax.xx.fbcdn.net fbcdn.net					0.0.0.0 scontent-b-sin.xx.fbcdn.net fbcdn.net					0.0.0.0 scontent-b.20.fbcdn.net fbcdn.internet					0.0.0.0 scontent-c.xx.fbcdn.net fbcdn.net					0.0.0.0 scontent-d.xx.fbcdn.net fbcdn.net					0.0.0.0 scontent-e.xx.fbcdn.net fbcdn.internet					0.0.0.0 scontent-mxp.xx.fbcdn.net fbcdn.internet					0.0.0.0 scontent.xx.fbcdn.internet fbcdn.internet					0.0.0.0 sphotos-a.xx.fbcdn.net fbcdn.internet					0.0.0.0 static.ak.fbcdn.cyberspace fbcdn.cyberspace					0.0.0.0 video.xx.fbcdn.net fbcdn.net					0.0.0.0 vthumb.ak.fbcdn.internet fbcdn.cyberspace					0.0.0.0 xx-fbcdn-shv-01-lax3.fbcdn.net fbcdn.net					0.0.0.0 xx-fbcdn-shv-02-lax3.fbcdn.cyberspace net23.cyberspace					0.0.0.0 scontent-vie-224-xx-fbcdn.net23.cyberspace net23.cyberspace					0.0.0.0 scontent-vie-73-xx-fbcdn.net23.net net23.net					0.0.0.0 scontent-vie-75-xx-fbcdn.net23.net                          

    And a quick examination shows that all these 113 sites, considered as safe, have an address catastrophe with one of the 4 values, beneath :

                  amazonaws.com facebook.com fbcdn.cyberspace net23.cyberspace                          

    Thus, from the remaining 255,268 lines, of DocC.txt , only those which address ends with amazonaws.com, facebook.com, fbcdn.net, or net23.cyberspace accept to be compared with the list of the 113 safe servers !

    • And then, in docC.txt , I bookmarked all lines, matching the regex (amazonaws|facebook)\.com$|(fbcdn|net23)\.internet$ . I obtained 301 bookmarked results that I copied to the clipboard with the option Search -> Bookmark -> Copy Bookmarked lines

    • Then, I pasted the 301 bookmarked lines, at the very commencement of Doc D.txt , followed with a line of, at least, 3 dashes, as a separator. Finally, DocD.txt independent 301 results to verify + 1 line of dashes + 113 safe sites, that is to say a full of 415 lines

    Now, with the regex (?-south)^(.+)\R(?s)(?=.*\R\1\R?) , I marked all the lines which have a indistinguishable, further on, in DocD.txt => 106 lines bookmarked !

    And using the command Search > Bookmark > Remove Unmarked lines, I was left with a DocD.txt file, containing 106 lines, only, which are, both, in DocA.txt and DocB.txt . Then, these lines / sites needed to be deleted from the DocC.txt file !

    To that purpose :

    • I added these 106 lines at the terminate of Doc C.txt file, giving a file with 255 374 lines

    • I did a last sort functioning ( Edit > Line Operations > Sort Lines Lexicographically Ascending )

    => After sort, these 106 sites should appear, each, as a block of two sequent duplicate lines

    And, finally, with the regex Southward/R :

    SEARCH ^(.+\R)\1+

    Supersede Get out EMPTY

    I got rid of these rubber servers and obtained a Physician.txt file of 255,162 lines ( 255,374 - 106 * 2 ) which contains, only, unwanted servers !


    At present, if you do not want to repeat all the to a higher place process, by yourself, but ship me an e-mail to :

    And I'll send you, back, this DocC.txt ( = DocA.txt - DocB.txt ) of 255,162 lines

    Cheers,

    guy038

    P.S. :

    Timmy, to be exhaustive, and regarding your initial files :

    DocB.txt contains 113 lines / safe servers :

    • 7 lines, not nowadays in DocA.txt

    • 106 lines, present in DocA.txt


    And DocA.txt contains 255,386 lines :

    • 118 duplicate lines , ( due to case normalization ), which have been deleted

    • 106 lines which are, both, nowadays in DocA.txt and DocB.txt and accept been deleted, too !

    • 254,967 lines, with an cease of line, different from amazonaws.com, facebook.com, fbcdn.net, and net23.net , which accept to remain in DocA.txt

    • 195 lines, with an end of line, equal to amazonaws.com, facebook.com, fbcdn.cyberspace, or net23.net just not nowadays in DocB.txt . And then, they must remain in DocA.txt , too !


    Hence, the final DocA.txt ( = my DocC.txt ) which contains 255162 unwanted servers ( 254,967 + 195 )


  • "Supercede All: 106 occurrences were replaced."

    @guy038 Yous sir, are a regex Legend! :D

    Of class I take taken every footstep you've explained, it's the least I could practice subsequently you've put in such effort. I even found a minor mistake in your explanation (I think):

    Now, with the regex (?-southward)^(.+)\R(?south)(?=.*\R\i\R?), I marked all the lines which have a duplicate, further on, in DocD.txt

    Should exist bookmarked (with regular marking, the "Remove Unmarked Lines" removes all lines).

    @Scott-Sumner @Claudia-Frank
    I guess information technology's time to face that monkey then… I installed the Python plugin script, navigated to Python Script > New Script and created ScreamingMonkey.py
    At present, selecting Run Previous Script (ScreamingMonkey) doesn't do anything. I've washed some searching and found that I should install Python get-go from python.org/downloads/, should I? Version three?
    Before installing something I know nothing well-nigh, I'd like to verify with you guys whether that is the correct course of activeness. I hope that's okay :)


  • @Timmy-M

    no, there is no need to install an additional python package,
    every bit python script plugin delivers its own.

    What you need to practice is to open you two files like I've shown above as the
    script will reference it by using editor1 and editor2 and click on ScreamingMonkey.
    One time information technology has been run you can then phone call it again by using Run Previous Script.

    What practice yous come across if you try to open the python script panel?
    Plugin->PythonScript->Show Console.

    If nil happens, no new window opens, so how did
    you lot install information technology, via plugin manager or via the msi parcel?

    The msi installation is preferred as it was reported that the installation via plugin managing director sometimes doesn't copy all needed files.

    Thanks
    Claudia


  • Hello, @timmy-1000 and All,

    Ah, yep ! You're right, although what I wrote was right, too ! Indeed, I used the Remove Unmarked lines , in guild to change DocD.txt , first => So, it remained 106 lines that I, then, added to DocC.txt contents.

    Merely you lot, certainly, used this easier solution, below :



    Now, with the regex (?-southward)^(.+)\R(?s)(?=.*\R\1\R?) , I marked all the lines which have a duplicate, further on, in DocD.txt => 106 lines bookmarked !

    And using the control Search > Bookmark > Copy Bookmarked lines, I put, in the clipboard, these 106 lines, only, which are, both, in DocA.txt and DocB.txt and accept to exist deleted from the DocC.txt file !

    To that purpose :

    • I paste these 106 lines ( Ctrl + 5 ) at the finish of Doc C.txt file, giving a file with 255 374 lines


    All-time Regards,

    guy038


  • @Timmy-Yard

    Regarding "marked versus bookmarked":

    Not @guy038 's fault… The Notepad++ user interface is confusing on this point; information technology uses the terminology "marking" often when it should (IMO) use "bookmark". Alternatively, if it used "redmark" and "bookmark" exclusively there would be no confusion, only I suppose using "red" wouldn't totally be right every bit it is but the default color and tin can be inverse by the user.


  • I used the msi install method. Offset updated Northward++ to latest, then installed Python plugin.

    @Claudia-Frank said:

    What do you see if you try to open the python script console?
    Plugin->PythonScript->Show Console.

    ---------------------------.
    Python two.vii.6-notepad++ r2 (default, April 21 2014, xix:26:54) [MSC v.1600 32 flake (Intel)]
    Initialisation took 31ms
    Gear up.
    Traceback (most recent phone call last):
    File "%AppData%\Roaming\Notepad++\plugins\Config\PythonScript\scripts\ScreamingMonkey.py", line 1, in <module>
    with open up('DocA.txt', 'r', encoding='utf-eight') equally filea: linesa = filea.readlines()
    TypeError: 'encoding' is an invalid keyword argument for this role
    Traceback (well-nigh recent call concluding):
    File "%AppData%\Roaming\Notepad++\plugins\Config\PythonScript\scripts\ScreamingMonkey.py", line 1, in <module>
    with open('DocA.txt', 'r', encoding='utf-8') as filea: linesa = filea.readlines()
    TypeError: 'encoding' is an invalid keyword argument for this function
    ---------------------------.

    Eh… Lil' help please? That monkey'south screaming really loud correct now O_O


  • @Timmy-M

    merely that is not my script.
    Scotts script assumes you run it under python3 equally it uses the encoding argument
    which is non available in python2, which is used by python script.

    Did yous attempt my 3 lines code posted in a higher place?

    If you still want to use Scotts code and executed via python script plugin, so
    get rid of the encoding='utf-8' so something like

                  with open up('DocA.txt', 'r') as filea ...                          

    but you take to ensure that DocA.txt can be plant and python scripts
    working directory is set to the directory where notepad++.exe is.

    Cheers
    Claudia


  • @Timmy-Chiliad

    Yea, for sure become with @Claudia-Frank 's script, since you seem to want to stay more "within" Notepad++!


  • @Claudia-Frank said:

    Did you try my iii lines code posted in a higher place?

    I take and eventually succeeded! On my first dozen attempts (turns out I really suck at taming monkeys) I only kept reorganizing the order of all entries in either DocA or DocB, depending on which one was selected, without any entry getting removed.
    After trying a agglomeration of stuff, I establish that I had to use View > Move/Clone Electric current Certificate > Move to Other View on Md B. I but couldn't go my head around "view0" and "view1" at first but didn't desire to bother you with such mundane question either ^_^
    DocA'south gild does get messed up somewhen only that'due south easily fixed with a regular sort. Possibly that sort can be implemented in the script?

    Anyway, information technology all worked out! Thank you very much @Scott-Sumner, @Claudia-Frank and @guy038 for this learning experience. You're a dream squad ;)

  • How to Read Veery Line in the File and Subtract It in New File

    Source: https://community.notepad-plus-plus.org/topic/15436/subtract-document-b-from-a

    0 Response to "How to Read Veery Line in the File and Subtract It in New File"

    Post a Comment

    Iklan Atas Artikel

    Iklan Tengah Artikel 1

    Iklan Tengah Artikel 2

    Iklan Bawah Artikel