Printable Version of Topic

Click here to view this topic in its original format

Linuxhelp _ Programming in Linux _ gawk help

Posted by: mkingiii Oct 24 2006, 04:20 AM

I am sure this is a relatively simple problem, but I am a novice at linux and need the answer quickly as it is critical to other work. I need to find and replace the same text in multiple files. After a quick search of linux commands, I found a sight that attributed the command gawk with the ability to find and replace text. However, the sight does not give an example of the full sequence of commands to use gawk to find and replace text, only that it can be used to do it. I would greatly appreciate it if someone could provide me with an example of the full command sequence to find and replace text. Thank you for helping.

Posted by: michaelk Oct 24 2006, 07:29 AM

There are many tools that are capable of replacing text.


sed example
sed -e 's/oldstuff/newstuff/g' inputFileName > outputFileName

Posted by: mkingiii Oct 24 2006, 04:22 PM

Thanks that helped alot. However, I now have a follow up question. I am able to run the replace command, but I am unable to keep the initial file name. I would like to be able to find and repalce the text in the original file, as the file name is important because I call on it in later programs(programs I cant modify). When I try this it keeps the orginal text and put the modified text below it. Is there some way to perform the find and replace in the orginal file.

Posted by: mkingiii Oct 24 2006, 05:10 PM

Actually, I think I have solved the problem. I am using the following command:

awk 'gsub(/this/, "that")' test >> testtmp ; mv testtmp test

This appears to find and replace the text while keeping the orginal file name. If this has some kind of problem I dont know about yet please let me know. Otherwise I will start employing this to modify the many files I need. Thanks again for you help.

Well, looks like I have answered my own question again. I tried one more test of the command line. This time I added text that did not include "this". The resulting test file only contained "that", it had deleted any line that did not originally have "this".

Posted by: mkingiii Oct 24 2006, 05:44 PM

Ok I think I have it this time. I am using the following command:

sed -e 's/YSN/LYS/g' test > test2 ; mv test2 test

This appears to be doing what I want. It replaces all instances of YSN with LYS, keeps any lines that does not contain YSN, and overwrites the original file. If there is something wrong with this, as there was with my first attempt, please let me know. Otherwise I think I have it. Thanks for your help.

Posted by: mkingiii Oct 24 2006, 06:24 PM

Ok here is a sample of the actual command I tried to use:

sed -e 's/YSN/LSY/g' ./aa6/prod/50/0/final.pdb.gz >> store ; mv store ./aa6/prod/50/0/final.pdb.gz

However, nothing happened. The file was unchanged. What is the problem now? I would appreciate some assistance, thanks.

Posted by: michaelk Oct 24 2006, 09:13 PM

The command should work as posted. Without actually know the contents of the file I assume from the gz that it is a compressed archive. Replacing text in an archive might leave the actual uncompressed file corrupted.

Is YSN text in the archive file itself or a file contained within final.pdb.gz.

I just noticed the ./ which is a shortcut for current working directory. Assuming what is posted is the compete path use:
sed -e 's/YSN/LSY/g' /aa6/prod/50/0/final.pdb.gz >> store ; mv store /aa6/prod/50/0/final.pdb.gz

I still question why replace text in an archive file.

Posted by: mkingiii Oct 24 2006, 11:07 PM

I am performing molecular modeling of proteins. The files contain data relevant to the protein for each amino acid in the protein, in a table format. Part of the data on the protein is the three letter designation for each amino acid in the sequence. I wanted to make the protein neutral so I had to modify a lysine residue to get rid of its positive charge. Lysine is normally given the abbreviation LYS. When I modified the lysine I called it LYSN, for lysine neutral. The program, AMBER, I use to construct the protein mandated that the amino acid code name only have three letters, so it shortened LYSN to YSN. However, I am now experiencing a problem. The analysis programs I use somehow make use of the three letter designation and are not recognizing YSN. I have gone in manually using emacs and replaced YSN with LYS in one file to confirm that this is the problem. Following the convertion to LYS I was able to use the analytical programs on the file. The analytical programs are part of a commercial package, so I can not readily alter them. So what I need to do is go into the files and change YSN to LYS. I have thousands of files to alter so I really cant do them all using emacs.

Yes, the files are compressed. They are red, although Iam not entirely sure what the color means. When I open them in emacs it says unzipping.

What I listed is the complete path from the directory I am trying to work from. However, this directory itself is a subdirectory. I actually think I tried removing the . but I believe I got an error message about something not existing, Iam not currently at the computer but I will try again later and post the exact error message.

I really appreciate your taking the time to provide assistance. Thank you.

Posted by: michaelk Oct 25 2006, 11:43 AM

Since your file is an archive sed will not be able to replace text like emacs. A bash script will work nicely to automate the process. However without know some details I can not provide anything specific. Does the archive contain many or just one file?

To uncompress the file:
tar -jxvf final.pdb.gz

Posted by: mkingiii Oct 25 2006, 01:56 PM

The archive ./aa6/prod/50/0/final.pdb.gz contains just one file (final.pdb.gz). All the files I need to modify have a similiar designation:


Posted by: mkingiii Oct 25 2006, 08:39 PM

Ok I think I have it finally.

I ran the following command sequence:
gunzip ./aa1/prod/0/1/final.pdb.gz ; perl -pi -e 's/YSN/LYS/g' ./aa1/prod/0/1/final.pdb ; gzip ./aa1/prod/0/1/final.pdb

This appears to have worked. I tested it and I was able to run the analysis programs. So thanks for all you help. All I need to do now is figure out a way to implement this. What I have is a file containing a list of files to be modified for example


Any Ideas on how I can easily modify this file to have the aforementioned command sequence.

Posted by: mkingiii Oct 26 2006, 03:10 PM

I guess what I would want to know is if there is a way to create a generic program file containing the command sequence I need to employ with place holders for the specific files from the lists. Something like:

gunzip xxx.gz ; perl -pi -e 's/YSN/LYS/g' xxx ; zip xxx

Then tell this program file to run the commands on the contents of the files containing the lists.

Powered by Invision Power Board (
© Invision Power Services (