Keyword Searching explained

as used by the Norcross Group

If you are an attorney, before you do anything else, please go to this page  and read the editorial in the January 2007 issue by Craig Ball.

Direct links within this document:
General Searching Questions
It is suggested that you review these links in the order they are listed here.

FILE DATES

In the simplest terms: These are general rules, but Microsoft advises that any or all of the file dates can be modified by any program.


Special section on file dates explained from:
Microsoft explanation of file dates or More common explanation
Here is the text from one of the above pages:
Notice, that MOVES affect the dates differently than copy.

Also, when dealing with MOVES or COPIES to an external USB device, there is no way of telling that the file was placed on the external device. IF and that is a big IF, the user opened a file and a link .lnk file was created which pointed to the external device, the lnk (link) file may contain enough information to indicate the operation. BUT a simple copy or move leaves no footprint on the drive to indicate the action to the external device.

"File properties with regards to the date and time stamps"

CAVEAT: These definitions are provided in plain English to assist individuals in deciding what type of "image" they are going to need. The definitions are not designed or intended to be used in any legal documents or legal proceedings.

Cluster:   The allocation unit of the file system. A group of sectors (512 bytes) are logically combined to form a cluster. The cluster is the way the operating system addresses file contents. Clusters can contain from one to 64 sectors of data.

 

Data : Data is information that resides on the hard drive or other storage medium (ie: USB "thumb drive", memory card, etc). It includes 100% of the storage area of the drive. Not restricted to that data or information which is usually visible to the casual computer user.

File slack: Space between the logical end of the file and the end of the last allocation unit (cluster) for that file. File slack may contain remnants of files and other data that at one time resided in that allocation unit, and have been deleted or moved. The reason file slack exists is that the current file data (ie: 1000 bytes) does not take up the entire cluster (ie 32000 byes), and thus residual data (31000 bytes) is left by prior files and is visible (only by forensic analysis). Some operating systems, and programs take special steps to make certain any or all of the file slack is wiped when writing data to the disk.

Keyword Search: The process of using specialized software to search for a list of keywords or phrases provided by the requesting party. The most common search is a keyword search of files to locate those files containing the supplied words. Depending on the type of image/clone available, in addition to the user visible files, a more in depth search may be able to search within slack, free space, zip files, and possibly e-mail files. Keyword searches generally produce substantial (in the thousands) number or "hits" which need to be reviewed for relevance. Keyword lists should be carefully thought out in advance, as someone will most probably have to review thousands of hits, most of which are non responsive. Keywords that are industry generic should definitely NOT be included in any keyword search request. (you wouldn't search for zipper on a drive from the garment industry, or contract from a real estate agent).

 

Meta Data: Is "data" about data. Generally includes file attributes such as the date and time stamps associated with the file. Can also include file size, and folder or path location and name. Meta Data when used in conjunction with Microsoft Office products generally refers to internally stored information not readily visible to the casual user. Depending on the current version and setup of the Office environment, this internal Meta data may contain additional file processing dates, author(s), print characteristics and other internally stored information.

 

Sector: The smallest addressable data storage area on a hard drive. Generally a sector contains 512 bytes of user data. Operating systems generally do not access data by a sector; they access data at the cluster level. Specialized software is needed to access data at the sector level.

 

Unallocated space: Allocation units (sectors or clusters) not assigned to active files within a file system. When a file is deleted the area of the disk that the file resides on is marked as free and is available for future use. The data or content of the file is NOT overwritten. Until this data area is overwritten, this "unallocated" area may contain any residual file data that has not yet been overwritten.


Some Common Questions:


Can online (yahoo, AOL and similar) type e-mails be recovered?

Usually only remnants of these e-mails are seen. Since the online e-mails are read as web pages, unless the documents are specifically saved to the hard drive, the web pages viewed remain only as long as other web pages on the system. The longevity of the web pages (cache) on the system is usually user defined. Usually to get "content" of the e-mails you need to contact the e-mail service provider. (Yahoo, Microsoft, AOL etc).


When a link in a report is "clicked" on, why is it not being displayed properly?

File content may be internal metadata, partially extracted, not fully intact and incomplete. In this case, it may be necessary to open the file with "WORDPAD" or "NOTEPAD" to view any text within the file. (Many of the files fall into this category)

The "file" is a remnant of one type of file, but is mistakenly listed as another type of file, and the OS cannot properly render the file.

The reviewer's computer does not have an acceptable viewer for the file listed. It was listed because a reviewer may determine that the filename may refer to file content of confidential information.

The file may be a partial Internet file, which is a remnant of a web page.


Can you do a live search of a drive, and what is searched for?

Yes, we can do a live search on a hard drive. You must realize that a "live" search will only search in files which are normally available to the Microsoft Explorer program, (and minimal access to some system and hidden files). Live searching DOES NOT search freespace or slack space. In some instances it will search within files being held in the recycle bin, but NOT in deleted files that have been removed from the recycle bin. Generally a rule of thumb for the files being searched is only files containing text information will provide useful hits. Outlook pst files, data bases, spreadsheets, pdf files, and other files which format their content to "binary" data will not produce valid hits. If however, you are looking for keywords within Word, html or other text documents our string search program can usually search at about 1-2 gigabytes per minute.

What do you get when a keyword hit is found in drive free space?

Drive freespace (DFS) is the multi-gigabytes worth of data on the drive that is not currently assigned to any file. DFS contains remnants of items that once were files, and now have been deleted, and possibly partially overwritten. No date, filename, or other identifiable information is generally associated with the content of DFS. When a keyword is found within DFS, any surrounding text "the NORCROSS GROUP" may be related, and it may not. Only a visual/manual review of the surrounding content will determine its relevance.

Review of the surrounding text will also include a lot of noise data, and in most cases the extraction of recognizable text around the keyword is a manual process. The extraction of keyword hits in DFS should only be requested as a last resort, and the requestor should consider and be aware of the extremely slow process it is. If you took notice of the word NORCROSS above, this is an example of what you might find in DFS when the keyword NORCROSS was searched for. The keyword is there, but surrounding text amounts to unrelated noise.


User Generated Files

User Generated Files are generally those files which the normal computer user will create during their day to day operation. In most cases it means, the documents, spreadsheets, data bases, pdf, zip, pictures, e-mail files, temporary internet files etc. that the user takes a positive action to create. On a normal system, these files would be located in the "Documents and Settings" folder. There would not usually be any executable or "binary" files in this grouping. Searching these types of files for a casual user produces more efficient output for review. If we process User Generated Files, we generally eliminate all or most of the easily identified programs, and other "system" type files.

User generated files also include files that may now be deleted. (whether recoverable or not.)

Deleted or Recoverable Deleted Files

Deleted files are just that, the files that the forensic software can identify as being deleted. This is more in-depth and inclusive than what the user would see in their recylce bin. However, because it is more inclusive, there may be a substantial number of files which even though the forensic software can identify the filename as being deleted, it MAY NOT be able to be recovered.

Deleted files often are not recoverable because the actual data of the file has been partially or totally overwritten, but the residual directory entry is still available for review. (see
recovering deleted below)

The file is not known to be fully recoverable until the recovery process is actually attempted.

Deleted files can contain all the types of files already mentioned: "system files, user files, temporary internet type files, etc". There is no easy distinction of the type of file, until it is actually recovered. However, the location or directory or filename which belongs to the file may give indication of its worthiness for searching, and recoverability.

Email Stores (pst, ost files, exchange edb)

These are searched using a different process, and may produce a totally different output than searching through other user files. Because e-mails are searched differently, we can often produce them in a standard .eml format. Requestors should be confident that they are asking for the correct type of e-mail search. Should the search be global across all users/custodians, or should it target one custodian (when dealing with Exchange server edb files).

What files should be searched?

When requesting a keyword search, the requestor should be well aware of what files he/she is asking for the keyword searches to be performed on. A typical hard drive will have thousands (more likely hundreds of thousands) of files on the drive. This file list is made up of operating system files (usually binary files), e-mail files, temporary internet browsing files, internet e-mail residue, "user" generated files (such as the typical word document or spreadsheet) and a significant amount of unused free space. If the search is performed on all the files on the system, a significant number of the hits will be located in files which are of no relavance to the case, and often are binary system files which the normal or casual reviewer will not be able to view or understand.

Keyword responses in files that are not of user generated file types often produce a high percentage of hits, and are totally useless to anyone. It is therefore incumbant on the requestor to properly and accurately convey to the analyst what they wish to accomplish with the keyword searches, and determine if it is possible or preferable to restrict the keyword searches to specific file types. This will reduce the amount of "noise" hits that are produced, and make the reviewers life a lot simpler.

What type of output do you need?

Decide what your next review or processing step is going to be. Then discuss the type of output you need.

A simple extraction of "Live Files", whether user generated or all inclusive of system type files should be thoroughly though out. As mentioned above, there may be hundreds of thousands of live files on a system. Some of which are binary, and of no use to the reviewer. If you ask for live files, or even recovered deleted files, do you want them in their original tree/folder structure. Do you wish for us to attempt to maintain the date/time meta data of the files. Do you want a listing of the files? If you want a listing, do you want it in a spreadsheet format or in a text file, realizing that most older versions of Excel have a 64000 line limit per sheet. What are you going to use the listing for, and do you, the reviewer plan on reprocessing it yourself.

Do you want the files on the drive to be renamed adding a pseudo unique bates number to the filename for uniqueness?

With e-mails, is a file with the ".eml" format acceptable? Is a Summation data base load file needed to go along with the extracted e-mails?


The keyword search process

The Norcross Group uses a number of processes and programs to conduct keyword searches. The search can comprise of three unique sections of the drive. One section and the most often requested is a search of live files. These are files which the casual user can see when exploring the file system. These live files contain user (generated) files and system files (which are often of no use to the average request). The next group of files are the deleted files. Some of which may be recoverable and others are not. And the third segment is the freespace, which is not associated with any file or other item which the casual user would have knowledge of.

The most common search is done on live files. With the major emphasis on user files. The emphasis on user files eliminates a lot of extraneous hits and targets only those files which the user created. Within the user files are sub categories like documents, spreadsheets, e-mails etc., which can be used to refine the searches.

Then, do you want duplicates elimated? If so, what logic is to be used to eliminate duplicates? By hash, filename, content, e-mail header. Content and e-mail header may mean different things to different analysis software.

Depending on the need of the requestor and the post processing use of the output, the results of the keyword searching can be provided in a user "clickable" html browser format, or "raw" files which can be provided to a processor to create traditional load files used by legal reviewers.

Both live file and recovered deleted file extracts can be provided in this fashion. However, the final use of the data needs to be known in advance so as to choose the right search tool (and output) for the job.

At the Physical Level (by sector):

Before going further, discussing the keyword searching process, it is useful to understand how data is located on hard drives. Generally a hard drive is comprised of sectors, which each contain 512 bytes (characters) of data.

 

These sectors are grouped into more logical areas called clusters. Generally between 4 and 8 sectors make up a cluster. So a total of about 32000 bytes reside in each cluster.

 

The entire hard drive consisting of multi-gigabytes of storage is made up of enough sectors/clusters to allow data to fill up the drive.

 

Directory entries, which store the filenames, information relating to the physical location (cluster areas) on the drive, and other meta data about the files is part of the file recovery process, but not necessarily the searching process, so we'll leave the directory association till later.

 

Lets try to make an analogy, picture a large blackboard being the entire drive, with lines across it so you can write information on the blackboard. Each cell in the table below would be a cluster made up of 8 sectors where any file could write data.

 

If we write

information onto

the drive, we

would take up

some clusters

for the files.

freespace on the

rest of the XX

Gigabytes of drive

which will

eventually be

filled with data

as will areas of the

drive necessary to

accommodate the

directory entries

relating to the files

 

 

 

 

 

 

 

 

 

If we write information onto the drive, we would take up some clusters for the files.

 

If we write

information onto

the drive, we

would take up

some clusters

for the files.

Now we write

another file after

the first one.

And so on.

Until we get all

the data we need

on the hard drive.

 

 

 

 

 

 

 

 

 

 

 

 

 

Now we write another file after the first one.

And so on with a third file. Until we get all the data we need on the hard drive.

 

If we were to do a keyword search for "clusters" and/or "data" we would at this point be able to identify the complete files containing these words. We could recover them in tact and all is well.

 

If we were to erase the first file, and not write anything more to the drive, we could still recover all of its contents with the keyword hit. However, what if part of that file was overwritten with new data?

 

We are going

to overwrite only

part of the file

with new data

some clusters

for the files.

Now we write

another file after

the first one.

And so on.

Until we get all

the data we need

on the hard drive.

which will

eventually lead to

a lot of deleted

and partially

overwritten data

also referred to

quite simply as

NOISE

 

 

 

 

 

The keyword would "clusters" would still hit, but we could only be able to intelligently recover that small amount of information (some clusters) in that last cluster that used to belong to the file. This area of recovered data is called file slack. And the content of file slack may be quite small and contain little information relevant to the case.

 

Next, lets assume that this file deletion, and overwriting of the available space is done over and over by the user. The clusters will contain remnants of many different files that have resided in those areas. Effectively this residual data that is seen is nothing but "noise" and it becomes difficult to distinguish the noise (left over time as a result of many files occupying the same clusters) from relevant information.

 

This is especially true when the user makes use of online web based mail programs. The mail viewed is normally not saved to the drive, but only remnants of the web page are written, and constantly being overwritten and replaced. Or, suppose that all the data in a large portion of the drive is deleted, and as much as 26 megabytes of data is actually free to be written to and re-written to over and over.

 

 

We are going

to overwrite only

part of the file

with new data

some clusters

for the files.

Now we write

stuff from other

files

And so on.

Until we get all

the data we need

and so on

This entire area

of the drive is now

unused space

because files that

once resided here

are now deleted,

but their data is

residual.

gmail web mail

residual message

 

 

 

If we were to look for the word:  "gmail web mail" and this keyword appeared in unused space (which is likely), then image all the erroneous surrounding text one would encounter. Any useful information relating to the original Gmail message is extremely difficult to identify in the surrounding text, and even more difficult to extract.

 

 

Here is a scenario that happens all too often, and the reviewers have trouble understanding why the forensic software says it retrieved one type of file, but the software can't render it properly, and the keyword found bears no assocation to the file content retrieved.

 

 

Assume at this time, the file, a spreadsheet (.xls) that contains the content: "We are going to overwrite only part of the file with new data" was deleted. The space is free to re-use.

 

At a later time, because these clusters were free to write information to, another, shorter file was written to these clusters, say a pdf file. Now, we have a pdf file with content in the first 4 clusters of what used to be a spreadsheet with slack containing our keyword "clusters". The directory entry of both the pdf and spreadsheet are still in tact, because they were written to different folders. So there is a reference to two files pointing to the same location on the drive.

 

Next, the current, pdf file is deleted, thus making that entire group of clusters free to be written to again.

 

When we do a keyword search for the word "clusters" the forensic program finds a filename which points to that group of clusters, in our case, it will associate the spreadsheet filename, that at one time used those clusters of the drive. Even though some of the data has been overwritten many times by other files residing in the same location.

We are advised that the area of the drive containing our keyword hit belongs to the filename that was the spreadsheet. The forensic program extracts the data/file and associates a spreadsheet filename with the data it extracts. When the reviewer clicks on the file that was extracted, if anything is displayed at all, it is usually trash because the operating system is trying to display remnants of one type of file, as another type of file, only because our keyword at some point in time belonged to a long deleted and overwritten file that is now incorrectly associated with a spreadsheet.

If a keyword is found or displayed in any file EXCEPT a currently existing file the chances of recovering the entire content of the original file containing that keyword are usually not good.

Simply put, finding keywords on the drive is relatively straightforward. Determining their relevance to the case is difficult. And if the keyword is in an area of the drive not "ASSIGNED" to a file, the extraction of useful surrounding text is difficult and very time consuming. It is effectively accomplished by a cut and past process for each keyword hit.

Selection of the Keywords:

It is best not to use industry generic keywords when searching. If the computer belongs to someone in the garment industry, don't look for the word zipper; don't look for the word contract on a computer owned by a real estate agent, etc. Don't look for the e-mail address of the owner of the computer, its all over the place, and don't look for simple 3 and 4 letter acronyms unless they are very unique. Remember that the operating system may have 30 - 40,000 files out there where your keyword may be found in many of them.

Recovering Deleted File Content

When a file is deleted, the operating system marks the clusters (sectors) as available for writing. The deleted files' data is not overwritten until the operating system makes use of the clusters/sectors where the file was located. New files will overwrite this area as the operating system determines it needs the space for new files.

Even when the (deleted) file data is overwritten by a new file, (if the new file is smaller in size, it may require less cluster space), the operating system may not make use of the entire area used by the previous file, and so "all" of the original data may not be completely overwritten. Some of the erased files original data may still be found on the drive.

Also, in some instances, if residual directory clusters are found, original file name references may be located in this deleted directory cluster. But the actual data associated with the file names located in the directory cluster may already be overwritten. This causes some confusion, because in some cases the forensic software may see the original (deleted) file name in the directory cluster, but the data area pointed to does not currently hold data which was in the original file. The data is the data of a newly created file residing in the clusters vacated by the deleted file.

A complicating factor to the recovery of deleted file data (which includes directory cluster data), is that the original file data when it is written to the disk may not be written to contiguous clusters/sectors which leads to the recovery of remnants and not complete files.

If the original file was fragmented, and some of the data in the fragmented cluster locations are now overwritten by new file data, the recovered data may only be a small portion of the original content. (Usually the first part of the deleted file is overwritten first, so any "recovered" data does not appear to be correct or related to the filename found.). The viewing of only partial file content may be difficult because the operating system program, which is usually assigned to view, the file (ie: WORD opens .doc files, EXCEL opens .xls files) may have problems rendering the incomplete content. In fact, that is almost always the case when only a partial recovery of data is performed. Some types of files cannot be rendered at all if not completely recovered.

Recovering, fully in tact, data from deleted files is not always 100% possible or accurate. There are many reasons technical reasons. The two most probably are: another file has overwritten all or part of the data, or the original file was fragmented and the forensic software cannot recreate the chain of clusters to assemble all the parts of an original fragmented file. There are other technical reasons beyond the scope of this discussion. Hopefully these discussions were not too technical. They definitely were not intended to be too technical or to be used in any legal documents or proceedings.

 

top