One for you clever techy people...

I have a stack of hard copy documents with an extensive amount of data that needs to be transferred into excel.  How can this be done?? 

No clue anymore, give a few a ring on Monday and see if you get lucky with a bright undergrad on holiday.  OCR isn’t clever enough to know what data to put into what cell.

Not at all, I neither a trainee or a paralegal. It's for a private research project I'm doing.

Thought about the pdf converter route. Tried it but it does not break it down into cells in excel, just lumps all the data into one cell, which isn't useful or helpful.

 

 

 

You can work the scan into Excel directly through Microsoft as well. Access Microsoft Office tools and open Microsoft Office Document Imaging. Click "File" and "Open" and then select the document. Look at the lower right corner of the screen and find the eyeball icon. Hover your mouse over this icon, and it should read "OCR" or "Optical Character Recognition." Click the icon to run the scan through the character recognition process. Next, highlight the entire document, copy it and then paste it into Excel. You can still opt to paste the results into a Word document first to work through the data before entering it into Excel.

If you can get it into excel or a digital format at all then you just need to do some data-wrangling on it. Download data-cleaner.

https://datacleaner.org/

It’s fairly easy for non technical people to use I would say. Are the columns clearly defined? Weird, for such a good product their website is unbelievably shit on a phone

Thanks all. Some helpful suggestions that I'll look into. 

I have about a ream of paper, with the details on both side at around 20 entries per sheet. I did a practice simply copying and pasting from pdf (once I scanned it) to excel and it just put the data into one cell - just as it looks on paper. 

May look to outsource but reluctant to as it may be price prohibitive. I do around 60wpm but don't have the time to painstainkingly do it myself.

 

Helloooooooooo me again.

Update: so, I've somehow painstakingly scanned all that ream + of paper. I have used https://online2pdf.com/ to try to convert into excel - which it does - but not uniformedly. All the information is in the excel spreadsheet but not in any order.

My next question is, how can I get them in order? For example in columns that say: Name, Company, Address etc. Can it be done? Is there a formula that can do this? Or is it a case of manually going through the tens of thousands of sells and retyping the required information??

@Effi: thanks for the steer. It seems that the Abbyscan method in the youtube link basically does the same thing as the online thingy I used. Plus datacleaner is now defunct.

@Wellington: I tried the OCR method and it basically separated each and every word of the document into their own separate cells?

Any pointers and steers greatly appreciated :-)

 

 

 

Gosh.

Do you know NOTHING about Excel? What are you? A lawyer?

Anyway,  1) select everything in the spreadsheet 2) Click on Data in the menu bar. 3) Click on Sort 4) Choose which column you want to sort on 5) say whether or not you have column labels in the first row 6) Click Ok 7) Have a glass of wine

 

1)I suggest that you set up a simple spreadsheet with half a dozen rows and columns and test sorting on multiple columns.

2) Clean up before you sort

3) Be very careful if you select less than everything. Sorting on a subset of columns wrecks your data. So think carefully

@Elfffi: I know a fair amount about excel, use it most every day, but not of this complexity. The data currently in the spreadsheet is not in the form that could benefit from the basic steps you indicate above - which I am familar with.

The data I need, needs to be essentially ripped from the spreadsheets as they currently sit so randomly within it as follows:

There are over 950 lines of data (produced from just 1 of the 15 pdfs I converted, so this is just the begining;

There is 1 cell of data per line . I only need the data from 2 lines in  every 5 or 6 lines of the 950. How can I rip the data from every other 2 lines/cells from the whole of the excel spread sheet?

There is only 1 column in the whole of the document (I can label it 'name' for argument sake).

 

 

 

 

I'm sure you'll get there, but what point are you trying to show / prove. There maybe easier ways to skip there if you start from the end, rather than the beginning oddly enough and just take selective data.

There is 1 cell of data per line . I only need the data from 2 lines in  every 5 or 6 lines of the 950.

So nothing in the spreadsheet identifies which cells you want?  How do you expect to select every five or every six? Presumably not at random.

What chambo said. This almost certainly won’t be worth the time and effort you are putting into it.

Does also beg the question why you can’t just ask for soft copies of whatever it is you’re reviewing. 

'So nothing in the spreadsheet identifies which cells you want?  How do you expect to select every five or every six? Presumably not at random.'

In the entire spreadsheet the are 2 words per plus information which identifies the information randomly set out that follows so for example:

line1: 'holder name' (what I need), followed by a mass of unwanted wordage

line2: a mass of unwanted wordage related to 1

line3: a mass of unwanted wordage related to 1 and 2

line4: a mass of unwanted wordage related to 1,2 and 3

line5: 'address' (what I need) followed by unwanted wordage related to 1,2,3 and 4

and so it goes on for nearly 1000 lines

I need 'holder name' and 'address' per every 5 or 6 lines. Can these be extracted or ripped from the spreadsheet to create a separate spreadsheet of two columns, one that says 'holder name' and the other that says 'address'. 

The institution that provided the information does not provide soft copies (I created my own soft copies from the ream of double sided hard copies they provided)...thus....

 

Question: I want to copy every 5th cell into a new column in an Excel worksheet. How can I do this?

Answer: You can use the Offset function for this.

  • Select the cell where you want the output to start, let’s say C1
  • Enter: =OFFSET(Sheet1!$A$1,(ROW()-1)*5,0)  ,  where A1 is the first cell to copy and 5 the number of cells to skip for the next cell to copy
  • Now you can drag this function in C1 down

untitled

@Elfffi: ok so, this is the formula input as per my document: 

=OFFSET(Sheet1!$A$5,(ROW()-1)*5,0) BUT whereas you have numbers 1,2, and 3 in your C column, my result in my column shows zeros '0' when I drag down the cells. Am I on the wrong track? 

@Zero Gravitas: Partially correct. Legally obtained, they are not the individuals personal addresses and the institution does provide it in *soft* copy.

It doesn’t have to be a personal (I.e. home) address to constitute personal information. It would be interesting to understand how you’ve legally obtained this info (obvs don’t expect you to actually answer on the board, just saying be careful, this sounds dodgy AF tbh)

Mis

For my test spreadsheet, your formula copied into C1, C2, C3 enters into those cells the values of A5, A10, A15.

Does your Excel version compute formulae automatically? Or do you need to press F2 or something?

@ZG: LOL! I'd have to be some kind of moron to involve myself in any illegal practices in plain sight or otherwise...I'm so up to my eyeballs in gdpr at work it would be a pity.

@Elfffi: Let me have another go at it and see where I land. Also I'm looking into engaging a 'data scraper' off of PeoplePerHour but it may well still be price prohibitive. 

 

I just cannot make sense of this tbh.

Every 5 or 6 lines?

what separates the cells? It sounds like white space. Is there anything else in the address field that identifies it as different to the unwanted ones? Eg occurrence of commas? It’s a really hard question if you just want to separate particular sections of plain text from others

I think I've cracked it! Sort off.

PPH were interesting, various offers saying it can be done the whole job could take upwards of £800 - and that was with me transferring the pdfs into excel first. SO! what to do? - Rethink.

1) I converted the pdfs into excel

2) I search and find the entire excel sheet for entries that say 'licence holder' (for example) and replace it with nothing (so only the actual name of the 'licence holder' remains

3) Besides the 'licence holder' name (in the same cell) still remains a shed load of blurb that I don't need, so, I use the 'delimit' function to split the excel sheet into two columns: 1) 'licence holder' 2) 'info I need'

4) So now I have two columns with what I want, HOWEVER, there is still some data in some of the cells in column 2 that I do not need so for those, I'm having to manually delete each of them, line by line.

I think from looking at youtube, there is a function called VBA that could probably do this? (it involves macros etc), but to be honest I just don't have the time to learn it at the moment. 

I probably managed to get a clean data of around 350 entries, but if I continue with the above system it will take me approximately a month to get through.

Thoughts?

 

My first thought is to smile at you referring to VBA as a function (it isn’t, it’s a coding language first introduced in Microsoft Excel 5.0 and is now a part of all MS Office applications.  The acronym stands for Visual Basic for Applications.  It is an event driven coding language).

My second thought is, why don’t you just throw money at this rather than waste your own valuable time?  Is a month of your spare time worth less than £800?

@Teclis@: hahaha! yes I'm like totally not tech savvy! 

Unfortunately I'm not in a financial position to throw money at it as I'd have liked. Doing it myself's not so bad...at least I'll know it's as accurate as damnit.

'there is still some data in some of the cells in column 2 that I do not need'

As Tecco says, you can do anything in VBA, ... ... except read minds.  

'Some data I do not need' is not an implementable requirement.

If you can define some rules e.g. I don't want any numeric data, or text containing capitals, or longer than 26 characters, or on 'this list', or whatever, then a solution will be possible. But 'that I do not need' does not work as a requirement.

You may need regular expressions rather than VBA, but unless you are doing something really complex, you'll be better off with Excel's native string functions such as  LEN(), LENGTH(), TRIM(), CONCATENATE() etc

A further thought to this is that you could make all the rules into a procedure which you can then execute via an onclick event in VBA by shoving a button onto the spreadsheet itself to publish the results onto a new worksheet. 

Would probably be somewhat more time productive than trying to have everything manually manipulate based on live formula of =() type functions.