The data-processing tag has no wiki summary.
3
votes
4answers
90 views
How to read 4GB file on 32bit system
In my case I have different files lets assume that I have >4GB file with data. I want to read that file line by line and process each line. One of my restrictions is that soft has to be run on 32bit ...
1
vote
1answer
23 views
Rounding with awk -0.0
I am using awk to round floating values in a csv file using (in a pipe)
awk '{$0=sprintf("%.2f",$1)}1'
This works basically fine, but has the problem that it produces both "0.00" and "-0.00" ...
1
vote
2answers
39 views
read in first 3 (out of 8) columns using lapply() in R
I know there have been similar questions answered but i cannot apply them to the following:
I have text files I am trying to read into R:
filelist = list.files(pattern = paste0("*_",str_sub(stock1, ...
0
votes
0answers
19 views
clustering missing value indicator values to capture missing value patterns
I am doing some data preparation with Python using Pandas and I am working with a dataset that has about 80 variables with missing values and I want to capture any patterns of missingness to cut down ...
-2
votes
1answer
23 views
How send json's data via curl?
I have some simple code, like this:
import json
from bottle import route, request,run
@route('/process_json',methods='POST')
def data_process():
data = json.loads(request.data)
username = ...
0
votes
1answer
19 views
Mapping financial data from multiple vendors to match internal formats and naming convention
I have a concern which I believe might be a good subject for the archives, as I imagine many people may encounter a similar problem at some point in their careers. I'm looking for any/all suggestions, ...
2
votes
1answer
52 views
Extracting an html table in another language using R
I am using R to extract HTML Tables from a website.
However, the language for the HTML Table is in Hindi and the text is displayed as unicodes.
Any way where I can set/install the font family and get ...
-2
votes
0answers
18 views
Parse large data files with javascript
I want to extract data from large plaintext files which from electronic structure simulations for a browser based visualization.
The files are extremely in-homogenous since log-messages and actual ...
1
vote
4answers
401 views
data processing pipeline tool for research
I'm wondering if there is a tool for automating complex data processing pipelines on large datasets. Sort of like shell command piping (e.g. cmd1 | cmd2 | cmd3 > file), but supporting more than ...
0
votes
0answers
21 views
Node.js data processing distribution
I'm in need of a strategy to distribute data processing using node.js. I'm trying to figure
out if using a worker pool and isolate groups of tasks in these workers is the best way, or
using a ...
0
votes
0answers
33 views
What are the available missing values treatment method in Weka?
Currently I am only able to find three types of missing values treatment methods under the "Preprocess" stage in Weka. They are the "ReplaceMissingValues", "ReplaceMissingWithUserConstant" and ...
1
vote
0answers
25 views
Is there a way to check if an integer's string representation contains a zero by using bitwise operations?
In C# I've been trying to come up with an interesting way to basically accomplish the following, but without using the string representation.
private static bool HasZeroDigit(int value)
{
string ...
0
votes
3answers
56 views
Low level file processing in ruby/python
So I hope this question already hasn't been answered, but I can't seem to figure out the right search term.
First some background:
I have text data files that are tabular and can easily climb into ...
0
votes
1answer
81 views
data processing pipeline python
I am working on the following problem. Lets say I have data (say image values RGB as integers) in a file per line. I want to read 10000 of these lines and make a frame object (image frame containing ...
1
vote
1answer
25 views
Generate popular subjects from collection of post titles
I have a content aggregator website. I'd like to process the post titles to generate a list of the most popular post subjects. A subject could be "software development" however an important point is ...
3
votes
1answer
75 views
Custom Floating Point Representation
I'm trying to write a parser that will read a particular file type, and I need to map the different data types to C# equivalents. Most of them aren't that difficult, but I'm having trouble wrapping my ...
1
vote
1answer
52 views
Pandas Dataframe selecting groups with minimal cardinality
I have a problem where I need to take groups of rows from a data frame where the number of items in a group exceeds a certain number (cutoff). For those groups, I need to take some head rows and the ...
0
votes
2answers
50 views
Lexicon dictionary for synonym words
There are few dictionaries available for natural language processing. Like positive, negative words dictionaries etc.
Is there any dictionary available which contains list of synonym for all ...
0
votes
1answer
132 views
Data processing with adding columns dynamically in Python Pandas Dataframe
I have the following problem.
Lets say this is my CSV
id f1 f2 f3
1 4 5 5
1 3 1 0
1 7 4 4
1 4 3 1
1 1 4 6
2 2 6 0
..........
So, I have rows which can be grouped by id.
I want to ...
11
votes
3answers
426 views
Plotting many lines as a heatmap
I have a large number (~1000) of files from a data logger that I am trying to process.
If I wanted to plot the trend from a single one of these log files I could do it using
...
0
votes
1answer
27 views
How to use supervised machine learning methods working on variant input dimensions?
So basically I am dealing with a training and test data set (a bunch of arrays) with unequal length like these:
a: {true, [1,3, 4, 5, 5, 8 ,10 ,10]}
b: {true, [1,3, 25, 18 ,1 ,10]}
c: {false, [1, 8 ...
1
vote
3answers
104 views
Need Better Algorithm to Scrub SQL Server Table with Java
I need to scrub an SQL Server table on a regular basis, but my solution is taking ridiculously long (about 12 minutes for 73,000 records).
My table has 4 fields:
id1
id2
val1
val2
For every group of ...
0
votes
1answer
57 views
How to write awk command to group line data and dump to file
One data file consists of multiple line data. A quick look of data file is like:
./gc_string/datadata.distr 10 1273377106 2
./gc_string/datadata.distr 10 -540812264 2
...
1
vote
0answers
38 views
MATLAB remove lead and lag data from variable
I am working in a MATLAB loop with data variables (in column form) that are pulled in from excel files. Each iteration of the loop opens a new file and repeats its process. Inside each file, I have ...
0
votes
2answers
63 views
Data normalization for new inputs into a trained neural network
I have a backpropagation neural network that I have created and coded it in Q with a Kdb+ database.
I am pre-processing data into the network with normalization into the form of [0,1], the network is ...
0
votes
1answer
55 views
Can Datomic simplify querying data contained in dynamically accessed HTML documents?
I need to write an API which would provide access to data being served as HTML documents from a web server. I need for my users to be able to perform queries over the data.
Say on a web site there is ...
0
votes
2answers
123 views
How to check is there any error in DhtmlxGrid using Dataprocessor?
I want to send data from DhtmlxGrid in my MVC project. I have set some basic validation on grid cells which are working fine. But before submitting i want to check if is there any error in the grid. ...
4
votes
1answer
41 views
PHP Data Processing Failing With Ambiguous Error
The user requests a Product Category and says what quantity they want of it, ie Sugar 7 lbs.
In the search results, from the database, I have the following items (which are each different products):
...
0
votes
1answer
80 views
Processing lots of data in python, should I use multiple threads/processes?
I am writing a program to process a huge file (~1.5GB). I am running Python 2.7 on a Windows 7 computer with a pretty good cpu (8 cores). Would it be more efficient in any way to use multiple threads ...
0
votes
1answer
77 views
When to apply Data whitening
Data Whitening (features scaling and mean normalization) is very useful when we use features that represent different characteristics and are on very different scales (eg number of rooms in a house ...
1
vote
1answer
93 views
Can I speed up a large data set operation in SQLite / Python?
I have a data set in the size range 1-5 billion 'box' objects stored in an SQLite database file in the format:
[x1,y1,z1,x2,y2,z2,box_id]
and currently I have an operation in a python script that ...
2
votes
1answer
112 views
I Need To Search a “dirty” text file in R and count the instances of a certain character
The data is called homicides.txt
I need to make a function count <- function(cause=Null)
which returns a certain integer
There are only a few acceptable causes, which if not present the function is ...
0
votes
1answer
97 views
What is a Data warehouse in this use case
I'm trying to figure out the difference (between tools/services/programs) between Data Warehouse, Clustered Data Processing and the tools/infrastructure for querying a Data Warehouse
So Let's say I ...
0
votes
1answer
27 views
Speeding up document processing and loading into database
I have a few million documents. What I am trying to do is simple, process the documents to extract the information I need and load it into a database. I am doing it in Python and using SQLAlchemy. ...
0
votes
0answers
91 views
Appropriate data processing design pattern?
I'm looking for an appropriate design pattern to accomplish the following:
I want to extract some information from some "ComplexDataObject" (e.g. an Image) and save the relevant information in a more ...
2
votes
2answers
396 views
Aggregate Functions over a List in JAVA
I have a list of Java Objects and I need to reduce it applying Aggregate Functions like a select over a DataBase.
NOTE: The data were calculated from multiples Databases and services calls. I expect ...
1
vote
2answers
65 views
OrderBy when a parent-value maybe null
Assume I want to order table q in T, by column q.As.OrderByDescending(p => p.Beginning).FirstOrDefault().B.C.
However, q.As.OrderByDescending(p => p.Beginning).FirstOrDefault() or a.B may be ...
1
vote
3answers
456 views
Hibernate out of memory exception while processing large collection of elements
I am trying to process collection of heavy weight elements (images). Size of collection varies between 8000 - 50000 entries. But for some reason after processing 1800-1900 entries my program falls ...
16
votes
3answers
450 views
How to smooth a curve in the right way?
Lets assume we have a dataset which might be given approximately by
import numpy as np
x = np.linspace(0,2*np.pi,100)
y = np.sin(x) + np.random.random(100) * 0.2
Therefore we have a variation of ...
0
votes
1answer
26 views
Running code (loop) server side and retrieving output later on
I am trying to do a simple program that keeps track of some internet data. I can get the data from a public JSON object, so that's not really the problem. I would like to automize the process as much ...
0
votes
0answers
53 views
Trying to process raw string (rank of countries by GDP) with python for other uses
I'm pretty new to this so sorry if this is a dumb question.
I'm trying to sort some data. Here's a rank of countries by GDP for example, that I'd like to find percentages of, add up certain amounts ...
0
votes
1answer
138 views
calculate min/avg/max/std-dev for ICMP time stamp data from hping [closed]
What's the best way to calculate min/avg/max/std-dev for some random data in shell?
What if one has several columns per line, and needs to calculate the statistics for each one?
Sample input (based ...
2
votes
2answers
668 views
solutions for cleaning/manipulating big data (currently using Stata)
I'm currently using a 10% sample of a very large dataset (10 vars, over 300m rows) which amounts to over 200 GB of data when stored in .dta format for the full dataset. Stata is able to handle ...
1
vote
2answers
435 views
Conditional merge for CSV files using python (pandas)
I am trying to merge >=2 files with the same schema.
The files will contain duplicate entries but rows won't be identical, for example:
file1:
store_id,address,phone
9191,9827 Park st,999999999
...
0
votes
2answers
137 views
Is a relational database appropriate for SAS like processing?
Currently I have a program that processes raw data in SAS, running queries like the following:
/*this code joins the details onto the spine, selecting the details
that have the lowest value2 that ...
1
vote
1answer
241 views
C# Signal Processing Plotting Rapid Data
I have a circuit that sends me two different data from sensors. Data is coming as packets. First data is '$' to separate one packet to another. After '$' it sends 16 bytes microphone data and 1 byte ...
0
votes
2answers
83 views
Tools to do data processing from Java
I've got a legacy system that uses SAS to ingest raw data from the database, cleanse and consolidate it, and then score the outputted documents.
I'm wanting to move to a Java or similar object ...
1
vote
2answers
129 views
Convert python dictionary to flowchart
I have a program that will generate a very large dictionary-style list that would look something like this:
{"a":"b",
"b":"c",
"C":"d",
"d":"b",
"d":"e"}
I would like to create a program using ...
0
votes
1answer
265 views
How to handle time series data with other attributes in machine learning?
I'm working on a binary classification problem, and if each data instance has several time series of different metrics and there're also some other attributes. How to deal with the time series, treat ...
0
votes
1answer
114 views
How do I perform koyck lag transformations in PMML?
I'm using PMML to transfer my models (that I develop in R) between different platforms. One issue I often face is that given input data I need to do a lot of pre-processing. Most times this is rather ...