File control tuples

classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

File control tuples

Bhupesh Chawda
​Hi,



Emitting ​file
​information
for a file based source like a file input operator
​in malhar ​
seems
​like
a
​good
feature to provide. It is useful information for any downstream operator to
​know
that a data tuple belongs to a certain file
​ for instance​
.


We propose to add capability in the abstract file input operator to emit
file control tuples. These control tuples can include filenames as well as
any metadata that the user wishes to include along with it.

​To link this meta data to each tuple, we can add another port to the input
operator which would carry the meta data along with the actual tuple. We
can try to reduce the amount of meta data that goes with each tuple by
having some sort of meta encoding in the control tuple.


~ Bhupesh​


_______________________________________________________

Bhupesh Chawda

E: [hidden email] | Twitter: @bhupeshsc

www.datatorrent.com  |  apex.apache.org
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: File control tuples

Pramod Immaneni
+1

On Fri, Jun 2, 2017 at 12:06 PM, Bhupesh Chawda <[hidden email]>
wrote:

> ​Hi,
>
> ​
>
> Emitting ​file
> ​information
> for a file based source like a file input operator
> ​in malhar ​
> seems
> ​like
> a
> ​good
> feature to provide. It is useful information for any downstream operator to
> ​know
> that a data tuple belongs to a certain file
> ​ for instance​
> .
>
>
> We propose to add capability in the abstract file input operator to emit
> file control tuples. These control tuples can include filenames as well as
> any metadata that the user wishes to include along with it.
>
> ​To link this meta data to each tuple, we can add another port to the input
> operator which would carry the meta data along with the actual tuple. We
> can try to reduce the amount of meta data that goes with each tuple by
> having some sort of meta encoding in the control tuple.
>
>
> ~ Bhupesh​
>
>
> _______________________________________________________
>
> Bhupesh Chawda
>
> E: [hidden email] | Twitter: @bhupeshsc
>
> www.datatorrent.com  |  apex.apache.org
>
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: File control tuples

Thomas Weise-2
Administrator
In reply to this post by Bhupesh Chawda
How does this relate to the batch control tuples work?

With a separate port, how can a downstream operator relate the metadata to
the tuples emitted from the primary port?

--
sent from mobile
On Jun 2, 2017 12:06 PM, "Bhupesh Chawda" <[hidden email]> wrote:

​Hi,



Emitting ​file
​information
for a file based source like a file input operator
​in malhar ​
seems
​like
a
​good
feature to provide. It is useful information for any downstream operator to
​know
that a data tuple belongs to a certain file
​ for instance​
.


We propose to add capability in the abstract file input operator to emit
file control tuples. These control tuples can include filenames as well as
any metadata that the user wishes to include along with it.

​To link this meta data to each tuple, we can add another port to the input
operator which would carry the meta data along with the actual tuple. We
can try to reduce the amount of meta data that goes with each tuple by
having some sort of meta encoding in the control tuple.


~ Bhupesh​


_______________________________________________________

Bhupesh Chawda

E: [hidden email] | Twitter: @bhupeshsc

www.datatorrent.com  |  apex.apache.org
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: File control tuples

Sanjay Pujare
The way this is supposed to work is if the downstream operator wants both
data+metadata then it connects to this new port ("...which would carry the
meta data along with the actual tuple...") otherwise it connects to the
legacy port which continues to behave the same way. Note that each tuple on
the new port has metadata as well.

+1 for the proposal.

On Sat, Jun 3, 2017 at 9:37 AM, Thomas Weise <[hidden email]> wrote:

> How does this relate to the batch control tuples work?
>
> With a separate port, how can a downstream operator relate the metadata to
> the tuples emitted from the primary port?
>
> --
> sent from mobile
> On Jun 2, 2017 12:06 PM, "Bhupesh Chawda" <[hidden email]> wrote:
>
> ​Hi,
>
> ​
>
> Emitting ​file
> ​information
> for a file based source like a file input operator
> ​in malhar ​
> seems
> ​like
> a
> ​good
> feature to provide. It is useful information for any downstream operator to
> ​know
> that a data tuple belongs to a certain file
> ​ for instance​
> .
>
>
> We propose to add capability in the abstract file input operator to emit
> file control tuples. These control tuples can include filenames as well as
> any metadata that the user wishes to include along with it.
>
> ​To link this meta data to each tuple, we can add another port to the input
> operator which would carry the meta data along with the actual tuple. We
> can try to reduce the amount of meta data that goes with each tuple by
> having some sort of meta encoding in the control tuple.
>
>
> ~ Bhupesh​
>
>
> _______________________________________________________
>
> Bhupesh Chawda
>
> E: [hidden email] | Twitter: @bhupeshsc
>
> www.datatorrent.com  |  apex.apache.org
>
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: File control tuples

Bhupesh Chawda
In reply to this post by Thomas Weise-2
This is not specific to the batch work. This is a more generic
functionality which even streaming applications can benefit from.

The separate port is for both the actual tuple as well as the metadata.

~ Bhupesh


_______________________________________________________

Bhupesh Chawda

E: [hidden email] | Twitter: @bhupeshsc

www.datatorrent.com  |  apex.apache.org



On Sat, Jun 3, 2017 at 9:37 AM, Thomas Weise <[hidden email]> wrote:

> How does this relate to the batch control tuples work?
>
> With a separate port, how can a downstream operator relate the metadata to
> the tuples emitted from the primary port?
>
> --
> sent from mobile
> On Jun 2, 2017 12:06 PM, "Bhupesh Chawda" <[hidden email]> wrote:
>
> ​Hi,
>
> ​
>
> Emitting ​file
> ​information
> for a file based source like a file input operator
> ​in malhar ​
> seems
> ​like
> a
> ​good
> feature to provide. It is useful information for any downstream operator to
> ​know
> that a data tuple belongs to a certain file
> ​ for instance​
> .
>
>
> We propose to add capability in the abstract file input operator to emit
> file control tuples. These control tuples can include filenames as well as
> any metadata that the user wishes to include along with it.
>
> ​To link this meta data to each tuple, we can add another port to the input
> operator which would carry the meta data along with the actual tuple. We
> can try to reduce the amount of meta data that goes with each tuple by
> having some sort of meta encoding in the control tuple.
>
>
> ~ Bhupesh​
>
>
> _______________________________________________________
>
> Bhupesh Chawda
>
> E: [hidden email] | Twitter: @bhupeshsc
>
> www.datatorrent.com  |  apex.apache.org
>
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: File control tuples

Thomas Weise-2
Administrator
The extra port seems unnecessary unless you are planning to associate each
individual data tuple with a file reference (similar to WindowedTuple)?

On Sat, Jun 3, 2017 at 8:50 PM, Bhupesh Chawda <[hidden email]>
wrote:

> This is not specific to the batch work. This is a more generic
> functionality which even streaming applications can benefit from.
>
> The separate port is for both the actual tuple as well as the metadata.
>
> ~ Bhupesh
>
>
> _______________________________________________________
>
> Bhupesh Chawda
>
> E: [hidden email] | Twitter: @bhupeshsc
>
> www.datatorrent.com  |  apex.apache.org
>
>
>
> On Sat, Jun 3, 2017 at 9:37 AM, Thomas Weise <[hidden email]> wrote:
>
> > How does this relate to the batch control tuples work?
> >
> > With a separate port, how can a downstream operator relate the metadata
> to
> > the tuples emitted from the primary port?
> >
> > --
> > sent from mobile
> > On Jun 2, 2017 12:06 PM, "Bhupesh Chawda" <[hidden email]>
> wrote:
> >
> > ​Hi,
> >
> > ​
> >
> > Emitting ​file
> > ​information
> > for a file based source like a file input operator
> > ​in malhar ​
> > seems
> > ​like
> > a
> > ​good
> > feature to provide. It is useful information for any downstream operator
> to
> > ​know
> > that a data tuple belongs to a certain file
> > ​ for instance​
> > .
> >
> >
> > We propose to add capability in the abstract file input operator to emit
> > file control tuples. These control tuples can include filenames as well
> as
> > any metadata that the user wishes to include along with it.
> >
> > ​To link this meta data to each tuple, we can add another port to the
> input
> > operator which would carry the meta data along with the actual tuple. We
> > can try to reduce the amount of meta data that goes with each tuple by
> > having some sort of meta encoding in the control tuple.
> >
> >
> > ~ Bhupesh​
> >
> >
> > _______________________________________________________
> >
> > Bhupesh Chawda
> >
> > E: [hidden email] | Twitter: @bhupeshsc
> >
> > www.datatorrent.com  |  apex.apache.org
> >
>
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: File control tuples

Bhupesh Chawda
Yes, the idea is to associate every tuple to a reference to the meta data
sent in the control tuple.
That way, even with a partitioned input operator, the downstream can
distinguish between two tuples from different files.

~ Bhupesh

On Jun 3, 2017 21:30, "Thomas Weise" <[hidden email]> wrote:

The extra port seems unnecessary unless you are planning to associate each
individual data tuple with a file reference (similar to WindowedTuple)?

On Sat, Jun 3, 2017 at 8:50 PM, Bhupesh Chawda <[hidden email]>
wrote:

> This is not specific to the batch work. This is a more generic
> functionality which even streaming applications can benefit from.
>
> The separate port is for both the actual tuple as well as the metadata.
>
> ~ Bhupesh
>
>
> _______________________________________________________
>
> Bhupesh Chawda
>
> E: [hidden email] | Twitter: @bhupeshsc
>
> www.datatorrent.com  |  apex.apache.org
>
>
>
> On Sat, Jun 3, 2017 at 9:37 AM, Thomas Weise <[hidden email]> wrote:
>
> > How does this relate to the batch control tuples work?
> >
> > With a separate port, how can a downstream operator relate the metadata
> to
> > the tuples emitted from the primary port?
> >
> > --
> > sent from mobile
> > On Jun 2, 2017 12:06 PM, "Bhupesh Chawda" <[hidden email]>
> wrote:
> >
> > ​Hi,
> >
> > ​
> >
> > Emitting ​file
> > ​information
> > for a file based source like a file input operator
> > ​in malhar ​
> > seems
> > ​like
> > a
> > ​good
> > feature to provide. It is useful information for any downstream operator
> to
> > ​know
> > that a data tuple belongs to a certain file
> > ​ for instance​
> > .
> >
> >
> > We propose to add capability in the abstract file input operator to emit
> > file control tuples. These control tuples can include filenames as well
> as
> > any metadata that the user wishes to include along with it.
> >
> > ​To link this meta data to each tuple, we can add another port to the
> input
> > operator which would carry the meta data along with the actual tuple. We
> > can try to reduce the amount of meta data that goes with each tuple by
> > having some sort of meta encoding in the control tuple.
> >
> >
> > ~ Bhupesh​
> >
> >
> > _______________________________________________________
> >
> > Bhupesh Chawda
> >
> > E: [hidden email] | Twitter: @bhupeshsc
> >
> > www.datatorrent.com  |  apex.apache.org
> >
>
Loading...